字幕表 動画を再生する
So now we're going to talk about something that is kind of a specific part of Big Data
So the velocity part huge amounts of data being generated all the time, which essentially is a data stream
So that's a flow of instances so you could have a flow of images coming in have a flow
Video coming in or just a flow of essentially lines to go into a database the thing about the dynamic data
Is that the patterns within it can change so if we've got for example a static machine learning model?
That's not going to deal very well with a changing pattern happening in the data
We build a single model at the start. We use it to make predictions on later data the model
Accuracy can kind of degenerate over time as that data changes
The problem of kind of designing algorithms to deal with this real time data
There's been a research topic for kind of several years now and there's several real world applications on top of that as well
so if you think about
Banks trying to detect fraud as patterns change of different forwards occurring
They want their models to kind of be able to update all the time similar for intrusion detection systems and computer networks
They want to be able to update
And keep on top of what is happening
Ideally, you would want this to happen automatically so minimum interference from humans, because otherwise they've got to spot when changes are happening
We just want the machines to be able to do it by themselves
So if you think about a traditional classification problem on a static batch of data
You assume you have all of that data there already. You have your training test set and you have
instances with
Features which X and then there's some unknown
function f of X which gives you the class label and you want to find a
hypothesis that gives you the best prediction possible
So what kind of approximates this function as well as possible?
So you have a red class and a green class and we have instances that look like this our function f of X may create
A class boundary that looks like this. So anything on this side is red. Anything on this side is green
Our model doesn't know that but we use standard machine learning techniques decision trees new or networks
Whatever you want and it learns a boundary
That looks like that and so that will do okay on whatever dates that we have
It's not effect, but it may get the results that we want. This is static classifications. We already have all our data
So we've got our data we've done our machine learning
This is the decision boundary that we've learnt. The dotted line is what is actually the boundary this gives. Okay results
Let's now say that this is happening in a data stream. So we get this data originally and we build this model
But then later on we have a similar distribution of instance arriving
However, what now happens is that some of these instances are now in reality in a different class
so the true boundary is now here, but we still have our
Model with this decision boundary and so we're now predicting instances here and here into the wrong class if we use that
Exact same model. So what we would see in this case in
Centage accuracy over time you would see at this change point
Accuracy would plummet. So this problem here is called real concept drift. What is effectively happened here
is that this function the unknown function has changed but we've kept our hypothesis our machine learning model exactly the same and so
It starts to perform badly
we can also have a similar problem called virtual drift and what would happen in this case is
that the
Target decision boundary has stayed the same from this original
But the instances we now see in the stream are somewhere else in the feature space. Let's say we now see
data
like this so though the
Kind of optimal decision boundary is in exactly the same place. We now have different data. That means that are predicted boundary
It's going to give this instance as wrong because we haven't got a way of incorporating
information from this instance into the original model that we built both of these will create this decrease in accuracy so we can also
Look at the drift in the data streams in terms of the speed they happen so something that would give us an accuracy plot that
Looks like this is called sudden drift we go from straight from one concept in the data stream
So one decision boundary straight to another one another possible thing that could happen
Is that our accuracy looks like this?
So rather than this sudden switch this decision boundary gradually shifts save me your life if we're looking at a very very oversimplified
Intrusion detection system. We have only two features that we're looking at in the original dataset
anything with these features, this is a
security
Problem and intrusion anything on this side is good in this case
What happens is that suddenly there's a new way of attacking the network and so suddenly
What was here is now not good. So we see those patterns and we say ok
No, that counts as an intrusion in this case
what it means is that we see something that we've not seen before so the model hasn't been trained with any similar data and
So it could get it, right it could fall somewhere up here and we correctly say this is bad
but it could also fall in an area that we didn't learn the decision boundary so well, so
Yeah, we get that prediction wrong. We just looked at what?
The problems are with using a single static model when we're dealing with incoming data
Over time the distribution changes and we start to see a decrease in accuracy on whatever model we built
So what happens in kind of a stream machine learning algorithm would be so first of all
You've got X arriving. This is your instance in our previous example, this would just have two values associated with it
What would first happen is we make a prediction? So in the classification example, we classify this. Yes
It's an intrusion. No, it's not intrusion using the current model that we have then what happens is we update whatever model we have
using information from X and we'll talk about some of the ways that this is done in a second and
One of the kind of caveats with stream machine learning is that you need for this to happen you?
need to have
The real class label if you're doing classification
So in order to incorporate information from this instance into whatever model you've got you need to have that label there now in some cases
It's very easy to say we've seen this data. This is what it's classified us
And we do that immediately if we're thinking about
Making weather predictions we can almost immediately say yes. This is what the weather is like it may be a day's delay
But yeah, we can that's pretty immediate thing four things for example for detection
You may see a pattern of data
you may
Predict it is not being fought and then suddenly two days later this person figures out that actually there's something wrong with their bank accounts
They phone up and it does turn out to be fraud
And so we'd only have the label for that data after that has happened
The final bit is to update the model
At this point and so the goal of updating the model over time is so that rather than having a performance plot
That looks like this so we go from 95s and accuracy down to 20% accuracy
We instead end up with something that okay
We may drift a little bit here and have a tiny performance decrease
But the model should very quickly recover back to the original level and we still have a high performance
So that's the goal of this model update. There's various approaches we can take so the first one is explicit drift handling
which means that we first of all detect when a drift happens in the data stream
So to do that
We have drift detection methods and these are usually statistical tests that look at some aspects of the data arriving
So if the distribution of the data we see arriving and the distribution of the classes we see is changing
If morph like that as a drift some of these we'll also look at the performance accuracy of the classifier
So if the classifier performance suddenly drops we can say well, we've probably got a drift here
We need to do something to the model to mitigate this
Who spots that though? Is it, you know, is there an algorithm that actually spots that something's different to what it should be
Yes, so there are various statistical tests that will do this
That will kind of just measure things like the mean of the data arriving and be able to spot things that have changed basically
So yeah, once we detected that a drift has happened
We then want to take some action. The first thing that we could do is we could do a complete replacement of the model
so we get rid of whatever model we had before and
we
We have taken chunk of recent data
And we retrain the model on that and continue using that for predictions until we've hit another drift
This is okay. But it means that we could be getting rid of some information in the previous model
That is maybe still going to be useful in the future
so then there are also methods that we'll look at specific parts of the model and say okay this specific part of it is
Causing a performance decrease. So let's get rid of this we can then
Learn from new instances something to replace this that will do it better basically
so if you think of a decision tree
If you can detect that there are certain branches in that decision tree that are no longer
Making good predictions you can get rid of them and we grow the tree to perform better prune it. Yeah, exactly
It is called pruning. You prune. Yeah, you prune the branches off the tree
There are no longer performing as you want them to the alternative to explicit handling is to do implicit drift handling
So rather than looking at the data or looking at the performance and saying something has changed we need to take action
We're just continually taking action. There are various approaches to implicit drift handling
So the first and probably most simple one is to use a sliding window
So if we imagine we have the data stream with instances arriving like this
We could say we have a sliding window of three instances and we learn a model off of them. We then
Take the next three learn a model off of them. So as each instance arrives we get rid of the oldest instance
And this makes the assumption that the oldest instances are the least relevant. This is usually the case
It's kind of a valid assumption to make so this performs
Okay
the problem with this though is that it kind of provides a crisp cut off points every
Instance within this window is treated with exactly the same
Kind of impacts on the classifier. They were weighted the same so we can introduce instance weighting
So that older instances will have a lower weight their impact on the classifier will be less
So again, the more recent instances will be have the largest impact on the current model
and then again these algorithms that we'll use instance weighting will usually have
Some threshold. So once the weight gets below a certain point they say that's the instance gone
We delete it presumably the windows can be larger or smaller
Yes, so setting the window size is a pretty important parameter
if you have a window, that is too large then
Okay, you're getting a lot of data to construct your model from which is good and cents between learning more data usually good
What it also means is that if there's very short-term drifts
So this drift happens and then we don't learn from that drift if that makes sense because we see that all as one
Chunk of the data again
If you didn't set the window to be too small we can react very well to very short-term drifts in the stream
But you then have a very limited amount of data to work on to construct the model
So there are methods that will automatically adjust the window size. So during times of drift the window size will get smaller
so we want to be very rapidly changing the model and then during times when everything is kind of very stable the
Window will grow to be as large as possible so that we can
Use as much data to construct this model as possible
So the problem weird sliding windows and instance weighting is that you need all of those instances available to construct the model
Continuously. So every time you add a new instance and delete another one you need to reconstruct that model and
So the way we can get around this is by using single pass algorithms
So we see each instance once use it to update the model and then get rid of that instance
It's probably still in long-term permanent storage, but in terms of what is being accessed to construct this algorithm
It's gone now in that respect then you've got information out of the instance, but you don't need the instance itself. Yeah, exactly
So we see the instance we incorporate what we can from it into the current model
We get rid of it and that instances impact is still in the model an example would be a decision tree
So decision trees are kind of constructed by splitting nodes where we're going to get a lot of information gained
from making a split on a certain attribute
So as the data stream changes the information gained that we might get and some of these nodes may change
So if we say get a new instance and it will say okay
Now this actually makes this a split worth making
We can make that split continue growing the tree and then that instance can go we don't need it anymore
But we still have the information from it in our model
So we've got our implicit and explicit drift handling appro. You can also have hybrids approaches
So the explicit drift handling is very good at spotting sudden drift. So anytime there's a sudden change
There'll be a sudden drop in performance that's very easy to pick up on with a simple statistical test
But when we then add in the implicit drift handling on top of that
It means that we can also deal very well with gradual drift
So gradual drift is a bit more difficult to identify
Simply because if you look at the previous instance or like say that 10 previous instances
With a gradual drift, you're not going to see a significant change
So it's a lot harder to detect by combining the implicit and explicit
Drift timing methods we end up with a performance plot. That would look something like this
We maintain pretty good performance for the entire duration of the data that's arriving the problems of a changing data distribution
And not the only problems with streams
and
so if you can imagine a very high volume stream and
high-speed got a lot of data arriving in a very short amount of time if
You take a single instance of that data stream and it takes you like five seconds to process it
But in that 5 seconds, you've had 10 more instances arrive. You're going to get a battery of instances very very quickly
So you need to be the model update stage needs to be very quick to avoid getting any backlog. The second problem is that with?
These algorithms we're not going to have the entire history of the stream available
To create the current model
so the models need to be
For example the single path algorithms that can say we don't need the historical data that we have the information we need from it
But we don't need to access these
Because otherwise, you just end up with huge huge data sets
Having to be used to create these models all the time
And again these streams of potentially infinite
We don't know when they're going to end and we don't know how much data they're going to end up containing
Most of the kind of and well-known machine learning algorithms have been adapted in various ways to be suitable for streams
So they now include update mechanisms. So they're more dynamic methods. So this includes but decision trees neural networks
K nearest neighbors. There's also clustering algorithms have also been adapted. So basically any classic algorithm you can think of there's
Multiple streaming versions of it now. So if you are interested in these streaming algorithms
There's a few bits of software that you could look at
for example, there's the
Mower suite of algorithms which interfaces with the worker data mining tool kit
This is free to download and use and includes implementations of a lot of popular streaming algorithms it also
Includes ways to synthesize data streams so generate essentially a stream of data
That you can then run the algorithms on
and you can control the amount of drift that you get how certain it is and things like that and
that's quite good to play around with to see the effects that
Different kinds of drift can have on accuracy in terms of big data streams
Specifically there's software such as the spark streaming module for Apache spark
well
There's also the more recent Apache flink that are designed to process very high volume data streams very quickly
you just mentioned some yourself where people can download and have a play with but I mean in the real world as an industry and
Websites and things that services that we use every day
He was using these streaming algorithms. And so a lot of the big companies or most companies to be honest will be generating data
Constantly that they want to model. So for example
Amazon recommendations like what to watch next what to buy next they want to
Understand changing patterns so that they can keep updating
Whatever model they have to get the best
recommendations again
optimizing ads to suggest based on
whatever
Searching history you have that's another thing that is being done via this. So yeah, there are a lot of real-world applications for this stuff
Now I've got the token so I can load a value in add the value emerged or into it and store it back and hand
And now I've got the token again
I can load something into its my register you and do the computation split across those machines
So rather than having one computer going through I don't know a billion database records. You can have each computer going through