Machine Learning for Everybody – Full Course - VoiceTube 動画で英語を学ぶ

字幕表動画を再生する

Kylie Ying has worked at many interesting places such as MIT, CERN, and Free Code Camp.
She's a physicist, engineer, and basically a genius. And now she's going to teach you
about machine learning in a way that is accessible to absolute beginners.
What's up you guys? So welcome to Machine Learning for Everyone. If you are someone who
is interested in machine learning and you think you are considered as everyone, then this video
is for you. In this video, we'll talk about supervised and unsupervised learning models,
we'll go through maybe a little bit of the logic or math behind them, and then we'll also see how
we can program it on Google CoLab. If there are certain things that I have done, and you know,
you're somebody with more experience than me, please feel free to correct me in the comments
and we can all as a community learn from this together. So with that, let's just dive right in.
Without wasting any time, let's just dive straight into the code and I will be teaching you guys
concepts as we go. So this here is the UCI machine learning repository. And basically,
they just have a ton of data sets that we can access. And I found this really cool one called
the magic gamma telescope data set. So in this data set, if you want to read all this information,
to summarize what I what I think is going on, is there's this gamma telescope, and we have all
these high energy particles hitting the telescope. Now there's a camera, there's a detector that
actually records certain patterns of you know, how this light hits the camera. And we can use
properties of those patterns in order to predict what type of particle caused that radiation. So
whether it was a gamma particle, or some other head, like hadron. Down here, these are all of
the attributes of those patterns that we collect in the camera. So you can see that there's, you
know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to
help us discriminate the patterns and whether or not they came from a gamma particle or hadron.
So in order to do this, we're going to come up here, go to the data folder. And you're going
to click this magic zero for data, and we're going to download that. Now over here, I have a colab
notebook open. So you go to colab dot research dot google.com, you start a new notebook. And
I'm just going to call this the magic data set. So actually, I'm going to call this for code camp
magic example. Okay. So with that, I'm going to first start with some imports. So I will import,
you know, I always import NumPy, I always import pandas. And I always import matplotlib.
And then we'll import other things as we go. So yeah,
we run that in order to run the cell, you can either click this play button here, or you can
on my computer, it's just shift enter and that that will run the cell. And here, I'm just going
to order I'm just going to, you know, let you guys know, okay, this is where I found the data set.
So I've copied and pasted this actually, but this is just where I found the data set.
And in order to import that downloaded file that we we got from the computer, we're going to go
over here to this folder thing. And I am literally just going to drag and drop that file into here.
Okay. So in order to take a look at, you know, what does this file consist of,
do we have the labels? Do we not? I mean, we could open it on our computer, but we can also just do
pandas read CSV. And we can pass in the name of this file.
And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here.
I'm just going to make the columns, the column labels, all of these attribute names over here.
So I'm just going to take these values and make that the column names.
All right, how do I do that? So basically, I will come back here, and I will create a list called
calls. And I will type in all of those things. With f size, f conk. And we also have f conk one.
We have f symmetry, f m three long, f m three trans, f alpha. Let's see, we have f dist and class.
Okay, great. Now in order to label those as these columns down here in our data frame.
So basically, this command here just reads some CSV file that you pass in CSV has come about comma
separated values, and turns that into a pandas data frame object. So now if I pass in a names here,
then it basically assigns these labels to the columns of this data set. So I'm going to set
this data frame equal to DF. And then if we call the head is just like, give me the first five things,
give me the first five things. Now you'll see that we have labels for all of these. Okay.
All right, great. So one thing that you might notice is that over here, the class labels,
we have G and H. So if I actually go down here, and I do data frame class unique,
you'll see that I have either G's or H's, and these stand for gammas or hadrons.
And our computer is not so good at understanding letters, right? Our computer is really good at
understanding numbers. So what we're going to do is we're going to convert this to zero for G and
one for H. So here, I'm going to set this equal to this, whether or not that equals G. And then
I'm just going to say as type int. So what this should do is convert this entire column,
if it equals G, then this is true. So I guess that would be one. And then if it's H, it would
be false. So that would be zero, but I'm just converting G and H to one and zero, it doesn't
really matter. Like, if G is one and H is zero or vice versa. Let me just take a step back right
now and talk about this data set. So here I have some data frame, and I have all of these different
values for each entry. Now this is a you know, each of these is one sample, it's one example,
it's one item in our data set, it's one data point, all of these things are kind of the same
thing when I mentioned, oh, this is one example, or this is one sample or whatever. Now, each of
these samples, they have, you know, one quality for each or one value for each of these labels
up here, and then it has the class. Now what we're going to do in this specific example is try to
predict for future, you know, samples, whether the class is G for gamma or H for hadron. And
that is something known as classification. Now, all of these up here, these are known as our features,
and features are just things that we're going to pass into our model in order to help us predict
the label, which in this case is the class column. So for you know, sample zero, I have
10 different features. So I have 10 different values that I can pass into some model.
And I can spit out, you know, the class the label, and I know the true label here is G. So this is
this is actually supervised learning. All right. So before I move on, let me just give you a quick
little crash course on what I just said. This is machine learning for everyone. Well, the first
question is, what is machine learning? Well, machine learning is a sub domain of computer science
that focuses on certain algorithms, which might help a computer learn from data, without a
programmer being there telling the computer exactly what to do. That's what we call explicit
programming. So you might have heard of AI and ML and data science, what is the difference between
all of these. So AI is artificial intelligence. And that's an area of computer science, where the
goal is to enable computers and machines to perform human like tasks and simulate human behavior.
Now machine learning is a subset of AI that tries to solve one specific problem and make predictions
using certain data. And data science is a field that attempts to find patterns and draw insights
from data. And that might mean we're using machine learning. So all of these fields kind of overlap,
and all of them might use machine learning. So there are a few types of machine learning.
The first one is supervised learning. And in supervised learning, we're using labeled inputs.
So this means whatever input we get, we have a corresponding output label, in order to train
models and to learn outputs of different new inputs that we might feed our model. So for example,
I might have these pictures, okay, to a computer, all these pictures are are pixels, they're pixels
with a certain color. Now in supervised learning, all of these inputs have a label associated with
them, this is the output that we might want the computer to be able to predict. So for example,
over here, this picture is a cat, this picture is a dog, and this picture is a lizard.
Now there's also unsupervised learning. And in unsupervised learning, we use unlabeled data
to learn about patterns in the data. So here are here are my input data points. Again, they're just
images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures.
And what I can do is I can feed all these to my computer. And I might not, you know,
my computer is not going to be able to say, Oh, this is a cat, dog and lizard in terms of,
you know, the output. But it might be able to cluster all these pictures, it might say,
Hey, all of these have something in common. All of these have something in common. And then these
down here have something in common, that's finding some sort of structure in our unlabeled data.
And finally, we have reinforcement learning. And reinforcement learning. Well, they usually
there's an agent that is learning in some sort of interactive environment, based on rewards and
penalties. So let's think of a dog, we can train our dog, but there's not necessarily, you know,
any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer.
Essentially, what we're doing is we're giving rewards to our computer, and tell your computer,
Hey, this is probably something good that you want to keep doing. Well, computer agent terminology.
But in this class today, we'll be focusing on supervised learning and unsupervised learning
and learning different models for each of those. Alright, so let's talk about supervised learning
first. So this is kind of what a machine learning model looks like you have a bunch of inputs
that are going into some model. And then the model is spitting out an output, which is our prediction.
So all these inputs, this is what we call the feature vector. Now there are different types
of features that we can have, we might have qualitative features. And qualitative means
categorical data, there's either a finite number of categories or groups. So one example of a
qualitative feature might be gender. And in this case, there's only two here, it's for the sake of
the example, I know this might be a little bit outdated. Here we have a girl and a boy, there are
two genders, there are two different categories. That's a piece of qualitative data. Another
example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or
a nation or a location, that might also be an example of categorical data. Now, in both of
these, there's no inherent order. It's not like, you know, we can rate us one and France to Japan
three, etc. Right? There's not really any inherent order built into either of these categorical
data sets. That's why we call this nominal data. Now, for nominal data, the way that we want
to feed it into our computer is using something called one hot encoding. So let's say that, you
know, I have a data set, some of the items in our data, some of the inputs might be from the US,
some might be from India, then Canada, then France. Now, how do we get our computer to recognize that
we have to do something called one hot encoding. And basically, one hot encoding is saying, okay,
well, if it matches some category, make that a one. And if it doesn't just make that a zero.
So for example, if your input were from the US, you would you might have 1000. India, you know,
0100. Canada, okay, well, the item representing Canada is one and then France, the item representing
France is one. And then you can see that the rest are zeros, that's one hot encoding.
Now, there are also a different type of qualitative feature. So here on the left,
there are different age groups, there's babies, toddlers, teenagers, young adults,
adults, and so on, right. And on the right hand side, we might have different ratings. So maybe
bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of
data, because they have some sort of inherent order, right? Like, being a toddler is a lot closer to
being a baby than being an elderly person, right? Or good is closer to great than it is to really
bad. So these have some sort of inherent ordering system. And so for these types of data sets,
we can actually just mark them from, you know, one to five, or we can just say, hey, for each of these,
let's give it a number. And this makes sense. Because, like, for example, the thing that I
just said, how good is closer to great, then good is close to not good at all. Well, four is closer
to five, then four is close to one. So this actually kind of makes sense. And it'll make sense for the
computer as well. Alright, there are also quantitative pieces of data and quantitative
pieces of data are numerical valued pieces of data. So this could be discrete, which means,
you know, they might be integers, or it could be continuous, which means all real numbers.
So for example, the length of something is a quantitative piece of data, it's a quantitative
feature, the temperature of something is a quantitative feature. And then maybe how many
Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative
feature. Okay, so these are continuous. And this over here is the screen. So those are the things
that go into our feature vector, those are our features that we're feeding this model, because
our computers are really, really good at understanding math, right at understanding numbers,
they're not so good at understanding things that humans might be able to understand.
Well, what are the types of predictions that our model can output? So in supervised learning,
there are some different tasks, there's one classification, and basically classification,
just saying, okay, predict discrete classes. And that might mean, you know, this is a hot dog,
this is a pizza, and this is ice cream. Okay, so there are three distinct classes and any other
pictures of hot dogs, pizza or ice cream, I can put under these labels. Hot dog, pizza, ice cream.
Hot dog, pizza, ice cream. This is something known as multi class classification. But there's also
binary classification. And binary classification, you might have hot dog, or not hot dog. So there's
only two categories that you're working with something that is something and something that's
isn't binary classification. Okay, so yeah, other examples. So if something has positive or negative
sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or
dogs. That's binary classification. Maybe, you know, you are writing an email filter, and you're
trying to figure out if an email spam or not spam. So that's also binary classification.
Now for multi class classification, you might have, you know, cat, dog, lizard, dolphin, shark,
rabbit, etc. We might have different types of fruits like orange, apple, pear, etc. And then
maybe different plant species. But multi class classification just means more than two. Okay,
and binary means we're predicting between two things. There's also something called regression
when we talk about supervised learning. And this just means we're trying to predict continuous
values. So instead of just trying to predict different categories, we're trying to come up
with a number that you know, is on some sort of scale. So some examples. So some examples might
be the price of aetherium tomorrow, or it might be okay, what is going to be the temperature?
Or it might be what is the price of this house? Right? So these things don't really fit into
discrete classes. We're trying to predict a number that's as close to the true value as possible
using different features of our data set. So that's exactly what our model looks like in
supervised learning. Now let's talk about the model itself. How do we make this model learn?
Or how can we tell whether or not it's even learning? So before we talk about the models,
let's talk about how can we actually like evaluate these models? Or how can we tell
whether something is a good model or bad model? So let's take a look at this data set. So this data
set has this is from a diabetes, a Pima Indian diabetes data set. And here we have different
number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI,
age, and then the outcome whether or not they have diabetes one for they do zero for they don't.
So here, all of these are quantitative features, right, because they're all on some scale.
So each row is a different sample in the data. So it's a different example, it's one person's data,
and each row represents one person in this data set. Now this column, each column represents a
different feature. So this one here is some measure of blood pressure levels. And this one
over here, as we mentioned is the output label. So this one is whether or not they have diabetes.
And as I mentioned, this is what we would call a feature vector, because these are all of our
features in one sample. And this is what's known as the target, or the output for that feature
vector. That's what we're trying to predict. And all of these together is our features matrix x.
And over here, this is our labels or targets vector y. So I've condensed this to a chocolate
bar to kind of talk about some of the other concepts in machine learning. So over here,
we have our x, our features matrix, and over here, this is our label y. So each row of this
will be fed into our model, right. And our model will make some sort of prediction. And what we do
is we compare that prediction to the actual value of y that we have in our label data set, because
that's the whole point of supervised learning is we can compare what our model is outputting to,
oh, what is the truth, actually, and then we can go back and we can adjust some things. So the next
iteration, we get closer to what the true value is. So that whole process here, the tinkering that,
okay, what's the difference? Where did we go wrong? That's what's known as training the model.
Alright, so take this whole, you know, chunk right here, do we want to really put our entire
chocolate bar into the model to train our model? Not really, right? Because if we did that, then
how do we know that our model can do well on new data that we haven't seen? Like, if I were to
create a model to predict whether or not someone has diabetes, let's say that I just train all my
data, and I see that all my training data does well, I go to some hospital, I'm like, here's my
model. I think you can use this to predict if somebody has diabetes. Do we think that would
be effective or not? Probably not, right? Because we haven't assessed how well our model can
generalize. Okay, it might do well after you know, our model has seen this data over and over and
over again. But what about new data? Can our model handle new data? Well, how do we how do we get our
model to assess that? So we actually break up our whole data set that we have into three different
types of data sets, we call it the training data set, the validation data set and the testing data
set. And you know, you might have 60% here 20% and 20% or 80 10 and 10. It really depends on how
many statistics you have, I think either of those would be acceptable. So what we do is then we feed
the training data set into our model, we come up with, you know, this might be a vector of predictions
corresponding with each sample that we put into our model, we figure out, okay, what's the difference
between our prediction and the true values, this is something known as loss, losses, you know,
what's the difference here, in some numerical quantity, of course. And then we make adjustments,
and that's what we call training. Okay. So then, once you know, we've made a bunch of adjustments,
we can put our validation set through this model. And the validation set is kind of used as a reality
check during or after training to ensure that the model can handle unseen data still. So every
single time after we train one iteration, we might stick the validation set in and see, hey, what's
the loss there. And then after our training is over, we can assess the validation set and ask,
hey, what's the loss there. But one key difference here is that we don't have that training step,
this loss never gets fed back into the model, right, that feedback loop is not closed.
Alright, so let's talk about loss really quickly. So here, I have four different types of models,
I have some sort of data that's being fed into the model, and then some output. Okay, so this output
here is pretty far from you know, this truth that we want. And so this loss is going to be high. In
model B, again, this is pretty far from what we want. So this loss is also going to be high,
let's give it 1.5. Now this one here, it's pretty close, I mean, maybe not almost, but pretty close
to this one. So that might have a loss of 0.5. And then this one here is maybe further than this,
but still better than these two. So that loss might be 0.9. Okay, so which of these model
performs the best? Well, model C has a smallest loss, so it's probably model C. Okay, now let's
take model C. After you know, we've come up with these, all these models, and we've seen, okay, model
C is probably the best model. We take model C, and we run our test set through this model. And this
test set is used as a final check to see how generalizable that chosen model is. So if I,
you know, finish training my diabetes data set, then I could run it through some chunk of the
data and I can say, oh, like, this is how we perform on data that it's never seen before at
any point during the training process. Okay. And that loss, that's the final reported performance
of my test set, or this would be the final reported performance of my model. Okay.
So let's talk about this thing called loss, because I think I kind of just glossed over it,
right? So loss is the difference between your prediction and the actual, like, label.
So this would give a slightly higher loss than this. And this would even give a higher loss,
because it's even more off. In computer science, we like formulas, right? We like formulaic ways
of describing things. So here are some examples of loss functions and how we can actually come
up with numbers. This here is known as L one loss. And basically, L one loss just takes the
absolute value of whatever your you know, real value is, whatever the real output label is,
subtracts the predicted value, and takes the absolute value of that. Okay. So the absolute
value is a function that looks something like this. So the further off you are, the greater your losses,
right in either direction. So if your real value is off from your predicted value by 10,
then your loss for that point would be 10. And then this sum here just means, hey,
we're taking all the points in our data set. And we're trying to figure out the sum of how far
everything is. Now, we also have something called L two loss. So this loss function is quadratic,
which means that if it's close, the penalty is very minimal. And if it's off by a lot,
then the penalty is much, much higher. Okay. And this instead of the absolute value, we just square
the the difference between the two. Now, there's also something called binary cross entropy loss.
It looks something like this. And this is for binary classification, this this might be the
loss that we use. So this loss, you know, I'm not going to really go through it too much.
But you just need to know that loss decreases as the performance gets better. So there are some
other measures of accurate or performance as well. So for example, accuracy, what is accuracy?
So let's say that these are pictures that I'm feeding my model, okay. And these predictions
might be apple, orange, orange, apple, okay, but the actual is apple, orange, apple, apple. So
three of them were correct. And one of them was incorrect. So the accuracy of this model is
three quarters or 75%. Alright, coming back to our colab notebook, I'm going to close this a little
bit. Again, we've imported stuff up here. And we've already created our data frame right here. And
this is this is all of our data. This is what we're going to use to train our models. So down here,
again, if we now take a look at our data set, you'll see that our classes are now zeros and ones.
So now this is all numerical, which is good, because our computer can now understand that.
Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things
have anything to do with the class. So here, I'm going to go through all the labels. So for label
in the columns of this data frame. So this just gets me the list. Actually, we have the list,
right? It's called so let's just use that might be less confusing of everything up to the last
thing, which is the class. So I'm going to take all these 10 different features. And I'm going
to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I
take that data frame, and I say, okay, for everything where the class is equal to one, so these are all
of our gammas, remember, now, for that portion of the data frame, if I look at this label, so now
these, okay, what this part here is saying is, inside the data frame, get me everything where
the class is equal to one. So that's all all of these would fit into that category, right?
And now let's just look at the label column. So the first label would be f length, which would
be this column. So this command here is getting me all the different values that belong to class one
for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm
just going to tell you know, matplotlib make the color blue, make this label this as you know, gamma
set alpha, why do I keep doing that, alpha equal to 0.7. So that's just like the transparency.
And then I'm going to set density equal to true, so that when we compare it to
the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true
just basically normalizes these distributions. So you know, if you have 200 in of one type,
and then 50 of another type, well, if you drew the histograms, it would be hard to compare because
one of them would be a lot bigger than the other, right. But by normalizing them, we kind of are
distributing them over how many samples there are. Alright, and then I'm just going to put a title
on here and make that the label, the y label. So because it's density, the y label is probability.
And the x label is just going to be the label.
What is going on. And I'm going to include a legend and PLT dot show just means okay, display
the plot. So if I run that, just be up to the last item. So we want a list, right, not just the last
item. And now we can see that we're plotting all of these. So here we have the length. Oh, and I
made this gamma. So this should be hadron. Okay, so the gammas in blue, the hadrons are in red. So
here we can already see that, you know, maybe if the length is smaller, it's probably more likely
to be gamma, right. And we can kind of you know, these all look somewhat similar. But here, okay,
clearly, if there's more asymmetry, or if you know, this asymmetry measure is larger, then it's
probably hadron. Okay, oh, this one's a good one. So f alpha seems like hadrons are pretty evenly
distributed. Whereas if this is smaller, it looks like there's more gammas in that area.
Okay, so this is kind of what the data that we're working with, we can kind of see what's going on.
Okay, so the next thing that we're going to do here is we are going to create our train,
our validation, and our test data sets. I'm going to set train valid and test to be equal to
this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample,
where I'm sampling everything, this will basically shuffle my data. Now, if I I want to pass in where
exactly I'm splitting my data set, so the first split is going to be maybe at 60%. So I'm going
to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going
to be the first place where you know, I cut it off, and that'll be my training data. Now, if I
then go to 0.8, this basically means everything between 60% and 80% of the length of the data
set will go towards validation. And then, like everything from 80 to 100, I'm going to pass
my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that
these columns seem to have values in like the 100s, whereas this one is 0.03. Right? So the scale of
all these numbers is way off. And sometimes that will affect our results. So I'm going to run this
is way off. And sometimes that will affect our results. So one thing that we would want to do
is scale these so that they are, you know, so that it's now relative to maybe the mean and the
standard deviation of that specific column. I'm going to create a function called scale data set.
And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are
going to be, you know, I take the data frame. And let's assume that the columns are going to be,
you know, that the label will always be the last thing in the data frame. So what I can do is say
data frame, dot columns all the way up to the last item, and get those values. Now for my y,
well, it's the last column. So I can just do this, I can just index into that last column,
and then get those values. Now, in, so I'm actually going to import something known as
the standard scalar from sk learn. So if I come up here, I can go to sk learn dot pre processing.
And I'm going to import standard scalar, I have to run that cell, I'm going to come back down here.
And now I'm going to create a scalar and use that skip or so standard scalar.
And with the scalar, what I can do is actually just fit and transform x. So here, I can say x
is equal to scalar dot fit, fit, transform x. So what that's doing is saying, okay, take x and
fit the standard scalar to x, and then transform all those values. And what would it be? And that's
going to be our new x. Alright. And then I'm also going to just create, you know, the whole data as
one huge 2d NumPy array. And in order to do that, I'm going to call H stack. So H stack is saying,
okay, take an array, and another array and horizontally stack them together. That's what
the H stands for. So by horizontally stacked them together, just like put them side by side,
okay, not on top of each other. So what am I stacking? Well, I have to pass in something
so that it can stack x and y. And now, okay, so NumPy is very particular about dimensions,
right? So in this specific case, our x is a two dimensional object, but y is only a one dimensional
thing, it's only a vector of values. So in order to now reshape it into a 2d item, we have to call
NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative
one comma one, that just means okay, make this a 2d array, where the negative one just means infer
what what this dimension value would be, which ends up being the length of y, this would be the
same as literally doing this. But the negative one is easier because we're making the computer
do the hard work. So if I stack that, I'm going to then return the data x and y. Okay. So one more
thing is that if we go into our training data set, okay, again, this is our training data set.
And we get the length of the training data set. But where the training data sets class is one,
so remember that this is the gammas. And then if we print that, and we do the same thing, but zero,
we'll see that, you know, there's around 7000 of the gammas, but only around 4000 of the hadrons.
So that might actually become an issue. And instead, what we want to do is we want to oversample
our our training data set. So that means that we want to increase the number of these values,
so that these kind of match better. And surprise, surprise, there is something that we can import
that will help us do that. It's so I'm going to go to from in the learn dot oversampling. And I'm
going to import this random oversampler, run that cell, and come back down here. So I will actually
add in this parameter called oversample, and set that to false for default. And if I do want to
oversample, then what I'm going to do, and by oversample, so if I do want to oversample,
then I'm going to create this ROS and set it equal to this random oversampler. And then for x and y,
I'm just going to say, okay, just fit and resample x and y. And what that's doing is saying, okay,
take more of the less class. So take take the less class and keep sampling from there to increase
the size of our data set of that smaller class so that they now match. So if I do this, and I scale
data set, and I pass in the training data set where oversample is true. So this let's say this
is train and then x train, y train. Oops, what's going on? These should be columns. So basically,
what I'm doing now is I'm just saying, okay, what is the length of y train? Okay, now it's
14,800, whatever. And now let's take a look at how many of these are type one. So actually,
we can just sum that up. And then we'll also see that if we instead switch the label and ask how
many of them are the other type, it's the same value. So now these have been evenly, you know,
rebalanced. Okay, well, okay. So here, I'm just going to make this the validation data set. And
then the next one, I'm going to make this the test data set. Alright, and we're actually going to
switch oversample here to false. Now, the reason why I'm switching that to false is because my
validation and my test sets are for the purpose of you know, if I have data that I haven't seen yet,
how does my sample perform on those? And I don't want to oversample for that right now. Like,
I don't care about balancing those I'm, I want to know if I have a random set of data that's
unlabeled, can I trust my model, right? So that's why I'm not oversampling. I run that. And again,
what is going on? Oh, it's because we already have this train. So I have to go come up here and split
that data frame again. And now let's run these. Okay. So now we have our data properly formatted.
And we're going to move on to different models now. And I'm going to tell you guys a little bit
about each of these models. And then I'm going to show you how we can do that in our code. So the
first model that we're going to learn about is KNN or K nearest neighbors. Okay, so here, I've
already drawn a plot on the y axis, I have the number of kids that a family might have. And then
on the x axis, I have their income in terms of 1000s per year. So, you know, if if someone's
making 40,000 a year, that's where this would be. And if somebody making 320, that's where that
would be somebody has zero kids, it'd be somewhere along this axis. Somebody has five, it'd be
somewhere over here. Okay. And now I have these plus signs and these minus signs on here. So what
I'm going to represent here is the plus sign means that they own a car. And the minus sign is going
to represent no car. Okay. So your initial thought should be okay, I think this is binary
classification because all of our points all of our samples have labels. So this is a sample with
the plus label. And this here is another sample with the minus label. This is an abbreviation for
width that I'll use. Alright, so we have this entire data set. And maybe around half the people
own a car and maybe around half the people don't own a car. Okay, well, what if I had some new
point, let me use choose a different color, I'll use this nice green. Well, what if I have a new
point over here? So let's say that somebody makes 40,000 a year and has two kids. What do we think
that would be? Well, just logically looking at this plot, you might think, okay, it seems like
they wouldn't have a car, right? Because that kind of matches the pattern of everybody else around
them. So that's a whole concept of this nearest neighbors is you look at, okay, what's around you.
And then you're basically like, okay, I'm going to take the label of the majority that's around me.
So the first thing that we have to do is we have to define a distance function. And a lot of times
in, you know, 2d plots like this, our distance function is something known as Euclidean distance.
And Euclidean distance is basically just this straight line distance like this. Okay. So this
would be the Euclidean distance, it seems like there's this point, there's this point, there's
that point, etc. So the length of this line, this green line that I just drew, that is what's known
as Euclidean distance. If we want to get technical with that, this exact formula is the distance here,
let me zoom in. The distance is equal to the square root of one point x minus the other points x
squared plus extend that square root, the same thing for y. So y one of one minus y two of the
other squared. Okay, so we're basically trying to find the length, the distances, the difference
between x and y, and then square each of those sum it up and take the square root. Okay, so I'm
going to erase this so it doesn't clutter my drawing. But anyways, now going back to this plot,
so here in the nearest neighbor algorithm, we see that there is a K, right? And this K is basically
telling us, okay, how many neighbors do we use in order to judge what the label is? So usually,
we use a K of maybe, you know, three or five, depends on how big our data set is. But here,
I would say, maybe a logical number would be three or five. So let's say that we take K to be equal
to three. Okay, well, of this data point that I drew over here, let me use green to highlight this.
Okay, so of this data point that I drew over here, it looks like the three closest points are definitely
this one, this one. And then this one has a length of four. And this one seems like it'd be a little
bit further than four. So actually, this would be these would be our three points. Well, all those
points are blue. So chances are, my prediction for this point is going to be blue, it's going to be
probably don't have a car. All right, now what if my point is somewhere? What if my point is
somewhere over here, let's say that a couple has four kids, and they make 240,000 a year. All right,
well, now my closest points are this one, probably a little bit over that one. And then this one,
right? Okay, still all pluses. Well, this one is more than likely to be plus. Right? Now,
let me get rid of some of these just so that it looks a little bit more clear. All right,
let's go through one more. What about a point that might be right here? Okay, let's see. Well,
definitely this is the closest, right? This one's also closest. And then it's really close between
the two of these. But if we actually do the mathematics, it seems like if we zoom in,
this one is right here. And this one is in between these two. So this one here is actually shorter
than this one. And that means that that top one is the one that we're going to take. Now,
what is the majority of the points that are close by? Well, we have one plus here, we have one plus
here, and we have one minus here, which means that the pluses are the majority. And that means
that this label is probably somebody with a car. Okay. So this is how K nearest neighbors would
work. It's that simple. And this can be extrapolated to further dimensions to higher dimensions. You
know, if you have here, we have two different features, we have the income, and then we have
the number of kids. But let's say we have 10 different features, we can expand our distance
function so that it includes all 10 of those dimensions, we take the square root of everything,
and then we figure out which one is the closest to the point that we desire to classify. Okay. So
that's K nearest neighbors. So now we've learned about K nearest neighbors. Let's see how we would
be able to do that within our code. So here, I'm going to label the section K nearest neighbors.
And we're actually going to use a package from SK learn. So the reason why we, you know, use these
packages and so that we don't have to manually code all these things ourselves, because it would
be really difficult. And chances are the way that we would code it, either would have bugs,
or it'd be really slow, or I don't know a whole bunch of issues. So what we're going to do is
hand it off to the pros. From here, I can say, okay, from SK learn, which is this package dot
neighbors, I'm going to import K neighbors classifier, because we're classifying. Okay,
so I run that. And our KNN model is going to be this K neighbors classifier. And we can pass in
a parameter of how many neighbors, you know, we want to use. So first, let's see what happens if
we just use one. So now if I do K, and then model dot fit, I can pass in my x training set and my
weight y train data. Okay. So that effectively fits this model. And let's get all the predictions. So
why can and I guess yeah, let's do y predictions. And my y predictions are going to be cannon model
dot predict. So let's use the test set x test. Okay. Alright, so if I call y predict, you'll see
that we have those. But if I get my truth values for that test set, you'll see that this is what
we actually do. So just looking at this, we got five out of six of them. Okay, great. So let's
actually take a look at something called the classification report that's offered by SK learn.
So if I go to from SK learn dot metrics, import classification report, what I can actually do is
say, hey, print out this classification report for me. And let's check, you know, I'm giving you the
y test and the y prediction. We run this and we see we get this whole entire chart. So I'm going
to tell you guys a few things on this chart. Alright, this accuracy is 82%, which is actually
pretty good. That's just saying, hey, if we just look at, you know, what each of these new points,
what it's closest to, then we actually get an 82% accuracy, which means how many do we get right
versus how many total are there. Now, precision is saying, okay, you might see that we have it
for class one, or class zero and class one. What precision is saying was, let's go to this Wikipedia
diagram over here, because I actually kind of like this diagram. So here, this is our entire data set.
And on the left over here, we have everything that we know is positive. So everything that is
actually truly positive, that we've labeled positive in our original data set. And over here,
this is everything that's truly negative. Now in the circle, we have things that are positive that
were labeled positive by our model. On the left here, we have things that are truly positive,
because you know, this side is the positive side and the side is the negative side. So these are
truly positive. Whereas all these ones out here, well, they should have been positive, but they
are labeled as negative. And in here, these are the ones that we've labeled positive, but they're
actually negative. And out here, these are truly negative. So precision is saying, okay, out of all
the ones we've labeled as positive, how many of them are true positives? And recall is saying,
okay, out of all the ones that we know are truly positive, how many do we actually get right? Okay,
so going back to this over here, our precision score, so again, precision, out of all the ones
that we've labeled as the specific class, how many of them are actually that class, it's 7784%. Now,
recall how out of all the ones that are actually this class, how many of those that we get, this
is 68% and 89%. Alright, so not too shabby, we can clearly see that this recall and precision for
like this, the class zero is worse than class one. Right? So that means for hadron, it's worked for
hadrons and for our gammas. This f1 score over here is kind of a combination of the precision and
recall score. So we're actually going to mostly look at this one because we have an unbalanced
test data set. So here we have a measure of 72 and 87 or point seven two and point eight seven,
which is not too shabby. All right. Well, what if we, you know, made this three. So we actually see
that, okay, so what was it originally with one? We see that our f1 score, you know, is now it was
point seven two and then point eight seven. And then our accuracy was 82%. So if I change that to
three. Alright, so we've kind of increased zero at the cost of one and then our overall accuracy
is 81. So let's actually just make this five. Alright, so you know, again, very similar numbers,
we have 82% accuracy, which is pretty decent for a model that's relatively simple. Okay,
the next type of model that we're going to talk about is something known as naive Bayes. Now,
in order to understand the concepts behind naive Bayes, we have to be able to understand
conditional probability and Bayes rule. So let's say I have some sort of data set that's shown in
this table right here. People who have COVID are over here in this red row. And people who do not
have COVID are down here in this green row. Now, what about the COVID test? Well, people who have
tested positive are over here in this column. And people who have tested negative are over here in
this column. Okay. Yeah, so basically, our categories are people who have COVID and test positive,
people who don't have COVID, but test positive, so a false false positive, people who have COVID
and test negative, which is a false negative, and people who don't have COVID and test negative,
which good means you don't have COVID. Okay, so let's make this slightly more legible. And here,
in the margins, I've written down the sums of whatever it's referring to. So this here is the
sum of this entire row. And this here might be the sum of this column over here. Okay. So the first
question that I have is, what is the probability of having COVID given that you have a positive
test? And in probability, we write that out like this. So the probability of COVID given, so this
line, that vertical line means given that, you know, some condition, so given a positive test,
okay, so what is the probability of having COVID given a positive test? So what this is asking is
saying, okay, let's go into this condition. So the condition of having a positive test, that is this
slice of the data, right? That means if you're in this slice of data, you have a positive test. So
given that we have a positive test, given in this condition, in this circumstance, we have a positive
test. So what's the probability that we have COVID? Well, if we're just using this data, the number
of people that have COVID is 531. So I'm gonna say that there's 531 people that have COVID. And then
now we divide that by the total number of people that have a positive test, which is 551. Okay,
so that's the probability and doing a quick division, we get that this is equal to around
96.4%. So according to this data set, which is data that I made up off the top of my head, so it's
not actually real COVID data. But according to this data, the probability of having COVID given
that you tested positive is 96.4%. Alright, now with that, let's talk about Bayes rule, which is
this section here. Let's ignore this bottom part for now. So Bayes rule is asking, okay, what is
the probability of some event A happening, given that B happened. So this, we already know has
happened. This is our condition, right? Well, what if we don't have data for that, right? Like, what
if we don't know what the probability of A given B is? Well, Bayes rule is saying, okay, well, you
can actually go and calculate it, as long as you have a probability of B given A, the probability
of A and the probability of B. Okay. And this is just a mathematical formula for that. Alright,
so here we have Bayes rule. And let's actually see Bayes rule in action. Let's use it on an example.
So here, let's say that we have some disease statistics, okay. So not COVID different disease.
And we know that the probability of obtaining a false positive is 0.05 probability of obtaining a
false negative is 0.01. And the probability of the disease is 0.1. Okay, what is the probability of
the disease given that we got a positive test? Hmm, how do we even go about solving this? So
what what do I mean by false positive? What's a different way to rewrite that? A false positive
is when you test positive, but you don't actually have the disease. So this here is a probability
that you have a positive test given no disease, right? And similarly for the false negative,
it's a probability that you test negative given that you actually have the disease. So if I put
that into a chart, for example, and this might be my positive and negative tests, and this might
be my diseases, disease and no disease. Well, the probability that I test positive, but actually
have no disease, okay, that's 0.05 over here. And then the false negatives up here for 0.01. So I'm
testing negative, but I don't actually have the disease. This so the probability that you test
positive, and you don't have the disease, plus a probability that you test negative, given that you
don't have the disease, that should sum up to one. Okay, because if you don't have the disease,
then you should have some probability that you're testing positive and some probability that you're
testing negative. But that probability, in total should be one. So that means that the probability
negative and no disease, this should be the reciprocal, this should be the opposite. So it
should be 0.95 because it's one minus whatever this probability is. And then similarly, oops,
up here, this should be 0.99 because the probability that we, you know,
test negative and have the disease plus the probability that we test positive and have the
disease should equal one. So this is our probability chart. And now, this probability of disease
being point 0.1 just means I have 10% probability of actually of having the disease, right? Like,
in the general population, the probability that I have the disease is 0.1. Okay, so what is the
probability that I have the disease given that I got a positive test? Well, remember that we
can write this out in terms of Bayes rule, right? So if I use this rule up here, this is the
probability of a positive test given that I have the disease times the probability of the disease
divided by the probability of the evidence, which is my positive test.
Alright, now let's plug in some numbers for that. The probability of having a positive test given
that I have the disease is 0.99. And then the probability that I have the disease is this value
over here 0.1. Okay. And then the probability that I have a positive test at all should be okay,
what is the probability that I have a positive test given that I actually have the disease
and then having having the disease. And then the other case, where the probability of me having a
negative test given or sorry, positive test giving no disease times the probability of not actually
having a disease. Okay, so I can expand that probability of having a positive test out into
these two different cases, I have a disease, and then I don't. And then what's the probability of
having positive tests in either one of those cases. So that expression would become 0.99 times 0.1
plus 0.05. So that's the probability that I'm testing positive, but don't have the disease.
And the times the probability that I don't actually have the disease. So that's one minus
0.1 probability that the population doesn't have the disease is 90%. So 0.9. And let's do that
multiplication. And I get an answer of 0.6875 or 68.75%. Okay. All right, so we can actually expand
that we can expand Bayes rule and apply it to classification. And this is what we call naive
base. So first, a little terminology. So the posterior is this over here, because it's asking,
Hey, what is the probability of some class CK? So by CK, I just mean, you know, the different
categories, so C for category or class or whatever. So category one might be cats, category two,
dogs, category three, lizards, all the way, we have k categories, k is just some number. Okay.
So what is the probability of having of this specific sample x, so this is our feature vector
of this one sample. What is the probability of x fitting into category 123 for whatever, right,
so that that's what this is asking, what is the probability that, you know, it's actually from
this class, given all this evidence that we see the x's. So the likelihood is this quantity over
here, it's saying, Okay, well, given that, you know, assume, assume we are, assume that this
class is class CK, okay, assume that this is a category. Well, what is the likelihood of
actually seeing x, all these different features from that category. And then this here is the
prior. So like in the entire population of things, what are the probabilities? What is the
probability of this class in general? Like if I have, you know, in my entire data set, what is the
percentage? What is the chance that this image is a cat? How many cats do I have? Right. And then this
down here is called the evidence because what we're trying to do is we're changing our prior,
we're creating this new posterior probability built upon the prior by using some sort of evidence,
right? And that evidence is a probability of x. So that's some vocab. And this here
is a rule for naive Bayes. Whoa, okay, let's digest that a little bit. Okay. So what is
let me use a different color. What is this side of the equation asking? It's asking,
what is the probability that we are in some class K, CK, given that, you know, this is my first
input, this is my second input, this is, you know, my third, fourth, this is my nth input. So let's
say that our classification is, do we play soccer today or not? Okay, and let's say our x's are,
okay, is it how much wind is there? How much rain is there? And what day of the week is it? So let's
So let's say that it's raining, it's not windy, but it's Wednesday, do we play soccer? Do we not?
So let's use Bayes rule on this. So this here
is equal to the probability of x one, x two, all these joint probabilities, given class K
times the probability of that class, all over the probability of this evidence.
Okay. So what is this fancy symbol over here, this means proportional to
so how our equal sign means it's equal to this like little squiggly sign means that this is
proportional to okay, and this denominator over here, you might notice that it has no impact on
the class like this, that number doesn't depend on the class, right? So this is going to be constant
for all of our different classes. So what I'm going to do is make things simpler. So I'm just
going to say that this probability x one, x two, all the way to x n, this is going to be proportional
to the numerator, I don't care about the denominator, because it's the same for every
single class. So this is proportional to x one, x two, x n given class K times the probability of
that class. Okay. All right. So in naive Bayes, the point of it being naive, is that we're actually
this joint probability, we're just assuming that all of these different things
are all independent. So in my soccer example, you know, the probability that we're playing soccer,
or the probability that, you know, it's windy, and it's rainy, and, and it's Wednesday, all these
things are independent, we're assuming that they're independent. So that means that I can
actually write this part of the equation here as this. So each term in here, I can just multiply
all of them together. So the probability of the first feature, given that it's class K,
times the probability of the second feature and given this problem, like class K all the way up
all the way up until, you know, the nth feature of given that it's class K. So this expands to
all of this. All right, which means that this here is now proportional to the thing that we just
expanded times this. So I'm going to write that out. So the probability of that class.
And I'm actually going to use this symbol. So what this means is it's a huge multiplication,
it means multiply everything to the right of this. So this probability x, given some class K,
but do it for all the i's. So I, what is I, okay, we're going to go from the first
the first x i all the way to the nth. So that means for every single i, we're just multiplying
these probabilities together. And that's where this up here comes from. So to wrap this up,
oops, this should be a line to wrap this up in plain English. Basically, what this is saying
is a probability that you know, we're in some category, given that we have all these different
features is proportional to the probability of that class in general, times the probability of
each of those features, given that we're in this one class that we're testing. So the probability
of it, you know, of us playing soccer today, given that it's rainy, not windy, and and it's
Wednesday, is proportional to Okay, well, what is what is the probability that we play soccer
anyways, and then times the probability that it's rainy, given that we're playing soccer,
times the probability that it's not windy, given that we're playing soccer. So how many times are
we playing soccer when it's windy, how you know, and then how many times are what's the probability
that's Wednesday, given that we're playing soccer. Okay. So how do we use this in order to make a
classification. So that's where this comes in our y hat, our predicted y is going to be equal to
something called the arg max. And then this expression over here, because we want to take
the arg max. Well, we want. So okay, if I write out this, again, this means the probability of
being in some class CK given all of our evidence. Well, we're going to take the K that maximizes
this expression on the right. That's what arc max means. So if K is in zero, oops,
one through K, so this is how many categories are, we're going to go through each K. And we're going
to solve this expression over here and find the K that makes that the largest. Okay. And remember
that instead of writing this, we have now a formula, thanks to Bayes rule for helping us
approximate that right in something that maybe we can we maybe we have like the evidence for that,
we have the answers for that based on our training set. So this principle of going through each of
these and finding whatever class whatever category maximizes this expression on the right,
this is something known as MAP for short, or maximum a posteriori.
Pick the hypothesis. So pick the K that is the most probable so that we minimize the probability
of misclassification. Right. So that is MAP. That is naive Bayes. Back to the notebook. So
just like how I imported k nearest neighbor, k neighbors classifier up here for naive Bayes,
I can go to SK learn naive Bayes. And I can import Gaussian naive Bayes.
Right. And here I'm going to say my naive Bayes model is equal. This is very similar to what we
had above. And I'm just going to say with this model, we are going to fit x train and y train.
All right, just like above. So this, I might actually, so I'm going to set that. And
exactly, just like above, I'm going to make my prediction. So here, I'm going to instead use my
naive Bayes model. And of course, I'm going to run the classification report again. So I'm actually
just going to put these in the same cell. But here we have the y the new y prediction and then y test
is still our original test data set. So if I run this, you'll see that. Okay, what's going on here,
we get worse scores, right? Our precision, for all of them, they look slightly worse. And our,
you know, for our precision, our recall, our f1 score, they look slightly worse for all the different
categories. And our total accuracy, I mean, it's still 72%, which is not too shabby. But it's still
72%. Okay. Which, you know, is not not that great. Okay, so let's move on to logistic regression.
Here, I've drawn a plot, I have y. So this is my label on one axis. And then this is maybe one of
my features. So let's just say I only have one feature in this case, text zero, right? Well,
we see that, you know, I have a few of one class type down here. And we know it's one class type
because it's zero. And then we have our other class type one up here. And then we have our
y. Okay. So many of you guys are familiar with regression. So let's start there. If I were to
draw a regression line through this, it might look something like like this. Right? Well, this
doesn't seem to be a very good model. Like, why would we use this specific line to predict why?
Right? It's, it's iffy. Okay. For example, we might say, okay, well, it seems like, you know,
everything from here downwards would be one class type in here, upwards would be another class type.
But when you look at this, you're just you, you visually can tell, okay, like, that line doesn't
make sense. Things are not those dots are not along that line. And the reason is because we
are doing classification, not regression. Okay. Well, first of all, let's start here, we know that
this model, if we just use this line, it equals m x. So whatever this let's just say it's x plus b,
which is the y intercept, right? And m is the slope. But when we use a linear regression,
is it actually y hat? No, it's not right. So when we're working with linear regression,
what we're actually estimating in our model is a probability, what's a probability between zero
and one, that is class zero or class one. So here, let's rewrite this as p equals m x plus b.
Okay, well, m x plus b, that can range, you know, from negative infinity to infinity,
right? For any for any value of x, it goes from negative infinity to infinity.
But probability, we know probably one of the rules of probability is that probability has to stay
between zero and one. So how do we fix this? Well, maybe instead of just setting the probability
equal to that, we can set the odds equal to this. So by that, I mean, okay, let's do probability
divided by one minus the probability. Okay, so now becomes this ratio. Now this ratio is allowed to
take on infinite values. But there's still one issue here. Let me move this over a bit.
The one issue here is that m x plus b, that can still be negative, right? Like if you know,
I have a negative slope, if I have a negative b, if I have some negative x's in there, I don't know,
but that can be that's allowed to be negative. So how do we fix that? We do that by actually taking
the log of the odds. Okay. So now I have the log of you know, some probability divided by one minus
the probability. And now that is on a range of negative infinity to infinity, which is good
because the range of log should be negative infinity to infinity. Now how do I solve for P
the probability? Well, the first thing I can do is take, you know, I can remove the log by taking
the not the e to the whatever is on both sides. So that gives me the probability
over the one minus the probability is now equal to e to the m x plus b. Okay. So let's multiply
that out. So the probability is equal to one minus probability e to the m x plus b. So P is equal to
e to the m x plus b minus P times e to the m x plus b. And now we have we can move like terms to
one side. So if I do P, so basically, I'm moving this over, so I'm adding P. So now P one plus e
to the m x plus b is equal to e to the m x plus b and let me change this parentheses make it a
little bigger. So now my probability can be e to the m x plus b divided by one plus e to the m x plus b.
Okay, well, let me just rewrite this really quickly, I want a numerator of one on top.
Okay, so what I'm going to do is I'm going to multiply this by negative m x plus b,
and then also the bottom by negative m x plus b, and I'm allowed to do that because
this over this is one. So now my probability is equal to one over
one plus e to the negative m x plus b. And now why did I rewrite it like that?
It's because this is actually a form of a special function, which is called the sigmoid
function. And for the sigmoid function, it looks something like this. So s of x sigmoid, you know,
that some x is equal to one over one plus e to the negative x. So essentially, what I just did up here
is rewrite this in some sigmoid function, where the x value is actually m x plus b.
So maybe I'll change this to y just to make that a bit more clear, it doesn't matter what
the variable name is. But this is our sigmoid function. And visually, what our sigmoid function
looks like is it goes from zero. So this here is zero to one. And it looks something like this
curved s, which I didn't draw too well. Let me try that again. It's hard to draw
something if I can draw this right. Like that. Okay, so it goes in between zero and one.
And you might notice that this form fits our shape up here.
Oops, let's draw it sharper. But if it's our shape up there a lot better, right?
Alright, so that is what we call logistic regression, we're basically trying to fit our data
to the sigmoid function. Okay. And when we only have, you know, one data point, so if we only have
one feature x, and that's what we call simple logistic regression. But then if we have, you know,
so that's only x zero, but then if we have x zero, x one, all the way to x n, we call this
multiple logistic regression, because there are multiple features that we're considering
when we're building our model, logistic regression. So I'm going to put that here.
And again, from SK learn this linear model, we can import logistic regression. All right.
And just like how we did above, we can repeat all of this. So here, instead of NB, I'm going to call
this log model, or LG logistic regression. I'm going to change this to logistic regression.
So I'm just going to use the default logistic regression. But actually, if you look here,
you see that you can use different penalties. So right now we're using
an L2 penalty. But L2 is our quadratic formula. Okay, so that means that for,
you know, outliers, it would really penalize that. For all these other things, you know,
you can toggle these different parameters, and you might get slightly different results.
If I were building a production level logistic regression model, then I would want to go and I
would want to figure out how to do that. So I'm going to go ahead and I'm going to go ahead and
I would want to figure out, you know, what are the best parameters to pass into here,
based on my validation data. But for now, we'll just we'll just use this out of the box.
So again, I'm going to fit the X train and the Y train. And I'm just going to predict again,
so I can just call this again. And instead of LG, NB, I'm going to use LG. So here, this is decent
precision 65% recall 71, f 168, or 82 total accuracy of 77. Okay, so it performs slightly
better than I base, but it's still not as good as K and N. Alright, so the last model for
classification that I wanted to talk about is something called support vector machines,
or SVMs for short. So what exactly is an SVM model, I have two different features x zero and
x one on the axes. And then I've told you if it's you know, class zero or class one based on the
blue and red labels, my goal is to find some sort of line between these two labels that best divides
the data. Alright, so this line is our SVM model. So I call it a line here because in 2d, it's a
line, but in 3d, it would be a plane and then you can also have more and more dimensions. So the
proper term is actually I want to find the hyperplane that best differentiates these two
classes. Let's see a few examples. Okay, so first, between these three lines, let's say A, B, and C,
and C, which one is the best divider of the data, which one has you know, all the data on one side
or the other, or at least if it doesn't, which one divides it the most, right, like which one
is has the most defined boundary between the two different groups. So this this question should be
pretty straightforward. It should be a right because a has a clear distinct line between where you
know, everything on this side of a is one label, it's negative and everything on this side of a
is the other label, it's positive. So what if I have a but then what if I had drawn my B
like this, and my C, maybe like this, sorry, they're kind of the labels are kind of close together.
But now which one is the best? So I would argue that it's still a, right? And why is it still a?
Right? And why is it still a? Because in these other two, look at how close this is to that,
to these points. Right? So if I had some new point that I wanted to estimate, okay,
say I didn't have A or B. So let's say we're just working with C. Let's say I have some new point
that's right here. Or maybe a new point that's right there. Well, it seems like just logically
looking at this. I mean, without the boundary, that would probably go under the positives,
right? I mean, it's pretty close to that other positive. So one thing that we care about in SVM
is something known as the margin. Okay, so not only do we want to separate the two classes really
well, we also care about the boundary in between where the points in those classes in our data set
are, and the line that we're drawing. So in a line like this, the closest values to this line
might be like here. And I'm trying to draw these perpendicular. Right? And so this effectively,
if I switch over to these dotted lines, if I can draw this right. So these effectively
are what's known as the margins. Okay, so these both here, these are our margins in our SVMs.
And our goal is to maximize those margins. So not only do we want the line that best separates the
two different classes, we want the line that has the largest margin. And the data points that lie
on the margin lines, the data. So basically, these are the data points that's helping us define our
divider. These are what we call support vectors. Hence the name support vector machines. Okay,
so the issue with SVM sometimes is that they're not so robust to outliers. Right? So for example,
if I had one outlier, like this up here, that would totally change where I want my support
vector to be, even though that might be my only outlier. Okay. So that's just something to keep
in mind. As you know, when you're working with SVM is, it might not be the best model if there
are outliers in your data set. Okay, so another example of SVMs might be, let's say that we have
data like this, I'm just going to use a one dimensional data set for this example. Let's
say we have a data set that looks like this. Well, our, you know, separators should be
perpendicular to this line. But it should be somewhere along this line. So it could be
anywhere like this. You might argue, okay, well, there's one here. And then you could also just
draw another one over here, right? And then maybe you can have two SVMs. But that's not really how
SVMs work. But one thing that we can do is we can create some sort of projection. So I realize here
that one thing I forgot to do was to label where zero was. So let's just say zero is here.
Now, what I'm going to do is I'm going to say, okay, I'm going to have x, and then I'm going to
have x, sorry, x zero and x one. So x zero is just going to be my original x. But I'm going to make
x one equal to let's say, x squared. So whatever is this squared, right? So now, my natives would be,
you know, maybe somewhere here, here, just pretend that it's somewhere up here.
Right. And now my pluses might be something like
that. And I'm going to run out of space over here. So I'm just going to draw these together,
use your imagination. But once I draw it like this, well, it's a lot easier to apply a boundary,
right? Now our SVM could be maybe something like this, this. And now you see that we've divided
our data set. Now it's separable where one class is this way. And the other class is that way.
Okay, so that's known as SVMs. I do highly suggest that, you know, any of these models that we just
mentioned, if you're interested in them, do go more in depth mathematically into them. Like how
do we how do we find this hyperplane? Right? I'm not going to go over that in this specific course,
because you're just learning what an SVM is. But it's a good idea to know, oh, okay, this is the
technique behind finding, you know, what exactly are the are the how do you define the hyperplane
that we're going to use. So anyways, this transformation that we did down here, this is known
as the kernel trick. So when we go from x to some coordinate x, and then x squared,
what we're doing is we are applying a kernel. So that's why it's called the kernel trick.
So SVMs are actually really powerful. And you'll see that here. So from sk learn.svm, we are going
to import SVC. And SVC is our support vector classifier. So with this, so with our SVM model,
we are going to, you know, create SVC model. And we are going to, again, fit this to X train, I
could have just copied and pasted this, I should be able to do that. So we're going to create SVC
again, fit this to X train, I could have just copied and pasted this, I should have probably
done that. Okay, taking a bit longer. All right. Let's predict using RSVM model. And here,
let's see if I can hover over this. Right. So again, you see a lot of these different
parameters here that you can go back and change if you were creating a production level model. Okay,
but in this specific case, we'll just use it out of the box again. So if I make predictions,
you'll note that Wow, the accuracy actually jumps to 87% with the SVM. And even with class zero,
there's nothing less than, you know, point eight, which is great. And for class one,
I mean, everything's at 0.9, which is higher than anything that we had seen to this point.
So so far, we've gone over four different classification models, we've done SVM,
logistic regression, naive Bayes and cannon. And these are just simple ways on how to implement
them. Each of these they have different, you know, they have different hyper parameters that you can
go and you can toggle. And you can try to see if that helps later on or not. But for the most part,
they perform, they give us around 70 to 80% accuracy. Okay, with SVM being the best. Now,
let's see if we can actually beat that using a neural net. Now the final type of model that
I wanted to talk about is known as a neural net or neural network. And neural nets look something
like this. So you have an input layer, this is where all your features would go. And they have
all these arrows pointing to some sort of hidden layer. And then all these arrows point to some
sort of output layer. So what is what is all this mean? Each of these layers in here, this is
something known as a neuron. Okay, so that's a neuron. In a neural net. These are all of our
features that we're inputting into the neural net. So that might be x zero x one all the way through
x n. Right. And these are the features that we talked about there, they might be you know,
the pregnancy, the BMI, the age, etc. Now all of these get weighted by some value. So they
are multiplied by some w number that applies to that one specific category that one specific
feature. So these two get multiplied. And the sum of all of these goes into that neuron. Okay,
so basically, I'm taking w zero times x zero. And then I'm adding x one times w one and then
I'm adding you know, x two times w two, etc, all the way to x n times w n. And that's getting
input into the neuron. Now I'm also adding this bias term, which just means okay, I might want
to shift this by a little bit. So I might add five or I might add 0.1 or I might subtract 100,
I don't know. But we're going to add this bias term. And the output of all these things. So
the sum of this, this, this and this, go into something known as an activation function,
okay. And then after applying this activation function, we get an output. And this is what a
neuron would look like. Now a whole network of them would look something like this.
So I kind of gloss over this activation function. What exactly is that? This is how a neural net
looks like if we have all our inputs here. And let's say all of these arrows represent some sort
of addition, right? Then what's going on is we're just adding a bunch of times, right? We're adding
the some sort of weight times these input layer a bunch of times. And then if we were to go back
and factor that all out, then this entire neural net is just a linear combination of these input
layers, which I don't know about you, but that just seems kind of useless, right? Because we could
literally just write that out in a formula, why would we need to set up this entire neural network,
we wouldn't. So the activation function is introduced, right? So without an activation
function, this just becomes a linear model. An activation function might look something like
this. And as you can tell, these are not linear. And the reason why we introduce these is so that
our entire model doesn't collapse on itself and become a linear model. So over here, this is
something known as a sigmoid function, it runs between zero and one, tanh runs between negative
one all the way to one. And this is ReLU, which anything less than zero is zero, and then anything
greater than zero is linear. So with these activation functions, every single output of a neuron
is no longer just the linear combination of these, it's some sort of altered linear state, which means
that the input into the next neuron is, you know, it doesn't it doesn't collapse on itself, it doesn't
become linear, because we've introduced all these nonlinearities. So this is a training set, the
model, the loss, right? And then we do this thing called training, where we have to feed the loss
back into the model, and make certain adjustments to the model to improve this predicted output.
Let's talk a little bit about the training, what exactly goes on during that step.
Let's go back and take a look at our L2 loss function. This is what our L2 loss function
looks like it's a quadratic formula, right? Well, up here, the error is really, really, really, really
large. And our goal is to get somewhere down here, where the loss is decreased, right? Because that
means that our predicted value is closer to our true value. So that means that we want to go
this way. Okay. And thanks to a lot of properties of math, something that we can do is called
gradient descent, in order to follow this slope down this way. This quadratic is, it has different
different slopes with respect to some value. Okay, so the loss with respect to some weight
w zero, versus w one versus w n, they might all be different. Right? So some way that I kind of
think about it is, to what extent is this value contributing to our loss. And we can actually
figure that out through some calculus, which we're not going to touch up on in this specific course.
But if you want to learn more about neural nets, you should probably also learn some calculus
and figure out what exactly back propagation is doing, in order to actually calculate, you know,
how much do we have to backstep by. So the thing is here, you might notice that this follows
this curve at all of these different points. And the closer we get to the bottom, the smaller
this step becomes. Now stick with me here. So my new value, this is what we call a weight update,
I'm going to take w zero, and I'm going to set some new value for w zero. And what I'm going to
set for that is the old value of w zero, plus some factor, which I'll just call alpha for now,
times whatever this arrow is. So that's basically saying, okay, take our old w zero, our old weight,
and just decrease it this way. So I guess increase it in this direction, right, like take a step in
this direction. But this alpha here is telling us, okay, don't don't take a huge step, right,
just in case we're wrong, take a small step, take a small step in that direction, see if we get any
closer. And for those of you who, you know, do want to look more into the mathematics of things,
the reason why I use a plus here is because this here is the negative gradient, right, if this were
just the if you were to use the actual gradient, this should be a minus.
Now this alpha is something that we call the learning rate. Okay, and that adjusts how quickly
we're taking steps. And that might, you know, tell our that that will ultimately control
how long it takes for our neural net to converge. Or sometimes if you set it too high, it might even
diverge. But with all of these weights, so here I have w zero, w one, and then w n. We make the same
update to all of them after we calculate the loss, the gradient of the loss with respect to that
weight. So that's how back propagation works. And that is everything that's going on here. After we
calculate the loss, we're calculating gradients, making adjustments in the model. So we're setting
all the all the weights to something adjusted slightly. And then we're going to calculate the
gradient. And then we're saying, Okay, let's take the training set and run it through the model
again, and go through this loop all over again. So for machine learning, we already have seen some
libraries that we use, right, we've already seen SK learn. But when we start going into neural
networks, this is kind of what we're trying to program. And it's not very fun to try to
do this from scratch, because not only will we probably have a lot of bugs, but also probably
not going to be fast enough, right? Wouldn't it be great if there are just some, you know,
full time professionals that are dedicated to solving this problem, and they could literally
just give us their code that's already running really fast? Well, the answer is, yes, that exists.
And that's why we use TensorFlow. So TensorFlow makes it really easy to define these models. But
we also have enough control over what exactly we're feeding into this model. So for example,
this line here is basically saying, Okay, let's create a sequential neural net. So sequential is
just, you know, what we've seen here, it just goes one layer to the next. And a dense layer means that
a dense layer means that all of them are interconnected. So here, this is interconnected with all of these
nodes, and this one's all these, and then this one gets connected to all of the next ones, and so on.
So we're going to create 16 dense nodes with relu activation functions. And then we're going
to create another layer of 16 dense nodes with relu activation. And then our output layer is going
to be just one node. Okay. And that's how easy it is to define something in TensorFlow. So TensorFlow
is an open source library that helps you develop and train your ML models. Let's implement this
for a neural net. So we're using a neural net for classification. Now, so our neural net model,
we are going to use TensorFlow, and I don't think I imported that up here. So we are going to import
that down here. So I'm going to import TensorFlow as TF. And enter. Cool. So my neural net model
is going to be, I'm going to use this. So essentially, this is saying layer all these
things that I'm about to pass in. So yeah, layer them linear stack of layers, layer them as a model.
And what that means, nope, not that. So what that means is I can pass in
some sort of layer, and I'm just going to use a dense layer.
Oops, dot dense. And let's say we have 32 units. Okay, I will also
set the activation as really. And at first we have to specify the input shape. So here we have 10,
and comma. Alright. Alright, so that's our first layer. Now our next layer, I'm just going to have
another dense layer of 32 units all using relu. And that's it. So for the final layer, this is
just going to be my output layer, it's going to just be one node. And the activation is going to
be sigmoid. So if you recall from our logistic regression, what happened there was when we had
a sigmoid, it looks something like this, right? So by creating a sigmoid activation on our last layer,
we're essentially projecting our predictions to be zero or one, just like in logistic regression.
And that's going to help us, you know, we can just round to zero or one and classify that way.
Okay. So this is my neural net model. And I'm going to compile this. So in TensorFlow,
we have to compile it. It's really cool, because I can just literally pass in what type of optimizer
I want, and it'll do it. So here, if I go to optimizers, I'm actually going to use atom.
And you'll see that, you know, the learning rate is 0.001. So I'm just going to use that default.
So 0.001. And my loss is going to be binary cross entropy. And the metrics that I'm also going to
include on here, so it already will consider loss, but I'm, I'm also going to tack on accuracy.
So we can actually see that in a plot later on. Alright, so I'm going to run this.
And one thing that I'm going to also do is I'm going to define these plot definitions. So I'm
actually copying and pasting this, I got these from TensorFlow. So if you go on to some TensorFlow
tutorial, they actually have these, this like, defined. And that's exactly what I'm doing here.
So I'm actually going to move this cell up, run that. So we're basically plotting the loss
over all the different epochs. epochs means like training cycles. And we're going to run that. So
means like training cycles. And we're going to plot the accuracy over all the epochs.
Alright, so we have our model. And now all that's left is, let's train it. Okay.
So I'm going to say history. So TensorFlow is great, because it keeps track of the history
of the training, which is why we can go and plot it later on. Now I'm going to set that equal to
this neural net model. And fit that with x train, y train, I'm going to make the number of epochs
equal to let's say just let's just use 100 for now. And the batch size, I'm going to set equal to,
let's say 32. Alright. And the validation split. So what the validation split does, if it's down
here somewhere. Okay, so yeah, this validation split is just the fraction of the training data
to be used as validation data. So essentially, every single epoch, what's going on is TensorFlow
saying, leave certain if this is point two, then leave 20% out. And we're going to test how the
model performs on that 20% that we've left out. Okay, so it's basically like our validation data
set. But TensorFlow does it on our training data set during the training. So we have now a measure
outside of just our validation data set to see, you know, what's going on. So validation split,
I'm going to make that 0.2. And we can run this. So if I run that, all right, and I'm actually going
to set verbose equal to zero, which means, okay, don't print anything, because printing something
for 100 epochs might get kind of annoying. So I'm just going to let it run, let it train,
and then we'll see what happens. Cool, so it finished training. And now what I can do is
because you know, I've already defined these two functions, I can go ahead and I can plot the loss,
oops, loss of that history. And I can also plot the accuracy throughout the training.
So this is a little bit ish what we're looking for. We definitely are looking for a steadily
decreasing loss and an increasing accuracy. So here we do see that, you know, our validation
accuracy improves from around point seven, seven or something all the way up to somewhere around
point, maybe eight one. And our loss is decreasing. So this is good. It is expected that the validation
loss and accuracy is performing worse than the training loss or accuracy. And that's because
our model is training on that data. So it's adapting to that data. Whereas the validation stuff is,
you know, stuff that it hasn't seen yet. So, so that's why. So in machine learning, as we saw above,
we could change a bunch of the parameters, right? Like I could change this to 64. So now it'd be
a row of 64 nodes, and then 32, and then one. So I can change some of these parameters.
And a lot of machine learning is trying to find, hey, what do we set these hyper parameters to?
So what I'm actually going to do is I'm going to rewrite this so that we can do something what's
known as a grid search. So we can search through an entire space of hey, what happens if, you know,
we have 64 nodes and 64 nodes, or 16 nodes and 16 nodes, and so on. And then on top of all that,
we can, you know, we can change this learning rate, we can change how many epochs we can change,
you know, the batch size, all these things might affect our training. And just for kicks,
I'm also going to add what's known as a dropout layer in here. And what dropout is doing is
saying, hey, randomly choose with at this rate, certain nodes, and don't train them in, you know,
in a certain iteration. So this helps prevent overfitting. Okay, so I'm actually going to
define this as a function called train model, we're going to pass in x train, y train,
the number of nodes, the dropout, you know, the probability that we just talked about
learning rate. So I'm actually going to say lr batch size. And we can also pass in number epochs,
right? I mentioned that as a parameter. So indent this, so it goes under here. And with these two,
I'm going to set this equal to number of nodes. And now with the two dropout layers, I'm going
to set dropout prob. So now you know, the probability of turning off a node during the training
is equal to dropout prob. And I'm going to keep the output layer the same. Now I'm compiling it,
but this here is now going to be my learning rate. And I still want binary cross entropy and
accuracy. We are actually going to train our model inside of this function. But here we can do the
epochs equal epochs, and this is equal to whatever, you know, we're passing in x train,
y train belong right here. Okay, so those are getting passed in as well. And finally, at the
end, I'm going to return this model and the history of that model. Okay. So now what I'll do
is let's just go through all of these. So let's say let's keep epochs at 100. And now what I can
do is I can say, hey, for a number of nodes in, let's say, let's do 1632 and 64, to see what
happens for the different dropout probabilities. And I mean, zero would be nothing. Let's use 0.2.
Also, to see what happens. You know, for the learning rate in 0.005, 0.001. And you know,
maybe we want to throw on 0.1 in there as well. And then for the batch size, let's do 1632,
64 as well. Actually, and let's also throw in 128. Actually, let's get rid of 16. Sorry,
so 128 in there. That should be 01. I'm going to record the model and history using this
train model here. So we're going to do x train y train, the number of nodes is going to be,
you know, the number of nodes that we've defined here, dropout, prob, LR, batch size, and epochs.
Okay. And then now we have both the model and the history. And what I'm going to do is again,
I want to plot the loss for the history. I'm also going to plot the accuracy.
Probably should have done them side by side, that probably would have been easier.
Okay, so what I'm going to do is split up, split this up. And that will be
the subplots. So now this is just saying, okay, I want one row and two columns in that row for my
plots. Okay, so I'm going to plot on my axis one, the loss. I don't actually know this is going to
work. Okay, we don't care about the grid. Yeah, let's let's keep the grid. And then now my other.
So now on here, I'm going to plot all the accuracies on the second plot.
I might have to debug this a bit.
We should be able to get rid of that. If we run this, we already have history saved as a variable
in here. So if I just run it on this, okay, it has no attribute x label. Oh, I think it's because
it's like set x label or something. Okay, yeah, so it's, it's set instead of just x label, y label.
So let's see if that works. All right, cool. Um, and let's actually make this a bit larger.
Okay, so we can actually change the figure size that I'm gonna set. Let's see what happens if I
set that to. Oh, that's not the way I wanted it. Okay, so that looks reasonable.
And that's just going to be my plot history function. So now I can plot them side by side.
Here, I'm going to plot the history. And what I'm actually going to do is I so here, first,
I'm going to print out all these parameters. So I'm going to print out
the F string to print out all of this stuff. So here, I'm going to print out all these parameters.
Uh, all of this stuff. So here, I'm printing out how many nodes, um, the dropout probability,
uh, the learning rate.
And we already know how many you found, so I'm not even going to bother with that.
So once we plot this, uh, let's actually also figure out what the, um, what the validation
losses on our validation set that we have that we created all the way back up here.
Alright, so remember, we created three data sets. Let's call our model and evaluate what the
validation data with the validation data sets loss would be. And I actually want to record,
let's say I want to record whatever model has the least validation loss. So
first, I'm going to initialize that to infinity so that you know, any model will beat that score.
So if I do float infinity, that will set that to infinity. And maybe I'll keep
track of the parameters. Actually, it doesn't really matter. I'm just going to keep track of
the model. And I'm gonna set that to none. So now down here, if the validation loss is ever
less than the least validation loss, then I am going to simply come down here and say,
Hey, this validation for this least validation loss is now equal to the validation loss.
And the least loss model is whatever this model is that just earned that validation loss. Okay.
So we are actually just going to let this run for a while. And then we're going to get our least
last model after that. So let's just run. All right, and now we wait.
All right, so we've finally finished training. And you'll notice that okay, down here, the loss
actually gets to like 0.29. The accuracy is around 88%, which is pretty good. So you might be wondering,
okay, why is this accuracy in this? Like, these are both the validation. So this accuracy here
is on the validation data set that we've defined at the beginning, right? And this one here,
this is actually taking 20% of our tests, our training set every time during the training,
and saying, Okay, how much of it do I get right now? You know, after this one step where I didn't
train with any of that. So they're slightly different. And actually, I realized later on
that I probably you know, probably what I should have done is over here, when we were defining
the model fit, instead of the validation split, you can define the validation data.
And you can pass in the validation data, I don't know if this is the proper syntax. But
that's probably what I should have done. But instead, you know, we'll just stick with what
we have here. So you'll see at the end, you know, with the 64 nodes, it seems like this is our best
performance 64 nodes with a dropout of 0.2, a learning rate of 0.001, and a batch size of 64.
And it does seem like yes, the validation, you know, the fake validation, but the validation
loss is decreasing, and then the accuracy is increasing, which is a good sign. Okay,
so finally, what I'm going to do is I'm actually just going to predict. So I'm going to take
this model, which we've called our least loss model, I'm going to take this model,
and I'm going to predict x test on that. And you'll see that it gives me some values that
are really close to zero and some that are really close to one. And that's because we have a sigmoid
output. So if I do this, and what I can do is I can cast them. So I'm going to say anything that's
greater than 0.5, set that to one. So if I actually, I think what happens if I do this?
Oh, okay, so I have to cast that as type. And so now you'll see that it's ones and zeros. And I'm
actually going to transform this into a column as well. So here I'm going to Oh, oops, I didn't
I didn't mean to do that. Okay, no, I wanted to just reshape it to that. So now it's one dimensional.
Okay. And using that we can actually just rerun the classification report based on these this
neural net output. And you'll see that okay, the the F ones are the accuracy gives us 87%. So it
seems like what happened here is the precision on class zero. So the hadrons has increased a bit,
but the recall decreased. But the F one score is still at a good point eight one. And for the other
class, it looked like the precision decreased a bit the recall increased for an overall F one score.
That's also been increased. I think I interpreted that properly. I mean, we went through all this
work and we got a model that performs actually very, very similarly to the SVM model that we
had earlier. And the whole point of this exercise was to demonstrate, okay, these are how you can
define your models. But it's also to say, hey, maybe, you know, neural nets are very, very
powerful, as you can tell. But sometimes, you know, an SVM or some other model might actually be more
appropriate. But in this case, I guess it didn't really matter which one we use at the end. An 87%
accuracy score is still pretty good. So yeah, let's now move on to regression.
We just saw a bunch of different classification models. Now let's shift gears into regression,
the other type of supervised learning. If we look at this plot over here, we see a bunch of scattered
data points. And here we have our x value for those data points. And then we have the corresponding y
value, which is now our label. And when we look at this plot, well, our goal in regression is to find
the line of best fit that best models this data. Essentially, we're trying to let's say we're given
some new value of x that we don't have in our sample, we're trying to say, okay, what would my
prediction for y be for that given x value. So that, you know, might be somewhere around there.
I don't know. But remember, in regression that, you know, given certain features,
we're trying to predict some continuous numerical value for y.
In linear regression, we want to take our data and fit a linear model to this data. So in this case,
our linear model might look something along the lines of here. Right. So this here would be
considered as maybe our line of best fit. And this line is modeled by the equation, I'm going to write
it down here, y equals b zero, plus b one x. Now b zero just means it's this y intercept. So if we
extend this y down here, this value here is b zero, and then b one defines the source of the
line, defines the slope of this line. Okay. All right. So that's the that's the formula
for linear regression. And how exactly do we come up with that formula? What are we trying to do
with this linear regression? You know, we could just eyeball where the line be, but humans are
not very good at eyeballing certain things like that. I mean, we can get close, but a computer is
better at giving us a precise value for b zero and b one. Well, let's introduce the concept of
something known as a residual. Okay, so residual, you might also hear this being called the error.
And what that means is, let's take some data point in our data set. And we're going to evaluate how
far off is our prediction from a data point that we already have. So this here is our y, let's say,
this is 12345678. So this is y eight, let's call it, you'll see that I use this y i in order to
I in order to represent, hey, just one of these points. Okay. So this here is why and this here
would be the prediction. Oops, this here would be the prediction for y eight, which I've labeled
with this hat. Okay, if it has a hat on it, that means hey, this is what this is my guess this is
my prediction for you know, this specific value of x. Okay. Now the residual would be this distance
here between y eight and y hat eight. So y eight minus y hat eight. All right, because that would
give us this here. And I'm just going to take the absolute value of this. Because what if it's below
the line, right, then you would get a negative value, but distance can't be negative. So we're
just going to put a little hat, or we're going to put a little absolute value around this quantity.
And that gives us the residual or the error. So let me rewrite that. And you know, to generalize
to all the points, I'm going to say the residual can be calculated as y i minus y hat of i. Okay.
So this just means the distance between some given point, and its prediction, its corresponding
prediction on the line. So now, with this residual, this line of best fit is generally trying to
decrease these residuals as much as possible. So now that we have some value for the error,
our line of best fit is trying to decrease the error as much as possible for all of the different
data points. And that might mean, you know, minimizing the sum of all the residuals. So this
here, this is the sum symbol. And if I just stick the residual calculation in there,
it looks something like that, right. And I'm just going to say, okay, for all of the eyes in our
data set, so for all the different points, we're going to sum up all the residuals. And I'm going
to try to decrease that with my line of best fit. So I'm going to find the B0 and B1, which gives
me the lowest value of this. Okay. Now in other, you know, sometimes in different circumstances,
we might attach a squared to that. So we're trying to decrease the sum of the squared residuals.
And what that does is it just, you know, it adds a higher penalty for how far off we are from,
you know, points that are further off. So that is linear regression, we're trying to find
this equation, some line of best fit that will help us decrease this measure of error
with respect to all the data points that we have in our data set, and try to come up with
the best prediction for all of them. This is known as simple linear regression.
And basically, that means, you know, our equation looks something like this. Now, there's also
multiple linear regression, which just means that hey, if we have more than one value for x, so like
think of our feature vectors, we have multiple values in our x vector, then our predictor might
look something more like this. Actually, I'm just going to say etc, plus b n, x n. So now I'm coming
up with some coefficient for all of the different x values that I have in my vector. Now you guys
might have noticed that I have some assumptions over here. And you might be asking, okay, Kylie,
what in the world do these assumptions mean? So let's go over them.
So let's go over them. The first one is linearity.
And what that means is, let's say I have a data set. Okay.
Linearity just means, okay, my does my data follow a linear pattern? Does y increase as x
increases? Or does y decrease at as x increases? Does so if y increases or decreases at a constant
rate as x increases, then you're probably looking at something linear. So what's the example of a
nonlinear data set? Let's say I had data that might look something like that. Okay. So now just
visually judging this, you might say, okay, seems like the line of best fit might actually be some
curve like this. Right. And in this case, we don't satisfy that linearity assumption anymore.
So with linearity, we basically just want our data set to follow some sort of linear trajectory.
And independence, our second assumption
just means this point over here, it should have no influence on this point over here,
or this point over here, or this point over here. So in other words, all the points,
all the samples in our data set should be independent. Okay, they should not rely on
one another, they should not affect one another.
Okay, now, normality and homoscedasticity, those are concepts which use this residual. Okay. So if
I have a plot that looks something like this, and I have a plot that looks like this. Okay,
something like this. And my line of best fit is somewhere here, maybe it's something like that.
In order to look at these normality and homoscedasticity assumptions, let's look at
the residual plot. Okay. And what that means is I'm going to keep my same x axis. But instead
of plotting now where they are relative to this y, I'm going to plot these errors. So now I'm
going to plot y minus y hat like this. Okay. And now you know, this one is slightly positive,
so it might be here, this one down here is negative, it might be here. So our residual plot,
it's literally just a plot of how you know, the values are distributed around our line of best
fit. So it looks like it might, you know, look something like this. Okay. So this might be our
residual plot. And what normality means, so our assumptions are normality and homoscedasticity,
I might have butchered that spelling, I don't really know. But what normality is saying is
saying, okay, these residuals should be normally distributed. Okay, around this line of best fit,
it should follow a normal distribution. And now what homoscedasticity says, okay, our variants
of these points should remain constant throughout. So this spread here should be approximately the
same as this spread over here. Now, what's an example of where you know, homoscedasticity is
not held? Well, let's say that our original plot actually looks something like this.
Okay, so now if we looked at the residuals for that, it might look something
like that. And now if we look at this spread of the points, it decreases, right? So now the spread
is not constant, which means that homoscedasticity, this assumption would not be fulfilled, and it
might not be appropriate to use linear regression. So that's just linear regression. Basically,
we have a bunch of data points, we want to predict some y value for those. And we're trying to come
up with this line of best fit that best describes, hey, given some value x, what would be my best
guess of what y is. So let's move on to how do we evaluate a linear regression model. So the first
measure that I'm going to talk about is known as mean absolute error, or MAE
for short, okay. And mean absolute error is basically saying, all right, let's take
all the errors. So all these residuals that we talked about, let's sum up the distance
for all of them, and then take the average. And then that can describe, you know, how far off are
we. So the mathematical formula for that would be, okay, let's take all the residuals.
Alright, so this is the distance. Actually, let me redraw a plot down here. So
suppose I have a data set, look like this. And here are all my data points, right. And now let's
say my line looks something like that. So my mean absolute error would be summing up all of these
values. This was a mistake. So summing up all of these, and then dividing by how many data points
I have. So what would be all the residuals, it would be y i, right, so every single point,
minus y hat i, so the prediction for that on here. And then we're going to sum over all of
all of the different i's in our data set. Right, so i, and then we divide by the number of points
we have. So actually, I'm going to rewrite this to make it a little clearer. So i is equal to
whatever the first data point is all the way through the nth data point. And then we divide
it by n, which is how many points there are. Okay, so this is our measure of mae. And this is basically
telling us, okay, in on average, this is the distance between our predicted value and the
actual value in our training set. Okay. And mae is good because it allows us to, you know, when we
get this value here, we can literally directly compare it to whatever units the y value is in.
So let's say y is we're talking, you know, the prediction of the price of a house, right, in
dollars. Once we have once we calculate the mae, we can literally say, oh, the average, you know,
price, the average, how much we're off by is literally this many dollars. Okay. So that's the
mean absolute error. An evaluation technique that's also closely related to that is called the mean
squared error. And this is MSE for short. Okay. Now, if I take this plot again, and I duplicated
and move it down here, well, the gist of mean squared error is kind of the same, but instead
of the absolute value, we're going to square. So now the MSE is something along the lines of,
okay, let's sum up something, right, so we're going to sum up all of our errors.
So now I'm going to do y i minus y hat i. But instead of absolute valuing them,
I'm going to square them all. And then I'm going to divide by n in order to find the mean. So
basically, now I'm taking all of these different values, and I'm squaring them first before I add
them to one another. And then I divide by n. And the reason why we like using mean squared error
is that it helps us punish large errors in the prediction. And later on, MSE might be important
because of differentiability, right? So a quadratic equation is differentiable, you know,
if you're familiar with calculus, a quadratic equation is differentiable, whereas the absolute
value function is not totally differentiable everywhere. But if you don't understand that,
don't worry about it, you won't really need it right now. And now one downside of mean squared
error is that once I calculate the mean squared error over here, and I go back over to y, and I
want to compare the values. Well, it gets a little bit trickier to do that because now my mean squared
error is in terms of y squared, right? It's this is now squared. So instead of just dollars, how,
you know, how many dollars off am I I'm talking how many dollars squared off am I. And that,
you know, to humans, it doesn't really make that much sense. Which is why we have created
something known as the root mean squared error. And I'm just going to copy this diagram over here
because it's very, very similar to mean squared error. Except now we take a big squared root.
Okay, so this is our messy, and we take the square root of that mean squared error. And so now the
term in which you know, we're defining our error is now in terms of that dollar sign symbol again.
So that's a pro of root mean squared error is that now we can say, okay, our error according
to this metric is this many dollar signs off from our predictor. Okay, so it's in the same unit,
which is one of the pros of root mean squared error. And now finally, there is the coefficient
of determination, or r squared. And this is a formula for r squared. So r squared is equal
to one minus RSS over TSS. Okay, so what does that mean? Basically, RSS stands for the sum
of the squared residuals. So maybe it should be SSR instead, but
RSS sum of the squared residuals, and this is equal to if I take the sum of all the values,
and I take y i minus y hat, i, and square that, that is my RSS, right, it's a sum of the squared
residuals. Now TSS, let me actually use a different color for that.
So TSS is the total sum of squares.
And what that means is that instead of being with respect to this prediction,
we are instead going to
take each y value and just subtract the mean of all the y values, and square that.
Okay, so if I drew this out,
and if this were my
actually, let's use a different color. Let's use green. If this were my predictor,
so RSS is giving me this measure here, right? It's giving me some estimate of how far off we are from
our regressor that we predicted. Actually, I'm gonna take this one, and I'm gonna take this one,
and actually, I'm going to use red for that. Well, TSS, on the other hand, is saying, okay,
how far off are these values from the mean. So if we literally didn't do any calculations for the
line of best fit, if we just took all the y values and average all of them, and said, hey,
this is the average value for every single x value, I'm just going to predict that average value
instead, then it's asking, okay, how far off are all these points from that line?
Okay, and remember that this square means that we're punishing larger errors, right? So even if
they look somewhat close in terms of distance, the further a few data points are, then the further
the larger our total sum of squares is going to be. Sorry, that was my dog. So the total sum of
squares is taking all of these values and saying, okay, what is the sum of squares, if I didn't do
any regressor, and I literally just calculated the average of all the y values in my data set,
and for every single x value, I'm just going to predict that average, which means that okay,
like, that means that maybe y and x aren't associated with each other at all. Like the
best thing that I can do for any new x value, just predict, hey, this is the average of my data set.
And this total sum of squares is saying, okay, well, with respect to that average,
what is our error? Right? So up here, the sum of the squared residuals, this is telling us what is
our what what is our error with respect to this line of best fit? Well, our total sum of squares
saying what is the error with respect to, you know, just the average y value. And if our line
of best fit is a better fit, then this total sum of squares, that means that you know, this numerator,
that means that this numerator is going to be smaller than this denominator, right?
And if our errors in our line of best fit are much smaller, then that means that this ratio
of the RSS over TSS is going to be very small, which means that R squared is going to go towards
one. And now when R squared is towards one, that means that that's usually a sign that we have a
good predictor. It's one of the signs, not the only one. So over here, I also have, you know,
that there's this adjusted R squared. And what that does, it just adjusts for the number of terms.
So x1, x2, x3, etc. It adjusts for how many extra terms we add, because usually when we,
you know, add an extra term, the R squared value will increase because that'll help us predict
y some more. But the value for the adjusted R squared increase if the new term actually
improves this model fit more than expected, you know, by chance. So that's what adjusted
R squared is. I'm not, you know, it's out of the scope of this one specific course.
And now that's linear regression. Basically, I've covered the concept of residuals or errors.
And, you know, how do we use that in order to find the line of best fit? And you know,
our computer can do all the calculations for us, which is nice. But behind the scenes,
it's trying to minimize that error, right? And then we've gone through all the different
ways of actually evaluating a linear regression model and the pros and cons of each one.
So now let's look at an example. So we're still on supervised learning. But now we're just going to
talk about regression. So what happens when you don't just want to predict, you know, type 123?
What happens if you actually want to predict a certain value? So again, I'm on the UCI machine
learning repository. And here I found this data set about bike sharing in Seoul, South Korea.
So this data set is predicting rental bike count. And here it's the kind of bikes rented at each
hour. So what we're going to do, again, you're going to go into the data folder, and you're going
to download this CSV file. And we're going to move over to collab again. And here I'm going to name
this FCC bikes and regression. I don't remember what I called the last one. But yeah, FCC bikes
regression. Now I'm going to import a bunch of the same things that I did earlier. And, you know,
I'm going to also continue to import the oversampler and the standard scaler. And then I'm actually
also just going to let you guys know that I have a few more things I wanted import. So this is a
library that lets us copy things. Seaborn is a wrapper over a matplotlib. So it also allows us
to plot certain things. And then just letting you know that we're also going to be using
TensorFlow. Okay, so one more thing that we're also going to be using, we're going to use the
sklearn linear model library. Actually, let me make my screen a little bit bigger. So yeah,
awesome. Run this and that'll import all the things that we need. So again, I'm just going to,
you know, give some credit to where we got this data set. So let me copy and paste this UCI thing.
And I will also give credit to this here.
Okay, cool. All right, cool. So this is our data set. And again, it tells us all the different
attributes that we have right here. So I'm actually going to go ahead and paste this in here.
Feel free to copy and paste this if you want me to read it out loud, so you can type it.
It's byte count, hour, temp, humidity, wind, visibility, dew point, temp, radiation, rain,
snow, and functional, whatever that means. Okay, so I'm going to come over here and import my data
by dragging and dropping. All right. Now, one thing that you guys might actually need to do is
you might actually have to open up the CSV because there were, at first, a few like forbidding
characters in mine, at least. So you might have to get rid of like, I think there was a degree here,
but my computer wasn't recognizing it. So I got rid of that. So you might have to go through
and get rid of some of those labels that are incorrect. I'm going to do this. Okay. But
after we've done that, we've imported in here, I'm going to create a data a data frame from that. So,
all right, so now what I can do is I can read that CSV file and I can get the data into here.
So so like data dot CSV. Okay, so now if I call data dot head, you'll see that I have all the
various labels, right? And then I have the data in there. So I'm going to from here, I'm actually
going to get rid of some of these columns that, you know, I don't really care about. So here,
I'm going to, when I when I type this in, I'm going to drop maybe the date, whether or not it's a
holiday, and the various seasons. So I'm just not going to care about these things. Access equals
one means drop it from the columns. So now you'll see that okay, we still have, I mean,
I guess you don't really notice it. But if I set the data frames columns equal to data set calls,
and I look at, you know, the first five things, then you'll see that this is now our data set.
It's a lot easier to read. So another thing is, I'm actually going to
df functional. And we're going to create this. So remember that our computers are not very good
at language, we want it to be in zeros and ones. So here, I will convert that.
Well, if this is equal to yes, then that that gets mapped as one. So then set type integer. All right.
Great. Cool. So the thing is, right now, these by counts are for whatever hour. So
to make this example simpler, I'm just going to index on an hour, and I'm gonna say, okay,
we're only going to use that specific hour. So I'm just going to index on an hour, and I'm
going to use an hour. So here, let's say. So this data frame is only going to be data frame where
the hour, let's say it equals 12. Okay, so it's noon. All right. So now you'll see that all the
equal to 12. And I'm actually going to now drop that column. Our access equals one. Alright,
so we run this cell. Okay, so now we got rid of the hour in here. And we just have the by count,
the temperature, humidity, wind, visibility, and yada, yada, yada. Alright, so what I want to do
is I'm going to actually plot all of these. So for i in all the columns, so the range, length of
whatever its data frame is, and all the columns, because I don't have by count as
actually, it's my first thing. So what I'm going to do is say for a label in data frame,
columns, everything after the first thing, so that would give me the temperature and
onwards. So these are all my features, right? I'm going to just scatter. So I want to see how that
label how that specific data, how that affects the by count. So I'm going to plot the bike count on
the y axis. And I'm going to plot, you know, whatever the specific label is on the x axis.
And I'm going to title this, whatever the label is. And, you know, make my y label, the bike count
at noon. And the x label as just the label. Okay, now, I guess we don't even need the legend.
We don't even need the legend. So just show that plot. All right. So it seems like functional is
not really doesn't really give us any utility. So then snow rain seems like this radiation,
you know, is fairly linear dew point temperature, visibility, wind doesn't really seem like it does
much humidity, kind of maybe like an inverse relationship. But the temperature definitely
looks like there's a relationship between that and the number of bikes, right. So what I'm actually
going to do is I'm going to drop some of the ones that don't don't seem like they really matter. So
maybe wind, you know, visibility. Yeah, so I'm going to get rid of when visibility and functional.
So now data frame, and I'm going to drop wind, visibility, and functional. All right. And the
axis again is the column. So that's one. So if I look at my data set, now, I have just the
temperature, the humidity, the dew point temperature, radiation, rain, and snow. So again,
what I want to do is I want to split this into my training, my validation and my test data set,
just as we talked before. Here, we can use the exact same thing that we just did. And we can say
numpy dot split, and sample, you know that the whole sample, and then create our splits
of the data frame. And we're going to do that. But now set this to eight. Okay.
So I don't really care about, you know, the the full grid, the full array. So I'm just going to
use an underscore for that variable. But I will get my training x and y's. And actually, I don't
have a function for getting the x and y's. So here, I'm going to write a function defined,
get x y. And I'm going to pass in the data frame. And I'm actually going to pass in what the name
of the y label is, and what the x what specific x labels I want to look at. So here, if that's none,
then I'm just like, like, I'm only going to I'm going to get everything from the data set. That's
not the wildlife. So here, I'm actually going to make first a deep copy of my data frame.
And that basically means I'm just copying everything over. If, if like x labels is none,
so if not x labels, then all I'm going to do is say, all right, x is going to be whatever this
data frame is. And I'm just going to take all the columns. So C for C, and data frame, dot columns,
if C does not equal the y label, right, and I'm going to get the values from that. But if there
is the x labels, well, okay, so in order to index only one thing, so like, let's say I pass in only
one thing in here, then my data frame is, so let me make a case for that. So if the length of x
labels is equal to one, then what I'm going to do is just say that this is going to be x labels,
and add that just that label values, and I actually need to reshape to make this 2d.
So I'm going to pass in negative one comma one there. Now, otherwise, if I have like a list of
specific x labels that I want to use, then I'm actually just going to say x is equal to data
frame of those x labels, dot values. And that should suffice. Alright, so now that's just me
extracting x. And in order to get my y, I'm going to do y equals data frame, and then passing the y
label. And at the very end, I'm going to say data equals NP dot h stack. So I'm stacking them horizontally
one next to each other. And I'll take x and y, and return that. Oh, but this needs to be values.
And I'm actually going to reshape this to make it 2d as well so that we can do this h stack.
And I will return data x, y. So now I should be able to say, okay, get x, y, and take that data
frame. And the y label, so my y label is byte count. And actually, so for the x label, I'm actually
going to let's just do like one dimension right now. And earlier, I got rid of the plots, but we
had seen that maybe, you know, the temperature dimension does really well. And we might be able
to use that to predict why. So I'm going to label this also that, you know, it's just using the
temperature. And I am also going to do this again for, oh, this should be train. And this should be
validation. And this should be a test. Because oh, that's Val. Right. But here, it should be Val.
And this should be test. Alright, so we run this and now we have our training validation and test
data sets for just the temperature. So if I look at x train temp, it's literally just the temperature.
Okay, and I'm doing this first to show you simple linear regression. Alright, so right now I can
create a regressor. So I can say the temp regressor here. And then I'm going to, you know, make a
linear regression model. And just like before, I can simply fix fit my x train temp, y train temp
in order to train train this linear regression model. Alright, and then I can also, I can print
this regressor is coefficients and the intercept. So if I do that, okay, this is the coefficient
for whatever the temperature is, and then the the x intercept, okay, or the y intercept, sorry. All
right. And I can, you know, score, so I can get the the r squared score. So I can score x test
and y test. All right, so it's an r squared of around point three eight, which is better than
zero, which would mean, hey, there's absolutely no association. But it's also not, you know, like,
good, it depends on the context. But, you know, the higher that number, it means the higher that
the two variables would be correlated, right? Which here, it's all right. It just means there's
maybe some association between the two. But the reason why I want to do this one D was to show
you, you know, if we plotted this, this is what it would look like. So if I create a scatterplot,
and let's take the training. So this is our data. And then let's make it blue. And then if I
also plotted, so something that I can do is say, you know, the x range, I'm going to plot it,
is when space, and this goes from negative 20 to 40, this piece of data. So I'm going to just say,
let's take 100 things from there. So I'm going to plot x, and I'm going to take this temper,
this, like, regressor, and predict x with that. Okay, and this label, I'm going to label that
the fit. And this color, let's make this red. And let's actually set the line with, so I can,
I can change how thick that value is. Okay. Now at the very end, let's create a legend. And let's,
all right, let's also create, you know, title, all these things that matter, in some sense. So
here, let's just say, this would be the bikes, versus the temperature, right? And the y label
would be number of bikes. And the x label would be the temperature. So I actually think that this
might cause an error. Yeah. So it's expecting a 2d array. So we actually have to reshape this.
Okay, there we go. So I just had to make this an array and then reshape it. So it was 2d. Now,
we see that, all right, this increases. But again, remember those assumptions that we had about
linear regression, like this, I don't really know if this fits those assumptions, right? I just
wanted to show you guys though, that like, all right, this is what a line of s fit through this
data would look like. Okay. Now, we can do multiple linear regression, right. So I'm going to go ahead
and do that as well. Now, if I take my data set, and instead of the labels, it's actually what's
my current data set right now. Alright, so let's just use all of these except for the byte count,
right. So I'm going to just say for the x labels, let's just take the data frames columns and just
remove the byte count. So does that work? So if this part should be of x labels is none. And then
this should work now. Oops, sorry. Okay, so I have Oh, but this here, because it's not just the
temperature anymore, we should actually do this, let's say all, right. So I'm just going to quickly
rerun this piece here so that we have our temperature only data set. And now we have our
all data set. Okay. And this regressor, I can do the same thing. So I can do the all regressor.
And I'm going to make this the linear regression. And I'm going to fit this to x train all and y
train all. Okay. Alright, so let's go ahead and also score this regressor. And let's see how the
R squared performs now. So if I test this on the test data set, what happens? Alright, so our R
square seems to improve it went from point four to point five, two, which is a good sign. Okay.
And I can't necessarily plot, you know, every single dimension. But this just this is just
to say, okay, this is this is improved, right? Alright, so one cool thing that you can do with
tensorflow is you can actually do regression, but with the neural net. So here, I'm going
to we already have our our training data for just the temperature and just, you know, for all the
different columns. So I'm not going to bother with splitting up the data again, I'm just going to go
ahead and start building the model. So in this linear regression model, typically, you know,
it does help if we normalize it. So that's very easy to do with tensorflow, I can just create some
normalizer layer. So I'm going to do tensorflow Keras layers, and get the normalization layer.
And the input shape for that will just be one because let's just do it again on just the
temperature and the access I will make none. Now for this temp normalizer, and I should have had
an equal sign there. I'm going to adapt this to X train temp, and reshape this to just a single vector.
So that should work great. Now with this model, so temp neural net model, what I can do is I can do,
you know, dot keras, sequential. And I'm going to pass in this normalizer layer. And then I'm
going to say, hey, just give me one single dense layer with one single unit. And what that's doing
is saying, all right, well, one single node just means that it's linear. And if you don't add any
sort of activation function to it, the output is also linear. So here, I'm going to have tensorflow
Keras layers dot dense. And I'm just going to have one unit. And that's going to be my model. Okay.
So with this model, let's compile. And for our optimizer, let's use,
let's use the atom again, dot atom, and we have to pass in the learning rate. So learning rate,
and our learning rate, let's do 0.01. And now, the loss, we actually let's get this one 0.1. And the
loss, I'm going to do mean squared error. Okay, so we run that we've compiled it, okay, great.
And just like before, we can call history. And I'm going to fit this model. So here,
if I call fit, I can just fit it, and I'm going to take the x train with the temperature,
but reshape it. Y train for the temperature. And I'm going to set verbose equal to zero so
that it doesn't, you know, display stuff. I'm actually going to set epochs equal to, let's do
1000. And the validation data should be let's pass in the validation data set here
as a tuple. And I know I spelled that wrong. So let's just run this.
And up here, I've copied and pasted the plot loss from our previous but changed the y label
to MSC. Because now we're talking we're dealing with mean squared error. And I'm going to plot
the loss of this history after it's done. So let's just wait for this to finish training and then to
plot. Okay, so this actually looks pretty good. We see that the value is still the same. So
this actually looks pretty good. We see that the values are converging. So now what I can do is
I'm going to go back up and take this plot. And we are going to just run that plot again. So
here, instead of this temperature regressor, I'm going to use the neural net regressor.
This neural net model.
And if I run that, I can see that, you know, this also gives me a linear regressor,
you'll notice that this this fit is not entirely the same as the one
up here. And that's due to the training process of, you know, of this neural net. So just two
different ways to try and try to find the best linear regressor. Okay, but here we're using back
propagation to train a neural net node, whereas in the other one, they probably are not doing that.
Okay, they're probably just trying to actually compute the line of s fit. So, okay, given this,
well, we can repeat the exact same exercise with our with our multiple linear regressions. Okay,
but I'm actually going to skip that part. I will leave that as an exercise to the viewer. Okay,
so now what would happen if we use a neural net, a real neural net instead of just, you know,
one single node in order to predict this. So let's start on that code, we already have our
normalizer. So I'm actually going to take the same setup here. But instead of, you know, this
one dense layer, I'm going to set this equal to 32 units. And for my activation, I'm going to use
Relu. And now let's duplicate that. And for the final output, I just want one answer. So I just
want one cell. And this activation is also going to be Relu, because I can't ever have less than
zero bytes. So I'm just going to set that as Relu. I'm just going to name this the neural net model.
Okay. And at the bottom, I'm going to have this neural net model. I'm going to have this neural
net model, I'm going to compile. And I will actually use the same compiler here. But instead of
instead of a learning rate of 0.01, I'll use 0.001. Okay. And I'm going to train this here.
So the history is this neural net model. And I'm going to fit that against x train temp, y train
temp, and valid validation data, I'm going to set this again equal to x val temp, and y val temp.
Now, for the verbose, I'm going to say equal to zero epochs, let's do 100. And here for the batch
size, actually, let's just not do a batch size right now. Let's just try it. Let's see what happens
here. And again, we can plot the loss of this history after it's done training. So let's just
run this. And that's not what we're supposed to get. So what is going on? Here is sequential,
we have our temperature normalizer, which I'm wondering now if we have to redo that.
Do that. Okay, so we do see this decline, it's an interesting curve, but we do we do see it eventually.
So this is our loss, which all right, if decreasing, that's a good sign.
And actually, what's interesting is let's just let's plot this model again. So here instead of that.
And you'll see that we actually have this like, curve that looks something like this. So actually,
what if I got rid of this activation? Let's train this again. And see what happens.
Alright, so even even when I got rid of that really at the end, it kind of knows, hey, you know, if
it's not the best model, if we had maybe one more layer in here, these are just things that you have
to play around with. When you're, you know, working with machine learning, it's like, you don't really
know what the best model is going to be. For example, this also is not brilliant. But I guess
it's okay. So my point is, though, that with a neural net, I mean, this is not brilliant, but also
there's like no data down here, right? So it's kind of hard for our model to predict. In fact,
we probably should have started the prediction somewhere around here. My point, though, is that
with this neural net model, you can see that this is no longer a linear predictor, but yet we still
get an estimate of the value, right? And we can repeat this exact same exercise, right? So let's
do that. Right. And we can repeat this exact same exercise with the multiple inputs. So here,
if I now pass in all of the data, so this is my all normalizer,
and I should just be able to pass in that. So let's move this to the next cell. Here,
I'm going to pass in my all normalizer. And let's compile it. Yeah, those parameters look good.
Great. So here with the history, when we're trying to fit this model, instead of temp,
we're going to use our larger data set with all the features. And let's just train that.
And of course, we want to plot the loss.
Okay, so that's what our loss looks like. So an interesting curve, but it's decreasing.
So before we saw that our R squared score was around point five, two. Well, we don't really have
that with a neural net anymore. But one thing that we can measure is hey, what is the mean squared
error, right? So if I come down here, and I compare the two mean squared errors, so
so I can predict x test all right. So these are my predictions using that linear regressor,
will linear multiple multiple linear regressor. So these are my live predictions, linear regression.
Okay. I'm actually going to do that at the bottom. So let me just copy and paste that cell and bring
it down here. So now I'm going to calculate the mean squared error for both the linear regressor
and the neural net. Okay, so this is my linear and this is my neural net. So if I do my neural net
model, and I predict x test all, I get my two, you know, different y predictions. And I can calculate
the mean squared error, right? So if I want to get the mean squared error, and I have y prediction
and y real, I can do numpy dot square, and then I would need the y prediction minus, you know, the
real. So this this is basically squaring everything. And this should be a vector. So if I just take
this entire thing and take the mean of that, that should give me the MSC. So let's just try that out.
And the y real is y test all, right? So that's my mean squared error for the linear regressor.
And this is my mean squared error for the neural net. So that's interesting. I will debug this live,
I guess. So my guess is that it's probably coming from this normalization layer. Because this input
shape is probably just six. And okay, so that works now. And the reason why is because, like,
my inputs are only for every vector, it's only a one dimensional vector of length six. So I should
have I should have just had six, comma, which is a tuple of size six from the start, or it's a it's
a tuple containing one element, which is a six. Okay, so it's actually interesting that my neural
net results seem like they they have a larger mean squared error than my linear regressor.
One thing that we can look at is, we can actually plot the real versus, you know, the the actual
results versus what the predictions are. So if I say, some access, and I use plt dot axes, and make
axes and make these equal, then I can scatter the the y, you know, the test. So what the actual
values are on the x axis, and then what the prediction are on the x axis. Okay. And I can
label this as the linear regression predictions. Okay, so then let me just label my axes. So the
x axis, I'm going to say is the true values. The y axis is going to be my linear regression predictions.
Or actually, let's plot. Let's just make this predictions.
And then at the end, I'm going to plot. Oh, let's set some limits.
Because I think that's like approximately the max number of bikes.
So I'm going to set my x limit to this and my y limit to this.
So here, I'm going to pass that in here too. And all right, this is what we actually get for our
linear regressor. You see that actually, they align quite well, I mean, to some extent. So 2000 is
probably too much 2500. I mean, looks like maybe like 1800 would be enough here for our limits.
And I'm actually going to label something else, the neural net predictions.
Let's add a legend. So you can see that our neural net for the larger values, it seems like
it's a little bit more spread out. And it seems like we tend to underestimate a little bit down
here in this area. Okay. And for some reason, these are way off as well.
But yeah, so we've basically used a linear regressor and a neural net. Honestly, there are
sometimes where a neural net is more appropriate and a linear regressor is more appropriate.
I think that it just comes with time and trying to figure out, you know, and just literally seeing
like, hey, what works better, like here, a linear, a multiple linear regressor might actually work
better than a neural net. But for example, with the one dimensional case, a linear regressor would
never be able to see this curve. Okay. I mean, I'm not saying this is a great model either, but I'm
just saying like, hey, you know, sometimes it might be more appropriate to use something that's not
linear. So yeah, I will leave regression at that. Okay, so we just talked about supervised learning.
And in supervised learning, we have data, we have some a bunch of features and for a bunch of
different samples. But each of those samples has some sort of label on it, whether that's a number,
a category, a class, etc. Right, we were able to use that label in order to try to predict
right, we were able to use that label in order to try to predict new labels of other points that
we haven't seen yet. Well, now let's move on to unsupervised learning. So with unsupervised
learning, we have a bunch of unlabeled data. And what can we do with that? You know, can we learn
anything from this data? So the first algorithm that we're going to discuss is known as k means
clustering. What k means clustering is trying to do is it's trying to compute k clusters from the data.
So in this example below, I have a bunch of scattered points. And you'll see that this
is x zero and x one on the two axes, which means I'm actually plotting two different features,
right of each point, but we don't know what the y label is for those points. And now, just looking
at these scattered points, we can kind of see how there are different clusters in the data set,
right. So depending on what we pick for k, we might have different clusters. Let's say k equals two,
right, then we might pick, okay, this seems like it could be one cluster, but this here is also
another cluster. So those might be our two different clusters. If we have k equals three,
for example, then okay, this seems like it could be a cluster. This seems like it could be a
cluster. And maybe this could be a cluster, right. So we could have three different clusters in the
data set. Now, this k here is predefined, if I can spell that correctly, by the person who's running
the model. So that would be you. All right. And let's discuss how you know, the computer actually
goes through and computes the k clusters. So I'm going to write those steps down here.
Now, the first step that happens is we actually choose well, the computer chooses three random
points on this plot to be the centroids. And by centuries, I just mean the center of the clusters.
Okay. So three random points, let's say we're doing k equals three, so we're choosing three
random points to be the centroids of the three clusters. If it were two, we'd be choosing two
random points. Okay. So maybe the three random points I'm choosing might be here.
Here, here, and here. All right. So we have three different points. And the second thing that we do
is we actually calculate
the distance for each point to those centroids. So between all the points and the centroid.
So basically, I'm saying, all right, this is this distance, this distance, this distance,
all of these distances, I'm computing between oops, not those two, between the points, not the
centroids themselves. So I'm computing the distances for all of these plots to each of the centroids.
Okay. And that comes with also assigning those points to the closest centroid.
What do I mean by that? So let's take this point here, for example, so I'm computing
this distance, this distance, and this distance. And I'm saying, okay, it seems like the red one
is the closest. So I'm actually going to put this into the red centroid. So if I do that for
all of these points, it seems slightly closer to red, and this one seems slightly closer to red,
right? Now for the blue, I actually wouldn't put any blue ones in here, but we would probably
actually, that first one is closer to red. And now it seems like the rest of them are probably
closer to green. So let's just put all of these into green here, like that. And cool. So now we
have, you know, our two, three, technically centroid. So there's this group here, there's
this group here. And then blue is kind of just this group here, it hasn't really touched any
of the points yet. So the next step, three that we do is we actually go and we recalculate the
centroid. So we compute new centroids based on the points that we have in all the centroids.
And by that, I just mean, okay, well, let's take the average of all these points. And where is that
new centroid? That's probably going to be somewhere around here, right? The blue one, we don't have
any points in there. So we won't touch and then the screen one, we can put that probably somewhere
over here, oops, somewhere over here. Right. So now if I erase all of the previously computed centroids,
I can go and I can actually redo step two over here, this calculation.
Alright, so I'm going to go back and I'm going to iterate through everything again,
and I'm going to recompute my three centroids. So let's see, we're going to take this red point,
these are definitely all red, right? This one still looks a bit red. Now,
this part, we actually start getting closer to the blues.
So this one still seems closer to a blue than a green, this one as well. And I think the rest
would belong to green. Okay, so now our three centroids are three, sorry, our three clusters
would be this, this, and then this, right? Those are our three centroids. And so now we go back
and we compute the new sorry, those would be the three clusters. So now we go back and we compute
the three centroids. So I'm going to get rid of this, this and this. And now where would this
red be centered, probably closer, you know, to this point here, this blue might be closer to
up here. And then this green would probably be somewhere. It's pretty similar to what we had
before. But it seems like it'd be pulled down a bit. So probably somewhere around there for green.
All right. And now, again, we go back and we compute the distance between all the points
and the centroids. And then we assign them to the closest centroid. Okay. So the reds are all here,
it's very clear. Actually, let me just circle that. And this it actually seems like this point is
it actually seemed like this point is closer to this blue now. So the blues seem like they would
be maybe this point looks like it'd be blue. So all these look like they would be blue now.
And the greens would probably be this cluster right here. So we go back, we compute the centroids,
bam. This one probably like almost here, bam. And then the green looks like it would be probably
here ish. Okay. And now we go back and we compute the we compute the clusters again.
So red, still this blue, I would argue is now this cluster here. And green is this cluster here.
Okay, so we go and we recompute the centroids, bam, bam. And, you know, bam. And now if I were
to go and assign all the points to clusters again, I would get the exact same thing. Right. And so
that's when we know that we can stop iterating between steps two and three is when we've
converged on some solution when we've reached some stable point. And so now because none of
these points are really changing out of their clusters anymore, we can go back to the user
and say, Hey, these are our three clusters. Okay. And this process, something known as
expectation maximization. This part where we're assigning the points to the closest centroid,
this is something this is our expectation step. And this part where we're computing the new
centroids, this is our maximization step. Okay, so that's expectation maximization.
And we use this in order to compute the centroids, assign all the points to clusters,
according to those centroids. And then we're recomputing all that over again, until we reach
some stable point where nothing is changing anymore. Alright, so that's our first example
of unsupervised learning. And basically, what this is doing is trying to find some structure,
some pattern in the data. So if I came up with another point, you know, might be somewhere here,
I can say, Oh, it looks like that's closer to if this is a, b, c, it looks like that's closest to
cluster B. And so I would probably put it in cluster B. Okay, so we can find some structure
in the data based on just how, how the points are scattered relative to one another. Now,
the second unsupervised learning technique that I'm going to discuss with you guys, something noted,
principal component analysis. And the point of principal component analysis is very often it's
used as a dimensionality reduction technique. So let me write that down. It's used for dimensionality
reduction. And what do I mean by dimensionality reduction is if I have a bunch of features like
x1 x2 x3 x4, etc. Can I just reduce that down to one dimension that gives me the most information
about how all these points are spread relative to one another. And that's what PCA is for. So PCA
principal component analysis. Let's say I have some points in the x zero and x one feature space.
Okay, so these points might be spread, you know, something like this.
Okay. So for example, if this were something to do with housing prices, right,
this here might be x zero might be hey, years since built, right, since the house was built,
and x one might be square footage of the house. Alright, so like years since built, I mean, like
right now it's been, you know, 22 years since a house in 2000 was built. Now principal component
analysis is just saying, alright, let's say we want to build a model, or let's say we want to,
you know, display something about our data, but we don't we don't have two axes to show it on.
How do we display, you know, how do we how do we demonstrate that this point is a further away from
this point than this point. And we can do that using principal component analysis. So
take what you know about linear regression and just forget about it for a second. Otherwise,
you might get confused. PCA is a way of trying to find direction in the space with the largest
variance. So this principal component, what that means is basically the component.
So some direction in this space with the largest variance, okay, it tells us the most about our
data set without the two different dimensions. Like, let's say we have these two different
mentions, and somebody's telling us, hey, you only get one dimension in order to show your data set.
What dimension do you want to show us? Okay, so let's say we want to show our data set,
what dimension like what do we do, we want to project our data onto a single dimension.
Alright, so that in this case might be a dimension that looks something like
this. And you might say, okay, we're not going to talk about linear regression, okay.
We don't have a y value. So linear regression, this would be why this is not why, okay, we don't
have a label for that. Instead, what we're doing is we're taking the right angle projection. So
all of these take that's not very visible. But take this right angle projection onto this line.
And what PCA is doing is saying, okay, map all of these points onto this one dimensional space.
So the transformed data set would be here.
This one's on the data sets are on the line. So we just put that there. But now this would be our
new one dimensional data set. Okay, it's not our prediction or anything. This is our new data set.
If somebody came to us said you only get one dimension, you only get one number to represent
each of these 2d points. What number would you give us? What number would you give us?
So this would be our new one dimensional data set. Okay, it's not our prediction or anything.
What number would you give me? This would be the number that we gave. Okay, this in this direction,
this is where our points are the most spread out. Right? If I took this plot,
and let me actually duplicate this so I don't have to rewrite anything.
Or so I don't have to erase and then redraw anything. Let me get rid of some of this stuff.
And I just got rid of a point there too. So let me draw that back.
Alright, so if this were my original data point, what if I had taken, you know, this to be
the PCA dimension? Okay, well, I then would have points that let me actually do that in different
color. So if I were to draw a right angle to this for every point, my points would look something
like this. And so just intuitively looking at these two different plots, this top one and this one,
we can see that the points are squished a little bit closer together. Right? Which means that the
variance that's not the space with the largest variance. The thing about the largest variance
is that this will give us the most discrimination between all of these points. The larger the
variance, the further spread out these points will likely be. Now, and so that's the that's the
dimension that we should project it on a different way to actually look at that, like what is the
dimension with the largest variance. It's actually it also happens to be the dimension that decreases
to be the dimension that decreases that minimizes the residuals. So if we take all the points, and
we take the residual from that the XY residual, so in linear regression, in linear regression,
we were looking only at this residual, the differences between the predictions right between
y and y hat, it's not that here in principal component analysis, we're taking the difference
from our current point in two dimensional space, and then it's projected point. Okay, so we're
taking that dimension. And we're saying, alright, how much, you know, how much distance is there
between that projection residual, and we're trying to minimize that for all of these points. So that
actually equates to this largest variance dimension, this dimension here, the PCA dimension,
you can either look at it as minimizing, minimize, let me get rid of this,
the projection residuals. So that's the stuff in orange.
Or to maximizing the variance between the points.
Okay. And we're not really going to talk about, you know, the method that we need in order to
calculate out the principal components, or like what that projection would be, because you will
need to understand linear algebra for that, especially eigenvectors and eigenvalues, which
I'm not going to cover in this class. But that's how you would find the principal components. Okay,
now, with this two dimensional data set here, sorry, this one dimensional data set, we started
from a 2d data set, and we now boil it down to one dimension. Well, we can go and take that
dimension, and we can do other things with it. Right, we can, like if there were a y label,
then we can now show x versus y, rather than x zero and x one in different plots with that y.
Now we can just say, oh, this is a principal component. And we're going to plot that with
the y. Or for example, if there were 100 different dimensions, and you only wanted to take five of
them, well, you could go and you could find the top five PCA dimensions. And that might be a lot
more useful to you than 100 different feature vector values. Right. So that's principal component
analysis. Again, we're taking, you know, certain data that's unlabeled, and we're trying to make
some sort of estimation, like some guess about its structure from that original data set, if we
wanted to take, you know, a 3d thing, so like a sphere, but we only have a 2d surface to draw it
on. Well, what's the best approximation that we can make? Oh, it's a circle. Right PCA is kind of
the same thing. It's saying if we have something with all these different dimensions, but we can't
show all of them, how do we boil it down to just one dimension? How do we extract the most
information from that multiple dimensions? And that is exactly either you minimize the projection
residuals, or you maximize the variance. And that is PCA. So we'll go through an example of that.
Now, finally, let's move on to implementing the unsupervised learning part of this class.
Here, again, I'm on the UCI machine learning repository. And I have a seeds data set where,
you know, I have a bunch of kernels that belong to three different types of wheat. So there's
comma, Rosa and Canadian. And the different features that we have access to are, you know,
geometric parameters of those wheat kernels. So the area perimeter, compactness, length, width,
width, asymmetry, and the length of the kernel groove. Okay, so all of these are real values,
which is easy to work with. And what we're going to do is we're going to try to predict,
or I guess we're going to try to cluster the different varieties of the wheat.
So let's get started. I have a colab notebook open again. Oh, you're gonna have to, you know,
go to the data folder, download this. And so I'm going to go to the data folder, download this,
and let's get started. So the first thing to do is to import our seeds data set into our colab
notebook. So I've done that here. Okay, and then we're going to import all the classics again,
so pandas. And then I'm also going to import seedborn because I'm going to want that for this
specific class. Okay. Great. So now our columns that we have in our seed data set are the area,
the perimeter, the compactness, the length, with asymmetry, groove, length, I mean, I'm just going
to call it groove. And then the class, right, the wheat kernels class. So now we have to import this,
I'm going to do that using pandas read CSV. And it's called seeds data.csv. So I'm going to turn
that into a data frame. And the names are equal to the columns over here. So what happens if I just
do that? Oops, what did I call this seeds data set text? Alright, so if we actually look at our
data frame right now, you'll notice something funky. Okay. And here, you know, we have all the
stuff under area. And these are all our numbers with some dash t. So the reason is because we
haven't actually told pandas what the separator is, which we can do like this. And this t that's
just a tab. So in order to ensure that like all whitespace gets recognized as a separator,
we can actually this is for like a space. So any spaces are going to get recognized as data
separators. So if I run that, now our this, you know, this is a lot better. Okay. Okay.
So now let's actually go and like visualize this data. So what I'm actually going to do is plot
each of these against one another. So in this case, pretend that we don't have access to the
class, right? Pretend that so this class here, I'm just going to show you in this example,
that like, hey, we can predict our classes using unsupervised learning. But for this example,
in unsupervised learning, we don't actually have access to the class. So I'm going to just try to
plot these against one another and see what happens. So for some I in range, you know,
the columns minus one because the classes in the columns. And I'm just going to say for j in range,
so take everything from I onwards, you know, so I like the next thing after I until the end of this.
So this will give us basically a grid of all the different like combinations. And our x label is
going to be columns I our y label is going to be the columns j. So those are our labels up here.
And I'm going to use seaborne this time. And I'm going to say scatter my data. So our x is going
to be our x label. Or y is going to be our y label. And our data is going to be the data frame that
we're passing in. So what's interesting here is that we can say hue. And what this will do is say,
like if I give this class, it's going to separate the three different classes into three different
hues. So now what we're doing is we're basically comparing the area and the perimeter or the area
and the compactness. But we're going to visualize, you know, what classes they're in. So let's go
ahead and I might have to show. So great. So basically, we can see perimeter and area we give
we get these three groups. The area compactness, we get these three groups, and so on. So these all
kind of look honestly like somewhat similar. Right, so Wow, look at this one. So this one,
we have the compactness and the asymmetry. And it looks like there's not really I mean,
it just looks like they're blobs, right? Sure, maybe class three is over here more, but
one and two kind of look like they're on top of each other. Okay. I mean, there are some that
might look slightly better in terms of clustering. But let's go through some of the some of the
clustering examples that we talked about, and try to implement those. The first thing that we're
going to do is just straight up clustering. So what we learned about was k means clustering.
So from SK learn, I'm going to import k means. Okay. And just for the sake of being able to run,
you know, any x and any y, I'm just going to say, hey, let's use some x. What's a good one, maybe.
I mean, perimeter asymmetry could be a good one. So x could be perimeter, y could be asymmetry.
Okay. And for this, the x values, I'm going to just extract those specific values.
Alright, well, let's make a k means algorithm, or let's, you know, define this. So k means,
and in this specific case, we know that the number of clusters is three. So let's just use that. And
I'm going to fit this against this x that I've just defined right here. Right. So, you know, if I
create this clusters, so one thing, one cool thing is I can actually go to this clusters, and I can
say k mean dot labels. And it'll give give me if I can type correctly, it'll give me what its
predictions for all the clusters are. And our actual, oops, not that. If we go to the data frame,
and we get the class, and the values from those, we can actually compare these two and say, hey,
like, you know, everything in general, most of the zeros that it's predicted, are the ones, right.
And in general, the twos are the twos here. And then this third class one, okay, that corresponds
to three. Now remember, these are separate classes. So the labels, what we actually call them don't
really matter. We can say a map zero to one map two to two and map one to three. Okay, and our,
you know, our mapping would do fairly well. But we can actually visualize this. And in order to do
that, I'm going to create this cluster cluster data frame. So I'm going to create a data frame.
And I'm going to pass in a horizontally stacked array with x, so my values for x and y. And then
the clusters that I have here, but I'm going to reshape them. So it's 2d.
Okay. And the columns, the labels for that are going to be x, y, and plus. Okay. So I'm going
to go ahead and do that same seaborne scatter plot. Again, where x is x, y is y. And now,
the hue is again the class. And the data is now this cluster data frame. Alright, so this here,
this here is my k means like, I guess classes.
So k means kind of looks like this. If I come down here and I plot, you know, my original data frame,
this is my original classes with respect to this specific x and y. And you'll see that, honestly,
like it doesn't do too poorly. Yeah, there's I mean, the colors are different, but that's fine.
For the most part, it gets information of the clusters, right. And now we can do that with
higher dimensions. So with the higher dimensions, if we make x equal to, you know, all the columns,
except for the last one, which is our class, we can do the exact same thing.
We can do the exact same thing. So here, and we can
predict this. But now, our columns are equal to our data frame columns all the way to the last one.
And then with this class, actually, so we can literally just say data frame columns.
And we can fit all of this. And now, if I want to plot the k means classes.
Alright, so this was my that's my clustered and my original. So actually, let me see if I can
get these on the same page. So yeah, I mean, pretty similar to what we just saw. But what's
actually really cool is even something like, you know, if we change. So what's one of them
where they were like on top of each other? Okay, so compactness and asymmetry, this one's messy.
Right. So if I come down here, and I say compactness and asymmetry, and I'm trying to do this in 2d,
this is what my scatterplot. So this is what you know, my k means is telling me for these two
dimensions for compactness and asymmetry, if we just look at those two, these are our three classes,
right? And we know that the original looks something like this. And are these two remotely
alike? No. Okay, so now if I come back down here, and I rerun this higher dimensions one,
but actually, this clusters, I need to get the labels of the k means again.
Okay, so if I rerun this with higher dimensions,
well, if we zoom out, and we take a look at these two, sure, the colors are mixed up. But in general,
there are the three groups are there, right? This does a much better job at assessing, okay,
what group is what. So, for example, we could relabel the one in the original class to two.
And then we could make sorry, okay, this is kind of confusing. But for example, if this light pink
were projected onto this darker pink here, and then this dark one was actually the light pink,
and this light one was this dark one, then you kind of see like these correspond to one another,
right? Like even these two up here are the same class as all the other ones over here, which are
the same in the same color. So you don't want to compare the two colors between the plots,
you want to compare which points are in what colors in each of the plots. So that's one cool
application. So this is how k means functions, it's basically taking all the data sets and saying,
All right, where are my clusters given these pieces of data? And then the next thing that we
talked about is PCA. So PCA, we're reducing the dimension, but we're mapping all these like,
you know, seven dimensions. I don't know if there are seven, I made that number up, but we're
mapping multiple dimensions into a lower dimension number. Right. And so let's see how that works.
So from SK learn decomposition, I can import PCA and that will be my PCA model.
So if I do PCA component, so this is how many dimensions you want to map it into.
And you know, for this exercise, let's do two. Okay, so now I'm taking the top two dimensions.
And my transformed x is going to be PCA dot fit transform, and the same x that I had up here.
And the same x that I had up here. Okay, so all the other all the values basically, area,
perimeter, compactness, length, width, asymmetry, groove. Okay. So let's run that. And we've
transformed it. So let's look at what the shape of x used to be. So they're okay. So seven was right,
I had 210 samples, each seven, seven features long, basically. And now my transformed x
is 210 samples, but only of length two, which means that I only have two dimensions now that
I'm plotting. And we can actually even take a look at, you know, the first five things.
Okay, so now we see each each one is a two dimensional point,
each sample is now a two dimensional point in our new in our new dimensions.
So what's cool is I can actually scatter these
zero and transformed x. So I actually have to
take the columns here. And if I show that,
basically, we've just taken this like seven dimensional thing, and we've made it into a
single or I guess to a two dimensional representation. So that's a point of PCA.
And actually, let's go ahead and do the same clustering exercise as we did up here. If I take
the k means this PCA data frame, I can let's construct data frame out of that. And the data
frame is going to be H stack. I'm going to take this transformed x and the clusters that reshape.
So actually, instead of clusters, I'm going to use k means dot labels. And I need to reshape this.
So it's 2d. So we can do the H stack. And for the columns, I'm going to set this to PCA one PCA two,
and the class. All right. So now if I take this, I can also do the same for the truth.
But instead of the k means labels, I want from the data frame the original classes.
And I'm just going to take the values from that. And so now I have a data frame for the k means
with PCA and then a data frame for the truth with also the PCA. And I can now plot these similarly
to how I plotted these up here. So let me actually take these two.
Instead of the cluster data frame, I want the this is the k means PCA data frame. This is still going
to be class, but now x and y are going to be the two PCA dimensions. Okay. So these are my two PCA
dimensions. And you can see that the data frame is going to be the same as the cluster data frame.
So these are my two PCA dimensions. And you can see that, you know, they're, they're pretty spread
out. And then here, I'm going to go to my truth classes. Again, it's PCA one PCA two, but instead
of k means this should be truth PCA data frame. So you can see that like in the truth data frame
along these two dimensions, we actually are doing fairly well in terms of separation, right? It does
seem like this is slightly more separable than the other like dimensions that we had been looking at
up here. So that's a good sign. And up here, you can see that hey, some of these correspond to one
another. I mean, for the most part, our algorithm or unsupervised clustering algorithm is able to
to give us is able to spit out, you know, what the proper labels are. I mean, if you map these
specific labels to the different types of kernels. But for example, this one might all be the comma
kernel kernels and same here. And then these might all be the Canadian kernels. And these might all
be the Canadian kernels. So it does struggle a little bit with, you know, where they overlap.
But for the most part, our algorithm is able to find the three different categories, and do a
fairly good job at predicting them without without any information from us, we haven't given our
algorithm any labels. So that's a gist of unsupervised learning. I hope you guys enjoyed
this course. I hope you know, a lot of these examples made sense. If there are certain things
that I have done, and you know, you're somebody with more experience than me, please let me know
in the comments and we can all as a community learn from this together. So thank you all for watching.