Placeholder Image

字幕表 動画を再生する

  • What is going on?

  • Everybody, welcome to Part six of the Python Data analysis and data Science with Pandas Tutorial Siri's In this last installment, we're going to be talking about applying machine learning to a panda's data frame.

  • So basically, what is the typical workflow when doing machine learning with pandas data sets.

  • So generally for if I'm going to do machine learning first I do on my data pre processing in panda's anyways, so it's just convenient to know.

  • Okay, well, what's the next step?

  • Once we've got our features, how do we feed it through a model?

  • So that's what we're doing here.

  • We're gonna be grabbing a new data set and we're gonna grab the diamonds data set.

  • So go ahead and download that.

  • Put it in your data sets directory.

  • So what's our objective gonna be?

  • Well, we're going to be predicting the price of diamonds, So basically, this table contains a bunch of values here.

  • We've got, uh, the carrot cut color clarity depth table.

  • I don't know what that means, but it's table percent.

  • We got the price, the X y Z, which is length with in depth of that diamond.

  • So the curiosity is can we take all of those values except for price, feed those through a regression model and predict the price of that diamond so that in the future, when we get diamonds and we don't know how much to pay for our diamonds Ah, we could just run him through a model.

  • So typical regression task here.

  • Nothing too fancy.

  • And the question is, are these features all descriptive enough to give us the price of this diamond?

  • Uh, probably we shall find out.

  • Uh, you can use another day to see if you'd like.

  • You can, uh this one has almost 54,000 rose, which is quite a few samples that we could actually work with here, which is pretty important for traditional ml.

  • You're gonna want prime more than 10,000 rose, I would think.

  • And then, if you want to do, like, deep learning, more than 100,000 arrows were Sydney's regular machine learning models.

  • Nothing too fancy.

  • We're not going deep learning here.

  • But if you want to learn more about what we're doing, one of their more about machine learning or just psych it learn or just basic ML models in General, you can check out this initial machine learning to toil.

  • Siri's here.

  • This one, basically what we do is we go through starting with regression, which is what we're gonna do here talking about how regression works.

  • We do an applied example using psychic learn.

  • And then we actually write a regression model ourselves.

  • And really, we do that with all of them.

  • So we do that with regression K nearest neighbors support vector machines and so on.

  • So the idea is to learn about the model Learned how to apply It was psychic learned.

  • And then how do we write that ourselves from scratch?

  • We do use numb pie and stuff like that, but not using some sort of machine learning library.

  • So it's pretty cool.

  • Serious than if you wanna learn about deep learning.

  • Check this one out.

  • So anyway, if you want to learn more about that kind of stuff and parameters because there's a lot of parameters here that we're gonna be working with, And if you want to know Maur like you don't feel like like you feel like you're still kind of in the gray areas for us.

  • What all this stuff means you can check that out.

  • That Siri's out.

  • So now what we want to do is let's say you just are a complete amateur.

  • You can check out this choosing the right estimator chart.

  • Basically, if you just Google choosing the right estimator, you will find this.

  • Uh, this is from psychic learned, by the way.

  • And this is just how you can pick a specific estimator for a nester class fireman.

  • It's like your model.

  • How do you pick the right one for what you're working with now?

  • So for us, we have more than 50 samples.

  • We do not want to pray to the category.

  • We do want to predict a quantity.

  • So because we're doing a brush and we want to predict price, which is a regression task as opposed to classification, where you're trying to predict, you know, one out of five classes in this case it's not classification, because it's like any price, right?

  • We're just We're trying to come up with some sort of calculation for price.

  • Do we have less than 100,000 samples?

  • If we if no, then So if we have a lot of samples, that's kind of confusing.

  • If you have more than 100,000 samples.

  • Go with Esther.

  • Dear Pressure.

  • Otherwise, you should go either here or with support.

  • Vector bird Russian with a linear kernel.

  • So we'll probably USP our linear kernel.

  • And again if you don't know what that is, you just click on it, right, And this will tell you.

  • Okay, here's what you need to do.

  • So Okay.

  • I understand.

  • Right?

  • Okay.

  • S K learn import SPM Got it.

  • Here's your class.

  • For God it do if it got it easy, right?

  • So let's get started.

  • So we're gonna import pandas as p.

  • D.

  • We are going to say d f equals P d.

  • Don't read C s V date.

  • It's not a capital D.

  • Data sets and diamonds don't see us.

  • We then we also want to say index call equal zero because this data set graces us with a love That's avocado with a lovely index.

  • Not only is it useless, it's also in string form.

  • Fascinating.

  • Uh, okay.

  • So come back over here.

  • So we're gonna say next call equal zero, because that way we don't generate duplicate indexes.

  • In this case, the index column is completely useless, but one thing you always want to keep.

  • Take note of is let's go ahead and just d f dot head here.

  • Any time you're doing machine learning, it's really easy to cheat.

  • Even when you're trying not to cheat.

  • You're doing your very, very best.

  • It is easy, not thio.

  • So looking at this data set, it's easy to cheat.

  • Anyway, looking at this day set, Um, basically, we like all of the columns here are meaningful columns, and then you have price except price.

  • President will prices meaningful as well, But prices were trying to predict.

  • So when we go to build this model, we actually want to use all of the columns and sent for price.

  • But when we get to that point, I'm gonna point something out.

  • Uh, and, uh, if I forget someone comment below because it's important.

  • OK, so that's our data set.

  • Now, one thing with machine learning is all of the data that you pass into your model at the end of the day.

  • Basically, all machine learning is is linear algebra.

  • Okay, so everything has to be numbers.

  • We have to convert everything the numbers and ideally, they're meaningful numbers, because if they're not meaningful there.

  • Useless.

  • So there are ways, like so we have cut color and clarity.

  • All of these need to be converted to numerical values in pandas.

  • A cz well, as probably psych it learned.

  • And I know it's in Kare Aas and tens airflow.

  • There are always, like to categorical methods that you can call So, like in pandas, we can say D F s.

  • So, for example, what if we just had d f cut?

  • Uh dot unique.

  • Okay, there's not very many here, but you can imagine it's an area where there's a lot.

  • Uh, so this presents problem.

  • We need to convert these two numerical values.

  • Well, one option you have is D f cut.

  • Um, anyway, so I was trying to see if that had a meaningful or I don't believe it does.

  • But anyways, do you have cut, uh, and then we could say dot as type.

  • And we can convert this type too categorical.

  • Actually, I think it's category and then dot cat dot code.

  • Um, you're on that real quick.

  • Uh, codes.

  • There we go.

  • Okay, so this will take cuts.

  • It will figure out k.

  • How many unique sare there and then it just assigns are you know, the 1st 1 it finds is zero Sorry zero.

  • The 2nd 1 is a 123 and it just keeps doing that until it's done, it's reached the maximum number, and so it just assigns an arbitrary code to our cut.

  • The problem is, we're doing regression.

  • Likely linear read Russian.

  • And we would prefer our values here to have meaning behind them.

  • So because is, this isn't arbitrary, right?

  • Premium is better than fair and so one, So we want to preserve that order.

  • So we're gonna preserve that order.

  • So just know this exists so later.

  • If you're doing classification, you just need arbitrary classes.

  • So when you're doing classification, uh, you could use this totally to make your classes into codes.

  • But in this case for features, we want our features to be meaningful, so we're not gonna do that.

  • So instead, what I'm gonna do is create dictionaries for all these things.

  • And in fact, I'm a copy and paste ease from the text based tutorial because there's nothing to gleam here.

  • So, copy paste.

  • Wow, Are we museum out a little bit there little far in there on, then Copy and paste this one.

  • So this will just be dictionaries that we're gonna map.

  • How did I know this In my diamond efficient auto?

  • No.

  • I found these keys from the description here of the data set.

  • They ordered them and stuff for us.

  • So, um yeah.

  • So now, uh, interestingly, it's started this 10 It shouldn't matter, but that's funny.

  • I kinda wanna fix it.

  • Let's fix it.

  • I don't know how I just noticed that now, but anyway, let me fix that real quick.

  • 456 and seven.

  • I don't really want to pass a value of zero to my regression model, if I could avoid it.

  • Mean, fix that in a text based version.

  • Okay, so now that we have these where you just want to map them, uh and so I'm just gonna come in here and I'm going to say D uh, cut equals d f cut it die map, and then we just map cut class dicked.

  • Right?

  • That's it.

  • And then we're gonna do the exact same thing for clarity and color.

  • So I'm just gonna copy paste, copy, pays, do this copy, paste, paste.

  • And then It's just clarity dicked and then color dict.

  • So just colors all I want color, color and color.

  • Awesome.

  • So, um, apple those Let's just check it with the d f dot head at the very end.

  • Great.

  • We now have our data set is basically it's ready.

  • It's been converted.

  • We're ready to pass this through an actual model.

  • Sort of.

  • We'll talk about why not in a 2nd 1st we need psychic learn.

  • Let's go to our favorite command prompt and do a pip installs Psych it dash learn.

  • We will grab psychic learning.

  • While we're doing that, I'm just gonna come back over, I think, and just start typing so cool.

  • So we're going to import s k learn, and then we're gonna go from S K learn import S T m.

  • So, um, doing it I want to create space, but I can't because of private installed, yet we have.

  • Cool.

  • So what can I do that?

  • Awesome.

  • Okay, so what we want to do first is any time you've got data, you wanna probably shuffle that data because the latest thing often we're gonna train and like, order.

  • And the latest thing is gonna be Maur biasing than the first thing that model saw.

  • So she's usually a good idea to shuffle the model, especially if they're sort.

  • If that data set is sorted in any way and is this data set sorted in any way, well, we come over here, and if we scroll to the tippy top because he price does appear to be, uh, this data set appears to be ordered by price now, so that's a problem.

  • And then also, it could become a problem later on.

  • So first I'm gonna shuffle this data set just to get it over with.

  • So there's many ways to shuffle a data frame.

  • There's a way in pandas to do it by shuffling by index.

  • But I don't want to do that because that's ugly.

  • The way the way pandas has you do it.

  • I just don't like psychic learn is it's really simple, and we're using psychic learned already.

  • So I'm gonna say d f equals s k learn dot You tills dot shuffle DF done.

  • That data frame is shuffled.

  • Now we want to assign values for X and y so in machine learning generally, capital X is your future set lower case.

  • Why?

  • Sometimes you'll see people use upper case.

  • Why it is your are your labels.

  • So X, what is X X is the feature set.

  • So this is the list of features that points to that label.

  • What's the label price?

  • So the list of features is basically everything except for price.

  • Right.

  • So this pretty simple In this case, we just do d f dot drop.

  • And we're just gonna drop that price column on axes one because we're not trying to drop Rose.

  • We're trying drop columns.

  • The men convert that to values, and then the next thing is D.

  • F.

  • Uh, this is just the f price dot values.

  • Now, eyes did say that, um, I wanted to point something out, So let's just run that real quick.

  • And then let's print.

  • Let's just look at X really quickly.

  • So look at how ex compares to up here.

  • For example.

  • Uh, it appears that this is indeed Cara cara.

  • I'm thinking kare AAs.

  • Anyway, this is Carrot right now.

  • One thing to watch out for now, imagine at the very beginning that we did not do Index Cole.

  • So if we hadn't done that, this index value here.

  • This course is probably having a string panties.

  • Might have fixed that for us, but imagine it didn't.

  • And now one of your columns was actually gonna be index, which is just a incremental number one to whatever.

  • Would that be a problem, or would that just be no ways?

  • Well, that would be a problem that would inform the model of price.

  • Why?

  • Because this data set is sorted by index, apparently because indexers increments by one.

  • And it was sorted by price.

  • So that is one way that you could, just unbeknownst to you, have cheated.

  • And that's the kind of stuff you gotta watch out for.

  • It's always just it's the super difficult you're gonna find.

  • This kind of stuff happens all the time.

  • It's really possible that I'm cheating in some way already on this dais.

  • Excitement spent too long trying toe.

  • Make sure I don't, but, uh, s so if I do, let me know below.

  • Um But anyway, uh, that's the kind of stuff you gotta watch out for because it will get you.

  • So anyways, it appears that we've done it right.

  • One way you could always be certain is to instead of, you know, blacklisting.

  • Just the price column.