字幕表 動画を再生する
Have you heard about this concept called "machine learning", and you're trying to figure out
exactly what that means? Or maybe you've checked out a few machine learning competitions on
Kaggle.com, but you don't know how to get started? If so, I'm here to help.
My name is Kevin Markham, and I'm a data science instructor in Washington, DC. This is my brand
new video series about how to use the scikit-learn library in Python for machine learning. This
is material that I love to teach, and I can't wait to share it with you.
In this series, I'm going to cover scikit-learn from the basics all the way through advanced
techniques. I'm not going to presume any familiarity with machine learning, and in fact,
we're going to spend the next few videos talking about machine learning before we write
any code. The reason being, there's really no point to using scikit-learn if you don't know
how to do proper machine learning.
You will need to have at least minimal experience with the Python programming language, but
I'll suggest some resources in the next video if you don't yet know Python.
So with that, let's get started!
In this video, I'll be covering the following topics: What is machine learning? What are
the two main categories of machine learning? What are some examples of machine learning?
And, how does machine learning "work"?
So, what exactly is machine learning? There's no universal definition, but at a high level,
I would define machine learning as the semi-automated extraction of knowledge from data. Let's break
that down into three component parts:
First, machine learning always starts with data, and your goal is to extract knowledge
or insight from that data. You have a question you're trying to answer, and you hypothesize
that your question might be answerable using the data.
Second, machine learning involves some amount of automation. Rather than trying to gather
your insights from the data manually, you are applying some process or algorithm to
the data using a computer so that the computer can help to provide the insight.
Third, machine learning is not a fully automated process. As any practitioner can tell you,
machine learning requires you to make many smart decisions in order for the process to
be successful. We'll cover many of those decisions throughout this video series.
Next, let's talk about the two main categories of machine learning, which are supervised
learning and unsupervised learning.
Supervised learning, also known as predictive modeling, is the process of making predictions
using data. For example, if my dataset is a series of email messages, my supervised
learning task might be to predict whether each email message is spam or non-spam, which
is also known as "ham". This is supervised learning because there is a specific outcome
we are trying to predict, namely ham or spam.
In contrast, unsupervised learning is the process of extracting structure from data
or learning how to best represent data. For example, if my dataset was the characteristics
and purchasing behavior of shoppers at a grocery store, my unsupervised learning task might
be to segment the shoppers into groups or "clusters" that exhibit similar behaviors.
I might find that college students, parents with young childern, and older adults have
characteristic shopping behaviors that are similar within each group but dissimilar from
the other two groups. This is an unsupervised learning task because there is no right or
wrong answer about how many clusters can be found in the data, which people belong in which
cluster, or even how to describe each cluster.
Let's do a quick quiz. This is Kaggle website, which is a popular platform for machine learning
competitions. This is their well-known Titanic competition, and the goal is to predict which
passengers survived the tragic sinking of the Titanic.
Is this supervised or unsupervised learning?
This is supervised learning, because your goal is to predict a specific outcome (namely
survival) for each passenger.
In this video series, I'm going to primarily focus on supervised learning, though I may
cover unsupervised learning in later videos.
We've talked about what supervised learning is, but we haven't yet talked about how it works.
So, how does it actually work?
At very high level, here are the two main steps of supervised learning:
First, you train a machine learning model using your existing labeled data. Labeled
data is data which has been labeled with the outcome, which in the case of the email example,
is whether each message is ham or spam. This is called "model training" because the model
is learning the relationship between the attributes of the data and the outcome. These attributes
might include the message text, the number of embedded links, the length of the message,
and so on.
Second, you make predictions on new data for which you don't know the true outcome. In
other words, when a new email message arrives, you want your trained model to accurately predict
whether the email is ham or spam without a human examining it.
To summarize these two steps, you could say that the model is learning from past examples,
made up of inputs and outputs, and then applying what it has learned to future inputs
in order to predict future outputs.
Because you are making predictions on unseen data, which is data that was not used to train
the model, it is often said that the primary goal of supervised learning is to build models
that generalize. In other words, you want to build machine learning models that accurately predict
the labels of your future emails, rather than accurately predicting the labels
of emails you have already received.
This simplified description of machine learning might raise some questions in your mind, such as:
How do I choose which attributes of my data to include in the model? How do I choose
which model to use? How do I optimize this model for best performance? How do I ensure
that I'm building a model that will generalize to unseen data? Can I estimate how well my
model is likely to perform on unseen data?
These are excellent questions, and hint at the complexity of doing effective machine
learning! All of these issues will be addressed later in the video series.
If you'd like a more in-depth introduction to machine learning, there are two resources that
I recommend that I've linked to below the video. The first resource is my favorite book
on machine learning, "An Introduction to Statistical Learning" by Trevor Hastie and Rob Tibshirani.
It's available as a free PDF download, and section 2.1 introduces machine learning in
a thorough yet accessible way.
The second resource I recommend is a 13-minute video from Caltech's "Learning From Data" course,
which uses some excellent examples to compare supervised and unsupervised learning, and
also introduces another type of machine learning called reinforcement learning.
In the next video in this series, I'll be covering the benefits and drawbacks of scikit-learn,
as well as my recommended way to set up Python for machine learning.
In the meantime, I'd love to hear from you in the YouTube comments if you have a
question about machine learning, or if you just have a cool example of
machine learning that you'd like to share. Please do subscribe on YouTube if you'd like to hear the moment
my next video comes out. Thanks for watching, and I'll see you soon.