字幕表 動画を再生する
JOSH GORDON: Classifiers are only
{ 機械学習 }
as good as the features you provide.
{ 機械学習 } レシピ
That means coming up with good features
分類器の質は供給する 特徴量の良さに依ります
is one of your most important jobs in machine learning.
But what makes a good feature, and how can you tell?
If you're doing binary classification,
ですが 何が良い特徴量になるのか 又どうやって分かるのでしょう
then a good feature makes it easy to decide
between two different things.
良い特徴量では2つの異なるものを 決定するのが簡単です
For example, imagine we wanted to write a classifier
例えば 2種の犬を区分けする
to tell the difference between two types of dogs--
greyhounds and Labradors.
Here we'll use two features-- the dog's height in inches
ここで2つの特徴量を使います 犬のインチでの高さと
and their eye color.
Just for this toy example, let's make a couple assumptions
この小例に対し 分かりやすいように
about dogs to keep things simple.
First, we'll say that greyhounds are usually
まず グレイハウンドは普通
taller than Labradors.
Next, we'll pretend that dogs have only two eye
colors-- blue and brown.
And we'll say the color of their eyes
doesn't depend on the breed of dog.
This means that one of these features is useful
and the other tells us nothing.
To understand why, we'll visualize them using a toy
理由を理解する為に私が作る トイデータセットを使って
dataset I'll create.
Let's begin with height.
How useful do you think this feature is?
Well, on average, greyhounds tend
to be a couple inches taller than Labradors, but not always.
ラブラドールより数インチ 高い傾向がありますが
There's a lot of variation in the world.
必ずしもそうじゃありません 世界には変異が沢山あります
So when we think of a feature, we
have to consider how it looks for different values
個体数の異なる値に対し どうかと考える必要があります
in a population.
Python に行って プログラム例をみましょう
Let's head into Python for a programmatic example.
犬の個体数 1000 を作ります
I'm creating a population of 1,000
グレイハウンドとラブラドールを 半々にします
dogs-- 50-50 greyhound Labrador.
I'll give each of them a height.
この例で グレイハウンドの身長は 平均 28 インチで
For this example, we'll say that greyhounds
ラブラドールは 24 インチです
are on average 28 inches tall and Labradors are 24.
Now, all dogs are a bit different.
Let's say that height is normally distributed,
ですから これらの両方を プラスマイナス4インチにしましょう
so we'll make both of these plus or minus 4 inches.
This will give us two arrays of numbers,
それらをヒストグラムで 可視化できます
and we can visualize them in a histogram.
パラメーターを付けて グレイハウンドは赤で
I'll add a parameter so greyhounds are in red
and Labradors are in blue.
Now we can run our script.
これは個体数中で所定の身長の 犬は何匹いるかを示します
This shows how many dogs in our population have a given height.
There's a lot of data on the screen,
so let's simplify it and look at it piece by piece.
We'll start with dogs on the far left
例えば約 20 インチの身長のものは?
of the distribution-- say, who are about 20 inches tall.
皆さんに 犬を身長で ラブラドールかグレイハウンドか
Imagine I asked you to predict whether a dog with his height
was a lab or a greyhound.
What would you do?
与えられた身長の各種の犬の 確率を調べるでしょう
Well, you could figure out the probability of each type
ここだとその犬は ラブラドールのようです
of dog given their height.
Here, it's more likely the dog is a lab.
身長 35 インチの犬を見ると
On the other hand, if we go all the way
グレイハウンドだと かなり確信できます
to the right of the histogram and look
では 中間の犬はどうでしょう
at a dog who is 35 inches tall, we
ここでグラフの情報は 少なくなります
can be pretty confident they're a greyhound.
というのは各種の犬の 確率が近いからです
Now, what about a dog in the middle?
ですから身長は有用な特徴量ですが 完ぺきではないのです
You can see the graph gives us less information
here, because the probability of each type of dog is close.
So height is a useful feature, but it's not perfect.
さもないと ただ if 文を書けばよくて
That's why in machine learning, you almost always
need multiple features.
どういう特徴量を使えばいいか 把握するには
Otherwise, you could just write an if statement
instead of bothering with the classifier.
To figure out what types of features you should use,
この犬がラブラドールか グレイハウンドか分かろうとすれば
do a thought experiment.
Pretend you're the classifier.
毛の長さとか 走る速さ
If you were trying to figure out if this dog is
a lab or a greyhound, what other things would you want to know?
You might ask about their hair length,
科学というよりはアートに近 いですが
or how fast they can run, or how much they weigh.
目安として 問題を解くのにいくつ 自分が必要か考えなさい
Exactly how many features you should use
では 目の色のような 別の特徴量を見てみましょう
is more of an art than a science,
but as a rule of thumb, think about how many you'd
犬の目の色は2通りだけで 青と茶としましょう
need to solve the problem.
そして目の色は犬種に 依らないとします
Now let's look at another feature like eye color.
この例に対するヒストグラムは こんな風でしょう
Just for this toy example, let's imagine
大抵の値に対し 分布は約半々です
dogs have only two eye colors, blue and brown.
And let's say the color of their eyes
犬の種類と関係しないので 何も教えません
doesn't depend on the breed of dog.
学習データにこのような 無用な特徴量を含めるのは
Here's what a histogram might look like for this example.
For most values, the distribution is about 50/50.
これは誤って有用に見える 可能性があるからです
So this feature tells us nothing,
特に少量の学習データしか ない場合はそうです
because it doesn't correlate with the type of dog.
また 特徴量は独立的にします
Including a useless feature like this in your training
data can hurt your classifier's accuracy.
That's because there's a chance they might appear useful purely
例えばデータセットに 既に 1つの特徴量
by accident, especially if you have only a small amount
of training data.
もう1つ cm での身長のような 特徴量をを加えたら
You also want your features to be independent.
And independent features give you
それは既にあるものと 全く関連しているので「いいえ」です
different types of information.
関連性の高い特徴量を 学習データから除外することは
Imagine we already have a feature-- height and inches--
in our dataset.
その理由は 多くの分類器は賢くなくて
Ask yourself, would it be helpful
インチでの身長と cm の身長が 同じだと理解できないのです
if we added another feature, like height in centimeters?
ですから この特徴量の重要性を 二重カウントするかもしれません
No, because it's perfectly correlated with one
最後に 特徴量を分かりやすく することです
we already have.
It's good practice to remove highly correlated features
2つの異なる都市間で手紙を送るのに 何日かかるか予測するとします
from your training data.
都市間が離れていればいるほど 長くかかります
That's because a lot of classifiers
aren't smart enough to realize that height in inches
in centimeters are the same thing,
so they might double count how important this feature is.
Last, you want your features to be easy to understand.
For a new example, imagine you want
to predict how many days it will take
手紙が着くまでどの位かかるか よく推量できます
to mail a letter between two different cities.
しかし緯度・経度と時間の 関係を分かることは
The farther apart the cities are, the longer it will take.
ずっと難しく学習データに もっと多くの例が必要になります
A great feature to use would be the distance
between the cities in miles.
A much worse pair of features to use
どの組み合わせが一番いいかさえも 分かります
would be the city's locations given by their latitude
and longitude.
And here's why.
次回も続けて教師付き学習に対して 直感を築いていきます
I can look at the distance and make
異なる種類の分類器が 同じ問題を解くのに
a good guess of how long it will take the letter to arrive.
But learning the relationship between latitude, longitude,
働き方についてもう少し 深く掘り下げます
and time is much harder and would require many more
ご視聴ありがとう では次回にお会いしましょう
examples in your training data.
Now, there are techniques you can
use to figure out exactly how useful your features are,
and even what combinations of them are best,
so you never have to leave it to chance.
We'll get to those in a future episode.
Coming up next time, we'll continue building our intuition
for supervised learning.
We'll show how different types of classifiers
can be used to solve the same problem and dive a little bit
deeper into how they work.
Thanks very much for watching, and I'll see you then.