 ## 字幕表 動画を再生する

• In cluster analysis, that’s how a cluster would look in two-dimensional space.

• There are two dimensions, or two features based on which we are performing clustering.

• For instance, theAgeandMoney spent’.

• Certainly, it makes no sense to have only one cluster, so let me zoom out of this graph.

• Here’s a nice picture of clusters.

• We can clearly see two clusters.

• I’ll also indicate their centroids.

• If we want to identify three clusters, this is the result we obtain.

• And that’s more or less how clustering works graphically.

• Okay.

• How do we perform clustering in practice?

• There are different methods we can apply to identify clusters.

• The most popular one is k-means, so that’s where we will start.

• Let’s simplify this scatter to 15 points, so we can get a better grasp of what happens.

• Cool.

• Here’s how k-means works.

• First, we must choose how many clusters we’d like to have.

• That’s where this method gets its name from.

• K stands for the number of clusters we are trying to identify.

• The next step is to specify the cluster seeds.

• A seed is basically a starting centroid.

• It is chosen at random or is specified by the data scientist based on prior knowledge

• about the data.

• One of the clusters will be the green cluster, the other one the orange cluster.

• And these are the seeds.

• The following step is to assign each point on the graph to a seed.

• Which is done based on proximity.

• For instance, this point is closer to the green seed than to the orange one.

• Therefore, it will belong to the green cluster.

• This point, on the other hand, is closer to the orange seed, therefore, it will be a part

• of the orange cluster.

• In this way, we can color all points on the graph, based on their Euclidean distance from

• the seeds.

• Great!

• The final step is to calculate the centroid of the green points and the orange points.

• The green seed will move closer to the green points to become their centroid and the orange

• will do the same for the orange points.

• From here, we would repeat the last two steps.

• Let’s recalculate the distances.

• All the green points are obviously closer to the green centroid, and the orange points

• are closer to the orange centroid.

• What about these two?

• Both of them are closer to the green centroid, so at this step we will reassign them to the

• green cluster.

• Finally, we must recalculate the centroids.

• That’s the new result.

• Now, all the green points are closest to the green centroid and all the orange ones to

• the orange.

• We can no longer reassign points, which completes the clustering process.

• This is the two-cluster solution.

• Alright.

• So that’s the whole idea behind clustering?

• In order to solidify your understanding, we will redo the process.

• In the beginning we said that with k-means clustering, we must specify the number of

• clusters prior to clustering, right?

• What if we want to obtain 3 clusters?

• The first step involves selecting the seeds.

• Let’s have another seed.

• Well use red for this one.

• Next, we must associate each of the points with the closest seed.

• Finally, we calculate the centroids of the colored points.

• We already know that k-means is an iterative process.

• So, we go back to the step where we associate each of the points with the closest seed.

• All orange points are settled, so no movement there.

• What about these two points?

• Now they are closer to the red seed, so they will go into the red cluster.

• That’s the only change in the whole graph.

• In the end, we recalculate the centroids and reach a situation where no more adjustments

• are necessary using the k-means algorithm.

• We have reached a three-cluster solution.

• This is the exact algorithm, which was used to find the solution of the problem you saw

• at the beginning of the lesson.

• Here’s a Python generated graph, with the three clusters colored.

• I am sorry they are not the same, but you get the point.

• That’s how we would usually represent the clusters graphically.

• Great!

• I think we have a good basis to start coding!

• We are going to cluster these countries using k-means in Python.

• Plus, well learn a couple of nice tricks along the way.

• Cool.

• Let’s import the relevant libraries.

• They are pandas, NumPy, MatPlotLib, dot, PyPlot, and Seaborn.

• As usual, I will set the style of all graphs to the Seaborn one.

• In this course, we will rely on scikit-learn for the actual clustering.

• Let’s import k-means from SK learn, dot, cluster.

• Note that both the ‘K’ and the ‘M’ in k-means are capital.

• Next, we will create a variable calleddata’, where we will load the CSV file: ‘3.01.

• Country clusters’.

• Let’s see what’s inside.

• Weve got Country, Latitude, Longitude, and Language.

• Let’s see how we gathered that data.

• Country and language are clear.

• What about the latitude and longitude values?

• These entries correspond to the geographic centers of the countries in our dataset.

• That is one way to represent location.

• I’ll quickly give an example.

• If you Google: ‘geographic center of US’, youll get a Wikipedia article, indicating

• it to be some point in South Dakota with a latitude of 44 degrees and 58 minutes North,

• and a longitude of 103 degrees and 46 minutes West.

• Then we can convert them todecimal degreesusing some online converter like the one provided

• by latlong.net.

• It’s important to know that the convention is such that North and East are positive,

• while West and South are negative.

• Okay.

• So that’s what we did.

• We got the decimal degrees of the geographic centers of the countries in the sample.

• That’s not optimal as the choice of South Dakota was biased by Alaska and Hawaii, but

• youll see that won’t matter too much for the clustering.

• Right.

• Let’s quickly plot the data.

• If we want our data to resemble a map we must set the axes to reflect the natural domain

• of latitude and longitude.

• Done.

• If I put the actual map next to this one, you will quickly notice that this methodology,

• while simple, is not bad at all.

• Alright.

• Let’s do some clustering.

• As we did earlier, our inputs will be contained in a variable called ‘x’.

• We will start by clustering based on location.

• So, we want ‘X’ to contain the latitude and the longitude.

• I’ll use the pandas methodiloc’.

• We haven’t mentioned it before and you probably don’t know that, butilocis a method

• which slices a data frame.

• The first argument indicates the row indices we want to keep, while the secondthe

• column indices.

• I want to keep all rows, so I’ll put ‘cоlumnsas the first argument.

• Okay.

• Remember that pandas indices start from 0.

• From the columns, I needLatitudeandLongitude’, or columns 1 and 2.

• So, the appropriate argument is: 1, columns, 3.

• This will slice the 1st and the 2nd columns out of the data frame.

• Let’s print x, to see the result.

• Exactly as we wanted it.

• Next, I’ll declare a variable called k-means.

• K-means is equal to capital ‘K’, capital ‘M’, and lowercaseeans’, brackets,

• 2.

• The right-side is actually the k-means method that we imported from sk-learn.

• The value in brackets is the number of clusters we want to produce.

• So, our variable ‘k-meansis now an object which we will use for the clustering itself.

• Similar to what weve seen with regressions, the clustering itself happens using thefit

• method.

• K-means, dot, fit, of x.

• That’s all we need to write.

• This line of code will apply k-means clustering with 2 clusters to the input data from X.

• The output indicates that the clustering has been completed with the following parameters.

• Usually though, we don’t need to just perform the clustering but are interested in the clusters

• themselves.

• We can obtain the predicted clusters for each observation using thefit predictmethod.

• Let’s declare a new variable called: ‘identified clusters’, equal to: kmeans, dot, fit predict,

• with input, X. I’ll also print this variable.

• The result is an array containing the predicted clusters.

• There are two clusters indicated by 0 and 1.

• You can clearly see that the first five observations are in the same cluster, zero, while the last

• one is in cluster one.

• Okay.

• Let’s create a data frame so we can see things more clearly.

• I’ll call this data framedata with clustersand it will be equal todata’.

• Then I’ll add an additional column to it calledCluster’, equal toidentified

• clusters’.

• As you can see, we have our table with the countries, latitude, longitude, language,

• but also cluster.

• It seems that the USA, Canada, France, UK, and Germany are in cluster: zero, while Australia

• is alone in cluster: one.

• Cool!

• Finally, let’s plot all this on a scatter plot.

• In order to resemble the map of the world, the y-axis will be the longitude, while the

• x-axislatitude.

• But that’s the same graph as before, isn’t it?

• Let’s use the first trick.

• In matplotlib, we can set the color to be determined by a variable.