Placeholder Image

字幕表 動画を再生する

  • In cluster analysis, that’s how a cluster would look in two-dimensional space.

  • There are two dimensions, or two features based on which we are performing clustering.

  • For instance, theAgeandMoney spent’.

  • Certainly, it makes no sense to have only one cluster, so let me zoom out of this graph.

  • Here’s a nice picture of clusters.

  • We can clearly see two clusters.

  • I’ll also indicate their centroids.

  • If we want to identify three clusters, this is the result we obtain.

  • And that’s more or less how clustering works graphically.

  • Okay.

  • How do we perform clustering in practice?

  • There are different methods we can apply to identify clusters.

  • The most popular one is k-means, so that’s where we will start.

  • Let’s simplify this scatter to 15 points, so we can get a better grasp of what happens.

  • Cool.

  • Here’s how k-means works.

  • First, we must choose how many clusters we’d like to have.

  • That’s where this method gets its name from.

  • K stands for the number of clusters we are trying to identify.

  • I’ll start with two clusters.

  • The next step is to specify the cluster seeds.

  • A seed is basically a starting centroid.

  • It is chosen at random or is specified by the data scientist based on prior knowledge

  • about the data.

  • One of the clusters will be the green cluster, the other one the orange cluster.

  • And these are the seeds.

  • The following step is to assign each point on the graph to a seed.

  • Which is done based on proximity.

  • For instance, this point is closer to the green seed than to the orange one.

  • Therefore, it will belong to the green cluster.

  • This point, on the other hand, is closer to the orange seed, therefore, it will be a part

  • of the orange cluster.

  • In this way, we can color all points on the graph, based on their Euclidean distance from

  • the seeds.

  • Great!

  • The final step is to calculate the centroid of the green points and the orange points.

  • The green seed will move closer to the green points to become their centroid and the orange

  • will do the same for the orange points.

  • From here, we would repeat the last two steps.

  • Let’s recalculate the distances.

  • All the green points are obviously closer to the green centroid, and the orange points

  • are closer to the orange centroid.

  • What about these two?

  • Both of them are closer to the green centroid, so at this step we will reassign them to the

  • green cluster.

  • Finally, we must recalculate the centroids.

  • That’s the new result.

  • Now, all the green points are closest to the green centroid and all the orange ones to

  • the orange.

  • We can no longer reassign points, which completes the clustering process.

  • This is the two-cluster solution.

  • Alright.

  • So that’s the whole idea behind clustering?

  • In order to solidify your understanding, we will redo the process.

  • In the beginning we said that with k-means clustering, we must specify the number of

  • clusters prior to clustering, right?

  • What if we want to obtain 3 clusters?

  • The first step involves selecting the seeds.

  • Let’s have another seed.

  • Well use red for this one.

  • Next, we must associate each of the points with the closest seed.

  • Finally, we calculate the centroids of the colored points.

  • We already know that k-means is an iterative process.

  • So, we go back to the step where we associate each of the points with the closest seed.

  • All orange points are settled, so no movement there.

  • What about these two points?

  • Now they are closer to the red seed, so they will go into the red cluster.

  • That’s the only change in the whole graph.

  • In the end, we recalculate the centroids and reach a situation where no more adjustments

  • are necessary using the k-means algorithm.

  • We have reached a three-cluster solution.

  • This is the exact algorithm, which was used to find the solution of the problem you saw

  • at the beginning of the lesson.

  • Here’s a Python generated graph, with the three clusters colored.

  • I am sorry they are not the same, but you get the point.

  • That’s how we would usually represent the clusters graphically.

  • Great!

  • I think we have a good basis to start coding!

  • We are going to cluster these countries using k-means in Python.

  • Plus, well learn a couple of nice tricks along the way.

  • Cool.

  • Let’s import the relevant libraries.

  • They are pandas, NumPy, MatPlotLib, dot, PyPlot, and Seaborn.

  • As usual, I will set the style of all graphs to the Seaborn one.

  • In this course, we will rely on scikit-learn for the actual clustering.

  • Let’s import k-means from SK learn, dot, cluster.

  • Note that both the ‘K’ and the ‘M’ in k-means are capital.

  • Next, we will create a variable calleddata’, where we will load the CSV file: ‘3.01.

  • Country clusters’.

  • Let’s see what’s inside.

  • Weve got Country, Latitude, Longitude, and Language.

  • Let’s see how we gathered that data.

  • Country and language are clear.

  • What about the latitude and longitude values?

  • These entries correspond to the geographic centers of the countries in our dataset.

  • That is one way to represent location.

  • I’ll quickly give an example.

  • If you Google: ‘geographic center of US’, youll get a Wikipedia article, indicating

  • it to be some point in South Dakota with a latitude of 44 degrees and 58 minutes North,

  • and a longitude of 103 degrees and 46 minutes West.

  • Then we can convert them todecimal degreesusing some online converter like the one provided

  • by latlong.net.

  • It’s important to know that the convention is such that North and East are positive,

  • while West and South are negative.

  • Okay.

  • So that’s what we did.

  • We got the decimal degrees of the geographic centers of the countries in the sample.

  • That’s not optimal as the choice of South Dakota was biased by Alaska and Hawaii, but

  • youll see that won’t matter too much for the clustering.

  • Right.

  • Let’s quickly plot the data.

  • If we want our data to resemble a map we must set the axes to reflect the natural domain

  • of latitude and longitude.

  • Done.

  • If I put the actual map next to this one, you will quickly notice that this methodology,

  • while simple, is not bad at all.

  • Alright.

  • Let’s do some clustering.

  • As we did earlier, our inputs will be contained in a variable called ‘x’.

  • We will start by clustering based on location.

  • So, we want ‘X’ to contain the latitude and the longitude.

  • I’ll use the pandas methodiloc’.

  • We haven’t mentioned it before and you probably don’t know that, butilocis a method

  • which slices a data frame.

  • The first argument indicates the row indices we want to keep, while the secondthe

  • column indices.

  • I want to keep all rows, so I’ll put ‘cоlumnsas the first argument.

  • Okay.

  • Remember that pandas indices start from 0.

  • From the columns, I needLatitudeandLongitude’, or columns 1 and 2.

  • So, the appropriate argument is: 1, columns, 3.

  • This will slice the 1st and the 2nd columns out of the data frame.

  • Let’s print x, to see the result.

  • Exactly as we wanted it.

  • Next, I’ll declare a variable called k-means.

  • K-means is equal to capital ‘K’, capital ‘M’, and lowercaseeans’, brackets,

  • 2.

  • The right-side is actually the k-means method that we imported from sk-learn.

  • The value in brackets is the number of clusters we want to produce.

  • So, our variable ‘k-meansis now an object which we will use for the clustering itself.

  • Similar to what weve seen with regressions, the clustering itself happens using thefit

  • method.

  • K-means, dot, fit, of x.

  • That’s all we need to write.

  • This line of code will apply k-means clustering with 2 clusters to the input data from X.

  • The output indicates that the clustering has been completed with the following parameters.

  • Usually though, we don’t need to just perform the clustering but are interested in the clusters

  • themselves.

  • We can obtain the predicted clusters for each observation using thefit predictmethod.

  • Let’s declare a new variable called: ‘identified clusters’, equal to: kmeans, dot, fit predict,

  • with input, X. I’ll also print this variable.

  • The result is an array containing the predicted clusters.

  • There are two clusters indicated by 0 and 1.

  • You can clearly see that the first five observations are in the same cluster, zero, while the last

  • one is in cluster one.

  • Okay.

  • Let’s create a data frame so we can see things more clearly.

  • I’ll call this data framedata with clustersand it will be equal todata’.

  • Then I’ll add an additional column to it calledCluster’, equal toidentified

  • clusters’.

  • As you can see, we have our table with the countries, latitude, longitude, language,

  • but also cluster.

  • It seems that the USA, Canada, France, UK, and Germany are in cluster: zero, while Australia

  • is alone in cluster: one.

  • Cool!

  • Finally, let’s plot all this on a scatter plot.

  • In order to resemble the map of the world, the y-axis will be the longitude, while the

  • x-axislatitude.

  • But that’s the same graph as before, isn’t it?

  • Let’s use the first trick.

  • In matplotlib, we can set the color to be determined by a variable.