字幕表 動画を再生する 英語字幕をプリント In cluster analysis, that’s how a cluster would look in two-dimensional space. There are two dimensions, or two features based on which we are performing clustering. For instance, the ‘Age’ and ‘Money spent’. Certainly, it makes no sense to have only one cluster, so let me zoom out of this graph. Here’s a nice picture of clusters. We can clearly see two clusters. I’ll also indicate their centroids. If we want to identify three clusters, this is the result we obtain. And that’s more or less how clustering works graphically. Okay. How do we perform clustering in practice? There are different methods we can apply to identify clusters. The most popular one is k-means, so that’s where we will start. Let’s simplify this scatter to 15 points, so we can get a better grasp of what happens. Cool. Here’s how k-means works. First, we must choose how many clusters we’d like to have. That’s where this method gets its name from. K stands for the number of clusters we are trying to identify. I’ll start with two clusters. The next step is to specify the cluster seeds. A seed is basically a starting centroid. It is chosen at random or is specified by the data scientist based on prior knowledge about the data. One of the clusters will be the green cluster, the other one the orange cluster. And these are the seeds. The following step is to assign each point on the graph to a seed. Which is done based on proximity. For instance, this point is closer to the green seed than to the orange one. Therefore, it will belong to the green cluster. This point, on the other hand, is closer to the orange seed, therefore, it will be a part of the orange cluster. In this way, we can color all points on the graph, based on their Euclidean distance from the seeds. Great! The final step is to calculate the centroid of the green points and the orange points. The green seed will move closer to the green points to become their centroid and the orange will do the same for the orange points. From here, we would repeat the last two steps. Let’s recalculate the distances. All the green points are obviously closer to the green centroid, and the orange points are closer to the orange centroid. What about these two? Both of them are closer to the green centroid, so at this step we will reassign them to the green cluster. Finally, we must recalculate the centroids. That’s the new result. Now, all the green points are closest to the green centroid and all the orange ones to the orange. We can no longer reassign points, which completes the clustering process. This is the two-cluster solution. Alright. So that’s the whole idea behind clustering? In order to solidify your understanding, we will redo the process. In the beginning we said that with k-means clustering, we must specify the number of clusters prior to clustering, right? What if we want to obtain 3 clusters? The first step involves selecting the seeds. Let’s have another seed. We’ll use red for this one. Next, we must associate each of the points with the closest seed. Finally, we calculate the centroids of the colored points. We already know that k-means is an iterative process. So, we go back to the step where we associate each of the points with the closest seed. All orange points are settled, so no movement there. What about these two points? Now they are closer to the red seed, so they will go into the red cluster. That’s the only change in the whole graph. In the end, we recalculate the centroids and reach a situation where no more adjustments are necessary using the k-means algorithm. We have reached a three-cluster solution. This is the exact algorithm, which was used to find the solution of the problem you saw at the beginning of the lesson. Here’s a Python generated graph, with the three clusters colored. I am sorry they are not the same, but you get the point. That’s how we would usually represent the clusters graphically. Great! I think we have a good basis to start coding! We are going to cluster these countries using k-means in Python. Plus, we’ll learn a couple of nice tricks along the way. Cool. Let’s import the relevant libraries. They are pandas, NumPy, MatPlotLib, dot, PyPlot, and Seaborn. As usual, I will set the style of all graphs to the Seaborn one. In this course, we will rely on scikit-learn for the actual clustering. Let’s import k-means from SK learn, dot, cluster. Note that both the ‘K’ and the ‘M’ in k-means are capital. Next, we will create a variable called ‘data’, where we will load the CSV file: ‘3.01. Country clusters’. Let’s see what’s inside. We’ve got Country, Latitude, Longitude, and Language. Let’s see how we gathered that data. Country and language are clear. What about the latitude and longitude values? These entries correspond to the geographic centers of the countries in our dataset. That is one way to represent location. I’ll quickly give an example. If you Google: ‘geographic center of US’, you’ll get a Wikipedia article, indicating it to be some point in South Dakota with a latitude of 44 degrees and 58 minutes North, and a longitude of 103 degrees and 46 minutes West. Then we can convert them to ‘decimal degrees’ using some online converter like the one provided by latlong.net. It’s important to know that the convention is such that North and East are positive, while West and South are negative. Okay. So that’s what we did. We got the decimal degrees of the geographic centers of the countries in the sample. That’s not optimal as the choice of South Dakota was biased by Alaska and Hawaii, but you’ll see that won’t matter too much for the clustering. Right. Let’s quickly plot the data. If we want our data to resemble a map we must set the axes to reflect the natural domain of latitude and longitude. Done. If I put the actual map next to this one, you will quickly notice that this methodology, while simple, is not bad at all. Alright. Let’s do some clustering. As we did earlier, our inputs will be contained in a variable called ‘x’. We will start by clustering based on location. So, we want ‘X’ to contain the latitude and the longitude. I’ll use the pandas method ‘iloc’. We haven’t mentioned it before and you probably don’t know that, but ‘iloc’ is a method which slices a data frame. The first argument indicates the row indices we want to keep, while the second – the column indices. I want to keep all rows, so I’ll put ‘cоlumns’ as the first argument. Okay. Remember that pandas indices start from 0. From the columns, I need ‘Latitude’ and ‘Longitude’, or columns 1 and 2. So, the appropriate argument is: 1, columns, 3. This will slice the 1st and the 2nd columns out of the data frame. Let’s print x, to see the result. Exactly as we wanted it. Next, I’ll declare a variable called k-means. K-means is equal to capital ‘K’, capital ‘M’, and lowercase ‘eans’, brackets, 2. The right-side is actually the k-means method that we imported from sk-learn. The value in brackets is the number of clusters we want to produce. So, our variable ‘k-means’ is now an object which we will use for the clustering itself. Similar to what we’ve seen with regressions, the clustering itself happens using the ‘fit’ method. K-means, dot, fit, of x. That’s all we need to write. This line of code will apply k-means clustering with 2 clusters to the input data from X. The output indicates that the clustering has been completed with the following parameters. Usually though, we don’t need to just perform the clustering but are interested in the clusters themselves. We can obtain the predicted clusters for each observation using the ‘fit predict’ method. Let’s declare a new variable called: ‘identified clusters’, equal to: kmeans, dot, fit predict, with input, X. I’ll also print this variable. The result is an array containing the predicted clusters. There are two clusters indicated by 0 and 1. You can clearly see that the first five observations are in the same cluster, zero, while the last one is in cluster one. Okay. Let’s create a data frame so we can see things more clearly. I’ll call this data frame ‘data with clusters’ and it will be equal to ‘data’. Then I’ll add an additional column to it called ‘Cluster’, equal to ‘identified clusters’. As you can see, we have our table with the countries, latitude, longitude, language, but also cluster. It seems that the USA, Canada, France, UK, and Germany are in cluster: zero, while Australia is alone in cluster: one. Cool! Finally, let’s plot all this on a scatter plot. In order to resemble the map of the world, the y-axis will be the longitude, while the x-axis – latitude. But that’s the same graph as before, isn’t it? Let’s use the first trick. In matplotlib, we can set the color to be determined by a variable.