K-means clustering using Scikit-learn in Python
In this tutorial, we will implement K-means clustering using the scikit-learn library in Python. Let us first understand what is clustering technique.
What is Clustering?
Clustering is used to find groups of similar objects that are related to each other. It is an unsupervised learning algorithm in machine learning. Unsupervised learning is used to analyze and cluster unlabeled datasets.
What is K-means clustering?
K-means clustering is a type of clustering in which we group unlabeled datasets into clusters. It is a centroid-based clustering, meaning each cluster is associated with a centroid. In K-means clustering K defines the number of pre-defined clusters that need to be created.
A centroid is nothing but the center of a cluster. The goal of K is to minimize the sum of squares criterion within the cluster. Now, let’s just see how the algorithm works, i.e., the steps in this algorithm. There are mainly four steps in the k-means clustering algorithm which are the following:
Step 1: Randomly select k number of centroids from the sample.
Step 2:Find the distance of the sample point from these centroids and put them next to its closest centroid.
Step 3: Centroids will be in the middle of clusters.
Step 4: Repeat steps 2 and 3 until you get the same clusters in the iterations.
K-means clustering using Scikit-learn
If you don’t have the scikit-learn library, you can install it using the following command:
pip install -U scikit-learn
Now let’s just implement the K-means clustering using this library
Let’s first import all the required Python libraries:
import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.cluster import KMeans
The make_blobs() function in the skelarn library is used to generate blobs of points with a Gaussian distribution. Now let’s create a dataset using this function.
X, y = make_blobs( n_samples=250, n_features=2, centers=4, cluster_std=0.7, shuffle=True, random_state=0 )
Now let’s generate the original clusters of the dataset which we have created:
plt.scatter( X[:, 0], X[:, 1], c='red', marker='o', edgecolor='black', s=50 ) plt.show()
Now let’s apply K-means clustering on this sample data .
k_means = KMeans( n_clusters=4, init='random', n_init=10, max_iter=250, tol=1e-04, random_state=0 ) y_km = km.fit_predict(X)
Now let’s just plot the four clusters which we have created using the K-means clustering algorithm and plot the graph of these clusters using matplotlib:
# ploting the 4 clusters created by kmeans plt.scatter( X[ykmeans == 0, 0], X[ykmeans == 0, 1], s = 50, c = 'red', marker = 'o', label = 'cluster 1' ) plt.scatter( X[ykmeans == 1, 0], X[ykmeans == 1, 1], s = 50, c = 'blue', marker = 's', label = 'cluster 2' ) plt.scatter( X[ykmeans == 2, 0], X[ykmeans == 2, 1], s = 50, c = 'green', marker = 's', label = 'cluster 3' ) plt.scatter( X[ykmeans == 3, 0], X[ykmeans == 3, 1], s = 50, c = 'brown', marker = 's', label = 'cluster 4' ) # plotting the 3 centroids of the 3 clusters plt.scatter( k_means.cluster_centers_[:, 0], k_means.cluster_centers_[:, 1], s = 250, marker = '*', c = 'gray', label = 'centroids' ) plt.legend(scatterpoints=1) plt.grid() plt.show()
Leave a Reply