K-Means Algorithm in Tensorflow
K-Means Algorithm is one of the most simple unsupervised algorithms. The main aim of this algorithm is to group n observations into k different clusters. Each point will belong to a different cluster. In this tutorial, we will implement the K-Means Algorithm using TensorFlow Python.
The Algorithm
In K-Means Algo:
- Firstly, we will select K randomly chosen points as centroids.
- Form K Clusters by assigning all points to the closest centroid
- Recompute the centroid again
- Finally, keep repeating this until centroids change.
It is very easy to implement and also converges very quickly. TensorFlow has some inbuilt functions that will help us to implement this. In contrast to supervised learning, unsupervised learning helps us to find patterns among the data. It is useful for exploring raw or unknown data. Dimensionality reduction is also an important part of unsupervised learning. Now lets code and plot this.
Creating the Model
First, we will import the necessary Python libraries.
import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import pandas as pd
Next, We will generate random data points and assign them to a constant tensor. We will Also assign random centers from the data points. It should be noted that these variables will have different dimensions. We will use TensorFlow’s “expand_dim” function for this.
#Initialize Variables pts = tf.constant(np.random.uniform(0, 10, (points_n, 2))) cds = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1])) #Equating Dimensions pts_expanded = tf.expand_dims(points, 0) cds_expanded = tf.expand_dims(centroids, 1)
Finally, we will calculate the distance of all points and recalculate the centroids in a For loop.
#Calculating Distances distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2) assignments = tf.argmin(distances, 0) #Updating Centroids means = [] for c in range(clusters_n): means.append(tf.reduce_mean( tf.gather(points, tf.reshape( tf.where( tf.equal(assignments, c) ),[1,-1]) ),reduction_indices=[1])) new_centroids = tf.concat(means, 0) update_centroids = tf.assign(centroids, new_centroids)
Results
For plotting results, we would need the Matplotlib Library. We will use Pyplot from Matplotlib. For each iteration, we will update the centroids and return their values. Then we will plot these points to find out the scatter in the data points.
with tf.Session() as sess: sess.run(init) for step in xrange(iteration_n): [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, cds, pts, assignments]) print("centroids", centroid_values) #Final Point plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5) plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15) plt.show()
Finally, a plot like this will be generated:
We were able to divide the points into 3 clusters.
Now any data set can be implemented using this algorithm. Some applications of K-Means are:
- Employee Absenteeism for a particular company
- Heat Maps for Pandemics like COVID-19
- Player Stat Analysis
- Customer division for various products
Indeed this is a very useful algorithm for analyzing data. As this algorithm will struggle with intermixed data, we can also use Hierarchical clustering for centroid identification.
Feel Free to ask your doubts in the comments section.
Leave a Reply