K-Means Algorithm in Tensorflow

K-Means Algorithm is one of the most simple unsupervised algorithms. The main aim of this algorithm is to group n observations into k different clusters. Each point will belong to a different cluster. In this tutorial, we will implement the K-Means Algorithm using TensorFlow Python.

The Algorithm

In K-Means Algo:

  1. Firstly, we will select K randomly chosen points as centroids.
  2. Form K Clusters by assigning all points to the closest centroid
  3. Recompute the centroid again
  4. Finally, keep repeating this until centroids change.

It is very easy to implement and also converges very quickly. TensorFlow has some inbuilt functions that will help us to implement this. In contrast to supervised learning, unsupervised learning helps us to find patterns among the data. It is useful for exploring raw or unknown data. Dimensionality reduction is also an important part of unsupervised learning. Now lets code and plot this.

Creating the Model

First, we will import the necessary Python libraries.

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import pandas as pd

Next, We will generate random data points and assign them to a constant tensor. We will Also assign random centers from the data points. It should be noted that these variables will have different dimensions. We will use TensorFlow’s “expand_dim” function for this.

#Initialize Variables
pts = tf.constant(np.random.uniform(0, 10, (points_n, 2)))
cds = tf.Variable(tf.slice(tf.random_shuffle(points), [0, 0], [clusters_n, -1]))

#Equating Dimensions
pts_expanded = tf.expand_dims(points, 0)
cds_expanded = tf.expand_dims(centroids, 1)

Finally, we will calculate the distance of all points and recalculate the centroids in a For loop.

#Calculating Distances
distances = tf.reduce_sum(tf.square(tf.subtract(points_expanded, centroids_expanded)), 2)
assignments = tf.argmin(distances, 0)

#Updating Centroids
means = []
for c in range(clusters_n):
    means.append(tf.reduce_mean(
      tf.gather(points, 
                tf.reshape(
                  tf.where(
                    tf.equal(assignments, c)
                  ),[1,-1])
               ),reduction_indices=[1]))

new_centroids = tf.concat(means, 0)
update_centroids = tf.assign(centroids, new_centroids)

Results

For plotting results, we would need the Matplotlib Library. We will use Pyplot from Matplotlib. For each iteration, we will update the centroids and return their values. Then we will plot these points to find out the scatter in the data points.

with tf.Session() as sess:
  sess.run(init)
  for step in xrange(iteration_n):
    [_, centroid_values, points_values, assignment_values] = sess.run([update_centroids, cds, pts, assignments])

print("centroids", centroid_values)

#Final Point
plt.scatter(points_values[:, 0], points_values[:, 1], c=assignment_values, s=50, alpha=0.5)
plt.plot(centroid_values[:, 0], centroid_values[:, 1], 'kx', markersize=15)
plt.show()

Finally, a plot like this will be generated:

We were able to divide the points into 3 clusters.

Now any data set can be implemented using this algorithm. Some applications of K-Means are:

  1. Employee Absenteeism for a particular company
  2. Heat Maps for Pandemics like COVID-19
  3. Player Stat Analysis
  4. Customer division for various products

Indeed this is a very useful algorithm for analyzing data. As this algorithm will struggle with intermixed data, we can also use Hierarchical clustering for centroid identification.

Feel Free to ask your doubts in the comments section.

Leave a Reply

Your email address will not be published. Required fields are marked *