Customer Segmentation using K means Clustering in Machine Learning with Python

This tutorial will discuss customer segmentation using the k means clustering algorithm with the help of step by step guide in Python programming.

K means the clustering Algorithm is an unsupervised machine learning Algorithm.

You can check more details about the means of clustering.

Customer segmentation means clustering the customers into different groups so one group of customers may represent those that tend to purchase more. In that mall and some other group may represent that don’t purchase that much in a mall. so having these groups of customers tell the mall to make better business decisions to make better marketing strategies.

You can download the data set form here Mall_Customers.

So let’s try to implement the code

Import the necessary Python libraries.

#Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

The first step in any machine-learning algorithm is,

Data Collection

1. Load data

#loading the data from csv file to a pandas DataFrame
customer_data=pd.read_csv("C:\\Users\\users drive\\OneDrive\\Desktop\\Mall_Customers.csv")

To check whether the data set is loaded or not print the first five rows by using the head method.

#printing first five rows of the dataframe


Now print the last five rows by using the tail method.

#printing last five rows of the dataframe


Remember this point whenever u are working on machine learning problems that need data exploration.

To check the dataset size.

#Finding the number of rows nad columns


(200, 5)

Now let’s see details about the dataset.

#Getting some information about the dataset


Let’s check the missing values in the dataset.

#Checking the missing values


In the given dataset the customer id is useless because we are depending on the score,

The group of customers will be classified based on their annual income and spending score.

Choose the annual income column and spending score column.

#The columns 3 and 4 represent annual income and spending score
print(X) #In the array first value represnts annual income and second value represents spending score



Choosing the number of clusters.

By using wccs method find the different numbers of the clusters.

Finding wccs values for different numbers of clusters.

for i in range(1,11):

Let’s try the visualization with the elbow graph.

#Plot and elbow graph 
plt.title("The Elbow point Graph")
plt.xlabel('Number of clusters')


It is also called a cutoff point graph.

an optimum number of clusters is 5.

#Training the k means clustering model

#Return a label for each data point based on their cluster


Visualizing all the clusters.

Plotting all the clusters and their centroids 5 clusters=0,1,2,3,4.

plt.scatter(X[Y==0,0],X[Y==0,1],s=50,c='violet',label='Cluster 1')
plt.scatter(X[Y==1,0],X[Y==1,1],s=50,c='green',label='Cluster 2')
plt.scatter(X[Y==2,0],X[Y==2,1],s=50,c='red',label='Cluster 3')
plt.scatter(X[Y==3,0],X[Y==3,1],s=50,c='black',label='Cluster 4')
plt.scatter(X[Y==4,0],X[Y==4,1],s=50,c='orange',label='Cluster 5')

#Plot the centroids 

plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')


Considering the above graph there are 5 different clusters with their own centroids you can also consider there are 5 different groups consider one group the customers buying things frequently and the second group visiting only once.

This is how the malls improve their marketing and give great discounts to the group of customers.

In the above example, there are some groups of customers where the customers are buying and leaving.

This method is also called the Market Basket strategy.

Leave a Reply

Your email address will not be published. Required fields are marked *