Understanding KNN algorithm using Iris Dataset with Python

Hey folks,

Let’s learn about a lazy algorithm that can be used for both classification and regression.

You know, it is the K-Nearest Neighbor Algorithm.

Instance-Based Learning

The knn algorithm is known by many names such as lazy learning, instance-based learning, case-based learning, or local-weighted regression, this is because it does not split the data while training. In other words, it uses all the data while training. Another property of the knn algorithm is that it follows non-parametric learning. Meaning it does not has a pre-assumption about the data.

It is said instance-based learning because here we do not process the training examples to train a model. Instead, we store them and whenever we need to classify a new example/data, we retrieve a set of similar instances to generate the results. To find the result, the algorithm follows the fact – “Similar things exist in close proximity”.

Now that we know a few things about the knn algorithm, let’s dive into the code part.

Implementation using Iris Dataset in Python

This dataset contains three classes of the iris flower. Among these three classes, the first is linearly separable whereas the other two classes aren’t linearly separable. For the implementation, we will use the scikit learn library.

Let’s import the needed Python libraries.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import matplotlib.pyplot as plt

Now we will read the contents of the dataset to check if they are in the required format.

df = pd.read_csv("D:\swapnali\Engineering\Third Year\sem 5\Machine Learning\Practical\iris.csv")
df.head()

The above lines will display the first 5 entries of the dataset, which contains text values. As shown below:

 

Text Based Values

We need to convert those values to numbers for easy calculations. To do this we use LabelEncoder.

iris_fl = LabelEncoder()
df['iris_fl_n'] = iris_fl.fit_transform(df['iris_fl'])
X = df.drop('iris_fl',axis='columns')
X = X.drop('iris_fl_n',axis='columns')
print(X.head())

Now if we look at the first five entries, we will a new column added which has the numeric values (0-2) for each class of iris flower. And the column having text values is been dropped.

Number valued column

After this, we will store our target values in a separate variable and split the data for train and test.

y = df['iris_fl_n']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)
k_range = range(1,26)
scores = {}
scores_list = []

To check the accuracy of the model changes with changing values of k, we use this loop and store the accuracy score of the model for each value of k. This is just to check the accuracy and can be omitted.

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_pred=knn.predict(X_test)
    scores[k]=metrics.accuracy_score(y_test,y_pred)
    scores_list.append(metrics.accuracy_score(y_test,y_pred))

Now we plot the accuracy of the model with respect to changing values of k.

plt.plot(k_range,scores_list)
plt.xlabel('Value of k for kNN')
plt.ylabel('Testing Accuracy')

Accuracy Graph

Let us go ahead and actually implement the most important part of our program. We will store our model in a variable named knn.

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)

Now, let’s introduce new values to the model to see it gives expected results. For this, we will make a list of test values. and pass them to the model to generate results.

x_test = [[5,4,3,4],[5,4,4,5]]
y_predict = knn.predict(x_test)

Now we will print the class into which our model has classified these values.

print("\n\nprediction for values:",x_new[0],"is: ",y_predict[0])
print("prediction for values:",x_new[1],"is: ",y_predict[1])

We get the output as:

prediction for values: [5, 4, 3, 4] is:  1
prediction for values: [5, 4, 4, 5] is:  2

Conclusion

We can cross-check that our model correctly classifies new instances into respective classes. This algorithm can be used in different scenarios like detecting patterns, classifying handwritten digits.

Thank You.

Further reading :

Learning to classify wines using scikit-learn

Leave a Reply

Your email address will not be published. Required fields are marked *