Understanding KNN algorithm using Iris Dataset with Python
Hey folks,
Let’s learn about a lazy algorithm that can be used for both classification and regression.
You know, it is the K-Nearest Neighbor Algorithm.
Instance-Based Learning
The knn algorithm is known by many names such as lazy learning, instance-based learning, case-based learning, or local-weighted regression, this is because it does not split the data while training. In other words, it uses all the data while training. Another property of the knn algorithm is that it follows non-parametric learning. Meaning it does not has a pre-assumption about the data.
It is said instance-based learning because here we do not process the training examples to train a model. Instead, we store them and whenever we need to classify a new example/data, we retrieve a set of similar instances to generate the results. To find the result, the algorithm follows the fact – “Similar things exist in close proximity”.
Now that we know a few things about the knn algorithm, let’s dive into the code part.
Implementation using Iris Dataset in Python
This dataset contains three classes of the iris flower. Among these three classes, the first is linearly separable whereas the other two classes aren’t linearly separable. For the implementation, we will use the scikit learn library.
Let’s import the needed Python libraries.
import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn import metrics import matplotlib.pyplot as plt
Now we will read the contents of the dataset to check if they are in the required format.
df = pd.read_csv("D:\swapnali\Engineering\Third Year\sem 5\Machine Learning\Practical\iris.csv") df.head()
The above lines will display the first 5 entries of the dataset, which contains text values. As shown below:
We need to convert those values to numbers for easy calculations. To do this we use LabelEncoder.
iris_fl = LabelEncoder() df['iris_fl_n'] = iris_fl.fit_transform(df['iris_fl']) X = df.drop('iris_fl',axis='columns') X = X.drop('iris_fl_n',axis='columns') print(X.head())
Now if we look at the first five entries, we will a new column added which has the numeric values (0-2) for each class of iris flower. And the column having text values is been dropped.
After this, we will store our target values in a separate variable and split the data for train and test.
y = df['iris_fl_n'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4) k_range = range(1,26) scores = {} scores_list = []
To check the accuracy of the model changes with changing values of k, we use this loop and store the accuracy score of the model for each value of k. This is just to check the accuracy and can be omitted.
for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train,y_train) y_pred=knn.predict(X_test) scores[k]=metrics.accuracy_score(y_test,y_pred) scores_list.append(metrics.accuracy_score(y_test,y_pred))
Now we plot the accuracy of the model with respect to changing values of k.
plt.plot(k_range,scores_list) plt.xlabel('Value of k for kNN') plt.ylabel('Testing Accuracy')
Let us go ahead and actually implement the most important part of our program. We will store our model in a variable named knn.
knn = KNeighborsClassifier(n_neighbors=5) knn.fit(X,y)
Now, let’s introduce new values to the model to see it gives expected results. For this, we will make a list of test values. and pass them to the model to generate results.
x_test = [[5,4,3,4],[5,4,4,5]] y_predict = knn.predict(x_test)
Now we will print the class into which our model has classified these values.
print("\n\nprediction for values:",x_new[0],"is: ",y_predict[0]) print("prediction for values:",x_new[1],"is: ",y_predict[1])
We get the output as:
prediction for values: [5, 4, 3, 4] is: 1 prediction for values: [5, 4, 4, 5] is: 2
Conclusion
We can cross-check that our model correctly classifies new instances into respective classes. This algorithm can be used in different scenarios like detecting patterns, classifying handwritten digits.
Thank You.
Further reading :
Learning to classify wines using scikit-learn
Leave a Reply