Skin disease classification using random forest algorithm in Python
Hello everyone! In this post, we will learn about how to classify various skin diseases based upon some attributes using the random forest algorithm.
By the end of the post, you will understand the implementation of the random forest algorithm.
- Random forest algorithm is a supervised machine learning algorithm used for classification and regression problems.
- In this algorithm, the given training dataset is divided into n subsets then for each of the subsets a decision tree is made.
- The testing dataset is passed to every decision tree, the majority of the output predicted by decision trees is determined as the final output.
Implementation of Random Forest Algorithm
- To implement the random forest algorithm in python we require some libraries and modules.
- The first step is importing the dataset.
import pandas as pd import numpy as np data=pd.read_csv('dataset.csv') data.head()
- The above code will display the first 5 rows of the dataset.
- The next step is separating output columns with other attributes.
x = dermatology.iloc[0:, :-1].values y = dermatology.iloc[:, 11].values
- Now for the given dataset split it into training and testing datasets.
- This can be done using the train_test_split module which has parameters independent and dependent variable and the test_size.
- For example,test_size=0.30 indicates 30% of records are considered as testing datasets and 70% as training datasets.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.30, random_state=0)
- Now the next step is feature scaling and fitting the dataset to feed the random forest classifier.
from sklearn.preprocessing import StandardScaler st_x= StandardScaler() x_train= st_x.fit_transform(x_train) x_test= st_x.transform(x_test) from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import BaggingClassifier from sklearn import tree classifier= RandomForestClassifier(n_estimators= 25, criterion="entropy") classifier.fit(x_train, y_train)
- The parameters n_estimators indicate the no of decision trees and criterion=”entropy” indicates the randomness of values.
- The output of the above code is the description of a random forest classifier.
- For the graphical representation of a collection of decision trees, we use matplotlib library.
import matplotlib.pyplot as mtp mtp.figure(figsize=(5,5)) for i in range(len(classifier.estimators_)): tree.plot_tree(classifier.estimators_[i],filled=True)
- The output of the above code is a tree structure.
- The last step is the evaluation of the confusion matrix.
- It determines the number of correct and wrong outputs by comparing the variables y_pred and y_test.
y_pred= classifier.predict(x_test) from sklearn.metrics import confusion_matrix cm= confusion_matrix(y_test, y_pred) cm
- The confusion matrix is as follows: