RandomForest Classifer and Regressor in Machine Learning in Python

RandomForest is a collection of Decision Trees. In other words, Decision Trees can be said as the basic unit for Random Forests.
To understand RandomForest, we need to understand Decision Trees. A decision Tree can be a classification or regression model. In each step, decisions are made ( maybe in the form of ‘yes’ or ‘no’ ), and with this, we move down the tree, further differentiating to reach a conclusion. The following diagram will illustrate Decision Trees.

diagram illustrate Decision Trees

RandomForest is called random because they use random features for splitting the nodes. It looks like we are ready to take a practical problem to understand it more clearly.

RandomForest Classifier and Regressor Model

We will take a weather dataset from Kaggle, on Delhi Climate. Just to give an overview, the data contains weather conditions ( smoke, clear, haze ),  humidity, precipitation, pressure, etc. The dataset contains a number of NaN values, which can be replaced with the mean value of the column. Columns with large NaN values can be removed. After this, we will convert pandas DataFrame to NumPy array, so as to operate scikit-learn. This can be done as follows.

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
X = preprocessing.StandardScaler().fit(X).transform(X)

StandardScaler() function normalize the data, to improve model accuracy. The dataset divides into training and testing datasets.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state=4)

Now, we have our training and test dataset. We can directly train our model as shown.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators = 200)
rfc.fit(X_train,y_train)

And the output will be something like this.

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

After training the model on the training dataset, we can make predictions as shown.

rfc_predict = rfc.predict(X_test)

Now using a classification report and confusion matrix to check the quality of prediction from a classification algorithm.

print(confusion_matrix(y_test,rfc_predict))
print(classification_report(y_test,rfc_predict))

And the output will be something like this.

[[7576 533 118 ... 0 0 0] [ 783 3087 30 ... 0 0 0] [ 107 40 491 ... 0 0 0] ... [ 0 0 0 ... 0 0 0] [ 0 0 0 ... 0 0 0] [ 1 0 0 ... 0 0 0]] precision recall f1-score support 1.0 0.80 0.90 0.85 8431 2.0 0.79 0.78 0.78 3981 3.0 0.70 0.74 0.72 662 4.0 0.58 0.45 0.51 634 5.0 0.63 0.45 0.52 498 6.0 0.77 0.80 0.78 565 7.0 0.48 0.27 0.34 422 8.0 0.50 0.27 0.35 438 9.0 0.70 0.71 0.71 371 10.0 0.35 0.22 0.27 294 11.0 0.66 0.90 0.76 208 12.0 0.42 0.37 0.40 43 13.0 0.57 0.48 0.52 66 14.0 0.67 0.86 0.75 76 15.0 0.59 0.51 0.55 37 16.0 0.36 0.07 0.11 73 17.0 0.45 0.19 0.26 81 18.0 0.72 0.40 0.52 84 19.0 0.69 0.41 0.52 75 20.0 0.43 0.18 0.25 67 21.0 0.74 0.91 0.82 35 22.0 0.50 0.31 0.38 36 23.0 0.25 0.04 0.07 25 24.0 0.50 0.20 0.29 5 25.0 0.00 0.00 0.00 12 26.0 0.00 0.00 0.00 3 27.0 1.00 0.12 0.22 8 28.0 0.00 0.00 0.00 2 29.0 0.00 0.00 0.00 1 31.0 0.00 0.00 0.00 1 33.0 0.00 0.00 0.00 1 35.0 0.00 0.00 0.00 1 accuracy 0.76 17236 macro avg 0.46 0.36 0.38 17236 weighted avg 0.74 0.76 0.75 17236

We can find the accuracy score as shown with output.

print("Random Forest Accuracy: ",metrics.accuracy_score(y_test,rfc_predict))
Random Forest Accuracy: 0.7622998375493154

The code for the regression model is as shown below.

from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 200)
rfr.fit(X_train,y_train)

Now making  predictions on test data

predictions = rfr.predict(X_test)

We can calculate errors, average absolute errors, mean absolute percentage error and accuracy as shown.

errors = abs(predictions - y_test)
average_absolute_error = round(np.mean(errors), 2)
mean_absolute_percentage_error = np.mean(100 * (errors / y_test))
accuracy = 100 - mean_absolute_percentage_error

Difference between classification and regression model are:

  1. The classification model predicts one class from a list.
  2. The regression model is known to predict any continuous value

Leave a Reply

Your email address will not be published.