Alcohol Quality Prediction using Random Forest Classification

This tutorial will discuss wine quality prediction using a Random forest classifier algorithm using the language python.

The general introduction of this tutorial is to predict the quality of the wine from the given dataset.

For example, in the given data set there are 12 columns and the last column is the quality of the wine we can take if the quality is greater than or equal to 7 then the wine is of good wine.

If the value is less than or equal to 6 then it is not good quality.

you can download the data set from here winequality-red

So let’s try to implement the code

Import the necessary header files.

#import the necessary headerfiles or import the dependencies is known as headerfiles
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns #it is useful for the data visualization
from sklearn.model_selection import train_test_split
#The important that import random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score#it is useful to find how well our model is performing

Load the data set.

#Data collection 
#loading the data set to a  pandas data frame
wine_dataset=pd.read_csv("C:\\Users\\users drive\\OneDrive\\Desktop\\winequality-red.csv")

To check how many rows and columns are there in a given dataset.

#now checking the rows and columns in the data set
wine_dataset.shape

OUTPUT:-

(1599, 12)

To check whether the data set is loaded or not print the first five rows of the data set and the last five rows by using the data frame.

#First five rows of the data set
wine_dataset.head()

OUTPUT:-

To check the last five frames.

#Now last 5 rows are
wine_dataset.tail()

OUTPUT:-

check the missing values in the given data.

wine_dataset.isnull().sum()

OUTPUT:-

Generate some statistical values for the data like mean-variance and max.

wine_dataset.describe() 
#we will get mean standard deviationa and percentage of the dataset
#These values are very helpful to see what are the range of the values in each column

OUTPUT:-

Number of values for each quality for this we will use the seaborn function.

sns.catplot(x='quality',data=wine_dataset,kind='count')
#These are the different quality values

OUTPUT:-

The above figure represents count vs quality.

Now we can compare the volatile acidity and the quality columns.

plot=plt.figure(figsize=(5,5))
sns.barplot(x='quality',y='volatile acidity',data=wine_dataset)
#These see that volatilke quality and quality will is inversly propotional

OUTPUT:-

If volatile acidity is high then the quality is low.

Citric acid vs quality

sns.barplot(x='quality',y='citric acid',data=wine_dataset)
#if the citric acid content is more then we are getting the high quality of the wine if it is low then the quality is not high

OUTPUT:-

This is the advantage of the data analysis part it helps us to understand which columns are more related to our label The volatile is inversely proportional and citric acid is directly proportional.

chlorides vs quality

sns.barplot(x='quality',y='chlorides',data=wine_dataset)

OUTPUT:-

Quality vs alcohol

sns.barplot(x='quality',y='alcohol',data=wine_dataset)
#directly propotional

OUTPUT:-

Corelation Values

#Now we will find the correlation between all the columns and the quality columns
#There are to types of correlation positive and negative
correlation=wine_dataset.corr()

Construction of heat map:-

#constructing a heat map to understand the correlation between the columns
plt.figure(figsize=(10,10))
sns.heatmap(correlation,cbar=True,square=True,fmt='.2f',annot=True,annot_kws={'size':8},cmap='Greens')

OUTPUT:-

Separation of columns:-

#we  need to seperate all data and quality column
#because we are checking the quality column
#seperate data and label
X=wine_dataset.drop('quality',axis=1)
print(X)

OUTPUT:-

Before needing to do label binarization if the quality is >=7 then the quality is good else<=6 then it is bad.

#need to store the quality column in different variable
#But before we need to do label binarization if the quality is >=7 then quality is good else<=6 then it is bad
Y=wine_dataset['quality'].apply(lambda y_value: 1 if y_value>=7 else 0)
print(Y)# we have 0 and 1

OUTPUT:-

split the data into train and test data.

#for that we need to create 4 variables
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=2)
print(Y.shape,Y_train.shape,Y_test.shape) #to check how manyvalues are there

OUTPUT:-

(1599,) (1279,) (320,)

MODEL TRAINING

model=RandomForestClassifier()
model.fit(X_train,Y_train)#it will fit the model
#y contains quality values

ACCURACY SCORES

#we need to evaluate on test data 
#Accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print('Accuracy : ',test_data_accuracy)

OUTPUT:-

Accuracy :  0.921875

This means our model can predict 92 values which are really good.

Now we can take the data one by one because we need to check the quality of the wine in quality if we consider all the datasets at one time then the output will be generated but the user cants able to understand which data output is generating that’s why we need to take the data one row at a time and generate the output.

Take some random values from the data set.

#Building a predictive system
input_data=(7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0)
#changing the input data to numpy array
input_data_as_numpy_array=np.asarray(input_data)
#here we are predicting only for the one input data thats we need to reshape the data
#reshape the data as we are predicting the label for only one instance
input_data_reshaped=input_data_as_numpy_array.reshape(1,-1)
prediction=model.predict(input_data_reshaped)
print(prediction)
if(prediction[0]==1):
    print('Good quality wine')
else:
    print('Bad quality wine')

OUTPUT:-

[1]
Good quality wine
#for second input
input_data=(7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5)
#changing the input data to numpy array
input_data_as_numpy_array=np.asarray(input_data)
#here we are predicting only for the one input data thats we need to reshape the data
#reshape the data as we are predicting the label for only one instance
input_data_reshaped=input_data_as_numpy_array.reshape(1,-1)
prediction=model.predict(input_data_reshaped)
print(prediction)
if(prediction[0]==1):
    print('Good quality wine')
else:
    print('Bad quality wine')

OUTPUT:-

[0]
Bad quality wine

 

Leave a Reply

Your email address will not be published. Required fields are marked *