Alcohol Quality Prediction using Random Forest Classification
This tutorial will discuss wine quality prediction using a Random forest classifier algorithm using the language python.
The general introduction of this tutorial is to predict the quality of the wine from the given dataset.
For example, in the given data set there are 12 columns and the last column is the quality of the wine we can take if the quality is greater than or equal to 7 then the wine is of good wine.
If the value is less than or equal to 6 then it is not good quality.
you can download the data set from here winequality-red
So let’s try to implement the code
Import the necessary header files.
#import the necessary headerfiles or import the dependencies is known as headerfiles import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #it is useful for the data visualization from sklearn.model_selection import train_test_split #The important that import random forest classifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score#it is useful to find how well our model is performing
Load the data set.
#Data collection #loading the data set to a pandas data frame wine_dataset=pd.read_csv("C:\\Users\\users drive\\OneDrive\\Desktop\\winequality-red.csv")
To check how many rows and columns are there in a given dataset.
#now checking the rows and columns in the data set wine_dataset.shape
OUTPUT:-
(1599, 12)
To check whether the data set is loaded or not print the first five rows of the data set and the last five rows by using the data frame.
#First five rows of the data set wine_dataset.head()
OUTPUT:-
To check the last five frames.
#Now last 5 rows are wine_dataset.tail()
OUTPUT:-
check the missing values in the given data.
wine_dataset.isnull().sum()
OUTPUT:-
Generate some statistical values for the data like mean-variance and max.
wine_dataset.describe() #we will get mean standard deviationa and percentage of the dataset #These values are very helpful to see what are the range of the values in each column
OUTPUT:-
Number of values for each quality for this we will use the seaborn function.
sns.catplot(x='quality',data=wine_dataset,kind='count') #These are the different quality values
OUTPUT:-
The above figure represents count vs quality.
Now we can compare the volatile acidity and the quality columns.
plot=plt.figure(figsize=(5,5)) sns.barplot(x='quality',y='volatile acidity',data=wine_dataset) #These see that volatilke quality and quality will is inversly propotional
OUTPUT:-
If volatile acidity is high then the quality is low.
Citric acid vs quality
sns.barplot(x='quality',y='citric acid',data=wine_dataset) #if the citric acid content is more then we are getting the high quality of the wine if it is low then the quality is not high
OUTPUT:-
This is the advantage of the data analysis part it helps us to understand which columns are more related to our label The volatile is inversely proportional and citric acid is directly proportional.
chlorides vs quality
sns.barplot(x='quality',y='chlorides',data=wine_dataset)
OUTPUT:-
Quality vs alcohol
sns.barplot(x='quality',y='alcohol',data=wine_dataset) #directly propotional
OUTPUT:-
Corelation Values
#Now we will find the correlation between all the columns and the quality columns #There are to types of correlation positive and negative correlation=wine_dataset.corr()
Construction of heat map:-
#constructing a heat map to understand the correlation between the columns plt.figure(figsize=(10,10)) sns.heatmap(correlation,cbar=True,square=True,fmt='.2f',annot=True,annot_kws={'size':8},cmap='Greens')
OUTPUT:-
Separation of columns:-
#we need to seperate all data and quality column #because we are checking the quality column #seperate data and label X=wine_dataset.drop('quality',axis=1)
print(X)
OUTPUT:-
Before needing to do label binarization if the quality is >=7 then the quality is good else<=6 then it is bad.
#need to store the quality column in different variable #But before we need to do label binarization if the quality is >=7 then quality is good else<=6 then it is bad Y=wine_dataset['quality'].apply(lambda y_value: 1 if y_value>=7 else 0)
print(Y)# we have 0 and 1
OUTPUT:-
split the data into train and test data.
#for that we need to create 4 variables X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=2)
print(Y.shape,Y_train.shape,Y_test.shape) #to check how manyvalues are there
OUTPUT:-
(1599,) (1279,) (320,)
MODEL TRAINING
model=RandomForestClassifier()
model.fit(X_train,Y_train)#it will fit the model #y contains quality values
ACCURACY SCORES
#we need to evaluate on test data #Accuracy on test data X_test_prediction=model.predict(X_test) test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print('Accuracy : ',test_data_accuracy)
OUTPUT:-
Accuracy : 0.921875
This means our model can predict 92 values which are really good.
Now we can take the data one by one because we need to check the quality of the wine in quality if we consider all the datasets at one time then the output will be generated but the user cants able to understand which data output is generating that’s why we need to take the data one row at a time and generate the output.
Take some random values from the data set.
#Building a predictive system input_data=(7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0) #changing the input data to numpy array input_data_as_numpy_array=np.asarray(input_data) #here we are predicting only for the one input data thats we need to reshape the data #reshape the data as we are predicting the label for only one instance input_data_reshaped=input_data_as_numpy_array.reshape(1,-1) prediction=model.predict(input_data_reshaped) print(prediction) if(prediction[0]==1): print('Good quality wine') else: print('Bad quality wine')
OUTPUT:-
[1] Good quality wine
#for second input input_data=(7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5) #changing the input data to numpy array input_data_as_numpy_array=np.asarray(input_data) #here we are predicting only for the one input data thats we need to reshape the data #reshape the data as we are predicting the label for only one instance input_data_reshaped=input_data_as_numpy_array.reshape(1,-1) prediction=model.predict(input_data_reshaped) print(prediction) if(prediction[0]==1): print('Good quality wine') else: print('Bad quality wine')
OUTPUT:-
[0] Bad quality wine
Leave a Reply