Learning to classify wines using scikit-learn

Hello everybody! In this post, we are going to learn how to classify wines on the basis of various features using scikit-learn in Python.

By end of this post, you will end up with a successful wine classifier. So let’s begin by first introducing the project.

Introduction

In this world numerous wines, including dessert wines, sparkling wines, appetizers, pop wines, table wines, and vintage wines are available. Now you may wonder, how do I say which wine is good and which is not? Machine Learning gives answers to all such questions!

Numerous methods for the classification of the wines are available. Some of them are as follows:

  1. CART
  2. Logistic Regression
  3. Random forest
  4. Naïve Bayes
  5. Perception
  6. SVM
  7. KNN

There are various steps involved in building the project. The same is shown in the flowchart below:

Step 1: Importing Modules

The code below displays all the modules I used in the classification process:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
from sklearn.preprocessing import normalize

Step 2: Dataset Preparation

Data Description

In the dataset, we have 6497 observations and in total 12 features. There aren’t NAN values in any variable. The name and description of the 12 features are as follows:

  1. Fixed acidity: Amount of acidity in the wine
  2. Volatile acidity: Amount of acetic acid present in the wine
  3. Citric acid: Amount of citric acid present in the wine
  4. Residual sugar: Amount of sugar after fermentation
  5. Chlorides: Amount of salts present in the wine
  6. Free sulfur dioxide: Amount of free form of SO2
  7. Total sulfur dioxide: Amount of free and bound forms of S02
  8. Density: Density of the wine (mass/volume)
  9. pH: pH of the wine ranging from 0-14
  10. Sulphates: Amount of sulfur dioxide gas (S02) levels in the wine
  11. Alcohol: Amount of alcohol present in the wine
  12. Quality: Final quality of the wine mentioned

You can download the data easily here.

Data Loading and initial cleaning

The CSV file is loaded using the pandas module, the same is shown below:

data=pd.read_csv("./wine_dataset.csv")
data.head()

The first 5 observations of the initial dataset are shown below:

Now cleaning of data takes place using a number of steps namely, dropping unnecessary columns and, dropping NAN values (if any). The code for the same is:

data=data.drop('Unnamed: 0',axis=1)
data.dropna()
data.head()

Now the first 5 observations of the cleaned dataset are as follows:

Data Visualization

An important step is to first visualize the data before processing it any further. I have done the visualization in two forms namely, histographs and seaborn graphs.

The code for the histographs is shown below:

plt.style.use('dark_background')
colors=['blue','green','red','cyan','magenta','yellow','blue','green','red','magenta','cyan','yellow']
plt.figure(figsize=(20,50))
for i in range(1,13):
    plt.subplot(6,6,i)
    plt.hist(data[data.columns[i-1]],color=colors[i-1])
    plt.xlabel(data.columns[i-1])
plt.show()

The output plot of the code above is separate plots for each feature of the wine and the count of wines in each feature.

The code for the second type of plot i.e. seaborn graph is:

import seaborn as sns
plt.figure(figsize=(10,10))
correlations = data[data.columns].corr(method='pearson')
sns.heatmap(correlations, annot = True)
plt.show()

Seaborn graphs show the relationship between different features in the dataset. Seaborn visualization is shown below:

Final Data Preparation

Once the data visualization is done, working on the data preparation was required. This involves the following steps:

  1. Test-Train Splitting of the dataset
  2. Normalizing the data
Test-Train Splitting of the dataset

To split the data into training and testing data, there is no optimal splitting percentage. But one of the fair splitting rules is the 80/20 rule where 80% of the data goes to training data and the rest 20% goes to testing data.

Normalizing Data

Normalizing is important to make sure all the values in all the features are comparable to each other.

The whole code for the final data preparation is shown below:

split=int(0.8*data.shape[0])
print("Split of data is at: ",split)
print("\n-------AFTER SPLITTING-------")
train_data=data[:split]
test_data=data[split:]
print('Shape of train data:',train_data.shape)
print('Shape of train data:',test_data.shape)
print("\n----CREATING X AND Y TRAINING TESTING DATA----")
y_train=train_data['quality']
y_test=test_data['quality']
x_train=train_data.drop('quality',axis=1)
x_test=test_data.drop('quality',axis=1)
print('Shape of x train data:',x_train.shape)
print('Shape of y train data:',y_train.shape)
print('Shape of x test data:',x_test.shape)
print('Shape of y test data:',y_test.shape)

nor_train=normalize(x_train)
nor_test=normalize(x_test)

The output of the code is shown below:

Split of data is at:  5197

-------AFTER SPLITTING-------
Shape of train data: (5197, 12)
Shape of train data: (1300, 12)

----CREATING X AND Y TRAINING TESTING DATA----
Shape of x train data: (5197, 11)
Shape of y train data: (5197,)
Shape of x test data: (1300, 11)
Shape of y test data: (1300,)

Step 3: Classifying Wines

I have done the classification of wine dataset using the testing and training dataset using two algorithms namely, SVM and Logistic Regression.

The code for Support Vector Machine (SVM) and Logistic Regression is shown below:

clf = svm.SVC(kernel='linear')
clf.fit(nor_train, y_train)
y_pred_svm = clf.predict(nor_test)
print("Accuracy (SVM) :",metrics.accuracy_score(y_test, y_pred_svm)*100)
logmodel = LogisticRegression()
logmodel.fit(nor_train, y_train)
y_pred_LR= logmodel.predict(nor_test)
print('Mean Absolute Error(Logistic Regression):', metrics.mean_absolute_error(y_test, y_pred_LR)*100)

The output accuracy is shown below:

Accuracy (SVM) : 50.30769230769231
Mean Absolute Error(Logistic Regression): 52.0

The algorithms I used gave a decent accuracy of around 50%. By making use of Tensorflow algorithms the accuracy rate can be much higher!

I hope you learned something from today’s post! The code for the same can be found here.

Stay tuned for classification using Tensorflow also in order to get better accuracy!

Want to Learn More? Also, Check out:

Leave a Reply

Your email address will not be published.