Email Spam Classification using Scikit-Learn

Hello fellow learner! In this post, we will talk about the classification of spam emails from the dataset we load using scikit-learn in Python.

Introduction to the Problem

Billions of spam are sent every day and more than 90% of those emails are malicious. Don’t the spams get annoying? They sure are! Also, spams can be very interesting but at the same time very dangerous.

Do you know? One out of every 3,000 e-mails contains malware charges. Hence let’s learn about the classification of spam emails in this post.

The steps involved in doing the same is as follows:

Email spam detect flowchart

Building the Classification Project

Step 1: Importing the Modules

First, we import all the necessary required modules into our project. The code for the same is as follows:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import svm
from sklearn.model_selection import GridSearchCV

Step 2: Loading the Data

The next step involves loading of Data with the help of the pandas module imported earlier:

data = pd.read_csv('./spam.csv')

The dataset we used has 5572 email samples and 2 unique labels ( spam and ham ). After loading we have to separate the data into training and testing data. You can download the data here.

Step 3: Testing and Training Data preparation

The separation of data into training and testing data includes two steps:

  1. Separating the x and y data ( the label and email text)
  2. Splitting the x and y data into four different datasets namely x_train,y_train,x_test, and y_test

The separation of data into x and y data is done in the following code:

x_data=data['EmailText']
y_data=data['Label']

The splitting of x and y data into four parts: x_train,y_train,x_test, and y_test. The splitting is done with the help of the 80:20 rule. The code for the same is as follows:

split =(int)(0.8*data.shape[0])
x_train=x_data[:split]
x_test=x_data[split:]
y_train=y_data[:split]
y_test=y_data[split:]

Step 3: Extracting Features

The code to extract the necessary features are is as follows:

count_vector = CountVectorizer()  
extracted_features = count_vector.fit_transform(x_train)

We made use of the CountVectorizer() function for the same.

Step 4: Building and Training the Model

The next step involved building and training the model using the dataset we created earlier. The code for the same is as follows:

tuned_parameters = {'kernel': ['rbf','linear'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]}
model = GridSearchCV(svm.SVC(), tuned_parameters)
model.fit(extracted_features,y_train)

print("Model Trained Successfully!")

Step 5: Calculating the accuracy of the Model

The final step includes checking the overall accuracy of our model using the testing data. The code for the same is as follows:

print("Accuracy of the model is: ",model.score(count_vector.transform(x_test),y_test)*100)

The final accuracy we get is as follows:

Accuracy of the model is:  98.7443946188341

We achieved 98.744% accuracy which is pretty good. Congrats!

The conclusion to the Project

We finally build a model and achieved high accuracy as well. Thank you for reading today’s post!

Keep reading and building to learn more!

You can find the code for the project here.

Read More Classification projects!

Learn Classification of clothing images using TensorFlow in Python

Learning to classify wines using scikit-learn

Classifying Radio Signals from Space using Keras in Python

Leave a Reply

Your email address will not be published. Required fields are marked *