Email Spam Classification using Scikit-Learn
Hello fellow learner! In this post, we will talk about the classification of spam emails from the dataset we load using scikit-learn in Python.
Introduction to the Problem
Billions of spam are sent every day and more than 90% of those emails are malicious. Don’t the spams get annoying? They sure are! Also, spams can be very interesting but at the same time very dangerous.
Do you know? One out of every 3,000 e-mails contains malware charges. Hence let’s learn about the classification of spam emails in this post.
The steps involved in doing the same is as follows:
Building the Classification Project
Step 1: Importing the Modules
First, we import all the necessary required modules into our project. The code for the same is as follows:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB, GaussianNB from sklearn import svm from sklearn.model_selection import GridSearchCV
Step 2: Loading the Data
The next step involves loading of Data with the help of the pandas module imported earlier:
data = pd.read_csv('./spam.csv')
The dataset we used has 5572 email samples and 2 unique labels ( spam and ham ). After loading we have to separate the data into training and testing data. You can download the data here.
Step 3: Testing and Training Data preparation
The separation of data into training and testing data includes two steps:
- Separating the x and y data ( the label and email text)
- Splitting the x and y data into four different datasets namely x_train,y_train,x_test, and y_test
The separation of data into x and y data is done in the following code:
x_data=data['EmailText'] y_data=data['Label']
The splitting of x and y data into four parts: x_train,y_train,x_test, and y_test. The splitting is done with the help of the 80:20 rule. The code for the same is as follows:
split =(int)(0.8*data.shape[0]) x_train=x_data[:split] x_test=x_data[split:] y_train=y_data[:split] y_test=y_data[split:]
Step 3: Extracting Features
The code to extract the necessary features are is as follows:
count_vector = CountVectorizer() extracted_features = count_vector.fit_transform(x_train)
We made use of the CountVectorizer() function for the same.
Step 4: Building and Training the Model
The next step involved building and training the model using the dataset we created earlier. The code for the same is as follows:
tuned_parameters = {'kernel': ['rbf','linear'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]} model = GridSearchCV(svm.SVC(), tuned_parameters) model.fit(extracted_features,y_train) print("Model Trained Successfully!")
Step 5: Calculating the accuracy of the Model
The final step includes checking the overall accuracy of our model using the testing data. The code for the same is as follows:
print("Accuracy of the model is: ",model.score(count_vector.transform(x_test),y_test)*100)
The final accuracy we get is as follows:
Accuracy of the model is: 98.7443946188341
We achieved 98.744% accuracy which is pretty good. Congrats!
The conclusion to the Project
We finally build a model and achieved high accuracy as well. Thank you for reading today’s post!
Keep reading and building to learn more!
You can find the code for the project here.
Read More Classification projects!
Learn Classification of clothing images using TensorFlow in Python
Learning to classify wines using scikit-learn
Classifying Radio Signals from Space using Keras in Python
Leave a Reply