Sentiment Analysis on Movie Reviews using Scikit-Learn

Hello learners! Today we will be learning how to conduct Sentiment Analysis on Movie Reviews using the Scikit-Learn Library in the Python library. Scikit-Learn is a very powerful machine-learning library that makes machine learning an effortless job. And using the same Scikit-Learn, we will be doing the Sentiment Analysis of movie reviews. In this tutorial, we would work on the reviews of the movie ‘Vikram’ starring the famous Tamil actor Kamal Hassan.


Know your input dataset:

Here, we will be taking the Vikram movie IMDb dataset. We will make use of the ‘Vikram_Train_Dataset.csv’ for the training purpose and ‘Vikram_Test_Dataset.csv’ for the test purpose.

The training dataset is like this:

Steps to be followed (or procedure):

Here, we will be using two important modules from the Scikit-Learn library, one is the feature_extraction.text and the other is naive_bayes (MultinomialNB).


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Then we will import the training dataset:

import pandas as pd
train_dataset = pd.read_csv('Vikram_Train_Dataset.csv', encoding = 'unicode_escape')

Then we will extract train reviews from the dataset as:

train_reviews = train_dataset['Train Reviews'].tolist()

We will then take the train targets from the dataset again:

train_targets = train_dataset['Sentiment'].tolist()

Now, we will declare and instance ‘vectorizer1’ from the class CountVectorizer() and apply the fit_transform on the train reviews:

vectorizer1 = CountVectorizer()
train_input = vectorizer1.fit_transform(train_reviews)

Now we create the instance ‘sentiment_model’ of MultinomialNB():

sentiment_model = MultinomialNB(), train_targets)

In the next step, we will extract the test dataset:

test_dataset = pd.read_csv('Vikram_Test_Dataset.csv', encoding = 'unicode_escape')

We get the output as:


We will find the maximum features of the sentiment analysis model:

maximum = sentiment_model.n_features_in_

We declare another instance ‘vectorizer2’ with the maximum number of features being the integer ‘maximum’:

vectorizer2 = CountVectorizer(max_features = maximum)
test_data = test_dataset['Test Reviews'].tolist()

Now we will predict the sentiments of the test dataset by writing:

sentiment_predicted = sentiment_model.predict(test_input)

Here, we will print the Pandas data frame table in order to see the results, hence we write:

test_dataset['Predicted Sentiment Reviews'] = sentiment_predicted



Hope you liked the simple tutorial. You can make wonderful projects by using this small but super powerful library called Scikit-Learn.

Till then, stay tuned for more such tutorials.

Leave a Reply

Your email address will not be published. Required fields are marked *