Sentiment analysis using Keras in Python

Hey folks! In this blog let us learn about “Sentiment analysis using Keras” along with little of NLP. We will learn how to build a sentiment analysis model that can classify a given review into positive or negative or neutral.

To start with, let us import the necessary Python libraries and the data. We can download the amazon review data from https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set 

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense

import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("C:/Users/username/Downloads/sentiment labelled sentences/amazon_cells_labelled.csv")
df.head(2)

 

Let us see how the data looks like.

                            Review	                     Sentiment         Sentiment1   Unnamed:3   Unnamed:4   Unnamed: 5
0	So there is no way for me to plug it in here i...    0	               NaN	    NaN	        NaN	    NaN
1	Good case	                                     Excellent value.1     NaN	    NaN	        NaN         NaN

Here we can observe that the data is irregularly distributed across the columns. Now our motive is to clean the data and separate the reviews and sentiments into two columns. Let us see how to do it!

 

Data preparation

Now let us combine the various sentiment values that are distributed across the unnamed columns. Let us use the “combine_first” function because it will combine the numbers and leaves the NaN values. Also, let us drop the unnamed columns because the useful data is already transferred to the “Sentiment 1” column.

df['Sentiment1'].combine_first(df['Unnamed: 3'])
df['Sentiment1'].combine_first(df['Unnamed: 4'])
df['Sentiment1'].combine_first(df['Unnamed: 5'])

df = df.drop(columns = ["Unnamed: 3", "Unnamed: 4" ,"Unnamed: 5"])

 

 

Now let us concatenate the reviews in other columns to the “Review” column. Later let us put all the sentiment values in “Sentiment1” column. Let us use combine_first() because it leaves the unwanted strings and NaN.

df["Review"] = df['Review'] + df['Sentiment'] 

df["Sentiment 1"] = df['Sentiment 1'].combine_first(df['Sentiment'])

df.head(2)

The output will be like:

                           Review	                         Sentiment          Sentiment 1
0	So there is no way for me to plug it in here i...	    0	                0
1	Good case Excellent value.	                        Excellent value.	   1

 

Now that we have classified the sentiment labels in “Sentiment 1” column and the corresponding reviews in “Review” column. So let’s drop the remaining unwanted columns.

df.drop(columns = "Sentiment", inplace = True)

df.rename(columns={"Sentiment 1": "Sentiment"},inplace = True)

df = df.dropna()

 

There might be some strings in the “Sentiment” column and there might be some numbers in the “Review” column. Let us write two functions to make our data suitable for processing.

Creating bag of words

Let us write the first function to eliminate the strings in the “Sentiment” column.

def Sentiment_process(sent):
    noalpha = []
    char = []
    for char in sent:
        if char!="0" and char!="1":
            noalpha.append(np.NaN)
            continue
        else:    
            noalpha.append(char)
            continue 
    return(noalpha)

Explanation:

If the character in the review is not a number (either 0 or 1), it is replaced with NaN, so that it will be easy for us to eliminate them. If it is 0 or 1, the number is appended as such.

df["Sentiment"] = Sentiment_process(list(df["Sentiment"]))

df = df.dropna()

Now we only have numbers in the “Sentiment” column.

Let us write the second function to eliminate the special characters, stopwords and numbers in the “Review” column and put them into a bag of words. We will eliminate the numbers first, and then we will remove the stopwords like “the”, “a” which won’t affect the sentiment.

import nltk
from nltk.corpus import stopwords

import string
def text_processing(text):
    nopunc = []
    for char in text:
        if char not in string.punctuation:
            if char!=str("0") and char!=str("1"):
                nopunc.append(char)
    nopunc = ''.join(nopunc)
    
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

Let us call the above function.We will first remove the numbers and then apply the text processing.

df["Review"] = df['Review'].str.replace('\d+', '')
df["BagOfWords"] = df["Review"].apply(text_processing)

Now let us see how the data looks like:

df.loc[51:53]

Output:

                               Review	                     Sentiment	                BagOfWords
51	good protection and does not make phone too bu...	1	[good, protection, make, phone, bulky]
52	A usable keyboard actually turns a PDA into a ...	1	[usable, keyboard, actually, turns, PDA, realw...

 

Building the model

Let us define x and y to fit into the model and do the train and test split.

x = df["BagOfWords"]
df["Sentiment"] = df["Sentiment"].astype(str).astype(int)
y = df["Sentiment"]


from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

 

Now let us tokenize the words. That is, we are going to change the words into numbers so that it will be compatible to feed into the model.

We will consider only the top 5000 words after tokenization. Let us convert the X_train values into tokens to convert the words into corresponding indices and store back to X_train. Similarly, we will tokenize X_test values.

from keras.preprocessing.text import Tokenizer    
from keras.preprocessing.text import text_to_word_sequence 

tokenizer = Tokenizer(num_words=5000)

tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

Let us truncate the reviews to make all the reviews to be equal in length. If the reviews are less than the length, it will be padded with empty values. But if the reviews are longer than the desired length, it will be cut short.

from keras.preprocessing import sequence

maxlen = 50
# Making the train and test statements to be of size 50 by truncating or padding accordingly

X_train = sequence.pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, padding='post', maxlen=maxlen)

Now let us build the keras model.

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Embedding, GlobalAveragePooling1D

model = Sequential([Embedding(10000, 17), 
                   GlobalAveragePooling1D(),
                   Dense(17,activation = "relu"),
                   Dense(12,activation = "relu"),
                   Dense(1,activation = "sigmoid")])

model.compile(
    loss = "binary_crossentropy",
    optimizer =  "adam",
    metrics = ["accuracy"])

model.summary()

Training and evaluation

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, verbose = 1)
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy is : ",accuracy*100)

Output:

Accuracy is :  85.77847814559937

We see that we have achieved a good accuracy.

Now let us test it with a review.

sample = "The product was very good and satisfying."
sample = text_processing(sample)
sample

Output:

['product', 'good', 'satisfying']

 

Let us perform all the preprocessing required.

sample = tokenizer.texts_to_sequences(sample)
sample

simple_list = []
for sublist in sample:
    for item in sublist:
        simple_list.append(item)
simple_list = [simple_list]
sample_review = sequence.pad_sequences(simple_list, padding='post', maxlen=maxlen)

Each and every word in the review will be a separate list and there will be sublists. We have made it into a single simple list so as to predict the sentiment properly.

 

ans = model.predict(sample_review)
ans

Output:

array([[0.8325547]], dtype=float32)

 

Let us see if this is positive or negative.

if (0.4 <= ans <= 0.6):
    print("The review is not too good nor too bad")
if(ans>0.6):
    print("The review is positive")
elif(ans<0.4):
    print("The review is negative")

Output:

The review is positive

 

Hurray! We have predicted the sentiment of any given review. That is all about “Sentiment analysis using Keras”. We have learnt how to properly process the data and feed it into the model to predict the sentiment and get good results.

THANK YOU

Leave a Reply

Your email address will not be published. Required fields are marked *