Sentiment analysis using Keras in Python
Hey folks! In this blog let us learn about “Sentiment analysis using Keras” along with little of NLP. We will learn how to build a sentiment analysis model that can classify a given review into positive or negative or neutral.
To start with, let us import the necessary Python libraries and the data. We can download the amazon review data from https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set
import numpy as np import pandas as pd from keras.models import Sequential from keras.layers import Dense import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv("C:/Users/username/Downloads/sentiment labelled sentences/amazon_cells_labelled.csv") df.head(2)
Let us see how the data looks like.
Review Sentiment Sentiment1 Unnamed:3 Unnamed:4 Unnamed: 5 0 So there is no way for me to plug it in here i... 0 NaN NaN NaN NaN 1 Good case Excellent value.1 NaN NaN NaN NaN
Here we can observe that the data is irregularly distributed across the columns. Now our motive is to clean the data and separate the reviews and sentiments into two columns. Let us see how to do it!
Data preparation
Now let us combine the various sentiment values that are distributed across the unnamed columns. Let us use the “combine_first” function because it will combine the numbers and leaves the NaN values. Also, let us drop the unnamed columns because the useful data is already transferred to the “Sentiment 1” column.
df['Sentiment1'].combine_first(df['Unnamed: 3']) df['Sentiment1'].combine_first(df['Unnamed: 4']) df['Sentiment1'].combine_first(df['Unnamed: 5']) df = df.drop(columns = ["Unnamed: 3", "Unnamed: 4" ,"Unnamed: 5"])
Now let us concatenate the reviews in other columns to the “Review” column. Later let us put all the sentiment values in “Sentiment1” column. Let us use combine_first() because it leaves the unwanted strings and NaN.
df["Review"] = df['Review'] + df['Sentiment'] df["Sentiment 1"] = df['Sentiment 1'].combine_first(df['Sentiment']) df.head(2)
The output will be like:
Review Sentiment Sentiment 1 0 So there is no way for me to plug it in here i... 0 0 1 Good case Excellent value. Excellent value. 1
Now that we have classified the sentiment labels in “Sentiment 1” column and the corresponding reviews in “Review” column. So let’s drop the remaining unwanted columns.
df.drop(columns = "Sentiment", inplace = True) df.rename(columns={"Sentiment 1": "Sentiment"},inplace = True) df = df.dropna()
There might be some strings in the “Sentiment” column and there might be some numbers in the “Review” column. Let us write two functions to make our data suitable for processing.
Creating bag of words
Let us write the first function to eliminate the strings in the “Sentiment” column.
def Sentiment_process(sent): noalpha = [] char = [] for char in sent: if char!="0" and char!="1": noalpha.append(np.NaN) continue else: noalpha.append(char) continue return(noalpha)
Explanation:
If the character in the review is not a number (either 0 or 1), it is replaced with NaN, so that it will be easy for us to eliminate them. If it is 0 or 1, the number is appended as such.
df["Sentiment"] = Sentiment_process(list(df["Sentiment"])) df = df.dropna()
Now we only have numbers in the “Sentiment” column.
Let us write the second function to eliminate the special characters, stopwords and numbers in the “Review” column and put them into a bag of words. We will eliminate the numbers first, and then we will remove the stopwords like “the”, “a” which won’t affect the sentiment.
import nltk from nltk.corpus import stopwords import string def text_processing(text): nopunc = [] for char in text: if char not in string.punctuation: if char!=str("0") and char!=str("1"): nopunc.append(char) nopunc = ''.join(nopunc) return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
Let us call the above function.We will first remove the numbers and then apply the text processing.
df["Review"] = df['Review'].str.replace('\d+', '') df["BagOfWords"] = df["Review"].apply(text_processing)
Now let us see how the data looks like:
df.loc[51:53]
Output:
Review Sentiment BagOfWords 51 good protection and does not make phone too bu... 1 [good, protection, make, phone, bulky] 52 A usable keyboard actually turns a PDA into a ... 1 [usable, keyboard, actually, turns, PDA, realw...
Building the model
Let us define x and y to fit into the model and do the train and test split.
x = df["BagOfWords"] df["Sentiment"] = df["Sentiment"].astype(str).astype(int) y = df["Sentiment"] from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)
Now let us tokenize the words. That is, we are going to change the words into numbers so that it will be compatible to feed into the model.
We will consider only the top 5000 words after tokenization. Let us convert the X_train values into tokens to convert the words into corresponding indices and store back to X_train. Similarly, we will tokenize X_test values.
from keras.preprocessing.text import Tokenizer from keras.preprocessing.text import text_to_word_sequence tokenizer = Tokenizer(num_words=5000) tokenizer.fit_on_texts(X_train) X_train = tokenizer.texts_to_sequences(X_train) X_test = tokenizer.texts_to_sequences(X_test)
Let us truncate the reviews to make all the reviews to be equal in length. If the reviews are less than the length, it will be padded with empty values. But if the reviews are longer than the desired length, it will be cut short.
from keras.preprocessing import sequence maxlen = 50 # Making the train and test statements to be of size 50 by truncating or padding accordingly X_train = sequence.pad_sequences(X_train, padding='post', maxlen=maxlen) X_test = sequence.pad_sequences(X_test, padding='post', maxlen=maxlen)
Now let us build the keras model.
from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense, Embedding, GlobalAveragePooling1D model = Sequential([Embedding(10000, 17), GlobalAveragePooling1D(), Dense(17,activation = "relu"), Dense(12,activation = "relu"), Dense(1,activation = "sigmoid")]) model.compile( loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"]) model.summary()
Training and evaluation
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, verbose = 1)
loss, accuracy = model.evaluate(X_test, y_test) print("Accuracy is : ",accuracy*100)
Output:
Accuracy is : 85.77847814559937
We see that we have achieved a good accuracy.
Now let us test it with a review.
sample = "The product was very good and satisfying." sample = text_processing(sample) sample
Output:
['product', 'good', 'satisfying']
Let us perform all the preprocessing required.
sample = tokenizer.texts_to_sequences(sample) sample simple_list = [] for sublist in sample: for item in sublist: simple_list.append(item) simple_list = [simple_list] sample_review = sequence.pad_sequences(simple_list, padding='post', maxlen=maxlen)
Each and every word in the review will be a separate list and there will be sublists. We have made it into a single simple list so as to predict the sentiment properly.
ans = model.predict(sample_review) ans
Output:
array([[0.8325547]], dtype=float32)
Let us see if this is positive or negative.
if (0.4 <= ans <= 0.6): print("The review is not too good nor too bad") if(ans>0.6): print("The review is positive") elif(ans<0.4): print("The review is negative")
Output:
The review is positive
Hurray! We have predicted the sentiment of any given review. That is all about “Sentiment analysis using Keras”. We have learnt how to properly process the data and feed it into the model to predict the sentiment and get good results.
THANK YOU
Leave a Reply