Building A Movie Review Classifier Using Tensorflow And Keras

In this article, we are going to explore the process of building a movie review classifier using TensorFlow and Keras in Python. For this task, we are going to use the IMDB dataset. Our goal here is to build a  classifier using TensorFlow to categorize movie reviews as positive(1) or negative(0).

In addition to this, we are going to use the IMDB dataset built into the Keras API which contains about 50,000 movie reviews.

DATA PREPARATION

Let’s use the Keras API which can run in Tensorflow. The IMDB dataset is already built into the Keras API so it is downloaded directly.

from tensorflow.python.keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = 10000)

OUTPUT

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 1s 0us/step

 

Also, let’s have a look at our training and testing data

print(x_train[0])

Output

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

Explanation: Here, We are going to see that the training data is in the form of integers. To train the classifier, the text needs to be tokenized. Tokenization is a method to normalize the text on basis of how often a word is used. The model can use only integer values to train on hence tokenization is necessary.

print(y_train[0])

OUTPUT

1

Assigning class names

class_names = ['Negative', 'Positive']
word_index = imdb.get_word_index()
print(word_index['hello'])

OUTPUT

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step
4822

Decoding The Reviews

To read the reviews, we need to convert it into strings. As the text is already tokenized before we will now use a dictionary that uses key-value mapping to decode the strings. The key-value pairs are going to be reversed so that the integers are mapped to strings. Now it is in a human-readable form.

See the Python code below:

reverse_word_index = dict((value, key) for key, value in word_index.items())

def decode(review):
    text = ''
    for i in review:
        text += reverse_word_index[i]
        text += ' '
    return text

Now let’s decode the data.

decode(x_train[0])

OUTPUT

'the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have two of script their with her nobody most that with wasn't to with armed acting watch an for with heartfelt film want an <'

def show_lengths():
    print('Length of 1st training example: ', len(x_train[0]))
    print('Length of 2nd training example: ',  len(x_train[1]))
    print('Length of 1st test example: ', len(x_test[0]))
    print('Length of 2nd test example: ',  len(x_test[1]))
    
show_lengths()

OUTPUT

Let’s explore the length of the examples also observe that they have different lengths.

Length of 1st training example:  218
Length of 2nd training example:  189
Length of 1st test example:  68
Length of 2nd test example:  260

Padding

from tensorflow.python.keras.preprocessing.sequence import pad_sequences
x_train = pad_sequences(x_train, padding = 'post', maxlen = 256)
x_test = pad_sequences(x_test,padding = 'post', maxlen = 256)

Now we are going to apply padding to equalize the lengths of all the examples. We specify a maximum length of 256. Whether short or long all the examples will now have equal length. Padding is important because the model must train on equal-sized sentences. Padding stretches the shorter strings to the maximum length specified.

Also, we will observe that the sentences longer than the specified length are shortened.

show_lengths()

OUTPUT

Length of 1st training example:  256
Length of 2nd training example:  256
Length of 1st test example:  256
Length of 2nd test example:  256

Therefore padding equalised the length of the words. This makes the training easier.

Training The Model

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Embedding, Dense, GlobalAveragePooling1D

model = Sequential([
    Embedding(10000, 16),
    GlobalAveragePooling1D(),
    Dense(16, activation = 'relu'),
    Dense(1, activation = 'sigmoid')
])

model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['acc']
)

model.summary()

Let’s now understand the steps

  • Embedding layer: Now we will convert the words into a vector space. This will group positive words closer and also the negative words are closer to each other. This grouping classifies positive sentiment together. Same is the case for words with a negative sentiment.
  • GlobalAveragePooling: Helps in reducing the dimensions. After applying an embedding layer the amount of data is huge hence reducing the dimensions becomes necessary.
  • Dense Layers: The activation function we use is sigmoid. This is going to serve the purpose of binary classification and only one output neuron will categorize the reviews as positive or negative.

OUTPUT

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0

Fitting The Model

from tensorflow.python.keras.callbacks import LambdaCallback

simple_logging = LambdaCallback(on_epoch_end = lambda e, l: print(e, end='.'))

E = 20

h = model.fit(
    x_train, y_train,
    validation_split = 0.2,
    epochs = E,
    callbacks = [simple_logging],
    verbose = False
)

OUTPUT

0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.

Prediction And Evaluation

Now we will visualise our model’s performance and also see how is the validation accuracy.

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(range(E), h.history['acc'], label = 'Training')
plt.plot(range(E), h.history['val_acc'], label = 'Validation')
plt.legend()
plt.show()

The movie review classifier is performing well on the training data although the validation accuracy is no at par with it. Let’s quantify the results further.

loss, acc = model.evaluate(x_test, y_test)
print('Test set accuracy: ', acc * 100)

OUTPUT

782/782 [==============================] - 1s 1ms/step - loss: 0.8834 - acc: 0.8420
Test set accuracy:  84.19600129127502

The test accuracy is around 84% which is quite decent. Let’s now test the model’s prediction capability.

Predicting results

import numpy as np

prediction = model.predict(np.expand_dims(x_test[0], axis = 0))
class_names = ['Negative', 'Positive']
print(class_names[np.argmax(prediction[0])])

OUTPUT

Negative

We can see that our classifier predicts the test review as negative. Let’s now decode the sentence and also check.

print(decode(x_test[0]))

OUTPUT

the wonder own as by is sequence i i and and to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world and her an have faint beginning own as is sequence the the the

We observe negative words like sadly, boring. Therefore the model classifies the movie review as ‘Negative’.

Thanks for reading!

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *