Building A Movie Review Classifier Using Tensorflow And Keras
In this article, we are going to explore the process of building a movie review classifier using TensorFlow and Keras in Python. For this task, we are going to use the IMDB dataset. Our goal here is to build a classifier using TensorFlow to categorize movie reviews as positive(1) or negative(0).
In addition to this, we are going to use the IMDB dataset built into the Keras API which contains about 50,000 movie reviews.
DATA PREPARATION
Let’s use the Keras API which can run in Tensorflow. The IMDB dataset is already built into the Keras API so it is downloaded directly.
from tensorflow.python.keras.datasets import imdb (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = 10000)
OUTPUT
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz 17465344/17464789 [==============================] - 1s 0us/step
Also, let’s have a look at our training and testing data
print(x_train[0])
Output
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Explanation: Here, We are going to see that the training data is in the form of integers. To train the classifier, the text needs to be tokenized. Tokenization is a method to normalize the text on basis of how often a word is used. The model can use only integer values to train on hence tokenization is necessary.
print(y_train[0])
OUTPUT
1
Assigning class names
class_names = ['Negative', 'Positive']
word_index = imdb.get_word_index() print(word_index['hello'])
OUTPUT
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json 1646592/1641221 [==============================] - 0s 0us/step 4822
Decoding The Reviews
To read the reviews, we need to convert it into strings. As the text is already tokenized before we will now use a dictionary that uses key-value mapping to decode the strings. The key-value pairs are going to be reversed so that the integers are mapped to strings. Now it is in a human-readable form.
See the Python code below:
reverse_word_index = dict((value, key) for key, value in word_index.items()) def decode(review): text = '' for i in review: text += reverse_word_index[i] text += ' ' return text
Now let’s decode the data.
decode(x_train[0])
OUTPUT
'the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have two of script their with her nobody most that with wasn't to with armed acting watch an for with heartfelt film want an <'
def show_lengths(): print('Length of 1st training example: ', len(x_train[0])) print('Length of 2nd training example: ', len(x_train[1])) print('Length of 1st test example: ', len(x_test[0])) print('Length of 2nd test example: ', len(x_test[1])) show_lengths()
OUTPUT
Let’s explore the length of the examples also observe that they have different lengths.
Length of 1st training example: 218 Length of 2nd training example: 189 Length of 1st test example: 68 Length of 2nd test example: 260
Padding
from tensorflow.python.keras.preprocessing.sequence import pad_sequences x_train = pad_sequences(x_train, padding = 'post', maxlen = 256) x_test = pad_sequences(x_test,padding = 'post', maxlen = 256)
Now we are going to apply padding to equalize the lengths of all the examples. We specify a maximum length of 256. Whether short or long all the examples will now have equal length. Padding is important because the model must train on equal-sized sentences. Padding stretches the shorter strings to the maximum length specified.
Also, we will observe that the sentences longer than the specified length are shortened.
show_lengths()
OUTPUT
Length of 1st training example: 256 Length of 2nd training example: 256 Length of 1st test example: 256 Length of 2nd test example: 256
Therefore padding equalised the length of the words. This makes the training easier.
Training The Model
from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Embedding, Dense, GlobalAveragePooling1D model = Sequential([ Embedding(10000, 16), GlobalAveragePooling1D(), Dense(16, activation = 'relu'), Dense(1, activation = 'sigmoid') ]) model.compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['acc'] ) model.summary()
Let’s now understand the steps
- Embedding layer: Now we will convert the words into a vector space. This will group positive words closer and also the negative words are closer to each other. This grouping classifies positive sentiment together. Same is the case for words with a negative sentiment.
- GlobalAveragePooling: Helps in reducing the dimensions. After applying an embedding layer the amount of data is huge hence reducing the dimensions becomes necessary.
- Dense Layers: The activation function we use is sigmoid. This is going to serve the purpose of binary classification and only one output neuron will categorize the reviews as positive or negative.
OUTPUT
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, None, 16) 160000 _________________________________________________________________ global_average_pooling1d_1 ( (None, 16) 0 _________________________________________________________________ dense_2 (Dense) (None, 16) 272 _________________________________________________________________ dense_3 (Dense) (None, 1) 17 ================================================================= Total params: 160,289 Trainable params: 160,289 Non-trainable params: 0
Fitting The Model
from tensorflow.python.keras.callbacks import LambdaCallback simple_logging = LambdaCallback(on_epoch_end = lambda e, l: print(e, end='.')) E = 20 h = model.fit( x_train, y_train, validation_split = 0.2, epochs = E, callbacks = [simple_logging], verbose = False )
OUTPUT
0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.
Prediction And Evaluation
Now we will visualise our model’s performance and also see how is the validation accuracy.
import matplotlib.pyplot as plt %matplotlib inline plt.plot(range(E), h.history['acc'], label = 'Training') plt.plot(range(E), h.history['val_acc'], label = 'Validation') plt.legend() plt.show()
The movie review classifier is performing well on the training data although the validation accuracy is no at par with it. Let’s quantify the results further.
loss, acc = model.evaluate(x_test, y_test) print('Test set accuracy: ', acc * 100)
OUTPUT
782/782 [==============================] - 1s 1ms/step - loss: 0.8834 - acc: 0.8420 Test set accuracy: 84.19600129127502
The test accuracy is around 84% which is quite decent. Let’s now test the model’s prediction capability.
Predicting results
import numpy as np prediction = model.predict(np.expand_dims(x_test[0], axis = 0)) class_names = ['Negative', 'Positive'] print(class_names[np.argmax(prediction[0])])
OUTPUT
Negative
We can see that our classifier predicts the test review as negative. Let’s now decode the sentence and also check.
print(decode(x_test[0]))
OUTPUT
the wonder own as by is sequence i i and and to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world and her an have faint beginning own as is sequence the the the
We observe negative words like sadly, boring. Therefore the model classifies the movie review as ‘Negative’.
Thanks for reading!
Leave a Reply