Text Generation with Keras and Tensorflow using LSTM and tokenization

Hello Everyone, today we will explore the world of writings! In this tutorial, we will be using the concept of Natural Language Processing (NLP) from Deep learning with TensorFlow and Keras API in Python.

What is NLP?

Well to put in simple terms, it is a predictive analysis technique. The machine analyzes the human language in terms of text and speeches.

We will generate text where a set of phrases are feed as input. The model will predict the next set of words from the previously gained knowledge from the trained dataset.

We will be using Shakespeare’s Caesar for the same.

Let’s start to code now!

IMPORTING PYTHON LIBRARIES

from keras.models import Sequential 
from keras.models import load_model
import numpy as np
import re
from keras.layers import Dropout
import pandas
from keras.layers import Dense
from keras.layers import LSTM
from random import randint
from keras.utils import to_categorical
from keras.layers import Embedding

Now, we will download the required dataset. Rather than manually downloading the dataset let us use python’s nltk library to do the same. If one does not have nltk installed on their system, simply run pip install nltk on your terminal. After that follow the below commands

python; //run python on the terminal

import nltk

nltk.download('gutenberg')

We will be using the Gutenberg dataset that has around 3036 books of total of 14 authors and it has the one we require namely, ‘Caesar ‘.

Later print a certain set of book names so that we can get a sneak peak into the contents of the dataset we have.

import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gut

print(gut.fileids())
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

We will now view some contents of the caesar book and print the first 500 characters of the book present in the dataset.

from nltk.corpus.gutenberg import raw
caesar_text = raw('shakespeare-caesar.txt')
print(caesar_text[:500])

Check the output below:

[The Tragedie of Julius Caesar by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Flauius, Murellus, and certaine Commoners ouer the Stage.

  Flauius. Hence: home you idle Creatures, get you home:
Is this a Holiday? What, know you not
(Being Mechanicall) you ought not walke
Vpon a labouring day, without the signe
Of your Profession? Speake, what Trade art thou?
  Car. Why Sir, a Carpenter

   Mur. Where is thy Leather Apron, and thy Rule?
What dost thou with thy best Apparrell on

From the output, we can observe that the text contains many numbers and special characters as well hence we need to filter them out so that the data can be processed in the way that the machine understands.

Data Cleaning:

Define a filter function preprocess_text() to remove the special characters and punctuations from the text.

def preprocess_text(sent):
    # first we will remove all the punctuations that can cause problems for the model
    nonPunctuationsSentence = re.sub('[^a-zA-Z]', ' ', sent)

    # We will now remove all the single characters from the sentences
    nonCharSentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', nonPunctuationsSentence )

    # Spaces can also cause problem for the model
    noSpaceSentence = re.sub(r'\s+', ' ', nonCharSentence)

    return noSpaceSentence.lower()

Since we will be focusing only on the first 500 words so we will apply our filter function on first 500 words.

Analyze the output:

the tragedie of julius caesar by william shakespeare actus primus scoena prima enter flauius murellus and certaine commoners ouer the stage flauius hence home you idle creatures get you home is this holiday what know you not being mechanicall you ought not walke vpon labouring day without the signe of your profession speake what trade art thou car why sir carpenter mur where is thy leather apron and thy rule what dost thou with thy best apparrell on you sir what trade are you cobl truely sir in'

Tokenization:

What is tokenization?

It is a method in which we split the given text into smaller units such as single words or numbers by locating the ending point of a word and the starting of the next following word.

Here, we will convert the individual words into integers for which we first need to convert the entire text into single words for which we need to use word_tokenize() from nltk.tokenize.

Let us also print the number of unique words present in the tokenized text.

from nltk.tokenize import word_tokenize as tokenizer

caesar_text_words = (tokenizer(caesar_text))
n_words = len(caesar_text_words )

print('Total Words: %d' % n_words)

unique_words = len(set(caesar_text_words ))


print('Unique Words: %d' % unique_words)
Total Words: 19642
Unique Words: 3003

Now I have used the tokenizer provided by nltk so far. However, as the model is being trained on TensorFlow/Keras we will use the Keras Tokenizer to fit out caesar text.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=3003)
tokenizer.fit_on_texts(caesar_text_words)

To keep track of all the words I made a variable to store the length of our vocabulary. Once that was done, I used the tokenizer class to convert all the words into the index.

vocab_size = len(tokenizer.word_index) + 1
word_2_index = tokenizer.word_index

print(caesar_text_words[500])
print(word_2_index[caesar_text_words[500]])

You can see the sample output below:

the
2

Modifying the shape of data

As text generation is a field of deep learning that falls into the many-to-one sequence category, (since the input is a sequence of words and the output is a single word), we will make use of LSTM for this.

I have extensively touched on LSTM in one of my previous tutorials which you can find here ⇒ here

But first, we will need to modify the shape of the input sequences and the corresponding outputs for the same, so it is human-readable:

sequenceInput= []
wordOutput= []
seq_length= 100

for i in range(0, n_words - seq_length, 1):
    in_seque= caesar_text_words[i:i + seq_length]
    out_seque = caesar_text_words[i + seq_length]
    sequenceInput.append([word_2_index[word] for word in in_seque])
    wordOutput.append(word_2_index[out_seque])

In the code above, I had declared two lists for input sequence and output words. I converted the words to index and store the first 100 in input sequences, such that the 101st will be treated as an output. The next iteration will shift the inputs from 100 to 101, and the output to 102nd word, this will go on.

print(sequenceInput[0])
[2, 886, 5, 1357, 12, 37, 1358, 1359, 471, 1360, 1361, 1362, 54, 472, 887, 1, 394, 888, 297, 2, 1363, 472, 230, 231, 4, 694, 889, 249, 4, 231, 8, 16, 1364, 29, 49, 4, 7, 272, 1365, 4, 553, 7, 358, 67, 1366, 71, 473, 2, 554, 5, 23, 1367, 98, 29, 395, 130, 32, 1368, 75, 137, 1369, 474, 108, 8, 63, 890, 1370, 1, 63, 891, 29, 555, 32, 22, 63, 232, 1371, 43, 4, 137, 29, 395, 33, 4, 892, 893, 137, 9, 359, 5, 1372, 1373, 62, 26, 28, 4, 72, 93, 894, 474]

This is a sample input sequence in terms of the index. Now for the next step, the data need to be shaped as per the input requirement of our model.

X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))
X = X / float(vocab_size)

y = to_categorical(output_words)

We will use the traditional NumPy reshape function to make the index input-friendly. In the code below, I have normalized the data for easier information. And lastly, output_words was sent to categorical function from Keras library for training.

print("The shape for X :", X.shape)
print("The shape for Y :", y.shape)

I will now print the shape of X and Y just for our reassurance, this step can be skipped but I highly recommend to cross-check the data before training.

The shape for X : (19542, 100, 1)
The shape for Y : (19542, 3004)

Model Training

I will define our model as a Sequential with 3 LSTM layers all with 800 hidden nodes and finish off the layers with a softmax activated dense layer.

model = Sequential()
model.add(LSTM(500, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(300, return_sequences=True))
model.add(LSTM(200))
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam')

I will also use adam as my optimizer as it is an all-rounder for all different types of models. You can use any kind of optimizer you for your usecase.

The model summary would look like this:

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 100, 500)          1004000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 300)          961200    
_________________________________________________________________
lstm_3 (LSTM)                (None, 200)               400800    
_________________________________________________________________
dense_1 (Dense)              (None, 3004)              603804    
=================================================================
Total params: 2,969,804
Trainable params: 2,969,804
Non-trainable params: 0

Now to start the heavy work we will call the fit method in the model class to start the training.

model.fit(X, y, batch_size=64, epochs=10, verbose=1)

After the training, this will be the output if all goes good.

Epoch 9/10
17150/17150 [==============================] - 191s 11ms/step - loss: 6.6049
Epoch 10/10
17150/17150 [==============================] - 193s 11ms/step - loss: 6.6033

Testing the model:

For testing I used the following input request picked at random from the sonnet data set, this following function will pick and print your input sequence.

randomSeq_index = np.random.randint(0, len(sequenceInput)-1)
randomSeq = sequenceInput[randomSeq_index]

index_2_wordMap = map(reversed, word_2_index.items())
index_2_word = dict(index_2_wordMap)
wordSequence = [index_2_word[indexValue] for indexValue in randomSeq]
print(' '.join(wordSequence))

output of the above code for me is this:

that spare cassius he reades much he is great obseruer and he lookes quite through the deeds of men he loues no playes as thou dost antony he heares no musicke seldome he smiles and smiles in such sort as if he mock himselfe and scorn his spirit that could be mou.....my right hand for this

Note: This will differ for you on each run as it is randomised.

Now I will pick the 100 words range and feed it into our model to predict the following phrases.

for i in range(100):
    sampleInt = np.reshape(randomSeq, (1, len(randomSeq), 1))
    sampleInt = sampleInt / float(vocab_size)
    indexPrediction = model.predict(sampleInt)
    wordIDPrediction = np.argmax(indexPrediction)
    inSeq = [index_2_word[Valueindex] for Valueindex in randomSeq]
    wordSequence.append(index_2_word[ wordIDPrediction])
    randomSeq.append(wordIDPrediction)
    randomSeq = randomSeq[1:len(randomSeq)]
final_output = ""
for words in wordSequence:
    final_output = final_output + " " + words

print(final_output)

For the above two sets of code blocks, I have used the previously explained functions only so be sure to reread the parts that seem confusing to you.

And the output looks like this:

that spare cassius he reades much he is great obseruer and he lookes quite through the deeds of men he loues no playes as thou dost antony he heares no musicke seldome he smiles and smiles in such sort as if he mock himselfe and scorn his spirit that could be mou to smile at any thing such men as he be neuer at hearts ease whiles they behold greater then themselues and therefore are they very dangerous rather tell thee what is to be fear then what feare for alwayes am caesar come on my right hand for this and and   and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and

ouch, well honestly this is better than what we could have imagined as it is still really difficult to impersonate a normal human being let alone a legendary writer. But maybe with better data and more training, you can extract better output. Best of luck!

Leave a Reply

Your email address will not be published. Required fields are marked *