Text Generation with Keras and Tensorflow using LSTM and tokenization
Hello Everyone, today we will explore the world of writings! In this tutorial, we will be using the concept of Natural Language Processing (NLP) from Deep learning with TensorFlow and Keras API in Python.
What is NLP?
Well to put in simple terms, it is a predictive analysis technique. The machine analyzes the human language in terms of text and speeches.
We will generate text where a set of phrases are feed as input. The model will predict the next set of words from the previously gained knowledge from the trained dataset.
We will be using Shakespeare’s Caesar for the same.
Let’s start to code now!
IMPORTING PYTHON LIBRARIES
from keras.models import Sequential from keras.models import load_model import numpy as np import re from keras.layers import Dropout import pandas from keras.layers import Dense from keras.layers import LSTM from random import randint from keras.utils import to_categorical from keras.layers import Embedding
Now, we will download the required dataset. Rather than manually downloading the dataset let us use python’s nltk library to do the same. If one does not have nltk installed on their system, simply run pip install nltk on your terminal. After that follow the below commands
python; //run python on the terminal import nltk nltk.download('gutenberg')
We will be using the Gutenberg dataset that has around 3036 books of total of 14 authors and it has the one we require namely, ‘Caesar ‘.
Later print a certain set of book names so that we can get a sneak peak into the contents of the dataset we have.
import nltk nltk.download('gutenberg') from nltk.corpus import gutenberg as gut print(gut.fileids())
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
We will now view some contents of the caesar book and print the first 500 characters of the book present in the dataset.
from nltk.corpus.gutenberg import raw caesar_text = raw('shakespeare-caesar.txt') print(caesar_text[:500])
Check the output below:
[The Tragedie of Julius Caesar by William Shakespeare 1599] Actus Primus. Scoena Prima. Enter Flauius, Murellus, and certaine Commoners ouer the Stage. Flauius. Hence: home you idle Creatures, get you home: Is this a Holiday? What, know you not (Being Mechanicall) you ought not walke Vpon a labouring day, without the signe Of your Profession? Speake, what Trade art thou? Car. Why Sir, a Carpenter Mur. Where is thy Leather Apron, and thy Rule? What dost thou with thy best Apparrell on
From the output, we can observe that the text contains many numbers and special characters as well hence we need to filter them out so that the data can be processed in the way that the machine understands.
Data Cleaning:
Define a filter function preprocess_text() to remove the special characters and punctuations from the text.
def preprocess_text(sent): # first we will remove all the punctuations that can cause problems for the model nonPunctuationsSentence = re.sub('[^a-zA-Z]', ' ', sent) # We will now remove all the single characters from the sentences nonCharSentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', nonPunctuationsSentence ) # Spaces can also cause problem for the model noSpaceSentence = re.sub(r'\s+', ' ', nonCharSentence) return noSpaceSentence.lower()
Since we will be focusing only on the first 500 words so we will apply our filter function on first 500 words.
Analyze the output:
the tragedie of julius caesar by william shakespeare actus primus scoena prima enter flauius murellus and certaine commoners ouer the stage flauius hence home you idle creatures get you home is this holiday what know you not being mechanicall you ought not walke vpon labouring day without the signe of your profession speake what trade art thou car why sir carpenter mur where is thy leather apron and thy rule what dost thou with thy best apparrell on you sir what trade are you cobl truely sir in'
Tokenization:
What is tokenization?
It is a method in which we split the given text into smaller units such as single words or numbers by locating the ending point of a word and the starting of the next following word.
Here, we will convert the individual words into integers for which we first need to convert the entire text into single words for which we need to use word_tokenize() from nltk.tokenize.
Let us also print the number of unique words present in the tokenized text.
from nltk.tokenize import word_tokenize as tokenizer caesar_text_words = (tokenizer(caesar_text)) n_words = len(caesar_text_words ) print('Total Words: %d' % n_words) unique_words = len(set(caesar_text_words )) print('Unique Words: %d' % unique_words)
Total Words: 19642 Unique Words: 3003
Now I have used the tokenizer provided by nltk so far. However, as the model is being trained on TensorFlow/Keras we will use the Keras Tokenizer to fit out caesar text.
from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=3003) tokenizer.fit_on_texts(caesar_text_words)
To keep track of all the words I made a variable to store the length of our vocabulary. Once that was done, I used the tokenizer class to convert all the words into the index.
vocab_size = len(tokenizer.word_index) + 1 word_2_index = tokenizer.word_index print(caesar_text_words[500]) print(word_2_index[caesar_text_words[500]])
You can see the sample output below:
the 2
Modifying the shape of data
As text generation is a field of deep learning that falls into the many-to-one sequence category, (since the input is a sequence of words and the output is a single word), we will make use of LSTM for this.
I have extensively touched on LSTM in one of my previous tutorials which you can find here ⇒ here
But first, we will need to modify the shape of the input sequences and the corresponding outputs for the same, so it is human-readable:
sequenceInput= [] wordOutput= [] seq_length= 100 for i in range(0, n_words - seq_length, 1): in_seque= caesar_text_words[i:i + seq_length] out_seque = caesar_text_words[i + seq_length] sequenceInput.append([word_2_index[word] for word in in_seque]) wordOutput.append(word_2_index[out_seque])
In the code above, I had declared two lists for input sequence and output words. I converted the words to index and store the first 100 in input sequences, such that the 101st will be treated as an output. The next iteration will shift the inputs from 100 to 101, and the output to 102nd word, this will go on.
print(sequenceInput[0])
[2, 886, 5, 1357, 12, 37, 1358, 1359, 471, 1360, 1361, 1362, 54, 472, 887, 1, 394, 888, 297, 2, 1363, 472, 230, 231, 4, 694, 889, 249, 4, 231, 8, 16, 1364, 29, 49, 4, 7, 272, 1365, 4, 553, 7, 358, 67, 1366, 71, 473, 2, 554, 5, 23, 1367, 98, 29, 395, 130, 32, 1368, 75, 137, 1369, 474, 108, 8, 63, 890, 1370, 1, 63, 891, 29, 555, 32, 22, 63, 232, 1371, 43, 4, 137, 29, 395, 33, 4, 892, 893, 137, 9, 359, 5, 1372, 1373, 62, 26, 28, 4, 72, 93, 894, 474]
This is a sample input sequence in terms of the index. Now for the next step, the data need to be shaped as per the input requirement of our model.
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1)) X = X / float(vocab_size) y = to_categorical(output_words)
We will use the traditional NumPy reshape function to make the index input-friendly. In the code below, I have normalized the data for easier information. And lastly, output_words was sent to categorical function from Keras library for training.
print("The shape for X :", X.shape) print("The shape for Y :", y.shape)
I will now print the shape of X and Y just for our reassurance, this step can be skipped but I highly recommend to cross-check the data before training.
The shape for X : (19542, 100, 1) The shape for Y : (19542, 3004)
Model Training
I will define our model as a Sequential with 3 LSTM layers all with 800 hidden nodes and finish off the layers with a softmax activated dense layer.
model = Sequential() model.add(LSTM(500, input_shape=(X.shape[1], X.shape[2]), return_sequences=True)) model.add(LSTM(300, return_sequences=True)) model.add(LSTM(200)) model.add(Dense(y.shape[1], activation='softmax')) model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam')
I will also use adam as my optimizer as it is an all-rounder for all different types of models. You can use any kind of optimizer you for your usecase.
The model summary would look like this:
Model: "sequential_11" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 100, 500) 1004000 _________________________________________________________________ lstm_2 (LSTM) (None, 100, 300) 961200 _________________________________________________________________ lstm_3 (LSTM) (None, 200) 400800 _________________________________________________________________ dense_1 (Dense) (None, 3004) 603804 ================================================================= Total params: 2,969,804 Trainable params: 2,969,804 Non-trainable params: 0
Now to start the heavy work we will call the fit method in the model class to start the training.
model.fit(X, y, batch_size=64, epochs=10, verbose=1)
After the training, this will be the output if all goes good.
Epoch 9/10 17150/17150 [==============================] - 191s 11ms/step - loss: 6.6049 Epoch 10/10 17150/17150 [==============================] - 193s 11ms/step - loss: 6.6033
Testing the model:
For testing I used the following input request picked at random from the sonnet data set, this following function will pick and print your input sequence.
randomSeq_index = np.random.randint(0, len(sequenceInput)-1) randomSeq = sequenceInput[randomSeq_index] index_2_wordMap = map(reversed, word_2_index.items()) index_2_word = dict(index_2_wordMap) wordSequence = [index_2_word[indexValue] for indexValue in randomSeq] print(' '.join(wordSequence))
output of the above code for me is this:
that spare cassius he reades much he is great obseruer and he lookes quite through the deeds of men he loues no playes as thou dost antony he heares no musicke seldome he smiles and smiles in such sort as if he mock himselfe and scorn his spirit that could be mou.....my right hand for this
Note: This will differ for you on each run as it is randomised.
Now I will pick the 100 words range and feed it into our model to predict the following phrases.
for i in range(100): sampleInt = np.reshape(randomSeq, (1, len(randomSeq), 1)) sampleInt = sampleInt / float(vocab_size) indexPrediction = model.predict(sampleInt) wordIDPrediction = np.argmax(indexPrediction) inSeq = [index_2_word[Valueindex] for Valueindex in randomSeq] wordSequence.append(index_2_word[ wordIDPrediction]) randomSeq.append(wordIDPrediction) randomSeq = randomSeq[1:len(randomSeq)] final_output = "" for words in wordSequence: final_output = final_output + " " + words print(final_output)
For the above two sets of code blocks, I have used the previously explained functions only so be sure to reread the parts that seem confusing to you.
And the output looks like this:
that spare cassius he reades much he is great obseruer and he lookes quite through the deeds of men he loues no playes as thou dost antony he heares no musicke seldome he smiles and smiles in such sort as if he mock himselfe and scorn his spirit that could be mou to smile at any thing such men as he be neuer at hearts ease whiles they behold greater then themselues and therefore are they very dangerous rather tell thee what is to be fear then what feare for alwayes am caesar come on my right hand for this and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and
ouch, well honestly this is better than what we could have imagined as it is still really difficult to impersonate a normal human being let alone a legendary writer. But maybe with better data and more training, you can extract better output. Best of luck!
Leave a Reply