Using RNN to predict shakesphearean text
INTRODUCTION
In this tutorial, we will look at building a Recurrent Neural Network model to predict the Shakespearean text using Keras TensorFlow API in Python. We will train the GitHub Shakespearean text dataset using a custom-built RNN model to generate new text.
TRAINING DATASET
We will use a set of paragraphs from Shakespeare’s work as our training dataset. The dataset is in the GitHub, and the link is here.
IMPORTING THE LIBRARIES
import numpy as np import pandas as pd import matplotlib.pyplot as plt import tensorflow as tf from tensorflow import keras import sweetviz as sw import seaborn as sns sns.set()
We have used some of the commonly used libraries in deep learning. Here, you can see a new library called sweetviz, which automates exploratory data analysis and is very useful in analysing our training dataset.
shakespeare_url='https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt' filepath=keras.utils.get_file('shakespeare.txt',shakespeare_url) with open(filepath) as f: shakespeare_text=f.read()
Now, we have downloaded the dataset in our python notebook, and before we use it for training, we need to preprocess the text dataset.
PREPROCESSING THE DATASET
TOKENISATION
Tokenisation is the process in which the longer strings of text are broken into smaller chunks or tokens. Larger pieces of text can be tokenised into sentences, sentences into words. Further preprocessing is done after the tokenisation.
tokenizer=keras.preprocessing.text.Tokenizer(char_level=True) tokenizer.fit_on_texts(shakespeare_text)
max_id=len(tokenizer.word_index) #Number of distinct characters dataset_size=tokenizer.document_count #Total number of characters [encoded]=np.array(tokenizer.texts_to_sequences([shakespeare_text]))-1 #Encoding the dataset
Here, we have used texts_to_sequences function, which basically removes filter like punctuation.
PREPARING THE DATASET
train_size=dataset_size*90//100 dataset=tf.data.Dataset.from_tensor_slices(encoded[:train_size])
Note that we used tf.data.Dataset module which is generally useful for a large set of elements like text data from a book. The module usually follows a typical pattern.
- Create a source dataset from the input dataset
- Apply dataset transformations to preprocess the dataset
- Iterate over the dataset and process the elements
We have used from_tensor_slices function from the modules which take data from the tokenised array we created earlier and converts the data into tensor format over which transformations can be applied.
n_steps=100 window_length=n_steps+1 dataset=dataset.repeat().window(window_length,shift=1,drop_remainder=True)
Dataset.repeat() iterates over the dataset and repeats the dataset specified number of times. window() is like a sliding window that slides the window by a specified number each time the dataset iterates over itself.
dataset=dataset.flat_map(lambda window: window.batch(window_length)) batch_size=32 dataset=dataset.shuffle(10000).batch(batch_size) dataset=dataset.map(lambda windows: (windows[:,:-1],windows[:,1:])) dataset=dataset.map(lambda X_batch,Y_batch: (tf.one_hot(X_batch,depth=max_id),Y_batch)) dataset=dataset.prefetch(1)
Now that we have prepared our dataset for training let’s build our model.
BUILDING OUR MODEL
model=keras.models.Sequential() model.add(keras.layers.GRU(128,return_sequences=True,input_shape=[None,max_id])) model.add(keras.layers.GRU(128,return_sequences=True)) model.add(keras.layers.TimeDistributed(keras.layers.Dense(max_id,activation='softmax')))
Seems simple right. Generally, a single LSTM or GRU layer is powerful enough to build an RNN model. Here, we have used two GRU layers, and we have used a TimeDistributed Dense layer. Let’s compile and train our model.
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam') history=model.fit(dataset,steps_per_epoch=train_size // batch_size,epochs=1)
31370/31370 [==============================] - 344s 11ms/step - loss: 1.0911
We have used ‘Adam’ optimizer here. You can also use other available optimizers to your need. Now that we have trained our model, we will test it using some text inputs.
TESTING OUR MODEL
def preprocess(texts): X=np.array(tokenizer.texts_to_sequences(texts))-1 return tf.one_hot(X,max_id)
def next_char(text,temperature=1): X_new=preprocess([text]) y_proba=model.predict(X_new)[0,-1:,:] rescaled_logits=tf.math.log(y_proba)/temperature char_id=tf.random.categorical(rescaled_logits,num_samples=1)+1 return tokenizer.sequences_to_texts(char_id.numpy())[0]
def complete_text(text,n_chars=50,temperature=1): for _ in range(n_chars): text+=next_char(text,temperature) return text
We have defined some utility function as when text input is provided. It preprocesses and prepares it according to our defined model and predicts the next characters up to the specified number of characters. Let’s test it with some input text character.
print(complete_text('t',temperature=0.2))
t and our of men these me some contake and faith hi
print(complete_text('j',temperature=0.2))
justress to see 'twixts i reved be such sonleous ha
Now, let’s see how our model predictability varies with varying t=input temperature parameter.
print(complete_text('S',n_chars=20,temperature=5))
Sko-t'll;asm, nmio:-n
print(complete_text('S',n_chars=20,temperature=0.5))
Still in her in: what
print(complete_text('S',n_chars=20,temperature=0.1))
S the other and in pr
CONCLUSION
In this tutorial, we have looked at how to train a Shakespearean text using a custom-built RNN model and test it using some text inputs. We also looked at how our model predictability varies with input temperature pa.
Leave a Reply