Using RNN to predict shakesphearean text

INTRODUCTION

In this tutorial, we will look at building a Recurrent Neural Network model to predict the Shakespearean text using Keras TensorFlow API in Python. We will train the GitHub Shakespearean text dataset using a custom-built RNN model to generate new text.

TRAINING DATASET

We will use a set of paragraphs from Shakespeare’s work as our training dataset. The dataset is in the GitHub, and the link is here.

IMPORTING THE LIBRARIES

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import sweetviz as sw
import seaborn as sns
sns.set()

We have used some of the commonly used libraries in deep learning. Here, you can see a new library called sweetviz, which automates exploratory data analysis and is very useful in analysing our training dataset.

shakespeare_url='https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
filepath=keras.utils.get_file('shakespeare.txt',shakespeare_url)
with open(filepath) as f:
    shakespeare_text=f.read()

Now, we have downloaded the dataset in our python notebook, and before we use it for training, we need to preprocess the text dataset.

PREPROCESSING THE DATASET

TOKENISATION

Tokenisation is the process in which the longer strings of text are broken into smaller chunks or tokens. Larger pieces of text can be tokenised into sentences, sentences into words. Further preprocessing is done after the tokenisation.

tokenizer=keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)
max_id=len(tokenizer.word_index) #Number of distinct characters
dataset_size=tokenizer.document_count #Total number of characters 
[encoded]=np.array(tokenizer.texts_to_sequences([shakespeare_text]))-1 #Encoding the dataset

Here, we have used texts_to_sequences function, which basically removes filter like punctuation.

PREPARING THE DATASET

train_size=dataset_size*90//100
dataset=tf.data.Dataset.from_tensor_slices(encoded[:train_size])

Note that we used tf.data.Dataset module which is generally useful for a large set of elements like text data from a book. The module usually follows a typical pattern.

  1. Create a source dataset from the input dataset
  2. Apply dataset transformations to preprocess the dataset
  3. Iterate over the dataset and process the elements

We have used  from_tensor_slices function from the modules which take data from the tokenised array we created earlier and converts the data into tensor format over which transformations can be applied.

n_steps=100
window_length=n_steps+1
dataset=dataset.repeat().window(window_length,shift=1,drop_remainder=True)

Dataset.repeat() iterates over the dataset and repeats the dataset specified number of times. window() is like a sliding window that slides the window by a specified number each time the dataset iterates over itself.

dataset=dataset.flat_map(lambda window: window.batch(window_length))

batch_size=32
dataset=dataset.shuffle(10000).batch(batch_size)
dataset=dataset.map(lambda windows: (windows[:,:-1],windows[:,1:]))
dataset=dataset.map(lambda X_batch,Y_batch: (tf.one_hot(X_batch,depth=max_id),Y_batch))
dataset=dataset.prefetch(1)

Now that we have prepared our dataset for training let’s build our model.

BUILDING OUR MODEL

model=keras.models.Sequential()
model.add(keras.layers.GRU(128,return_sequences=True,input_shape=[None,max_id]))
model.add(keras.layers.GRU(128,return_sequences=True))
model.add(keras.layers.TimeDistributed(keras.layers.Dense(max_id,activation='softmax')))

Seems simple right. Generally, a single LSTM or GRU layer is powerful enough to build an RNN model. Here, we have used two GRU layers, and we have used a TimeDistributed Dense layer. Let’s compile and train our model.

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam')

history=model.fit(dataset,steps_per_epoch=train_size // batch_size,epochs=1)
31370/31370 [==============================] - 344s 11ms/step - loss: 1.0911

We have used ‘Adam’ optimizer here. You can also use other available optimizers to your need. Now that we have trained our model, we will test it using some text inputs.

TESTING OUR MODEL

def preprocess(texts):
    X=np.array(tokenizer.texts_to_sequences(texts))-1
    return tf.one_hot(X,max_id)
def next_char(text,temperature=1):
    X_new=preprocess([text])
    y_proba=model.predict(X_new)[0,-1:,:]
    rescaled_logits=tf.math.log(y_proba)/temperature
    char_id=tf.random.categorical(rescaled_logits,num_samples=1)+1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]
def complete_text(text,n_chars=50,temperature=1):
    for _ in range(n_chars):
        text+=next_char(text,temperature)
    return text

We have defined some utility function as when text input is provided. It preprocesses and prepares it according to our defined model and predicts the next characters up to the specified number of characters. Let’s test it with some input text character.

print(complete_text('t',temperature=0.2))
t and our of men these me some contake and faith hi
print(complete_text('j',temperature=0.2))
justress to see 'twixts i reved be such sonleous ha


Now, let’s see how our model predictability varies with varying t=input temperature parameter.

print(complete_text('S',n_chars=20,temperature=5))
Sko-t'll;asm, nmio:-n
print(complete_text('S',n_chars=20,temperature=0.5))
Still in her in: 
what
print(complete_text('S',n_chars=20,temperature=0.1))
S the other and in pr

CONCLUSION

In this tutorial, we have looked at how to train a Shakespearean text using a custom-built RNN model and test it using some text inputs. We also looked at how our model predictability varies with input temperature pa.

 

Leave a Reply

Your email address will not be published. Required fields are marked *