German To English Translator Using Keras and TensorFlow using LSTM model!

translation image

In this article, an LSTM model for translating German text to English text is done using tokenization and I will train the model on Keras and TensorFlow backend. We have already worked with LSTM once when predicting the stock prices for shares, thus we will expand the use of LSTM today.

The biggest problem for tourists is having to deal with the local language. German keeps in front of you, the need for a translator increases. We will today see how to make a custom language translator step by step, with the pipeline being as follows:

  • Collecting Data
  • Preprocessing the data
  • Tokenizing the data
  • Training the model
  • Running Inference on Train and Test Data

Before we start, the following tutorial has keras=2.3.1 and Tensorflow==2.1.0 and the code was done inside jupyter notebook.

 

Prerequisites:

First, we will import all the libraries, if you don’t have any of these kindly install them using pip or conda install in cmd.

 

import string
import re
from pickle import load,dump

from unicodedata import normalize
from numpy import array
from numpy.random import rand
from numpy.random import shuffle
from numpy import argmax

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint

from keras.models import load_model
from nltk.translate.bleu_score import corpus_bleu

Collecting Data:

For the data, we have used the dataset provided by ManyThings.org website. The website has a lot of great datasets for phrases and translations in various languages. You can download the data set from the link provided here: https://raw.githubusercontent.com/jbrownlee/Datasets/master/deu.txt

 

Cleaning Data:

Download the dataset and save it in the same folder as your code directory. We will now load the data into our code with the below code:

def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

Now if you observer the dataset inside the file, the data needs a lot of cleaning. Firstly we will separate the English and German phrases, as they are separated by a tab character we will use the python strip and split function.

def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

After getting the hold of each sentence, we will perform all the cleaning in the next function, things we will be working on include:

  • Remove non-printable characters and symbols
  • Remove all the punctuations in the data
  • Normalize to ASCII
  • Make all the characters lowercase
  • Remove all the non-alphabetic characters
# clean a list of lines
def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return array(cleaned)

And lastly, we will save this cleaned data to avoid going through cleaning each time.

def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

Execute all the above functions with the code below:

# load dataset
filename = 'deu.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')

Tokenizing And Training:

After saving the pickles for the English-German phrases with cleaning and saving in the list, we will use the following function to load it into the code for training.

# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))

This function will return the list of phrases in form of a list.

We will now split the dataset into train and test list into three pickles. We will reduce the dataset to our use case.

# load dataset
raw_dataset = load_clean_sentences('english-german.pkl')

# reduce dataset size
n_sentences = 10000
dataset = raw_dataset[:n_sentences, :]
# random shuffle
shuffle(dataset)
# split into train/test
train, test = dataset[:9000], dataset[9000:]
# save
save_clean_data(dataset, 'english-german-both.pkl')
save_clean_data(train, 'english-german-train.pkl')
save_clean_data(test, 'english-german-test.pkl')

Tokenizer:

To train an LSTM with text, we need a tokenizer. A tokenizer is nothing but a tool for converting strings and characters into the dictionary with weights. The weights of a tokenizer are with respect to the number of times the substring has repapered. We will use the following function to create a tokenizer:

def create_tokenizer(lines):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(lines)
    return tokenizer

We will need to get the maximum length of the sentences in order to perform one-hot encoding, and for that, these functions have been used:

def max_length(lines):
    return max(len(line.split()) for line in lines)

What is one hot encoding?

One hot encoding is a process by which variables that are more appropriate for categorical algorithms are converted into a form that can be used for ML algorithms to increase accuracy and make sense. If you wanna know more about one hot encoding please refer to this amazing blog:

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

We will make two functions for encoding the input and one-hot encoding the output sequences. Both of them are listed below:

def encode_sequences(tokenizer, length, lines):
    X = tokenizer.texts_to_sequences(lines)
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X

def encode_output(sequences, vocab_size):
    ylist = list()
    for sequence in sequences:
        encoded = to_categorical(sequence, num_classes=vocab_size)
        ylist.append(encoded)
    y = array(ylist)
    y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size)
    return y

Model initializing And Training:

Time to move to make the LSTM model. We will be using a sequential model style and LSTM layers. I have previously covered LSTM in-depth in my previous tutorial so feel free to learn about all these layers over there.

 

You can access my previous tutorial on LSTM here: https://mollickhub.xyz/stock-prices-prediction-with-tensorflow-keras/

New layers added will be embeddings for the encoding handling. The repeat vector for basically repeating the input n times.

And TimeDistributed Dense layer will give out output while distributing in the time axis.

 

def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    return model

Now we will call all the above-defined helper functions to start the training process:

# load datasets
dataset = load_clean_sentences('english-german-both.pkl')
train = load_clean_sentences('english-german-train.pkl')
test = load_clean_sentences('english-german-test.pkl')

# prepare english tokenizer
eng_tokenizer = create_tokenizer(dataset[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
eng_length = max_length(dataset[:, 0])
print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

# prepare german tokenizer
ger_tokenizer = create_tokenizer(dataset[:, 1])
ger_vocab_size = len(ger_tokenizer.word_index) + 1
ger_length = max_length(dataset[:, 1])
print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length))

# prepare training data
trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])
trainY = encode_output(trainY, eng_vocab_size)

# prepare validation data
testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])
testY = encode_output(testY, eng_vocab_size)

# define model
model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')
# summarize defined model
print(model.summary())
plot_model(model, to_file='model.png', show_shapes=True)
# fit model
filename = 'model.h5'

checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

The output of training will look something like this:

English Vocabulary Size: 2404
English Max Length: 5
German Vocabulary Size: 3856
German Max Length: 10
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 10, 256)           987136    
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 5, 256)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 5, 256)            525312    
_________________________________________________________________
time_distributed_2 (TimeDist (None, 5, 2404)           617828    
=================================================================
Total params: 2,655,588
Trainable params: 2,655,588
Non-trainable params: 0

And if the training has been completed successfully, you will see this in the output:

Epoch 00029: val_loss improved from 2.15330 to 2.14423, saving model to model.h5
Epoch 30/30
 - 14s - loss: 0.5411 - val_loss: 2.1519

Epoch 00030: val_loss did not improve from 2.14423

Now you have made a successful German to English translator using Keras, time to put the model to test.

 

Testing the Model:

For testing the model we will need some helper functions again, basically to calculate the score, convert the output of the model in form of strings. The functions make use of various tokenizer functions, listed below:

def word_for_id(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

And For predicting sequences from the model we will use the NumPy argmax to find the prediction from our dictionary.

def predict_sequence(model, tokenizer, source):
    prediction = model.predict(source, verbose=0)[0]
    integers = [argmax(vector) for vector in prediction]
    target = list()
    for i in integers:
        word = word_for_id(i, tokenizer)
        if word is None:
            break
        target.append(word)
    return ' '.join(target)

And Lastly, to find a numeric way of studying our model’s performance, we will use the BLEU score. BLEU or the Bilingual Evaluation Understudy is a method of evaluation for comparing a candidate translation of the text ( in our case German) to one or more reference translations (English).

def evaluate_model(model, tokenizer, sources, raw_dataset):
    actual, predicted = list(), list()
    for i, source in enumerate(sources):
        # translate encoded source text
        source = source.reshape((1, source.shape[0]))
        translation = predict_sequence(model, eng_tokenizer, source)
        raw_target, raw_src = raw_dataset[i]
        if i < 10:
            print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation))
        actual.append([raw_target.split()])
        predicted.append(translation.split())
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

And Now for the final touch we will call these function and get out German-to-English output:

# test on some training sequences
print('train')
evaluate_model(model, eng_tokenizer, trainX, train)
# test on some test sequences
print('test')
evaluate_model(model, eng_tokenizer, testX, test)

And the output for the above command will look like this, this may differ from model to model.

train
src=[gehort das dir], target=[is that yours], predicted=[is this yours]
src=[sollen wir beginnen], target=[shall we begin], predicted=[shall we begin]
src=[ich habe kohldampf], target=[im starving], predicted=[im starving]
src=[sie sind so gutig], target=[you are so kind], predicted=[you are you]
src=[ich werde punktlich da sein], target=[ill be on time], predicted=[ill be in]
src=[wurde sie gesehen], target=[was she seen], predicted=[did she seen]
src=[er ist nicht perfekt], target=[hes not perfect], predicted=[hes isnt perfect]
src=[verpiss dich], target=[go away], predicted=[get away]
src=[ich werde tom mitbringen], target=[ill bring tom], predicted=[ill ask tom]
src=[hast du ihn getroffen], target=[did you meet him], predicted=[did you meet him]
BLEU-1: 0.852658
BLEU-2: 0.791225
BLEU-3: 0.709834
BLEU-4: 0.461608
test
src=[wer fehlt], target=[who is absent], predicted=[whos missing]
src=[tom ist unverschamt], target=[tom is insolent], predicted=[tom is bald]
src=[nicht in panik ausbrechen], target=[dont panic], predicted=[dont be hurt]
src=[scher dich weg], target=[get lost], predicted=[get away]
src=[das meine ich ernst], target=[i am not kidding], predicted=[i feel mean]
src=[ich werde dich einladen], target=[ill treat you], predicted=[ill will you]
src=[ich muss blind sein], target=[i must be blind], predicted=[i must to go]
src=[tom fuhlte sich traurig], target=[tom felt sad], predicted=[tom felt]
src=[sie sind damlich], target=[youre silly], predicted=[youre silly]
src=[er ist im brunnen], target=[hes in the well], predicted=[hes is trouble]
BLEU-1: 0.508143
BLEU-2: 0.382528
BLEU-3: 0.316085
BLEU-4: 0.172019

Los geht’s, We have just prepared a German to English translator using Keras and TensorFlow using one-hot encoding technique! Be sure to play around with the model as you desire!

Leave a Reply

Your email address will not be published. Required fields are marked *