German To English Translator Using Keras and TensorFlow using LSTM model!

In this article, an LSTM model for translating German text to English text is done using tokenization and I will train the model on Keras and TensorFlow backend. We have already worked with LSTM once when predicting the stock prices for shares, thus we will expand the use of LSTM today.
The biggest problem for tourists is having to deal with the local language. German keeps in front of you, the need for a translator increases. We will today see how to make a custom language translator step by step, with the pipeline being as follows:
- Collecting Data
- Preprocessing the data
- Tokenizing the data
- Training the model
- Running Inference on Train and Test Data
Before we start, the following tutorial has keras=2.3.1 and Tensorflow==2.1.0 and the code was done inside jupyter notebook.
Prerequisites:
First, we will import all the libraries, if you don’t have any of these kindly install them using pip or conda install in cmd.
import string import re from pickle import load,dump from unicodedata import normalize from numpy import array from numpy.random import rand from numpy.random import shuffle from numpy import argmax from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import Embedding from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.callbacks import ModelCheckpoint from keras.models import load_model from nltk.translate.bleu_score import corpus_bleu
Collecting Data:
For the data, we have used the dataset provided by ManyThings.org website. The website has a lot of great datasets for phrases and translations in various languages. You can download the data set from the link provided here: https://raw.githubusercontent.com/jbrownlee/Datasets/master/deu.txt
Cleaning Data:
Download the dataset and save it in the same folder as your code directory. We will now load the data into our code with the below code:
def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text
Now if you observer the dataset inside the file, the data needs a lot of cleaning. Firstly we will separate the English and German phrases, as they are separated by a tab character we will use the python strip and split function.
def to_pairs(doc): lines = doc.strip().split('\n') pairs = [line.split('\t') for line in lines] return pairs
After getting the hold of each sentence, we will perform all the cleaning in the next function, things we will be working on include:
- Remove non-printable characters and symbols
- Remove all the punctuations in the data
- Normalize to ASCII
- Make all the characters lowercase
- Remove all the non-alphabetic characters
# clean a list of lines def clean_pairs(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for pair in lines: clean_pair = list() for line in pair: # normalize unicode characters line = normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lowercase line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string clean_pair.append(' '.join(line)) cleaned.append(clean_pair) return array(cleaned)
And lastly, we will save this cleaned data to avoid going through cleaning each time.
def save_clean_data(sentences, filename): dump(sentences, open(filename, 'wb')) print('Saved: %s' % filename)
Execute all the above functions with the code below:
# load dataset filename = 'deu.txt' doc = load_doc(filename) # split into english-german pairs pairs = to_pairs(doc) # clean sentences clean_pairs = clean_pairs(pairs) # save clean pairs to file save_clean_data(clean_pairs, 'english-german.pkl')
Tokenizing And Training:
After saving the pickles for the English-German phrases with cleaning and saving in the list, we will use the following function to load it into the code for training.
# load a clean dataset def load_clean_sentences(filename): return load(open(filename, 'rb'))
This function will return the list of phrases in form of a list.
We will now split the dataset into train and test list into three pickles. We will reduce the dataset to our use case.
# load dataset raw_dataset = load_clean_sentences('english-german.pkl') # reduce dataset size n_sentences = 10000 dataset = raw_dataset[:n_sentences, :] # random shuffle shuffle(dataset) # split into train/test train, test = dataset[:9000], dataset[9000:] # save save_clean_data(dataset, 'english-german-both.pkl') save_clean_data(train, 'english-german-train.pkl') save_clean_data(test, 'english-german-test.pkl')
Tokenizer:
To train an LSTM with text, we need a tokenizer. A tokenizer is nothing but a tool for converting strings and characters into the dictionary with weights. The weights of a tokenizer are with respect to the number of times the substring has repapered. We will use the following function to create a tokenizer:
def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer
We will need to get the maximum length of the sentences in order to perform one-hot encoding, and for that, these functions have been used:
def max_length(lines): return max(len(line.split()) for line in lines)
What is one hot encoding?
One hot encoding is a process by which variables that are more appropriate for categorical algorithms are converted into a form that can be used for ML algorithms to increase accuracy and make sense. If you wanna know more about one hot encoding please refer to this amazing blog:
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
We will make two functions for encoding the input and one-hot encoding the output sequences. Both of them are listed below:
def encode_sequences(tokenizer, length, lines): X = tokenizer.texts_to_sequences(lines) # pad sequences with 0 values X = pad_sequences(X, maxlen=length, padding='post') return X def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = array(ylist) y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size) return y
Model initializing And Training:
Time to move to make the LSTM model. We will be using a sequential model style and LSTM layers. I have previously covered LSTM in-depth in my previous tutorial so feel free to learn about all these layers over there.
You can access my previous tutorial on LSTM here: https://valueml.com/stock-prices-prediction-with-tensorflow-keras/
New layers added will be embeddings for the encoding handling. The repeat vector for basically repeating the input n times.
And TimeDistributed Dense layer will give out output while distributing in the time axis.
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model
Now we will call all the above-defined helper functions to start the training process:
# load datasets dataset = load_clean_sentences('english-german-both.pkl') train = load_clean_sentences('english-german-train.pkl') test = load_clean_sentences('english-german-test.pkl') # prepare english tokenizer eng_tokenizer = create_tokenizer(dataset[:, 0]) eng_vocab_size = len(eng_tokenizer.word_index) + 1 eng_length = max_length(dataset[:, 0]) print('English Vocabulary Size: %d' % eng_vocab_size) print('English Max Length: %d' % (eng_length)) # prepare german tokenizer ger_tokenizer = create_tokenizer(dataset[:, 1]) ger_vocab_size = len(ger_tokenizer.word_index) + 1 ger_length = max_length(dataset[:, 1]) print('German Vocabulary Size: %d' % ger_vocab_size) print('German Max Length: %d' % (ger_length)) # prepare training data trainX = encode_sequences(ger_tokenizer, ger_length, train[:, 1]) trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0]) trainY = encode_output(trainY, eng_vocab_size) # prepare validation data testX = encode_sequences(ger_tokenizer, ger_length, test[:, 1]) testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0]) testY = encode_output(testY, eng_vocab_size) # define model model = define_model(ger_vocab_size, eng_vocab_size, ger_length, eng_length, 256) model.compile(optimizer='adam', loss='categorical_crossentropy') # summarize defined model print(model.summary()) plot_model(model, to_file='model.png', show_shapes=True) # fit model filename = 'model.h5' checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min') model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)
The output of training will look something like this:
English Vocabulary Size: 2404 English Max Length: 5 German Vocabulary Size: 3856 German Max Length: 10 Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 10, 256) 987136 _________________________________________________________________ lstm_3 (LSTM) (None, 256) 525312 _________________________________________________________________ repeat_vector_2 (RepeatVecto (None, 5, 256) 0 _________________________________________________________________ lstm_4 (LSTM) (None, 5, 256) 525312 _________________________________________________________________ time_distributed_2 (TimeDist (None, 5, 2404) 617828 ================================================================= Total params: 2,655,588 Trainable params: 2,655,588 Non-trainable params: 0
And if the training has been completed successfully, you will see this in the output:
Epoch 00029: val_loss improved from 2.15330 to 2.14423, saving model to model.h5 Epoch 30/30 - 14s - loss: 0.5411 - val_loss: 2.1519 Epoch 00030: val_loss did not improve from 2.14423
Now you have made a successful German to English translator using Keras, time to put the model to test.
Testing the Model:
For testing the model we will need some helper functions again, basically to calculate the score, convert the output of the model in form of strings. The functions make use of various tokenizer functions, listed below:
def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None
And For predicting sequences from the model we will use the NumPy argmax to find the prediction from our dictionary.
def predict_sequence(model, tokenizer, source): prediction = model.predict(source, verbose=0)[0] integers = [argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target)
And Lastly, to find a numeric way of studying our model’s performance, we will use the BLEU score. BLEU or the Bilingual Evaluation Understudy is a method of evaluation for comparing a candidate translation of the text ( in our case German) to one or more reference translations (English).
def evaluate_model(model, tokenizer, sources, raw_dataset): actual, predicted = list(), list() for i, source in enumerate(sources): # translate encoded source text source = source.reshape((1, source.shape[0])) translation = predict_sequence(model, eng_tokenizer, source) raw_target, raw_src = raw_dataset[i] if i < 10: print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation)) actual.append([raw_target.split()]) predicted.append(translation.split()) # calculate BLEU score print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
And Now for the final touch we will call these function and get out German-to-English output:
# test on some training sequences print('train') evaluate_model(model, eng_tokenizer, trainX, train) # test on some test sequences print('test') evaluate_model(model, eng_tokenizer, testX, test)
And the output for the above command will look like this, this may differ from model to model.
train src=[gehort das dir], target=[is that yours], predicted=[is this yours] src=[sollen wir beginnen], target=[shall we begin], predicted=[shall we begin] src=[ich habe kohldampf], target=[im starving], predicted=[im starving] src=[sie sind so gutig], target=[you are so kind], predicted=[you are you] src=[ich werde punktlich da sein], target=[ill be on time], predicted=[ill be in] src=[wurde sie gesehen], target=[was she seen], predicted=[did she seen] src=[er ist nicht perfekt], target=[hes not perfect], predicted=[hes isnt perfect] src=[verpiss dich], target=[go away], predicted=[get away] src=[ich werde tom mitbringen], target=[ill bring tom], predicted=[ill ask tom] src=[hast du ihn getroffen], target=[did you meet him], predicted=[did you meet him] BLEU-1: 0.852658 BLEU-2: 0.791225 BLEU-3: 0.709834 BLEU-4: 0.461608 test src=[wer fehlt], target=[who is absent], predicted=[whos missing] src=[tom ist unverschamt], target=[tom is insolent], predicted=[tom is bald] src=[nicht in panik ausbrechen], target=[dont panic], predicted=[dont be hurt] src=[scher dich weg], target=[get lost], predicted=[get away] src=[das meine ich ernst], target=[i am not kidding], predicted=[i feel mean] src=[ich werde dich einladen], target=[ill treat you], predicted=[ill will you] src=[ich muss blind sein], target=[i must be blind], predicted=[i must to go] src=[tom fuhlte sich traurig], target=[tom felt sad], predicted=[tom felt] src=[sie sind damlich], target=[youre silly], predicted=[youre silly] src=[er ist im brunnen], target=[hes in the well], predicted=[hes is trouble] BLEU-1: 0.508143 BLEU-2: 0.382528 BLEU-3: 0.316085 BLEU-4: 0.172019
Los geht’s, We have just prepared a German to English translator using Keras and TensorFlow using one-hot encoding technique! Be sure to play around with the model as you desire!
Leave a Reply