Nationality predictor using LSTM

INTRODUCTION

In this tutorial, we will learn how to build a nationality predictor using LSTM with the help of Keras TensorFlow API in Python. The model processes a person’s name and predicts their nationality.

GETTING THE DATASET

We will use a dataset from Kaggle that contains a set of person’s name and the nation they belong to. The link to the dataset is here. D0wnload the dataset from Kaggle and upload it to the drive. If you work on google colab or use a jupyter notebook, move the file to the home location of the jupyter environment. I recommend using colab because LSTM takes really long time for training.

IMPORTING THE LIBRARIES

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.utils import plot_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,GRU,RNN,Dense,Embedding

Here we have used Label Encoder and One Hot Encoder for converting words into vectors. These are not recommended because they do not learn the relationship between words and the sentences’ context. Embedding is an advanced technique that also learns the relationship between words, along with the model. Other than these, all other libraries are well known to you.

ANALYSING THE DATA

f=open('name2lang.txt','r')
text=f.read().split('\n')
ltrs="abcdefghijklmnopqrstuvwxyz"
ltrs=list(ltrs)
ltrs.append('##pad##')

Here we have created a list that contains all the alphabets.

names=[]
lbl=[]
freq={}
for i in text:
    i=i.split(',')
    name=''
    t=i[0].lower()
    for l in t:
      if l in ltrs:
        name+=l 
    names.append(name)
    lbl.append(i[1].strip())
    if lbl[-1] not in freq.keys():
      freq[lbl[-1]]=1
    else:
      freq[lbl[-1]]+=1

We created three lists for storing the names, labels and frequency of the labels.

plt.bar(list(freq.keys()),list(freq.values()),color='r')
plt.xticks(rotation=90)

This is the reason why we created a frequency list. We can visualise each labels frequency in a histogram. The link is here.

plt.pie(freq.values(),labels=freq.keys(),radius=2,autopct='%1.1f%%')
plt.show()

We visualised labels frequency using a pie chart. The link is here

lengths=[len(name) for name in names]

print('Average number of letters: ',np.mean(lengths))
print('Median no. of ltrs: ',np.median(lengths))
print('standard dev: ',np.std(lengths))
Average number of letters:  7.133665835411471
Median no. of ltrs:  7.0
standard dev:  2.0706207047158274

We analysed the length of the names and printed some standard statistical results for lengths.

plt.hist(lengths,range=(2,20))
plt.show()

Histogram for the lengths’ list. The link is here.

WORD TO VECTOR

max_len=9
output_len=18

len(ltrs)
le=LabelEncoder()
int_enc=le.fit_transform(ltrs)
ohe=OneHotEncoder(sparse=False)
ohe.fit(int_enc.reshape(-1,1))

Here, first, we have used Label Encoder to assign numerical indices to each of the names. Next, we used One Hot Encoder to convert the assigned indices to binary arrays. It takes assigned indices as input and outputs a binary array with 1 on the assigned index and 0 as remaining entries.

TRAINING THE DATASET

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,shuffle=True,stratify=y)

x_train.shape
(16040, 9)

We split the data into training and validation set with a validation split of 20 percent. Let’s build our LSTM model and train our dataset on the model.

hidden_size=256
model=Sequential([Embedding(input_dim=len(ltrs), output_dim=hidden_size,mask_zero=True),
                  LSTM(hidden_size,return_sequences=True),
                  LSTM(50,return_sequences=False),
                  Dense(18,activation='softmax')
    ])

It looks simple, right. But don’t underestimate the model. It is one of the most powerful models in Natural Language Processing. We built our model using Keras Sequential API. First, we included an Embedding layer that learns the word to vector representation along with the model. Next, we built an LSTM model that takes hidden size as its input, i.e. it has that many hidden units in it. Next, we have stacked another LSTM block on top of our first block. The output of the first LSTM block will be the input for the second block. After that, we finally used a Dense block with Softmax activation that takes the final LSTM block’s output as input and outputs each class probability. The number of nationalities is 18, so the final output size is set to 18. Now, let us train our model using the dataset.

model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 256)         6912      
_________________________________________________________________
lstm (LSTM)                  (None, None, 256)         525312    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                61400     
_________________________________________________________________
dense (Dense)                (None, 18)                918       
=================================================================
Total params: 594,542
Trainable params: 594,542
Non-trainable params: 0
_________________________________________________________________
model.fit(x_train,y_train,epochs=50,batch_size=512,validation_split=0.2)
Epoch 35/50
26/26 [==============================] - 1s 25ms/step - loss: 0.4434 - accuracy: 0.8667 - val_loss: 0.6332 - val_accuracy: 0.8074
Epoch 36/50
26/26 [==============================] - 1s 23ms/step - loss: 0.4201 - accuracy: 0.8743 - val_loss: 0.6445 - val_accuracy: 0.8089
Epoch 37/50
26/26 [==============================] - 1s 23ms/step - loss: 0.3965 - accuracy: 0.8807 - val_loss: 0.6426 - val_accuracy: 0.8086
Epoch 38/50
26/26 [==============================] - 1s 24ms/step - loss: 0.3793 - accuracy: 0.8852 - val_loss: 0.6319 - val_accuracy: 0.8117
Epoch 39/50
26/26 [==============================] - 1s 23ms/step - loss: 0.3689 - accuracy: 0.8923 - val_loss: 0.6330 - val_accuracy: 0.8083
Epoch 40/50
26/26 [==============================] - 1s 23ms/step - loss: 0.3505 - accuracy: 0.8975 - val_loss: 0.6406 - val_accuracy: 0.8052
Epoch 41/50
26/26 [==============================] - 1s 24ms/step - loss: 0.3291 - accuracy: 0.8987 - val_loss: 0.6299 - val_accuracy: 0.8092
Epoch 42/50
26/26 [==============================] - 1s 23ms/step - loss: 0.3190 - accuracy: 0.9059 - val_loss: 0.6307 - val_accuracy: 0.8074
Epoch 43/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2995 - accuracy: 0.9128 - val_loss: 0.6395 - val_accuracy: 0.8099
Epoch 44/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2958 - accuracy: 0.9151 - val_loss: 0.6272 - val_accuracy: 0.8086
Epoch 45/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2768 - accuracy: 0.9175 - val_loss: 0.6265 - val_accuracy: 0.8102
Epoch 46/50
26/26 [==============================] - 1s 24ms/step - loss: 0.2637 - accuracy: 0.9225 - val_loss: 0.6395 - val_accuracy: 0.8058
Epoch 47/50
26/26 [==============================] - 1s 24ms/step - loss: 0.2499 - accuracy: 0.9252 - val_loss: 0.6560 - val_accuracy: 0.7986
Epoch 48/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2466 - accuracy: 0.9288 - val_loss: 0.6262 - val_accuracy: 0.8089
Epoch 49/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2334 - accuracy: 0.9318 - val_loss: 0.6364 - val_accuracy: 0.8099
Epoch 50/50
26/26 [==============================] - 1s 23ms/step - loss: 0.2226 - accuracy: 0.9356 - val_loss: 0.6377 - val_accuracy: 0.8111
Out[20]:
<tensorflow.python.keras.callbacks.History at 0x7f719fc4e8d0

We trained the model for 50 epochs, and finally, our validation accuracy is 81 per cent. Now let us test it in our test set and see how accurately our model performs.

model.evaluate(x_test,y_test)
126/126 [==============================] - 1s 6ms/step - loss: 0.6686 - accuracy: 0.8105
[0.6685804128646851, 0.8104737997055054]

model.save_weights('model_weights.h5')

We saved the weights of our model for further use.

def process_ip(str):
  str=str.lower()
  ip=list(str)
  ip=le.transform(ip)

  pad_ip=[]
  for i in range(max_len):
    if i<len(ip):
      pad_ip.append(ip[i])
    else:
      pad_ip.append(0)

  pad_ip=np.asarray(pad_ip)
  pad_ip=np.expand_dims(pad_ip,axis=0)
  return pad_ip

def evaluate(str):
  model_ip=process_ip(str)
  model_op=model.predict(model_ip)
  ans=np.asarray([np.argmax(model_op)])
  ans=nat.inverse_transform(ans)
  ans=ans[0]

  return ans

This is the helper function for testing. It preprocesses the input and converts it into a vector using Label Encoder and One Hot Encoder and trains it on our model and also converts the resultant output class probability into the corresponding nationality by taking the max class probability, identifies its index and outputs the respective nationality name.

test_name=input()

print()
ans=evaluate(test_name)
print('Given Name  : ',test_name)
print('Nationality : ',ans)
sherlock

Given Name  :  sherlock
Nationality :  English

The input is a particular person name, and then we called the helper function onto the input, and finally, it printed the predicted nationality as the output.

CONCLUSION

In this tutorial, we learned how to build a nationality predictor using LSTM. This is just a basic NLP application, and as you go on, you will have to learn how to process sentences instead of words and represent them as a vector. There are many NLP applications, and things will really get interesting as you dig deeper into this subject.

 

Leave a Reply

Your email address will not be published. Required fields are marked *