Nationality predictor using LSTM
INTRODUCTION
In this tutorial, we will learn how to build a nationality predictor using LSTM with the help of Keras TensorFlow API in Python. The model processes a person’s name and predicts their nationality.
GETTING THE DATASET
We will use a dataset from Kaggle that contains a set of person’s name and the nation they belong to. The link to the dataset is here. D0wnload the dataset from Kaggle and upload it to the drive. If you work on google colab or use a jupyter notebook, move the file to the home location of the jupyter environment. I recommend using colab because LSTM takes really long time for training.
IMPORTING THE LIBRARIES
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder,OneHotEncoder from keras.preprocessing.sequence import pad_sequences from sklearn.model_selection import train_test_split import tensorflow as tf from keras.utils import plot_model from tensorflow.keras.models import Sequential from tensorflow.keras.layers import LSTM,GRU,RNN,Dense,Embedding
Here we have used Label Encoder and One Hot Encoder for converting words into vectors. These are not recommended because they do not learn the relationship between words and the sentences’ context. Embedding is an advanced technique that also learns the relationship between words, along with the model. Other than these, all other libraries are well known to you.
ANALYSING THE DATA
f=open('name2lang.txt','r') text=f.read().split('\n')
ltrs="abcdefghijklmnopqrstuvwxyz" ltrs=list(ltrs) ltrs.append('##pad##')
Here we have created a list that contains all the alphabets.
names=[] lbl=[] freq={} for i in text: i=i.split(',') name='' t=i[0].lower() for l in t: if l in ltrs: name+=l names.append(name) lbl.append(i[1].strip()) if lbl[-1] not in freq.keys(): freq[lbl[-1]]=1 else: freq[lbl[-1]]+=1
We created three lists for storing the names, labels and frequency of the labels.
plt.bar(list(freq.keys()),list(freq.values()),color='r') plt.xticks(rotation=90)
This is the reason why we created a frequency list. We can visualise each labels frequency in a histogram. The link is here.
plt.pie(freq.values(),labels=freq.keys(),radius=2,autopct='%1.1f%%') plt.show()
We visualised labels frequency using a pie chart. The link is here
lengths=[len(name) for name in names] print('Average number of letters: ',np.mean(lengths)) print('Median no. of ltrs: ',np.median(lengths)) print('standard dev: ',np.std(lengths))
Average number of letters: 7.133665835411471 Median no. of ltrs: 7.0 standard dev: 2.0706207047158274
We analysed the length of the names and printed some standard statistical results for lengths.
plt.hist(lengths,range=(2,20)) plt.show()
Histogram for the lengths’ list. The link is here.
WORD TO VECTOR
max_len=9 output_len=18 len(ltrs)
le=LabelEncoder() int_enc=le.fit_transform(ltrs) ohe=OneHotEncoder(sparse=False) ohe.fit(int_enc.reshape(-1,1))
Here, first, we have used Label Encoder to assign numerical indices to each of the names. Next, we used One Hot Encoder to convert the assigned indices to binary arrays. It takes assigned indices as input and outputs a binary array with 1 on the assigned index and 0 as remaining entries.
TRAINING THE DATASET
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,shuffle=True,stratify=y) x_train.shape
We split the data into training and validation set with a validation split of 20 percent. Let’s build our LSTM model and train our dataset on the model.
hidden_size=256 model=Sequential([Embedding(input_dim=len(ltrs), output_dim=hidden_size,mask_zero=True), LSTM(hidden_size,return_sequences=True), LSTM(50,return_sequences=False), Dense(18,activation='softmax') ])
It looks simple, right. But don’t underestimate the model. It is one of the most powerful models in Natural Language Processing. We built our model using Keras Sequential API. First, we included an Embedding layer that learns the word to vector representation along with the model. Next, we built an LSTM model that takes hidden size as its input, i.e. it has that many hidden units in it. Next, we have stacked another LSTM block on top of our first block. The output of the first LSTM block will be the input for the second block. After that, we finally used a Dense block with Softmax activation that takes the final LSTM block’s output as input and outputs each class probability. The number of nationalities is 18, so the final output size is set to 18. Now, let us train our model using the dataset.
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy']) model.summary()
model.fit(x_train,y_train,epochs=50,batch_size=512,validation_split=0.2)
We trained the model for 50 epochs, and finally, our validation accuracy is 81 per cent. Now let us test it in our test set and see how accurately our model performs.
model.evaluate(x_test,y_test)
CONCLUSION
In this tutorial, we learned how to build a nationality predictor using LSTM. This is just a basic NLP application, and as you go on, you will have to learn how to process sentences instead of words and represent them as a vector. There are many NLP applications, and things will really get interesting as you dig deeper into this subject.
Leave a Reply