Introduction to Natural Language Processing (NLP) using TensorFlow in Python

Before we begin, let’s consider a scenario where you want to communicate very important information to the machine but due to the limited vocabulary of the machine, the machine fails to understand you.

Feel stuck now? The answer to your problem here is Natural Language Processing!

Now one may wonder what is Natural language processing? You may have several questions regarding the same. I am sure all your concerns and questions are clear in this article!

Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP) gives computers the ability to do tasks involving the human language and comes with a diverse vocabulary.

NLP is used at a lot of places including Translation of text into various languages and extracting vital information from text.

The flowchart below represents a simple way to implement NLP:

Figure1: Flowchart representing basic Natural Language Processing

Let’s dive more into the NLP concepts.

Some Important NLP Concepts

There are various concepts involved when one talks about Natural Language Processing. The same are as follows:

  1. Tokenization & Stopwords Removal
  2. Stemming
  3. Building a Vocabulary
  4. Vectorization

We will cover each of them in brief.

1. Tokenization & Stopwords Removal

Tokenization is simply breaking down the text into separate words. And stopwords Removal implies the removal of the words which are not relevant for our processing. It also helps in making our dataset smaller and hence making it faster to be processed.

2. Stemming

Stemming is converting words into their base forms. For instance let’s say we have the word ‘sleeping’, to make our processing easier we can convert the word to ‘sleep’ (the base form of the original word). This process is also known as ‘Lemmatization’.

3. Building a Vocabulary

This concept helps in building a common vocabulary which includes a list of all the unique words in the text we originally had. The reason behind doing this is that it will be helpful when our code has to convert the words to numbers.

4. Vectorization

In this concept, we convert our words or sentences into vector form. The size of the vector is always greater than the actual length of the sentence as the vector size is equivalent to the size of the vocabulary. After doing this we can identify all the unique as well as the most frequent words in the vocabulary.

Implementing some NLP Concepts in Tensorflow with Python

The first step is to always import the necessary Python modules. The code for the same is shown below:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

Tokenization using TensorFlow

As said earlier, Tokenisation is simply breaking down sentences into words. The code to implement the same is shown below:

sample_texts=['ValueML clears all my concepts',
              'I love clearing my coding concepts through ValueML',
             'All Machine Learning and Deep Learning concepts are available here']
token = Tokenizer(num_words=1000)
token.fit_on_texts(sample_texts)
words=token.word_index
print(words)

The code is described line by line below:

  • 1: Created a list having a number of sample sentences that you need to process further.
  • 4: Initializing tokenizer object which has a maximum word limit as a parameter.
  • 5: We then fix the tokenizer created in line 4 on the same sentences created in line 1.
  • 6: Assigning values to words so that you can process the words later.
  • 7: Prints the dictionary having the words as keys and the values are the values assigned to the words by the tokenizer.

The output of the tokenizer is shown below:

{'concepts': 1, 'valueml': 2, 'all': 3, 'my': 4, 'learning': 5, 
'clears': 6, 'i': 7, 'love': 8, 'clearing': 9, 'coding': 10, 
'through': 11, 'machine': 12, 'and': 13,'deep': 14, 'are': 15, 
'available': 16, 'here': 17}

Converting more sentences to numbers

For machines to understand the text, we have to convert the text into numbers to convert it to machine-understandable form. The code for the same is shown below:

input_text = ["I love to go through ValueML","ValueML is best for Machine Learning concepts"]
text_to_num=token.texts_to_sequences(input_text)
print(text_to_num)

The output of the code above is shown below where the input text sentences are converted to numerical form.

[[7, 8, 11, 2], [2, 12, 5, 1]]

BUT now the problem with this approach is that the words that are not included in the original vocabulary are not handled properly. The same is shown in the figure below:

Figure2: The problem of missing words in the vocabulary that remain unhandled

Handling the issue we are facing:

The issue can be handled in multiple ways:

  1. Having a huge vocabulary originally
  2. Using the out of vocabulary parameter when you initialize the tokenizer ( only efficient for small data )

The first approach is not manually possible without a proper dataset. So for now we will make use of the out of vocabulary parameter. The code for the same is shown below:

new_token = Tokenizer(num_words=1000,oov_token="<I am missing!>")
new_token.fit_on_texts(sample_texts)
words=new_token.word_index
print(words)

The ‘oov_token’ is the out of vocabulary (OOV) token and it can be assigned anything one wishes to, unless and until it doesn’t confuse the machine with any other word from the dictionary. So, make it as unique as possible.

This time the output contains the default value for the OOV words as shown below:

{'<I am missing!>': 1, 'concepts': 2, 'valueml': 3, 'all': 4, 
'my': 5, 'learning': 6, 'clears': 7, 'i': 8, 'love': 9, 
'clearing': 10, 'coding': 11, 'through': 12, 'machine': 13, 
'and': 14, 'deep': 15, 'are': 16, 'available': 17, 'here': 18}

Now when one converts the text into numbers, all the words that are missing from the vocabulary are assigned the value of the OOV word value from the dictionary shown above.

The output of the same is shown below:

[[8, 9, 1, 1, 12, 3], [3, 1, 1, 1, 13, 6, 2]]

Summing up what we learned!

Congratulations! You have reached the end of the article!

Now let’s revisit what we learned today. We first gained knowledge about what exactly Natural Language Processing (NLP) is and why is it needed. Then we learned about a few concepts or terms related to NLP and in the later sections, we learned to implement a few of them using TensorFlow.

And yes there is a lot more to explore in NLP!

Stay tuned to learn more!

3 responses to “Introduction to Natural Language Processing (NLP) using TensorFlow in Python”

  1. Richa says:

    Great content!
    Very excited to learn more about NLP!

  2. Hardik Upadhyay says:

    This is sooo knowlegleablee????????????
    Keep posting such things????❤️

  3. Kartik says:

    Definitely learned new things about NLP which I had not focused on . Very informative

Leave a Reply

Your email address will not be published. Required fields are marked *