Sentence Similarity using Transformers with Natural Language Processing

In this lesson, you will learn about how can we identify whether two sentences are similar or not using Transformers with Natural Language Processing in Python programming.


Sentence Similarity is the task of identifying how similar two texts are. Sentence similarity models turn input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are. This activity is very beneficial for information retrieval and clustering/grouping.

The logic behind sentence similarity is this :

  • First, take a sentence and turn it into a vector.
  • Take many other sentences and turn them into vectors.
  • Find the sentences that have the shortest distance (Euclidean similarity) or the smallest angle (cosine similarity) between them – more on that here.
  • We now have a way to compare the semantic similarity of texts – simple!

On a high level, there isn’t much else to it. But, of course, we want to understand what’s going on in greater depth and implement it in Python as well! So, let’s get started.

Why BERT Helps

BERT seems to be the NLP MVP. BERT’s ability to incorporate the meaning of words into tightly packed vectors contributes significantly to this.

We call them dense vectors because each value in the vector has a value and a rationale for being that value, as opposed to sparse vectors, such as one-hot encoded vectors, in which the majority of values are 0.

BERT excels at producing dense vectors, and each encoder layer (there are several) generates a collection of dense vectors.

This will be a vector with 768 elements on BERT basis. These 768 values include our numerical representation of a single token, which we can utilize to generate contextual word embeddings.

Because each token is represented by one of these vectors (produced by each encoder), we are looking at a tensor of dimension 768 by the number of tokens.

We may use these tensors to generate semantic representations of the input sequence by transforming them. Then, using our similarity metrics, we can compute the degree of similarity between distinct sequences.

The last hidden state tensor is the simplest and most typically retrieved tensor, and it is readily produced by the BERT model.

Of course, at 512×768 pixels, this is a huge tensor, and we need a vector to apply our similarity metrics on.

To do this, we must first transform our last hidden states tensor to a vector of 768 dimensions.

Creating the Vector

We utilize a mean pooling technique to transform our last hidden states tensor into a vector.

Each of those 512 tokens has a value of 768. This pooling procedure will take the mean of all token embeddings and compress it into a single 768 vector space, resulting in the creation of a sentence vector.’

At the same time, we cannot just accept the mean activation as it is. Null padding tokens must be considered (which we should not include).

In Code

We’ll go through two approaches: the simple one and the slightly more complicated one.

Sentence-Transformers are simple.
The simplest way for us to accomplish what we just discussed is to use the sentence-transformers package, which simplifies the majority of this procedure into a few lines of code.

First, we use pip install sentence-transformers to install sentence-transformers. HuggingFace’s transformers are used behind the scenes in this library, therefore we may discover sentence-transformers models here..

We’ll employ the best-base-nli-mean-tokens model, which employs the same reasoning we’ve studied thus far.

(It also utilizes 128 input tokens instead of 512.)

Let’s make some words, then initialize our model and encode them:

  • Write a few sentences to encode (sentences 0 and 2 are both similar) and initialize the model:
from sentence_transformers import SentenceTransformer
sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."

model = SentenceTransformer('bert-base-nli-mean-tokens')


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=405234788.0), HTML(value='')))
  • Encode the sentences :
sent_embeddings = model.encode(sentences)


(4, 768)

We now have four-sentence embeddings, each with 768 values.

Now we take those embeddings and compute the cosine similarity between them. So, for sentence 0:

  • Three years later, the coffin was still full of Jello.

In order to find the most similar sentence we use :

from sklearn.metrics.pairwise import cosine_similarity as cs


As you can see that sentences [0] i.e “Three years later the coffin was still full of Jello.” and sentences [2] i.e “The person box was packed with jelly many dozens of months later.” has a greater similarity score 0.72 which is higher as compared to other sentences as their similarity score are not up to that extent.

Leave a Reply

Your email address will not be published. Required fields are marked *