A Brief about Subword Tokenization

In this tutorial, you will learn how to generate a subword vocabulary from a dataset, that is used to build the vocabulary from the text.BertTokenizer.

The fundamental advantage of a subword tokenizer is that it can interpolate between word-based and character-based tokenization. The common words are added to the vocabulary, although the tokenizer might fall back on word fragments and individual characters for unfamiliar words.

Introduction

TensorFlow implementations of numerous popular tokenizers are included in the TensorFlow_text package. Three subword tokenizers are as follows :

  • text.BertTokenizer – BertTokenizer is a higher-level interface. It contains BERT’s token splitting mechanism as well as a WordPieceTokenizer. It accepts sentences and returns token IDs.
  • text.WordpieceTokenizer – The WordPieceTokenizer class is a lower-level interface. It just implements the WordPiece algorithm. You must standardize and divide the text into words before calling it. It accepts words as input and returns token IDs.
  • text.SentencepieceTokenizer – The SentencepieceTokenizer needs a more involved setup. Its initializer requires a sentence piece model that has already been trained. Instructions for building one of these models may be found in the google/sentence piece repository. When tokenizing, it may accept sentences as input.

This tutorial creates a Word piece vocabulary from the top down, beginning with existing words. This method does not work for Japanese, Chinese, or Korean since these languages do not have distinct multi-character units. Consider text to tokenize these languages. SentencepieceTokenizer, text.UnicodeCharTokenizer, or this method.

Setup

Let’s import all the required Python libraries for subword tokenization.

import collections
import os
import pathlib
import re
import string
import sys
import tempfile
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import tensorflow_text as text
import tensorflow as tf

Import Dataset

Let’s download the dataset from this.

ex, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_ex, val_ex = ex['train'], ex['validation']

This dataset generates Portuguese/English sentence pairs:

for pt, eng in train_ex.take(1):
  print("Portuguese: ", pt.numpy().decode('utf-8'))
  print("English:   ", eng.numpy().decode('utf-8'))

Output

Portuguese:  e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
English:    and when you improve searchability , you actually take away the one advantage of print , which is serendipity .

The above output shows that :

  • All the texts are in lowercase.
  • There are gaps between the punctuation marks.
  • It is unclear whether or not Unicode normalization is being applied.

Generate the vocabulary

This module creates a word piece vocabulary from a dataset. If you already have a vocabulary file and merely want to learn how to compose a text. BertTokenizer or text. If you’re using the WordpieceTokenizer tokenizer, you may move forward to the Build the tokenizer part.

The tensorflow_text pip package contains the vocabulary generating code. It is not automatically imported; you must explicitly import it:

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

The vocabulary will be generated via the bert_vocab_ from _dataset function.

There are several arguments you may use to modify its behavior. You’ll primarily be using the defaults in this lesson. If you want to understand more about the alternatives, read about the algorithm first, then examine the code.

Code

params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

Let’s generate vocabulary from Portuguese data.

pt_vocab = bert_vocab.bert_vocab_from_dataset(
    train_pt.batch(1000).prefetch(2),
    **bert_vocab_args
)

print(pt_vocab[:10])
print(pt_vocab[100:110])
print(pt_vocab[1000:1010])
print(pt_vocab[-10:])

Output

['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', "'"]
['no', 'por', 'mais', 'na', 'eu', 'esta', 'muito', 'isso', 'isto', 'sao']
['90', 'desse', 'efeito', 'malaria', 'normalmente', 'palestra', 'recentemente', '##nca', 'bons', 'chave']
['##–', '##—', '##‘', '##’', '##“', '##”', '##⁄', '##€', '##♪', '##♫']

Write a  Portuguese vocabulary file and save the file as pt_vocab.txt:

def write_vocab(filepath, vocab):
  with open(filepath, 'w') as f:
    for token in vocab:
      print(token, file=f)
write_vocab('pt_vocab.txt', pt_vocab)

Generate another vocabulary from English data.

en_vocab = bert_vocab.bert_vocab_from_dataset(
    train_en.batch(1000).prefetch(2),
    **bert_vocab_args
)

print(en_vocab[:10])
print(en_vocab[100:110])
print(en_vocab[1000:1010])
print(en_vocab[-10:])

Output

['[PAD]', '[UNK]', '[START]', '[END]', '!', '#', '$', '%', '&', "'"]
['as', 'all', 'at', 'one', 'people', 're', 'like', 'if', 'our', 'from']
['choose', 'consider', 'extraordinary', 'focus', 'generation', 'killed', 'patterns', 'putting', 'scientific', 'wait']
['##_', '##`', '##ย', '##ร', '##อ', '##–', '##—', '##’', '##♪', '##♫']

Write an English vocabulary file and save the file as en_vocab.txt:

write_vocab_file('en_vocab.txt', en_vocab)

The text.BertTokenizer can be begun by bypassing the vocabulary file’s path as the first argument (see the section on tf. lookup for other options):

pt_tokenizers = text.BertTokenizer('pt_vocab.txt', **bert_tokenizer_params)
en_tokenizers = text.BertTokenizer('en_vocab.txt', **bert_tokenizer_params)

You may now use it to encrypt text. Take a batch of 3 examples from the English data:

Code

for pt_examples, en_examples in train_ex.batch(3).take(1):
  for ex in en_examples:
    print(ex.numpy())

Output

b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'
b'but what if it were active ?'
b"but they did n't test for curiosity ."

Tokenize it using the BertTokenizer.tokenize function. This first return a tf. RaggedTensor with axes (batch, word, word-piece):

Code

token_batch = en_tokenizers.tokenize(en_examples)
# Merge the word and word-piece axes -> (batch, tokens)
token_batch = token_batch.merge_dims(-2,-1)

for ex in token_batch.to_list():
  print(ex)

Output

[72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15]
[87, 90, 107, 76, 129, 1852, 30]
[87, 83, 149, 50, 9, 56, 664, 85, 2512, 15]

We can use tf.gather when the token IDs are replaced with their text representations , the phrases “searchability” and “serendipity” are decomposed into “search ##ability” and “s ##ere ##and ##ip ##ity”:

Now look for each token id in the vocabulary and join with spaces using:

txt_tokens = tf.gather(en_vocab, token_batch)
tf.strings.reduce_join(txt_tokens, separator=' ', axis=-1)

Output

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'and when you improve search ##ability , you actually take away the one advantage of print , which is s ##ere ##nd ##ip ##ity .',
       b'but what if it were active ?',
       b"but they did n ' t test for curiosity ."], dtype=object)>

Use the BertTokenizer.detokenize function to reassemble words from extracted tokens:

Code

word = en_tokenizers.detokenize(token_batch)
tf.strings.reduce_join(word, separator=' ', axis=-1)

Output

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .',
       b'but what if it were active ?',
       b"but they did n ' t test for curiosity ."], dtype=object)>

 

Leave a Reply

Your email address will not be published.