A comprehensive guide to Tokenizing with TF Text

Overview

The practice of splitting up a string into tokens is known as tokenization. These tokens are often words, numbers, and/or punctuation. The tensorflow_text package includes many tokenizers for preparing text required by text-based models. If you conduct tokenization in the TensorFlow graph, you won’t have to worry about changes between training and inference processes or managing preparation scripts.

This post goes through the many tokenization choices available by TensorFlow Text, when you should use one over another, and how these tokenizers are invoked from within your model.

Import Libraries

Let’s import all the required Python libraries.

import requests
import tensorflow as tf
import tensorflow_text as tf_text

Tokenizers

TensorFlow Text’s tokenizer suite is shown here. String inputs are presumed to be in UTF-8 format. Please refer to the Unicode reference for further information on converting strings to UTF-8.

These tokenizers try to break a string into words, which is the most natural approach to split text.

  • Whitespace Tokenizer: Whitespace Tokenizer is the most fundamental tokenizer, splitting strings based on ICU-defined whitespace characters (eg. space, tab, new line). This is frequently useful for quickly constructing prototype models.

Code

tokenizers= tf_text.WhitespaceTokenizer()
tokens = tokenizers.tokenize(["What you know you can't explain, but you can feel it."])
print(tokens.to_list())

Output

[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'can', b'feel', b'it.']]

One limitation of this tokenizer is that punctuation is combined with the word to form a token.

  • UnicodeScriptTokenizer: The UnicodeScriptTokenizer separates strings depending on Unicode script boundaries. The script codes correspond to International Components for Unicode (ICU) UScriptCode values. http://icu-project.org/apiref/icu4c/uscript 8h.html.In reality, this is similar to the WhitespaceTokenizer, with the main distinction being that it will break punctuation (USCRIPT_ COMMON) from language texts (e.g. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc.) while also separating language texts from one other. It should be noted that this will also divide contraction words into distinct tokens.

Code

tokenizers = tf_text.UnicodeScriptTokenizer()
tokens = tokenizers.tokenize(["What you know you can explain, but you can't feel it."])
print(tokens.to_list())

Output

[[b'What', b'you', b'know', b'you', b'can', b'explain', b',', b'but', b'you', b'can', b"'", b't', b'feel', b'it', b'.']]

Subword Tokenizers

Subword tokenizers can be employed with a reduced vocabulary, allowing the model to learn about novel terms from the subwords that make them up.

We go through the Subword Tokenization choices briefly here, but the Subword Tokenization lesson goes into further detail and also discusses how to build the vocab files.

  • Workpiece Tokenizer: WordPiece tokenization is a data-driven tokenization approach that produces a collection of sub-tokens. Although these sub tokens may correlate to language morphemes, this is not always the case.The WordpieceTokenizer assumes that the input has previously been tokenized. Because of this requirement, you should usually divide using the WhitespaceTokenizer or UnicodeScriptTokenizer first.Code
tokenizers = tf_text.WhitespaceTokenizer()
tokens = tokenizers.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())

Output

[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]

The WordpieceTokenizer may be used to break the string into subtokens once it has been split into tokens.

Code

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"
open(filepath, 'wb').write(r.content)

subtokenizers = tf_text.UnicodeScriptTokenizer(filepath)
subtoken = tokenizers.tokenize(tokens)
print(subtoken.to_list())

Output

52382
[[[b'What'], [b'you'], [b'know'], [b'you'], [b"can't"], [b'explain,'], [b'but'], [b'you'], [b'feel'], [b'it.']]]

 

Other Splitters

Splitter and SplitterWithOffsets are the major interfaces, with single methods split and split with offsets. The SplitterWithOffsets variation (which extends Splitter) offers a byte offsets option. This allows the caller to determine which bytes in the original string were used to construct the token.

  • UnicodeCharTokenizer : This function divides a string into UTF-8 characters. It is beneficial for CJK languages when there are no gaps between words.

Code

tokenizers = tf_text.UnicodeCharTokenizer()
tokens = tokenizers.tokenize(["What you know you can explain, but you can't feel it."])
print(tokens.to_list())

Output

[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 102, 101, 101, 108, 32, 105, 116, 46]]

The output is in the form of Unicode codepoints. This may also be used to generate character n-grams such as bigrams. To convert characters back to UTF-8.

chars = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), "UTF-8")
bigrams = tf_text.ngrams(chars, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')
print(bigrams.to_list())

 Output

[[b'Wh', b'ha', b'at', b't ', b' y', b'yo', b'ou', b'u ', b' k', b'kn', b'no', b'ow', b'w ', b' y', b'yo', b'ou', b'u ', b' c', b'ca', b'an', b'n ', b' e', b'ex', b'xp', b'pl', b'la', b'ai', b'in', b'n,', b', ', b' b', b'bu', b'ut', b't ', b' y', b'yo', b'ou', b'u ', b' c', b'ca', b'an', b"n'", b"'t", b't ', b' f', b'fe', b'ee', b'el', b'l ', b' i', b'it', b't.']]
  • HubModule Tokenizer: This is a wrapper over models posted to TF Hub to simplify calls because TF Hub does not presently allow ragged tensors. Tokenization using a model is very beneficial for CJK languages when you wish to divide into words but don’t have spaces to offer a heuristic guide. We now have a single segmentation model for the Chinese.

Code

MODEL_HANDLE = "https://tfhub.dev/google/zh_segmentation/1"
segmenters = tf_text.HubModuleTokenizer(MODEL_HANDLE)
token = segmenters.tokenize(["新华社北京"])
print(token.to_list())

Output

[[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe', b'\xe5\x8c\x97\xe4\xba\xac']]

The results of UTF-8 encoded byte strings may be difficult to view. To facilitate viewing, decode the list values.

def decode_lists(x):
  if type(x) is list:
    return list(map(decode_lists, x))
  return x.decode("UTF-8")

def decode_utf8_tensor(x):
  return list(map(decode_lists, x.to_list()))

print(decode_utf8_tensor(tokens))

Output

[['新华社', '北京']]
  • SplitMerge Tokenizer: SplitMergeTokenizer and SplitMergeFromLogitsTokenizer split a string depending on supplied values that indicate where the string should be split. This is important for creating your own segmentation models, as seen in the preceding Segmentation example. A value of 0 denotes the beginning of a new string, whereas a value of 1 indicates that the character is part of the existing string.

Code

string = ["新华社北京"]
label = [[0, 1, 0, 0, 1]]
tokenizers = tf_text.SplitMergeTokenizer()
tokens = tokenizers.tokenize(string, label)
print(decode_utf8_tensor(tokens))

Output

[['新华', '社', '北京']]

 

Leave a Reply

Your email address will not be published.