BertTokenizer¶

BertTokenizer operator is an adapted version from official Hugging face BertTokenizerFast implementation.

Summary of the tokenizers (🤗 Huggingface)¶

Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is straightforward.

Space and punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it’s the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization usually generates a very big vocabulary (the set of all unique words and tokens used). E.g., Transformer XL uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!

While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter “t” is much harder than learning a context-independent representation for the word “today”. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.

There are three main types of tokenizers used in Transformers:

Byte-Pair Encoding (BPE)
WordPiece, used by BERT, DistilBERT, and Electra
SentencePiece

APIs¶

class pyis.python.ops.BertTokenizer(self: ops.BertTokenizer, vocab_file: str, do_lower_case: bool = True, do_basic_tokenize: bool = True, cls_token: str = '[CLS]', sep_token: str = '[SEP]', unk_token: str = '[UNK]', pad_token: str = '[PAD]', mask_token: str = '[MASK]', tokenize_chinese_chars: bool = True, strip_accents: bool = False, suffix_indicator: str = '##') → None¶

For tokenizing text into id sequences (same as Bert Tokenizer Fast)

Create a BertTokenizer instance

Parameters

vocab_file_path (str) – path to the vocabulary file
do_lower_case (bool) – should the tokenizer turn string into lowercase, default to true
do_basic_tokenize (bool) – should the tokenizer do basic tokenize first, default to true
cls_token (str) – start token symbol, default to ‘[CLS]’
sep_token (str) – end token symbol, default to ‘[SEP]’
unknown_token (str) – unknown token symbol, default to ‘[UNK]’
pad_token (str) – padding token symbol, default to ‘[PAD]’
mask_token (str) – mask token symbol, default to ‘[MASK]’
tokenize_chinese_chars (bool) – should the tokenizer tokenize chinese chars, default to true
strip_accents (bool) – shoule the tokenizer strip accents. default to false
suffix_indicator (str) – string prefix indicates the token is a suffix of previous token. default to ‘##’

convert_id_to_token(self: ops.BertTokenizer, id: int) → str¶

Convert token id to token text

Parameters: id (int) – token id
Returns: output token text to the given id

convert_token_to_id(self: ops.BertTokenizer, token: str) → int¶

Convert token text to token id

Parameters: token (str) – token text
Returns: output token id to the given token text

decode(self: ops.BertTokenizer, ids: List[int]) → str¶

Decode a list of token indices, turn them into a string.

Parameters: ids (List[int]) – list of token indices
Returns: output the corresponding string to the list.

encode(self: ops.BertTokenizer, query: str, max_length: int = 1000000000000000) → List[int]¶

Tokenize the input string and return list of token indices (ids).

Parameters

query (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15

Returns

output list of token indices (ids)

encode2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') → List[int]¶

Tokenize two input strings and return list of token indices (ids).

Parameters

str1 (str) – input query to be tokenized
str2 (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’

Returns

output list of token indices (ids)

encode_plus(self: ops.BertTokenizer, str: str, max_length: int = 1000000000000000) → List[Tuple[int, int, int]]¶

Tokenize the input string and return list of token indices (ids), type ids and attention mask.

Parameters

query (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15

Returns

output list of tuple, each tuple contains (ids, type ids, attention mask)

encode_plus2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') → List[Tuple[int, int, int]]¶

Tokenize two input strings and return list of token indices (ids), type ids and attention mask.

Parameters

str1 (str) – input query to be tokenized
str2 (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’

Returns

output list of tuple, each tuple contains (ids, type ids, attention mask)

tokenize(self: ops.BertTokenizer, query: str) → List[str]¶

Tokenize the input string and return list of tokens.

Parameters: query (str) – input query to be tokenized
Returns: output list of tokens

Example¶

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

from pyis.python import ops
from typing import List

tokenizer: ops.BertTokenizer = ops.BertTokenizer('vocab.txt')

query: str = 'what is the time in US?'

token_ids: List[int] = tokenizer.tokenize(query)