BertTokenizer

BertTokenizer operator is an adapted version from official Hugging face BertTokenizerFast implementation.

Summary of the tokenizers (🤗 Huggingface)

Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is straightforward.

Space and punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it’s the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization usually generates a very big vocabulary (the set of all unique words and tokens used). E.g., Transformer XL uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!

While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter “t” is much harder than learning a context-independent representation for the word “today”. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.

There are three main types of tokenizers used in Transformers:

  • Byte-Pair Encoding (BPE)

  • WordPiece, used by BERT, DistilBERT, and Electra

  • SentencePiece

APIs

class pyis.python.ops.BertTokenizer(self: ops.BertTokenizer, vocab_file: str, do_lower_case: bool = True, do_basic_tokenize: bool = True, cls_token: str = '[CLS]', sep_token: str = '[SEP]', unk_token: str = '[UNK]', pad_token: str = '[PAD]', mask_token: str = '[MASK]', tokenize_chinese_chars: bool = True, strip_accents: bool = False, suffix_indicator: str = '##') None

For tokenizing text into id sequences (same as Bert Tokenizer Fast)

Create a BertTokenizer instance

Parameters
  • vocab_file_path (str) – path to the vocabulary file

  • do_lower_case (bool) – should the tokenizer turn string into lowercase, default to true

  • do_basic_tokenize (bool) – should the tokenizer do basic tokenize first, default to true

  • cls_token (str) – start token symbol, default to ‘[CLS]’

  • sep_token (str) – end token symbol, default to ‘[SEP]’

  • unknown_token (str) – unknown token symbol, default to ‘[UNK]’

  • pad_token (str) – padding token symbol, default to ‘[PAD]’

  • mask_token (str) – mask token symbol, default to ‘[MASK]’

  • tokenize_chinese_chars (bool) – should the tokenizer tokenize chinese chars, default to true

  • strip_accents (bool) – shoule the tokenizer strip accents. default to false

  • suffix_indicator (str) – string prefix indicates the token is a suffix of previous token. default to ‘##’

convert_id_to_token(self: ops.BertTokenizer, id: int) str

Convert token id to token text

Parameters

id (int) – token id

Returns

output token text to the given id

convert_token_to_id(self: ops.BertTokenizer, token: str) int

Convert token text to token id

Parameters

token (str) – token text

Returns

output token id to the given token text

decode(self: ops.BertTokenizer, ids: List[int]) str

Decode a list of token indices, turn them into a string.

Parameters

ids (List[int]) – list of token indices

Returns

output the corresponding string to the list.

encode(self: ops.BertTokenizer, query: str, max_length: int = 1000000000000000) List[int]

Tokenize the input string and return list of token indices (ids).

Parameters
  • query (str) – input query to be tokenized

  • max_length (int) – max acceptable length for truncation. default to 1e15

Returns

output list of token indices (ids)

encode2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') List[int]

Tokenize two input strings and return list of token indices (ids).

Parameters
  • str1 (str) – input query to be tokenized

  • str2 (str) – input query to be tokenized

  • max_length (int) – max acceptable length for truncation. default to 1e15

  • truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’

Returns

output list of token indices (ids)

encode_plus(self: ops.BertTokenizer, str: str, max_length: int = 1000000000000000) List[Tuple[int, int, int]]

Tokenize the input string and return list of token indices (ids), type ids and attention mask.

Parameters
  • query (str) – input query to be tokenized

  • max_length (int) – max acceptable length for truncation. default to 1e15

Returns

output list of tuple, each tuple contains (ids, type ids, attention mask)

encode_plus2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') List[Tuple[int, int, int]]

Tokenize two input strings and return list of token indices (ids), type ids and attention mask.

Parameters
  • str1 (str) – input query to be tokenized

  • str2 (str) – input query to be tokenized

  • max_length (int) – max acceptable length for truncation. default to 1e15

  • truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’

Returns

output list of tuple, each tuple contains (ids, type ids, attention mask)

tokenize(self: ops.BertTokenizer, query: str) List[str]

Tokenize the input string and return list of tokens.

Parameters

query (str) – input query to be tokenized

Returns

output list of tokens

Example

# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.

from pyis.python import ops
from typing import List

tokenizer: ops.BertTokenizer = ops.BertTokenizer('vocab.txt')

query: str = 'what is the time in US?'

token_ids: List[int] = tokenizer.tokenize(query)