BertTokenizer¶
BertTokenizer operator is an adapted version from official Hugging face BertTokenizerFast implementation.
Summary of the tokenizers (🤗 Huggingface)¶
Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is straightforward.
Space and punctuation tokenization and rule-based tokenization are both examples of word tokenization, which is loosely defined as splitting sentences into words. While it’s the most intuitive way to split texts into smaller chunks, this tokenization method can lead to problems for massive text corpora. In this case, space and punctuation tokenization usually generates a very big vocabulary (the set of all unique words and tokens used). E.g., Transformer XL uses space and punctuation tokenization, resulting in a vocabulary size of 267,735!
While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for the model to learn meaningful input representations. E.g. learning a meaningful context-independent representation for the letter “t” is much harder than learning a context-independent representation for the word “today”. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds, transformers models use a hybrid between word-level and character-level tokenization called subword tokenization.
There are three main types of tokenizers used in Transformers:
Byte-Pair Encoding (BPE)
WordPiece, used by BERT, DistilBERT, and Electra
SentencePiece
APIs¶
- class pyis.python.ops.BertTokenizer(self: ops.BertTokenizer, vocab_file: str, do_lower_case: bool = True, do_basic_tokenize: bool = True, cls_token: str = '[CLS]', sep_token: str = '[SEP]', unk_token: str = '[UNK]', pad_token: str = '[PAD]', mask_token: str = '[MASK]', tokenize_chinese_chars: bool = True, strip_accents: bool = False, suffix_indicator: str = '##') None ¶
For tokenizing text into id sequences (same as Bert Tokenizer Fast)
Create a BertTokenizer instance
- Parameters
vocab_file_path (str) – path to the vocabulary file
do_lower_case (bool) – should the tokenizer turn string into lowercase, default to true
do_basic_tokenize (bool) – should the tokenizer do basic tokenize first, default to true
cls_token (str) – start token symbol, default to ‘[CLS]’
sep_token (str) – end token symbol, default to ‘[SEP]’
unknown_token (str) – unknown token symbol, default to ‘[UNK]’
pad_token (str) – padding token symbol, default to ‘[PAD]’
mask_token (str) – mask token symbol, default to ‘[MASK]’
tokenize_chinese_chars (bool) – should the tokenizer tokenize chinese chars, default to true
strip_accents (bool) – shoule the tokenizer strip accents. default to false
suffix_indicator (str) – string prefix indicates the token is a suffix of previous token. default to ‘##’
- convert_id_to_token(self: ops.BertTokenizer, id: int) str ¶
Convert token id to token text
- Parameters
id (int) – token id
- Returns
output token text to the given id
- convert_token_to_id(self: ops.BertTokenizer, token: str) int ¶
Convert token text to token id
- Parameters
token (str) – token text
- Returns
output token id to the given token text
- decode(self: ops.BertTokenizer, ids: List[int]) str ¶
Decode a list of token indices, turn them into a string.
- Parameters
ids (List[int]) – list of token indices
- Returns
output the corresponding string to the list.
- encode(self: ops.BertTokenizer, query: str, max_length: int = 1000000000000000) List[int] ¶
Tokenize the input string and return list of token indices (ids).
- Parameters
query (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
- Returns
output list of token indices (ids)
- encode2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') List[int] ¶
Tokenize two input strings and return list of token indices (ids).
- Parameters
str1 (str) – input query to be tokenized
str2 (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’
- Returns
output list of token indices (ids)
- encode_plus(self: ops.BertTokenizer, str: str, max_length: int = 1000000000000000) List[Tuple[int, int, int]] ¶
Tokenize the input string and return list of token indices (ids), type ids and attention mask.
- Parameters
query (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
- Returns
output list of tuple, each tuple contains (ids, type ids, attention mask)
- encode_plus2(self: ops.BertTokenizer, str1: str, str2: str, max_length: int = 1000000000000000, truncation_strategy: str = 'longest_first') List[Tuple[int, int, int]] ¶
Tokenize two input strings and return list of token indices (ids), type ids and attention mask.
- Parameters
str1 (str) – input query to be tokenized
str2 (str) – input query to be tokenized
max_length (int) – max acceptable length for truncation. default to 1e15
truncation_strategy (str) – truncation strategy, coule be ‘longest_first’ (default), ‘longest_from_back’, ‘only_first’ and ‘only_second’
- Returns
output list of tuple, each tuple contains (ids, type ids, attention mask)
- tokenize(self: ops.BertTokenizer, query: str) List[str] ¶
Tokenize the input string and return list of tokens.
- Parameters
query (str) – input query to be tokenized
- Returns
output list of tokens
Example¶
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT license.
from pyis.python import ops
from typing import List
tokenizer: ops.BertTokenizer = ops.BertTokenizer('vocab.txt')
query: str = 'what is the time in US?'
token_ids: List[int] = tokenizer.tokenize(query)