Tokenizers
The tokenizers
helper module providers a set of functions to split text into tokens.
Choosing your tokenizer
By default, the tokenizers
module uses the large
tokenizer. You can change the tokenizer by passing the model identifier.
count
Counts the number of tokens in a string.
truncate
Drops a part of the string to fit into a token budget
chunk
Splits the text into chunks of a given token size. The chunk tries to find appropriate chunking boundaries based on the document type.
You can configure the chunking size, overlap and add line numbers.