Skip to content

Tokenizers

The tokenizers helper module providers a set of functions to split text into tokens.

const n = tokenizers.count("hello world")

Choosing your tokenizer

By default, the tokenizers module uses the large tokenizer. You can change the tokenizer by passing the model identifier.

const n = await tokenizers.count("hello world", { model: "gpt-4o-mini" })

count

Counts the number of tokens in a string.

const n = await tokenizers.count("hello world")

truncate

Drops a part of the string to fit into a token budget

const truncated = await tokenizers.truncate("hello world", 5)

chunk

Splits the text into chunks of a given token size. The chunk tries to find appropriate chunking boundaries based on the document type.

const chunks = await tokenizers.chunk(env.files[0])
for(const chunk of chunks) {
...
}

You can configure the chunking size, overlap and add line numbers.

const chunks = await tokenizers.chunk(env.files[0], {
chunkSize: 128,
chunkOverlap 10,
lineNumbers: true
})