Skip to content
An 8-bit style illustration of a geometric speech bubble made up of distinct, colored blocks to represent separate text tokens; some small colored rectangles detach from the main shape, symbolizing text chunking; a basic slider icon illustrates truncation. The image is minimalistic, flat, in five colors, sized 128x128 pixels, with no background or human figures.

Tokenizers

The tokenizers helper module provides a set of functions to split text into tokens.

const n = tokenizers.count("hello world")

By default, the tokenizers module uses the large tokenizer. You can change the tokenizer by passing the model identifier.

const n = await tokenizers.count("hello world", { model: "gpt-4o-mini" })

Counts the number of tokens in a string.

const n = await tokenizers.count("hello world")

Drops a part of the string to fit into a token budget

const truncated = await tokenizers.truncate("hello world", 5)

Splits the text into chunks of a given token size. The chunk tries to find appropriate chunking boundaries based on the document type.

const chunks = await tokenizers.chunk(env.files[0])
for(const chunk of chunks) {
...
}

You can configure the chunking size, overlap and add line numbers.

const chunks = await tokenizers.chunk(env.files[0], {
chunkSize: 128,
chunkOverlap 10,
lineNumbers: true
})