Tokenizers

The tokenizers helper module provides a set of functions to split text into tokens.

const n = tokenizers.count("hello world")

Choosing your tokenizer

By default, the tokenizers module uses the large tokenizer. You can change the tokenizer by passing the model identifier.

const n = await tokenizers.count("hello world", { model: "gpt-4o-mini" })

Counts the number of tokens in a string.

const n = await tokenizers.count("hello world")

Drops a part of the string to fit into a token budget

const truncated = await tokenizers.truncate("hello world", 5)

Splits the text into chunks of a given token size. The chunk tries to find appropriate chunking boundaries based on the document type.

const chunks = await tokenizers.chunk(env.files[0])
for(const chunk of chunks) {
    ...
}

You can configure the chunking size, overlap and add line numbers.

const chunks = await tokenizers.chunk(env.files[0], {
    chunkSize: 128,
    chunkOverlap 10,
    lineNumbers: true
})