A warning about tiktoken, BPE, and OpenAI models

2024-11-20

Update: Folks at GitHub wrote a fast BPE implementation in Rust, which is fast, relatively simple, and algorithmically clever.

Today in stories from software running in production.

At Val Town, we implemented Semantic Search a little while ago. It uses an OpenAI vector embedding model to turn some text into a vector that can be used to find similar bits of content. It's nifty!

Like the rest of OpenAI's endpoints, the vector embedding model endpoint costs money, and has limits. The limits and pricing are expressed in terms of tokens. So we truncated our input text by tokens. The recommended way to do this is to use tiktoken. Here's a bit of Python and some output:

import tiktoken
import timeit

encoding = tiktoken.get_encoding("cl100k_base")

for len in [100, 1000, 10000, 100000]:
    print(
        len,
        timeit.timeit(lambda: encoding.encode("a" * len), number=10))

Output:

➜ python hello.py
100 0.00041395798325538635
1000 0.003758916980586946
10000 0.3689383330056444
100000 39.23603595799068

You might have noticed that, while the len increasing by one decimal point, the time it takes to compute the encoding is increasing more. It's superlinear. Bad news!

This is shared behavior between the TypeScript and Python parts of the module, and has been reported and people are trying to fix it, but at the core the problem is that Tiktoken implements Byte Pair Encoding in a simple way that is not performant, and gets scary in this pathological "repeated character" case.

Maybe this'll get fixed upstream. In the meantime, you might want to truncate text by characters before truncating it with a tiktoken-based encoder. This problem is similar to regex backtracking attacks in my opinion: it can be both triggered accidentally as well as intentionally, and cause a scary pinned main thread in production.

Tom MacWright

A warning about tiktoken, BPE, and OpenAI models