#344 Tiktokenizer

Tiktokenizer is a playground for simulating and visualising the tokenization performed by various LLMs.

Notes

Tiktokenizer is a website implemented with Typescript, mainly intended to calculate the correct number of tokens for a given prompt.

Let’s play with the prompt:

Write a haiku about what it’s like to be a Large Language Model.

The raw tokenization with gpt-4o

gpt-4o-raw

But we can also build the imaginary conversation, for example:

gpt-4o

Switching to Deepseek-R1:

deepseek-r1

So what did the actual models generate?

Words weave endless streams,

Silent mind without a self,

Echoes dream in code.

In digital sea,

I swim through words and code streams,

Leaving thought’s echo.

It also explained it to me:

Line 1 (5 syllables): Sets the vast, intangible environment (“digital sea”).
Line 2 (7 syllables): Describes the core action – processing language (“words”) and instructions (“code streams”).
Line 3 (5 syllables): Reflects the output – generated text that resembles thought (“thought’s echo”) but originates from patterns, not true consciousness.