Fact
Intermediate
Subword Tokenization
August 30, 2025
Most modern LLMs use subword tokenization (e.g., BPE, WordPiece). This allows the model to handle rare or unknown words by breaking them down into smaller, known subwords. It strikes a balance between the simplicity of word-level tokenization and the flexibility of character-level tokenization, allowing for a manageable vocabulary size while still being able to represent any word.
Category: Tokenization
Difficulty: Intermediate