Tokenization Explained with Code: BPE vs WordPiece vs SentencePiece

Tokenization

Tokenization

If neural networks understand numbers, how do they understand text?
The answer lies in tokenization — one of the most overlooked yet critical components of modern NLP systems.

In this blog, we’ll break down how tokenization works, why subword tokenization became the standard, and compare BPE, WordPiece, and SentencePiece with intuition and code.


1. Why Tokenization Exists

Neural networks cannot process raw text. They operate on numerical inputs.

So text must be converted into:

Text → Tokens → Token IDs → Embeddings

A naive approach would be word-level tokenization:

"I love machine learning" → ["I", "love", "machine", "learning"]

🚫 Problems with Word-Level Tokenization

  • Huge vocabulary size
  • Out-of-vocabulary (OOV) words
  • Cannot handle rare words, typos, or new terms

This led to subword tokenization.


2. Subword Tokenization: The Core Idea

Instead of treating each word as atomic, we split words into smaller units:

"unhappiness" → ["un", "happi", "ness"]

Why this works:

  • Common words remain whole
  • Rare words are decomposed
  • Vocabulary size stays manageable
  • Model can generalize better

The three most popular subword methods are:

  • BPE (Byte Pair Encoding)
  • WordPiece
  • SentencePiece

3. Byte Pair Encoding (BPE)

🧠 Intuition

BPE starts with characters and repeatedly merges the most frequent pair.

🔁 Algorithm

  1. Initialize vocabulary with characters
  2. Count most frequent adjacent pairs
  3. Merge the most frequent pair
  4. Repeat until vocab size is reached

✨ Example

Corpus:

low, lower, newest, widest

After merges:

low → low
lower → low + er
newest → new + est

💻 BPE with HuggingFace Tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(vocab_size=1000)

tokenizer.train(files=["corpus.txt"], trainer=trainer)

output = tokenizer.encode("Tokenization is powerful")
print(output.tokens)

✅ Used In

  • GPT-2
  • GPT-3 (byte-level BPE)

4. WordPiece

🧠 Intuition

WordPiece is similar to BPE but optimizes for likelihood rather than frequency.

Instead of merging the most frequent pair, it merges the pair that maximizes training data likelihood.

🔑 Special Feature

  • Uses ## prefix to denote subwords
playing → play + ##ing

💻 WordPiece in Transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization matters a lot"
output = tokenizer.tokenize(text)

print(output)

Example Output

['token', '##ization', 'matters', 'a', 'lot']

✅ Used In

  • BERT
  • DistilBERT
  • Electra

5. SentencePiece

🧠 Intuition

SentencePiece treats text as a raw stream, not words.

🚫 No whitespace assumptions

This makes it language-agnostic.

🔥 Why SentencePiece is Powerful

  • Works with languages without spaces (Chinese, Japanese)
  • Handles noisy text better
  • Uses special token to represent spaces
"I love NLP" → ['▁I', '▁love', '▁NLP']

💻 SentencePiece Example

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='spm',
    vocab_size=1000
)

sp = spm.SentencePieceProcessor()
sp.load('spm.model')

print(sp.encode("Tokenization is cool", out_type=str))

✅ Used In

  • T5
  • ALBERT
  • LLaMA

6. Key Differences at a Glance

FeatureBPEWordPieceSentencePiece
Uses frequency
Likelihood-based
Space-aware
Language-agnostic
Used inGPTBERTLLaMA, T5

7. Which Tokenizer Should You Use?

Use BPE if:

  • You’re building a GPT-like model
  • You want simplicity and speed

Use WordPiece if:

  • You want BERT-style masked LM
  • You care about likelihood-based merges

Use SentencePiece if:

  • You’re working with multilingual data
  • You want whitespace-independent tokenization

8. Why Tokenization Matters More Than You Think

Tokenization affects:

  • Context length
  • Memory usage
  • Training stability
  • Inference speed

Bad tokenization can cripple model performance even with a strong architecture.


9. Final Thoughts

Tokenization is not just a preprocessing step — it’s a design decision.

Understanding how BPE, WordPiece, and SentencePiece work gives you an edge when:

  • Fine-tuning LLMs
  • Building RAG systems
  • Designing multilingual models

Written by Jatin Chopra — AI Engineer & Technical Writer

📬 Want to connect or collaborate? Head over to the Contact page or find me on GitHub or LinkedIn

Add a Comment

Your email address will not be published. Required fields are marked *