Tokenization Explained with Code: BPE vs WordPiece vs SentencePiece

December 17, 2025

Tokenization

If neural networks understand numbers, how do they understand text?
The answer lies in tokenization — one of the most overlooked yet critical components of modern NLP systems.

In this blog, we’ll break down how tokenization works, why subword tokenization became the standard, and compare BPE, WordPiece, and SentencePiece with intuition and code.

1. Why Tokenization Exists

Neural networks cannot process raw text. They operate on numerical inputs.

So text must be converted into:

Text → Tokens → Token IDs → Embeddings

A naive approach would be word-level tokenization:

"I love machine learning" → ["I", "love", "machine", "learning"]

🚫 Problems with Word-Level Tokenization

Huge vocabulary size
Out-of-vocabulary (OOV) words
Cannot handle rare words, typos, or new terms

This led to subword tokenization.

2. Subword Tokenization: The Core Idea

Instead of treating each word as atomic, we split words into smaller units:

"unhappiness" → ["un", "happi", "ness"]

Why this works:

Common words remain whole
Rare words are decomposed
Vocabulary size stays manageable
Model can generalize better

The three most popular subword methods are:

BPE (Byte Pair Encoding)
WordPiece
SentencePiece

3. Byte Pair Encoding (BPE)

🧠 Intuition

BPE starts with characters and repeatedly merges the most frequent pair.

🔁 Algorithm

Initialize vocabulary with characters
Count most frequent adjacent pairs
Merge the most frequent pair
Repeat until vocab size is reached

✨ Example

Corpus:

low, lower, newest, widest

After merges:

low → low
lower → low + er
newest → new + est

💻 BPE with HuggingFace Tokenizers

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(vocab_size=1000)

tokenizer.train(files=["corpus.txt"], trainer=trainer)

output = tokenizer.encode("Tokenization is powerful")
print(output.tokens)

✅ Used In

GPT-2
GPT-3 (byte-level BPE)

4. WordPiece

🧠 Intuition

WordPiece is similar to BPE but optimizes for likelihood rather than frequency.

Instead of merging the most frequent pair, it merges the pair that maximizes training data likelihood.

🔑 Special Feature

Uses ## prefix to denote subwords

playing → play + ##ing

💻 WordPiece in Transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "Tokenization matters a lot"
output = tokenizer.tokenize(text)

print(output)

Example Output

['token', '##ization', 'matters', 'a', 'lot']

✅ Used In

BERT
DistilBERT
Electra

5. SentencePiece

🧠 Intuition

SentencePiece treats text as a raw stream, not words.

🚫 No whitespace assumptions

This makes it language-agnostic.

🔥 Why SentencePiece is Powerful

Works with languages without spaces (Chinese, Japanese)
Handles noisy text better
Uses special token ▁ to represent spaces

"I love NLP" → ['▁I', '▁love', '▁NLP']

💻 SentencePiece Example

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='spm',
    vocab_size=1000
)

sp = spm.SentencePieceProcessor()
sp.load('spm.model')

print(sp.encode("Tokenization is cool", out_type=str))

✅ Used In

T5
ALBERT
LLaMA

6. Key Differences at a Glance

Feature	BPE	WordPiece	SentencePiece
Uses frequency	✅	❌	✅
Likelihood-based	❌	✅	❌
Space-aware	✅	✅	❌
Language-agnostic	❌	❌	✅
Used in	GPT	BERT	LLaMA, T5

7. Which Tokenizer Should You Use?

Use BPE if:

You’re building a GPT-like model
You want simplicity and speed

Use WordPiece if:

You want BERT-style masked LM
You care about likelihood-based merges

Use SentencePiece if:

You’re working with multilingual data
You want whitespace-independent tokenization

8. Why Tokenization Matters More Than You Think

Tokenization affects:

Context length
Memory usage
Training stability
Inference speed

Bad tokenization can cripple model performance even with a strong architecture.

9. Final Thoughts

Tokenization is not just a preprocessing step — it’s a design decision.

Understanding how BPE, WordPiece, and SentencePiece work gives you an edge when:

Fine-tuning LLMs
Building RAG systems
Designing multilingual models

Written by Jatin Chopra — AI Engineer & Technical Writer

📬 Want to connect or collaborate? Head over to the Contact page or find me on GitHub or LinkedIn

Decode AI with JC

Tokenization Explained with Code: BPE vs WordPiece vs SentencePiece

Tokenization

1. Why Tokenization Exists

🚫 Problems with Word-Level Tokenization

2. Subword Tokenization: The Core Idea

Why this works:

3. Byte Pair Encoding (BPE)

🧠 Intuition

🔁 Algorithm

✨ Example

💻 BPE with HuggingFace Tokenizers

✅ Used In

4. WordPiece

🧠 Intuition

🔑 Special Feature

💻 WordPiece in Transformers

Example Output

✅ Used In

5. SentencePiece

🧠 Intuition

🔥 Why SentencePiece is Powerful

💻 SentencePiece Example

✅ Used In

6. Key Differences at a Glance

7. Which Tokenizer Should You Use?

Use BPE if:

Use WordPiece if:

Use SentencePiece if:

8. Why Tokenization Matters More Than You Think

9. Final Thoughts

About The Author

jatinchopra1011@gmail.com

Add a Comment

Your Weekly Dose of Knowledge – Delivered