sentencepiece

sentencepiece

Popular

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

18Kestrellas
1.6Kforks
Actualizado 1/24/2026
SKILL.md
readonlyread-only
name
sentencepiece
description

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

version
1.0.0

SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

  • Building multilingual models (no language-specific rules)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need reproducible tokenization (deterministic vocabulary)
  • Want to train on raw text (no pre-tokenization needed)
  • Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance:

  • Speed: 50,000 sentences/sec
  • Memory: ~6MB for loaded model
  • Languages: All (language-independent)

Use alternatives instead:

  • HuggingFace Tokenizers: Faster training, more flexibility
  • tiktoken: OpenAI models (GPT-3.5/4)
  • BERT WordPiece: English-centric tasks

Quick start

Installation

# Python
pip install sentencepiece

# C++ (requires CMake)
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build && cd build
cmake .. && make -j $(nproc)
sudo make install

Train model

# Command-line (BPE with 8000 vocab)
spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API
import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='m',
    vocab_size=8000,
    model_type='bpe'
)

Training time: ~1-2 minutes for 100MB corpus

Encode and decode

import sentencepiece as spm

# Load model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces
pieces = sp.encode('This is a test', out_type=str)
print(pieces)  # ['▁This', '▁is', '▁a', '▁test']

# Encode to IDs
ids = sp.encode('This is a test', out_type=int)
print(ids)  # [284, 47, 11, 1243]

# Decode
text = sp.decode(ids)
print(text)  # "This is a test"

Language-independent design

Whitespace as symbol (▁)

text = "Hello world"
pieces = sp.encode(text, out_type=str)
print(pieces)  # ['▁Hello', '▁world']

# Decode preserves spaces
decoded = sp.decode_pieces(pieces)
print(decoded)  # "Hello world"

Key principle: Treat text as raw Unicode, whitespace = ▁ (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='bpe_model',
    vocab_size=16000,
    model_type='bpe'
)

Used by: mBART

Unigram (default)

spm.SentencePieceTrainer.train(
    input='data.txt',
    model_prefix='unigram_model',
    vocab_size=8000,
    model_type='unigram'
)

Used by: T5, ALBERT, XLNet

Training configuration

Essential parameters

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='m',
    vocab_size=32000,
    model_type='unigram',
    character_coverage=0.9995,  # 1.0 for CJK
    user_defined_symbols=['[SEP]', '[CLS]'],
    unk_piece='<unk>',
    num_threads=16
)

Character coverage

Language Type Coverage Rationale
English 0.9995 Most common chars
CJK (Chinese) 1.0 All characters needed
Multilingual 0.9995 Balance

Encoding options

Subword regularization

# Sample different tokenizations
for _ in range(3):
    pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)
    print(pieces)

# Output (different each time):
# ['▁token', 'ization']
# ['▁tok', 'en', 'ization']

Use case: Data augmentation for robustness.

Common patterns

T5-style training

spm.SentencePieceTrainer.train(
    input='c4_corpus.txt',
    model_prefix='t5',
    vocab_size=32000,
    model_type='unigram',
    user_defined_symbols=[f'<extra_id_{i}>' for i in range(100)],
    unk_id=2,
    eos_id=1,
    pad_id=0
)

Integration with transformers

from transformers import T5Tokenizer

# T5 uses SentencePiece internally
tokenizer = T5Tokenizer.from_pretrained('t5-base')
inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

Performance benchmarks

Training speed

Corpus BPE (16k) Unigram (8k)
100 MB 1-2 min 3-4 min
1 GB 10-15 min 30-40 min

Tokenization speed

  • SentencePiece: 50,000 sentences/sec
  • HF Tokenizers: 200,000 sentences/sec (4× faster)

Supported models

T5 family: t5-base, t5-large (32k vocab, Unigram)
ALBERT: albert-base-v2 (30k vocab, Unigram)
XLNet: xlnet-base-cased (32k vocab, Unigram)
mBART: facebook/mbart-large-50 (250k vocab, BPE)

References

Resources

You Might Also Like

Related Skills

summarize

summarize

179Kresearch

Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).

openclaw avataropenclaw
Obtener
prompt-lookup

prompt-lookup

143Kresearch

Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.

skill-lookup

skill-lookup

143Kresearch

Activates when the user asks about Agent Skills, wants to find reusable AI capabilities, needs to install skills, or mentions skills for Claude. Use for discovering, retrieving, and installing skills.

sherpa-onnx-tts

sherpa-onnx-tts

88Kresearch

Local text-to-speech via sherpa-onnx (offline, no cloud)

moltbot avatarmoltbot
Obtener
openai-whisper

openai-whisper

87Kresearch

Local speech-to-text with the Whisper CLI (no API key).

moltbot avatarmoltbot
Obtener
seo-review

seo-review

66Kresearch

Perform a focused SEO audit on JavaScript concept pages to maximize search visibility, featured snippet optimization, and ranking potential

leonardomso avatarleonardomso
Obtener