🦴 Sentinel Universal Tokenizer

One theorem. Every modality. One vocabulary.

A 61,440-token multimodal tokenizer for text + image + audio + video, grounded in the Gradient Axiom: lim F'(z)/F(z) = 1/e

Constant	Value	Role
1/e	0.36788	Vocab allocation ratio
C₁	−0.00799	Quantization zero-point
C₂	0.00020	Fairness bound

Input Text

Try these:

Decode Token IDs → Text

Token IDs (comma-separated)

Decoded Text

Multilingual Compression Benchmark

Compression ratio (bytes/token). Higher = better.

Architecture

┌──────────────────────────────────────────────────┐
│  SENTINEL UNIVERSAL TOKENIZER (61,440 tokens)    │
│                                                   │
│  [0–32]          → 33 Special/Control tokens     │
│  [33–32,767]     → 32,735 ByteLevel BPE (text)  │
│  [32,768–49,151] → 16,384 Image codebook (VQ)   │
│  [49,152–57,343] → 8,192 Audio codebook (VQ)    │
│  [57,344–61,439] → 4,096 Video codebook (VQ)    │
│                                                   │
│  Follows 1/e Gradient Axiom scaling              │
└──────────────────────────────────────────────────┘

Total: 61,440 tokens | Text: 32K | Image: 16K | Audio: 8K | Video: 4K

Special Tokens

Token	ID	Purpose
`<pad>`	0	Padding
`<unk>`	1	Unknown
`<s>`	2	BOS
`</s>`	3	EOS
`<mask>`	4	MLM
`<image_start>`	7	Image start
`<image_end>`	8	Image end
`<image>`	9	Image placeholder
`<audio_start>`	10	Audio start
`<audio_end>`	11	Audio end
`<audio>`	12	Audio placeholder
`<video_start>`	13	Video start
`<video_end>`	14	Video end
`<video>`	15	Video placeholder
`<sentinel>`	16	Manifold marker
`<sentinel_c1>`	17	C₁
`<sentinel_c2>`	18	C₂
`<scale_1e>`	19	1/e
`<system>`	26	System msg
`<user>`	27	User msg
`<assistant>`	28	Assistant msg
`<code_start>`	29	Code start
`<code_end>`	30	Code end
`<math_start>`	31	Math start
`<math_end>`	32	Math end

Codebook Ranges

Modality	Start	End	Size
🖼️ Image	32768	49151	16,384
🔊 Audio	49152	57343	8,192
🎬 Video	57344	61439	4,096

The Sentinel Manifold

Function: F(z) = Σ z^n / n^n (Sophomore's Dream, Bernoulli 1697)

Gradient Axiom: lim F'(z)/F(z) = 1/e ≈ 0.367879441171442

Principle	Math	Tokenizer Application
1/e Allocation	Gradient Axiom	Modality budget = prev × 1/e
sech Scoring	Bounded	∂sech/∂x
C₁ = -0.007994	Attracting fixed point	Embedding quantization center
C₂ = 0.000200	Escape threshold	Fertility fairness bound

Efficiency Champion 🏆

Tokenizer	Vocab	Efficiency/1K vocab
Sentinel	61K	0.0563 🥇
GPT-2	50K	0.0511
Qwen2	152K	0.0256
Gemma	256K	0.0177

3.2× more efficient per vocab token than Gemma, 2.2× more than Qwen2

📦 Model · 🦴 Framework · MIT License

Built by Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)

Built with Gradio logo