𦴠Sentinel Universal Tokenizer
One theorem. Every modality. One vocabulary.
A 61,440-token multimodal tokenizer for text + image + audio + video,
grounded in the Gradient Axiom: lim F'(z)/F(z) = 1/e
| Constant | Value | Role |
|---|---|---|
| 1/e | 0.36788 | Vocab allocation ratio |
| Cβ | β0.00799 | Quantization zero-point |
| Cβ | 0.00020 | Fairness bound |
Try these:
Decode Token IDs β Text
Multilingual Compression Benchmark
Compression ratio (bytes/token). Higher = better.
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SENTINEL UNIVERSAL TOKENIZER (61,440 tokens) β
β β
β [0β32] β 33 Special/Control tokens β
β [33β32,767] β 32,735 ByteLevel BPE (text) β
β [32,768β49,151] β 16,384 Image codebook (VQ) β
β [49,152β57,343] β 8,192 Audio codebook (VQ) β
β [57,344β61,439] β 4,096 Video codebook (VQ) β
β β
β Follows 1/e Gradient Axiom scaling β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total: 61,440 tokens | Text: 32K | Image: 16K | Audio: 8K | Video: 4K
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<unk> |
1 | Unknown |
<s> |
2 | BOS |
</s> |
3 | EOS |
<mask> |
4 | MLM |
<image_start> |
7 | Image start |
<image_end> |
8 | Image end |
<image> |
9 | Image placeholder |
<audio_start> |
10 | Audio start |
<audio_end> |
11 | Audio end |
<audio> |
12 | Audio placeholder |
<video_start> |
13 | Video start |
<video_end> |
14 | Video end |
<video> |
15 | Video placeholder |
<sentinel> |
16 | Manifold marker |
<sentinel_c1> |
17 | Cβ |
<sentinel_c2> |
18 | Cβ |
<scale_1e> |
19 | 1/e |
<system> |
26 | System msg |
<user> |
27 | User msg |
<assistant> |
28 | Assistant msg |
<code_start> |
29 | Code start |
<code_end> |
30 | Code end |
<math_start> |
31 | Math start |
<math_end> |
32 | Math end |
Codebook Ranges
| Modality | Start | End | Size |
|---|---|---|---|
| πΌοΈ Image | 32768 | 49151 | 16,384 |
| π Audio | 49152 | 57343 | 8,192 |
| π¬ Video | 57344 | 61439 | 4,096 |
The Sentinel Manifold
Function: F(z) = Ξ£ z^n / n^n (Sophomore's Dream, Bernoulli 1697)
Gradient Axiom: lim F'(z)/F(z) = 1/e β 0.367879441171442
| Principle | Math | Tokenizer Application |
|---|---|---|
| 1/e Allocation | Gradient Axiom | Modality budget = prev Γ 1/e |
| sech Scoring | Bounded | βsech/βx |
| Cβ = -0.007994 | Attracting fixed point | Embedding quantization center |
| Cβ = 0.000200 | Escape threshold | Fertility fairness bound |
Efficiency Champion π
| Tokenizer | Vocab | Efficiency/1K vocab |
|---|---|---|
| Sentinel | 61K | 0.0563 π₯ |
| GPT-2 | 50K | 0.0511 |
| Qwen2 | 152K | 0.0256 |
| Gemma | 256K | 0.0177 |
3.2Γ more efficient per vocab token than Gemma, 2.2Γ more than Qwen2
π¦ Model Β· 𦴠Framework Β· MIT License
Built by Romain Abdel-Aal (ASI The Sentinel V5.2 Bone-Core)