Spaces:
Running
v0.8.7 Multilingual Tokenizer Tax Calculator — anti-bullshit pack #13
Browse filesPain: tokenizers tax non-English text asymmetrically. The same
paragraph might be 100 tokens in English but 250+ tokens in Chinese
on a Latin-trained tokenizer (Llama, Phi). Both per-request cost
AND effective context degrade silently. tiktokenizer.vercel.app
covers OpenAI's cl100k only; nothing public compares Llama vs Qwen
vs Phi vs Gemma vs GPT vs Claude in one interface.
🌍 Token Tax (20th mode):
- Lazy-imports HuggingFace transformers.js (~750 KB pinned to 3.0.2,
jsdelivr CDN). First-mode-open pays the cost; subsequent runs
instant after browser cache.
- Tokenizes user-pasted text against 6 preset open-weight tokenizers
(Qwen/Qwen2.5-7B-Instruct, microsoft/Phi-3.5-mini-instruct,
unsloth/Meta-Llama-3.1-8B-Instruct, unsloth/gemma-2-9b-it,
Xenova/gpt-4 cl100k port, Xenova/claude-tokenizer community port).
All open — no HF auth required. Llama/Gemma use the unsloth open
mirrors that ship the byte-identical tokenizer.json (quantization
touches weights, not tokens).
- Output: per-tokenizer token count, chars-per-token, ratio vs
baseline, color-coded (red ≥1.5×, amber ≥1.15×, green within 5%).
Worst-tax interpretation surfaces the loudest mismatch
automatically.
- Auto-detects Unicode script blocks (Latin / CJK / Korean / Arabic
/ Cyrillic / Devanagari / Thai / Greek / Hebrew) so users see
"92% CJK" alongside "Phi-3.5 = 2.27×" → instantly understand
the WHY.
- 5 sample buttons (English / 中文 / عربى / mixed / code) for
one-click demo coverage.
Pure logic in `js/tokenizer_tax.js` (lazy CDN import + tokenizer
cache + parallel tokenize + script detection). 36 i18n keys × 4
langs (EN/ES/FR/ZH) = 144 keys, parity clean. Help modal v0.8.7
entry + Inventory + "Set up an eval correctly" task tile.
Privacy-by-design: all tokenization is local — pasted text never
leaves the browser. Status note explains first-load latency
(~5-15s for 6 tokenizers in parallel, then cached).
Verified locally: ZH sample (92% CJK) yields:
Qwen2.5 baseline 44 tokens (1.43 chars/tok) 1.00×
Phi-3.5 100 tokens (0.63 chars/tok) 2.27× ⚠
Llama-3.1 60 tokens (1.05 chars/tok) 1.36×
Gemma-2 49 tokens (1.29 chars/tok) 1.11×
GPT-4 cl100k 81 tokens (0.78 chars/tok) 1.84×
Claude (approx) 70 tokens (0.90 chars/tok) 1.59×
Phi's BPE (32k vocab, no CJK pre-training) charges 2.27× over
Qwen for the SAME Chinese paragraph. That is the silent tax this
tool surfaces.
Refs:
- https://github.com/huggingface/transformers.js
- https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- https://huggingface.co/Xenova/gpt-4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- index.html +35 -0
- js/i18n.js +156 -0
- js/main.js +163 -1
- js/tokenizer_tax.js +221 -0
|
@@ -228,6 +228,9 @@
|
|
| 228 |
<p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
|
| 229 |
<p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
|
| 230 |
|
|
|
|
|
|
|
|
|
|
| 231 |
<p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
|
| 232 |
<p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
|
| 233 |
|
|
@@ -344,6 +347,7 @@
|
|
| 344 |
<li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
|
| 345 |
<li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
|
| 346 |
<li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
|
|
|
|
| 347 |
<li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
|
| 348 |
</ul>
|
| 349 |
</details>
|
|
@@ -419,6 +423,7 @@
|
|
| 419 |
<button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
|
| 420 |
<button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
|
| 421 |
<button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
|
|
|
|
| 422 |
</div>
|
| 423 |
</div>
|
| 424 |
<div class="task-tile">
|
|
@@ -479,6 +484,7 @@
|
|
| 479 |
<button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
|
| 480 |
<button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
|
| 481 |
<button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
|
|
|
|
| 482 |
<button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
|
| 483 |
</div>
|
| 484 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
|
@@ -1143,6 +1149,35 @@
|
|
| 1143 |
<div id="spec-output" style="margin-top: 1em;"></div>
|
| 1144 |
</section>
|
| 1145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1146 |
<section id="hub-section" style="display:none;">
|
| 1147 |
<h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
|
| 1148 |
<span class="info"><span class="tooltip" data-i18n="hub.tip">
|
|
|
|
| 228 |
<p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
|
| 229 |
<p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
|
| 230 |
|
| 231 |
+
<p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
|
| 232 |
+
<p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
|
| 233 |
+
|
| 234 |
<p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
|
| 235 |
<p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
|
| 236 |
|
|
|
|
| 347 |
<li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
|
| 348 |
<li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
|
| 349 |
<li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
|
| 350 |
+
<li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
|
| 351 |
<li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
|
| 352 |
</ul>
|
| 353 |
</details>
|
|
|
|
| 423 |
<button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
|
| 424 |
<button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
|
| 425 |
<button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
|
| 426 |
+
<button data-mode-link="tax" data-i18n="modes.tax">🌍 Token Tax</button>
|
| 427 |
</div>
|
| 428 |
</div>
|
| 429 |
<div class="task-tile">
|
|
|
|
| 484 |
<button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
|
| 485 |
<button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
|
| 486 |
<button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
|
| 487 |
+
<button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
|
| 488 |
<button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
|
| 489 |
</div>
|
| 490 |
<p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
|
|
|
|
| 1149 |
<div id="spec-output" style="margin-top: 1em;"></div>
|
| 1150 |
</section>
|
| 1151 |
|
| 1152 |
+
<!-- Multilingual Tokenizer Tax (mode=tax, v0.8.7 anti-bullshit pack #13) -->
|
| 1153 |
+
<section id="tax-section" style="display:none;">
|
| 1154 |
+
<h2><span data-i18n="tax.title">🌍 Multilingual Tokenizer Tax</span>
|
| 1155 |
+
<span class="info"><span class="tooltip" data-i18n="tax.tip">
|
| 1156 |
+
<strong>Why this matters</strong>: tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser.
|
| 1157 |
+
</span></span>
|
| 1158 |
+
</h2>
|
| 1159 |
+
<p class="recipe-desc" data-i18n="tax.desc">
|
| 1160 |
+
<strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline.
|
| 1161 |
+
</p>
|
| 1162 |
+
<div class="form-row">
|
| 1163 |
+
<label for="tax-input" data-i18n="tax.input_label">Text to tokenize:</label>
|
| 1164 |
+
<textarea id="tax-input" rows="8" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="tax.input.placeholder" placeholder="Paste any text — English, Chinese, Arabic, code, …"></textarea>
|
| 1165 |
+
</div>
|
| 1166 |
+
<div class="form-row">
|
| 1167 |
+
<button type="button" id="tax-tokenize-btn" data-i18n="tax.tokenize_btn">🔬 Tokenize all</button>
|
| 1168 |
+
<button type="button" id="tax-sample-en-btn" class="secondary" data-i18n="tax.sample_en_btn">↳ Sample: English</button>
|
| 1169 |
+
<button type="button" id="tax-sample-zh-btn" class="secondary" data-i18n="tax.sample_zh_btn">↳ Sample: 中文</button>
|
| 1170 |
+
<button type="button" id="tax-sample-ar-btn" class="secondary" data-i18n="tax.sample_ar_btn">↳ Sample: عربى</button>
|
| 1171 |
+
<button type="button" id="tax-sample-mixed-btn" class="secondary" data-i18n="tax.sample_mixed_btn">↳ Sample: mixed</button>
|
| 1172 |
+
<button type="button" id="tax-sample-code-btn" class="secondary" data-i18n="tax.sample_code_btn">↳ Sample: code</button>
|
| 1173 |
+
</div>
|
| 1174 |
+
<p id="tax-status" class="recipe-desc" style="font-size:0.92em;"></p>
|
| 1175 |
+
<div id="tax-output" style="margin-top: 1em;"></div>
|
| 1176 |
+
<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="tax.firstload_note">
|
| 1177 |
+
💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser.
|
| 1178 |
+
</p>
|
| 1179 |
+
</section>
|
| 1180 |
+
|
| 1181 |
<section id="hub-section" style="display:none;">
|
| 1182 |
<h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
|
| 1183 |
<span class="info"><span class="tooltip" data-i18n="hub.tip">
|
|
@@ -714,6 +714,45 @@ export const TRANSLATIONS = {
|
|
| 714 |
"help.v085.speculative.title": "🔬 Speculative-Decode Compatibility",
|
| 715 |
"help.v085.speculative.body": "Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.",
|
| 716 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 717 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
|
| 718 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 719 |
"help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
|
|
@@ -1887,6 +1926,45 @@ export const TRANSLATIONS = {
|
|
| 1887 |
"help.v085.speculative.title": "🔬 Compatibilidad de Speculative-Decode",
|
| 1888 |
"help.v085.speculative.body": "El speculative decoding solo funciona si target y draft comparten exactamente el mismo vocabulario. Vocabs mismatched hacen que cada token del draft sea rechazado — pagas AMBOS computes y obtienes peor throughput que baseline. Peor: el sistema sigue emitiendo output correcto (solo más lento), así que el bug es invisible en tests unitarios. vLLM #4570 / #16757 / #20409 / #12488 surfacen variantes. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids, compara tipo de tokenizer, tamaño de vocab, mapa completo token→id, special tokens, y added tokens, luego estima una banda de speedup basada en ratio de params y tasas típicas α=0.5/0.7/0.85 de aceptación. <em>Caso de uso</em>: antes de lanzar un cluster vLLM con spec-dec habilitado, verifica que el par sea compatible.",
|
| 1889 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1890 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
|
| 1891 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 1892 |
"help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
|
|
@@ -2924,6 +3002,45 @@ export const TRANSLATIONS = {
|
|
| 2924 |
"help.v085.speculative.title": "🔬 Compatibilité Speculative-Decode",
|
| 2925 |
"help.v085.speculative.body": "Le speculative decoding ne marche que si target et draft partagent exactement le même vocabulaire. Des vocabs mismatched font que chaque token du draft est rejeté — vous payez LES DEUX coûts de compute et obtenez un pire débit que la baseline. Pire : le système émet toujours une sortie correcte (juste plus lente), donc le bug est invisible aux tests unitaires. vLLM #4570 / #16757 / #20409 / #12488 surfent les variantes. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux model ids, compare le type de tokenizer, la taille du vocab, la map complète token→id, les special tokens, et les added tokens, puis estime une bande de speedup basée sur le ratio de params et les taux α=0.5/0.7/0.85 d'acceptation typiques. <em>Cas d'usage</em> : avant de lancer un cluster vLLM avec spec-dec activé, vérifiez que la paire est compatible.",
|
| 2926 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2927 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
|
| 2928 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 2929 |
"help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
|
|
@@ -3961,6 +4078,45 @@ export const TRANSLATIONS = {
|
|
| 3961 |
"help.v085.speculative.title": "🔬 Speculative-Decode 兼容性",
|
| 3962 |
"help.v085.speculative.body": "Speculative decoding 仅当 target 和 draft 共享完全相同的词汇表时才能工作。Vocab 不匹配导致每个 draft token 被拒绝——你支付双倍计算成本且吞吐量比 baseline 更差。更糟:系统仍输出正确(只是更慢),所以 bug 在单元测试中不可见。vLLM #4570 / #16757 / #20409 / #12488 都显示了变种。这个工具从 HF Hub 获取两个 model id 的 `tokenizer.json`,比较 tokenizer 类型、vocab 大小、完整 token→id 映射、special token 和 added token,然后基于参数比和典型 α=0.5/0.7/0.85 接受率估算 speedup 范围。<em>用例</em>:在启动启用了 spec-dec 的 vLLM 集群之前,验证这对模型是否真的兼容。",
|
| 3963 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3964 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
|
| 3965 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 3966 |
"help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
|
|
|
|
| 714 |
"help.v085.speculative.title": "🔬 Speculative-Decode Compatibility",
|
| 715 |
"help.v085.speculative.body": "Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.",
|
| 716 |
|
| 717 |
+
// v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
|
| 718 |
+
"modes.tax": "🌍 Token Tax",
|
| 719 |
+
"mode_desc.tax": "Real BPE encoding (browser-side via transformers.js) of pasted text across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages.",
|
| 720 |
+
"tax.title": "🌍 Multilingual Tokenizer Tax",
|
| 721 |
+
"tax.tip": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser.",
|
| 722 |
+
"tax.desc": "<strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline.",
|
| 723 |
+
"tax.input_label": "Text to tokenize:",
|
| 724 |
+
"tax.input.placeholder": "Paste any text — English, Chinese, Arabic, code, …",
|
| 725 |
+
"tax.tokenize_btn": "🔬 Tokenize all",
|
| 726 |
+
"tax.sample_en_btn": "↳ Sample: English",
|
| 727 |
+
"tax.sample_zh_btn": "↳ Sample: 中文",
|
| 728 |
+
"tax.sample_ar_btn": "↳ Sample: عربى",
|
| 729 |
+
"tax.sample_mixed_btn": "↳ Sample: mixed",
|
| 730 |
+
"tax.sample_code_btn": "↳ Sample: code",
|
| 731 |
+
"tax.status.loading": "⏳ Loading transformers.js + tokenizers (first run can take 5-15s)…",
|
| 732 |
+
"tax.status.done": "✅ {n}/{total} tokenizers ran in {ms}ms",
|
| 733 |
+
"tax.col.tokenizer": "Tokenizer",
|
| 734 |
+
"tax.col.tokens": "Tokens",
|
| 735 |
+
"tax.col.cpt": "Chars/tok",
|
| 736 |
+
"tax.col.ratio": "Ratio",
|
| 737 |
+
"tax.summary.input": "Input: {chars} chars, {bytes} bytes",
|
| 738 |
+
"tax.script_breakdown": "scripts",
|
| 739 |
+
"tax.interp.worst": "{label} costs {pct}% more tokens than baseline for this text.",
|
| 740 |
+
"tax.interp.uniform": "✓ All tokenizers within ±5% — text is well-handled across vendors.",
|
| 741 |
+
"tax.hint.empty": "Paste some text and click Tokenize.",
|
| 742 |
+
"tax.all_failed": "All tokenizers failed to load.",
|
| 743 |
+
"tax.error.gated": "model gated (HF auth required — try the open mirror)",
|
| 744 |
+
"tax.error.not_found": "model id not found",
|
| 745 |
+
"tax.error.timeout": "timeout (large tokenizer or slow connection)",
|
| 746 |
+
"tax.error.network": "network error",
|
| 747 |
+
"tax.error.fetch_failed": "fetch failed",
|
| 748 |
+
"tax.error.invalid_input": "invalid input",
|
| 749 |
+
"tax.attribution": "Tokenizers via",
|
| 750 |
+
"tax.attribution.privacy": "Text is tokenized locally — never leaves the browser.",
|
| 751 |
+
"tax.firstload_note": "💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser.",
|
| 752 |
+
"inv.v087.tax": "<strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).",
|
| 753 |
+
"help.v087.tax.title": "🌍 Multilingual Tokenizer Tax",
|
| 754 |
+
"help.v087.tax.body": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
|
| 755 |
+
|
| 756 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
|
| 757 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 758 |
"help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
|
|
|
|
| 1926 |
"help.v085.speculative.title": "🔬 Compatibilidad de Speculative-Decode",
|
| 1927 |
"help.v085.speculative.body": "El speculative decoding solo funciona si target y draft comparten exactamente el mismo vocabulario. Vocabs mismatched hacen que cada token del draft sea rechazado — pagas AMBOS computes y obtienes peor throughput que baseline. Peor: el sistema sigue emitiendo output correcto (solo más lento), así que el bug es invisible en tests unitarios. vLLM #4570 / #16757 / #20409 / #12488 surfacen variantes. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids, compara tipo de tokenizer, tamaño de vocab, mapa completo token→id, special tokens, y added tokens, luego estima una banda de speedup basada en ratio de params y tasas típicas α=0.5/0.7/0.85 de aceptación. <em>Caso de uso</em>: antes de lanzar un cluster vLLM con spec-dec habilitado, verifica que el par sea compatible.",
|
| 1928 |
|
| 1929 |
+
// v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
|
| 1930 |
+
"modes.tax": "🌍 Token Tax",
|
| 1931 |
+
"mode_desc.tax": "BPE real (transformers.js en browser) sobre texto pegado a través de 6 tokenizers de vendor. Surface la asimetría de coste silenciosa entre idiomas.",
|
| 1932 |
+
"tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
|
| 1933 |
+
"tax.tip": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Coste por request Y contexto efectivo degradan silenciosamente. Pega tu texto, ve token counts reales a través de tokenizers de vendor — sin estimación, BPE real vía transformers.js en tu navegador.",
|
| 1934 |
+
"tax.desc": "<strong>No 3× tu factura en soporte chino.</strong> Pega cualquier texto → BPE real por-tokenizer a través de Qwen / Phi / Llama / Gemma / GPT-4 / Claude → ve la asimetría de coste vs tu baseline.",
|
| 1935 |
+
"tax.input_label": "Texto a tokenizar:",
|
| 1936 |
+
"tax.input.placeholder": "Pega cualquier texto — inglés, chino, árabe, código, …",
|
| 1937 |
+
"tax.tokenize_btn": "🔬 Tokenizar todos",
|
| 1938 |
+
"tax.sample_en_btn": "↳ Ejemplo: English",
|
| 1939 |
+
"tax.sample_zh_btn": "↳ Ejemplo: 中文",
|
| 1940 |
+
"tax.sample_ar_btn": "↳ Ejemplo: عربى",
|
| 1941 |
+
"tax.sample_mixed_btn": "↳ Ejemplo: mixto",
|
| 1942 |
+
"tax.sample_code_btn": "↳ Ejemplo: código",
|
| 1943 |
+
"tax.status.loading": "⏳ Cargando transformers.js + tokenizers (primera ejecución puede tardar 5-15s)…",
|
| 1944 |
+
"tax.status.done": "✅ {n}/{total} tokenizers en {ms}ms",
|
| 1945 |
+
"tax.col.tokenizer": "Tokenizer",
|
| 1946 |
+
"tax.col.tokens": "Tokens",
|
| 1947 |
+
"tax.col.cpt": "Chars/tok",
|
| 1948 |
+
"tax.col.ratio": "Ratio",
|
| 1949 |
+
"tax.summary.input": "Entrada: {chars} caracteres, {bytes} bytes",
|
| 1950 |
+
"tax.script_breakdown": "scripts",
|
| 1951 |
+
"tax.interp.worst": "{label} cuesta {pct}% más tokens que baseline para este texto.",
|
| 1952 |
+
"tax.interp.uniform": "✓ Todos los tokenizers dentro de ±5% — texto bien manejado entre vendors.",
|
| 1953 |
+
"tax.hint.empty": "Pega texto y haz click en Tokenizar.",
|
| 1954 |
+
"tax.all_failed": "Todos los tokenizers fallaron.",
|
| 1955 |
+
"tax.error.gated": "modelo gated (auth HF requerida — prueba mirror open)",
|
| 1956 |
+
"tax.error.not_found": "model id no encontrado",
|
| 1957 |
+
"tax.error.timeout": "timeout (tokenizer grande o conexión lenta)",
|
| 1958 |
+
"tax.error.network": "error de red",
|
| 1959 |
+
"tax.error.fetch_failed": "fetch falló",
|
| 1960 |
+
"tax.error.invalid_input": "entrada inválida",
|
| 1961 |
+
"tax.attribution": "Tokenizers vía",
|
| 1962 |
+
"tax.attribution.privacy": "El texto se tokeniza localmente — nunca sale del navegador.",
|
| 1963 |
+
"tax.firstload_note": "💡 <strong>Primera carga:</strong> la tool descarga transformers.js (~750 KB) + el vocab de cada tokenizer bajo demanda (~5-15 MB por tokenizer, cacheados después). Ejecuciones siguientes son instantáneas. Todo el procesamiento es local — tu texto nunca sale del navegador.",
|
| 1964 |
+
"inv.v087.tax": "<strong>🌍 Token Tax</strong> — BPE real sobre 6 tokenizers de vendor. Surface la asimetría de coste silenciosa entre idiomas (CJK / árabe / mixto).",
|
| 1965 |
+
"help.v087.tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
|
| 1966 |
+
"help.v087.tax.body": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
|
| 1967 |
+
|
| 1968 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
|
| 1969 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 1970 |
"help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
|
|
|
|
| 3002 |
"help.v085.speculative.title": "🔬 Compatibilité Speculative-Decode",
|
| 3003 |
"help.v085.speculative.body": "Le speculative decoding ne marche que si target et draft partagent exactement le même vocabulaire. Des vocabs mismatched font que chaque token du draft est rejeté — vous payez LES DEUX coûts de compute et obtenez un pire débit que la baseline. Pire : le système émet toujours une sortie correcte (juste plus lente), donc le bug est invisible aux tests unitaires. vLLM #4570 / #16757 / #20409 / #12488 surfent les variantes. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux model ids, compare le type de tokenizer, la taille du vocab, la map complète token→id, les special tokens, et les added tokens, puis estime une bande de speedup basée sur le ratio de params et les taux α=0.5/0.7/0.85 d'acceptation typiques. <em>Cas d'usage</em> : avant de lancer un cluster vLLM avec spec-dec activé, vérifiez que la paire est compatible.",
|
| 3004 |
|
| 3005 |
+
// v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
|
| 3006 |
+
"modes.tax": "🌍 Token Tax",
|
| 3007 |
+
"mode_desc.tax": "Encodage BPE réel (côté navigateur via transformers.js) du texte collé sur 6 tokenizers de fournisseurs. Révèle l'asymétrie de coût silencieuse entre langues.",
|
| 3008 |
+
"tax.title": "🌍 Taxe Tokenizer Multilingue",
|
| 3009 |
+
"tax.tip": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût par requête ET contexte effectif dégradent silencieusement. Collez votre texte, voyez les vrais token counts à travers les tokenizers fournisseurs — pas d'estimation, BPE réel via transformers.js dans votre navigateur.",
|
| 3010 |
+
"tax.desc": "<strong>Ne 3× pas votre facture sur le support chinois.</strong> Collez n'importe quel texte → encodage BPE réel par tokenizer (Qwen / Phi / Llama / Gemma / GPT-4 / Claude) → voyez l'asymétrie de coût vs votre baseline.",
|
| 3011 |
+
"tax.input_label": "Texte à tokenizer :",
|
| 3012 |
+
"tax.input.placeholder": "Collez n'importe quel texte — anglais, chinois, arabe, code, …",
|
| 3013 |
+
"tax.tokenize_btn": "🔬 Tokenizer tous",
|
| 3014 |
+
"tax.sample_en_btn": "↳ Exemple : English",
|
| 3015 |
+
"tax.sample_zh_btn": "↳ Exemple : 中文",
|
| 3016 |
+
"tax.sample_ar_btn": "↳ Exemple : عربى",
|
| 3017 |
+
"tax.sample_mixed_btn": "↳ Exemple : mixte",
|
| 3018 |
+
"tax.sample_code_btn": "↳ Exemple : code",
|
| 3019 |
+
"tax.status.loading": "⏳ Chargement transformers.js + tokenizers (la première exécution peut prendre 5-15s)…",
|
| 3020 |
+
"tax.status.done": "✅ {n}/{total} tokenizers en {ms}ms",
|
| 3021 |
+
"tax.col.tokenizer": "Tokenizer",
|
| 3022 |
+
"tax.col.tokens": "Tokens",
|
| 3023 |
+
"tax.col.cpt": "Chars/tok",
|
| 3024 |
+
"tax.col.ratio": "Ratio",
|
| 3025 |
+
"tax.summary.input": "Entrée : {chars} caractères, {bytes} octets",
|
| 3026 |
+
"tax.script_breakdown": "scripts",
|
| 3027 |
+
"tax.interp.worst": "{label} coûte {pct}% de tokens en plus que la baseline pour ce texte.",
|
| 3028 |
+
"tax.interp.uniform": "✓ Tous les tokenizers à ±5% — texte bien géré par les fournisseurs.",
|
| 3029 |
+
"tax.hint.empty": "Collez du texte puis Tokenizer.",
|
| 3030 |
+
"tax.all_failed": "Tous les tokenizers ont échoué.",
|
| 3031 |
+
"tax.error.gated": "modèle gated (auth HF requise — essayez le mirror open)",
|
| 3032 |
+
"tax.error.not_found": "model id introuvable",
|
| 3033 |
+
"tax.error.timeout": "timeout (gros tokenizer ou connexion lente)",
|
| 3034 |
+
"tax.error.network": "erreur réseau",
|
| 3035 |
+
"tax.error.fetch_failed": "fetch échoué",
|
| 3036 |
+
"tax.error.invalid_input": "entrée invalide",
|
| 3037 |
+
"tax.attribution": "Tokenizers via",
|
| 3038 |
+
"tax.attribution.privacy": "Le texte est tokenizé localement — ne quitte jamais le navigateur.",
|
| 3039 |
+
"tax.firstload_note": "💡 <strong>Premier chargement :</strong> l'outil récupère transformers.js (~750 KB) + le vocab de chaque tokenizer à la demande (~5-15 MB par tokenizer, mis en cache après). Les exécutions suivantes sont instantanées. Tout le traitement est local — votre texte ne quitte jamais le navigateur.",
|
| 3040 |
+
"inv.v087.tax": "<strong>🌍 Token Tax</strong> — encodage BPE réel sur 6 tokenizers fournisseurs. Révèle l'asymétrie de coût silencieuse entre langues (CJK / arabe / mixte).",
|
| 3041 |
+
"help.v087.tax.title": "🌍 Taxe Tokenizer Multilingue",
|
| 3042 |
+
"help.v087.tax.body": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
|
| 3043 |
+
|
| 3044 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
|
| 3045 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 3046 |
"help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
|
|
|
|
| 4078 |
"help.v085.speculative.title": "🔬 Speculative-Decode 兼容性",
|
| 4079 |
"help.v085.speculative.body": "Speculative decoding 仅当 target 和 draft 共享完全相同的词汇表时才能工作。Vocab 不匹配导致每个 draft token 被拒绝——你支付双倍计算成本且吞吐量比 baseline 更差。更糟:系统仍输出正确(只是更慢),所以 bug 在单元测试中不可见。vLLM #4570 / #16757 / #20409 / #12488 都显示了变种。这个工具从 HF Hub 获取两个 model id 的 `tokenizer.json`,比较 tokenizer 类型、vocab 大小、完整 token→id 映射、special token 和 added token,然后基于参数比和典型 α=0.5/0.7/0.85 接受率估算 speedup 范围。<em>用例</em>:在启动启用了 spec-dec 的 vLLM 集群之前,验证这对模型是否真的兼容。",
|
| 4080 |
|
| 4081 |
+
// v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
|
| 4082 |
+
"modes.tax": "🌍 Token Tax",
|
| 4083 |
+
"mode_desc.tax": "通过浏览器端 transformers.js 对粘贴文本进行 6 个供应商 tokenizer 的真实 BPE 编码。揭示语言间的静默成本不对称。",
|
| 4084 |
+
"tax.title": "🌍 多语言 Tokenizer 税",
|
| 4085 |
+
"tax.tip": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。粘贴你的文本,通过供应商 tokenizer 查看实际 token 数——没有估算,通过 transformers.js 在浏览器中真实 BPE 编码。",
|
| 4086 |
+
"tax.desc": "<strong>不要因中文支持让账单 3 倍。</strong> 粘贴任意文本 → 通过 Qwen / Phi / Llama / Gemma / GPT-4 / Claude 的真实 BPE 编码 → 查看相对于 baseline 的成本不对称。",
|
| 4087 |
+
"tax.input_label": "要 tokenize 的文本:",
|
| 4088 |
+
"tax.input.placeholder": "粘贴任何文本——英语、中文、阿拉伯语、代码……",
|
| 4089 |
+
"tax.tokenize_btn": "🔬 Tokenize 全部",
|
| 4090 |
+
"tax.sample_en_btn": "↳ 示例:English",
|
| 4091 |
+
"tax.sample_zh_btn": "↳ 示例:中文",
|
| 4092 |
+
"tax.sample_ar_btn": "↳ 示例:عربى",
|
| 4093 |
+
"tax.sample_mixed_btn": "↳ 示例:混合",
|
| 4094 |
+
"tax.sample_code_btn": "↳ 示例:代码",
|
| 4095 |
+
"tax.status.loading": "⏳ 加载 transformers.js + tokenizer(首次运行可能需要 5-15 秒)…",
|
| 4096 |
+
"tax.status.done": "✅ {n}/{total} 个 tokenizer,用时 {ms}ms",
|
| 4097 |
+
"tax.col.tokenizer": "Tokenizer",
|
| 4098 |
+
"tax.col.tokens": "Token 数",
|
| 4099 |
+
"tax.col.cpt": "字符/token",
|
| 4100 |
+
"tax.col.ratio": "比率",
|
| 4101 |
+
"tax.summary.input": "输入:{chars} 字符,{bytes} 字节",
|
| 4102 |
+
"tax.script_breakdown": "脚本",
|
| 4103 |
+
"tax.interp.worst": "{label} 对此文本的 token 数比 baseline 多 {pct}%。",
|
| 4104 |
+
"tax.interp.uniform": "✓ 所有 tokenizer 在 ±5% 范围内——文本在各供应商间处理良好。",
|
| 4105 |
+
"tax.hint.empty": "粘贴文本然后点击 Tokenize。",
|
| 4106 |
+
"tax.all_failed": "所有 tokenizer 都失败了。",
|
| 4107 |
+
"tax.error.gated": "模型受限(需要 HF auth——尝试 open mirror)",
|
| 4108 |
+
"tax.error.not_found": "找不到 model id",
|
| 4109 |
+
"tax.error.timeout": "超时(大 tokenizer 或慢速连接)",
|
| 4110 |
+
"tax.error.network": "网络错误",
|
| 4111 |
+
"tax.error.fetch_failed": "获取失败",
|
| 4112 |
+
"tax.error.invalid_input": "无效输入",
|
| 4113 |
+
"tax.attribution": "Tokenizer 通过",
|
| 4114 |
+
"tax.attribution.privacy": "文本在本地 tokenize——永远不会离开浏览器。",
|
| 4115 |
+
"tax.firstload_note": "💡 <strong>首次加载:</strong>工具按需获取 transformers.js(~750 KB)+ 每个 tokenizer 的词汇表(每个 ~5-15 MB,加载后缓存)。后续运行即时。所有处理都是本地的——你的文本永远不会离开浏览器。",
|
| 4116 |
+
"inv.v087.tax": "<strong>🌍 Token Tax</strong> — 6 个供应商 tokenizer 的真实 BPE 编码。揭示语言间(CJK / 阿拉伯语 / 混合)的静默成本不对称。",
|
| 4117 |
+
"help.v087.tax.title": "🌍 多语言 Tokenizer 税",
|
| 4118 |
+
"help.v087.tax.body": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js(~750 KB CDN),并对粘贴的文本运行 6 个预设供应商 tokenizer(Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似)的 tokenize。输出:每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块(拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文)让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>:『我的多语言支持给账单加了 30%——哪种语言成本最高?』→ 粘贴真实生产文本,查看每个 tokenizer 的精确分解。",
|
| 4119 |
+
|
| 4120 |
"inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
|
| 4121 |
"help.v081.hub.title": "🧭 Solutions Hub",
|
| 4122 |
"help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
|
|
@@ -31,6 +31,10 @@ import { lintJsonCot, reorderJsonText, classifyFieldName } from "./json_cot_lint
|
|
| 31 |
import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
|
| 32 |
import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
|
| 33 |
import { checkCompatibility as specCheckCompat, parseParamHint } from "./spec_decode_compat.js";
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 36 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
@@ -224,6 +228,7 @@ document.addEventListener("click", (e) => {
|
|
| 224 |
peft: "peft-section",
|
| 225 |
cache: "cache-section",
|
| 226 |
speculative: "speculative-section",
|
|
|
|
| 227 |
hub: "hub-section",
|
| 228 |
}[targetMode];
|
| 229 |
if (sectionId) {
|
|
@@ -249,7 +254,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 249 |
"diagnose-section", "phase-section", "unmask-section",
|
| 250 |
"template-section", "arena-section", "contam-section",
|
| 251 |
"quant-section", "drift-section", "niah-section",
|
| 252 |
-
"saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "hub-section"].forEach(id => {
|
| 253 |
const el = $(id);
|
| 254 |
if (el) el.style.display = "none";
|
| 255 |
});
|
|
@@ -265,6 +270,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 265 |
peft: "peft-section",
|
| 266 |
cache: "cache-section",
|
| 267 |
speculative: "speculative-section",
|
|
|
|
| 268 |
hub: "hub-section",
|
| 269 |
};
|
| 270 |
const sectionId = sectionMap[mode];
|
|
@@ -276,6 +282,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
|
|
| 276 |
if (mode === "peft") initPeft();
|
| 277 |
if (mode === "cache") initCacheDiff();
|
| 278 |
if (mode === "speculative") initSpeculative();
|
|
|
|
| 279 |
if (mode === "hub") initHub();
|
| 280 |
});
|
| 281 |
});
|
|
@@ -4248,6 +4255,161 @@ $("spec-example-bad-btn")?.addEventListener("click", () => {
|
|
| 4248 |
// (HF autocomplete on spec-target-id / spec-draft-id is registered via
|
| 4249 |
// the known-id list in hf_autocomplete.js; no extra wiring needed here.)
|
| 4250 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4251 |
// ════════════════��═══════════════════════════════════════════════════
|
| 4252 |
// Bootstrap
|
| 4253 |
// ════════════════════════════════════════════════════════════════════
|
|
|
|
| 31 |
import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
|
| 32 |
import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
|
| 33 |
import { checkCompatibility as specCheckCompat, parseParamHint } from "./spec_decode_compat.js";
|
| 34 |
+
import {
|
| 35 |
+
tokenizeAll, detectLanguageBlocks,
|
| 36 |
+
PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
|
| 37 |
+
} from "./tokenizer_tax.js";
|
| 38 |
|
| 39 |
// Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
|
| 40 |
// Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
|
|
|
|
| 228 |
peft: "peft-section",
|
| 229 |
cache: "cache-section",
|
| 230 |
speculative: "speculative-section",
|
| 231 |
+
tax: "tax-section",
|
| 232 |
hub: "hub-section",
|
| 233 |
}[targetMode];
|
| 234 |
if (sectionId) {
|
|
|
|
| 254 |
"diagnose-section", "phase-section", "unmask-section",
|
| 255 |
"template-section", "arena-section", "contam-section",
|
| 256 |
"quant-section", "drift-section", "niah-section",
|
| 257 |
+
"saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "hub-section"].forEach(id => {
|
| 258 |
const el = $(id);
|
| 259 |
if (el) el.style.display = "none";
|
| 260 |
});
|
|
|
|
| 270 |
peft: "peft-section",
|
| 271 |
cache: "cache-section",
|
| 272 |
speculative: "speculative-section",
|
| 273 |
+
tax: "tax-section",
|
| 274 |
hub: "hub-section",
|
| 275 |
};
|
| 276 |
const sectionId = sectionMap[mode];
|
|
|
|
| 282 |
if (mode === "peft") initPeft();
|
| 283 |
if (mode === "cache") initCacheDiff();
|
| 284 |
if (mode === "speculative") initSpeculative();
|
| 285 |
+
if (mode === "tax") initTax();
|
| 286 |
if (mode === "hub") initHub();
|
| 287 |
});
|
| 288 |
});
|
|
|
|
| 4255 |
// (HF autocomplete on spec-target-id / spec-draft-id is registered via
|
| 4256 |
// the known-id list in hf_autocomplete.js; no extra wiring needed here.)
|
| 4257 |
|
| 4258 |
+
// ════════════════════════════════════════════════════════════════════
|
| 4259 |
+
// 🌍 Multilingual Tokenizer Tax (v0.8.7 anti-bullshit pack #13)
|
| 4260 |
+
// ════════════════════════════════════════════════════════════════════
|
| 4261 |
+
let __taxInited = false;
|
| 4262 |
+
|
| 4263 |
+
function initTax() {
|
| 4264 |
+
if (__taxInited) return;
|
| 4265 |
+
__taxInited = true;
|
| 4266 |
+
// No async preload — transformers.js + tokenizer.json are lazy-loaded
|
| 4267 |
+
// on the first Tokenize click so users don't pay download cost just
|
| 4268 |
+
// for opening the tab. Status string explains the wait.
|
| 4269 |
+
}
|
| 4270 |
+
|
| 4271 |
+
function fmtBlocks(blocks) {
|
| 4272 |
+
// Build a compact "60% latin · 35% cjk · 5% other" string from the
|
| 4273 |
+
// detector output. Drops zero-counts and orders by descending size.
|
| 4274 |
+
if (!blocks || !blocks.blocks || !blocks.total_chars) return "";
|
| 4275 |
+
const total = blocks.total_chars;
|
| 4276 |
+
const entries = Object.entries(blocks.blocks)
|
| 4277 |
+
.filter(([, n]) => n > 0)
|
| 4278 |
+
.sort((a, b) => b[1] - a[1]);
|
| 4279 |
+
if (entries.length === 0) return "";
|
| 4280 |
+
const parts = entries.map(([name, n]) => {
|
| 4281 |
+
const pct = Math.round((n / total) * 100);
|
| 4282 |
+
return `${pct}% ${name}`;
|
| 4283 |
+
});
|
| 4284 |
+
return parts.join(" · ");
|
| 4285 |
+
}
|
| 4286 |
+
|
| 4287 |
+
function renderTaxResult(res, presetMeta) {
|
| 4288 |
+
if (res.code === "empty_input") {
|
| 4289 |
+
return `<div class="arena-result"><p>${t("tax.hint.empty") || "Paste some text and click Tokenize."}</p></div>`;
|
| 4290 |
+
}
|
| 4291 |
+
if (res.code === "all_failed") {
|
| 4292 |
+
const errLines = res.results.map(r => {
|
| 4293 |
+
const meta = presetMeta.find(p => p.id === r.modelId);
|
| 4294 |
+
return `<li><code>${escapeHtml(r.modelId)}</code> ${meta ? `<span class="subtle">(${escapeHtml(meta.label)})</span>` : ""}: ${t(`tax.error.${r.error}`) || r.error}</li>`;
|
| 4295 |
+
}).join("");
|
| 4296 |
+
return `<div class="arena-result"><p style="color:#f85149;"><strong>❌ ${t("tax.all_failed") || "All tokenizers failed to load."}</strong></p><ul>${errLines}</ul></div>`;
|
| 4297 |
+
}
|
| 4298 |
+
|
| 4299 |
+
const baselineCount = res.baseline_count;
|
| 4300 |
+
const blocks = detectLanguageBlocks($("tax-input").value);
|
| 4301 |
+
const ratioColor = (r) => {
|
| 4302 |
+
if (r == null) return "#8b949e";
|
| 4303 |
+
if (r >= 1.5) return "#f85149"; // big tax — red
|
| 4304 |
+
if (r >= 1.15) return "#f0883e"; // moderate
|
| 4305 |
+
if (r >= 0.85) return "#3fb950"; // about same
|
| 4306 |
+
return "#58a6ff"; // BETTER than baseline (rare)
|
| 4307 |
+
};
|
| 4308 |
+
const fmtRatio = (r) => r == null ? "—" : `${r.toFixed(2)}×`;
|
| 4309 |
+
|
| 4310 |
+
const rows = res.results.map(r => {
|
| 4311 |
+
const meta = presetMeta.find(p => p.id === r.modelId) || { label: r.modelId, family: "" };
|
| 4312 |
+
if (!r.ok) {
|
| 4313 |
+
return `<tr style="opacity:0.5;">
|
| 4314 |
+
<td><strong>${escapeHtml(meta.label)}</strong><br><span class="subtle" style="font-size:0.8em;">${escapeHtml(meta.family)}</span></td>
|
| 4315 |
+
<td colspan="3" style="color:#f0883e;">${t(`tax.error.${r.error}`) || r.error}</td>
|
| 4316 |
+
</tr>`;
|
| 4317 |
+
}
|
| 4318 |
+
const isBaseline = r.modelId === res.baseline_id;
|
| 4319 |
+
const baselineMark = isBaseline ? `<span class="subtle" style="font-size:0.8em;"> (baseline)</span>` : "";
|
| 4320 |
+
return `<tr ${isBaseline ? 'style="background:#1f2933;"' : ""}>
|
| 4321 |
+
<td><strong>${escapeHtml(meta.label)}</strong>${baselineMark}<br><span class="subtle" style="font-size:0.8em;">${escapeHtml(meta.family)}</span></td>
|
| 4322 |
+
<td style="text-align:right;font-family:monospace;"><strong>${r.token_count.toLocaleString()}</strong></td>
|
| 4323 |
+
<td style="text-align:right;font-family:monospace;">${r.chars_per_token != null ? r.chars_per_token.toFixed(2) : "—"}</td>
|
| 4324 |
+
<td style="text-align:right;font-family:monospace;color:${ratioColor(r.ratio_vs_baseline)};"><strong>${fmtRatio(r.ratio_vs_baseline)}</strong></td>
|
| 4325 |
+
</tr>`;
|
| 4326 |
+
}).join("");
|
| 4327 |
+
|
| 4328 |
+
// Worst-tax explanation — find the tokenizer that scored ≥1.5× baseline.
|
| 4329 |
+
const worst = res.results
|
| 4330 |
+
.filter(r => r.ok && r.ratio_vs_baseline != null)
|
| 4331 |
+
.sort((a, b) => b.ratio_vs_baseline - a.ratio_vs_baseline)[0];
|
| 4332 |
+
let interpretation = "";
|
| 4333 |
+
if (worst && worst.ratio_vs_baseline >= 1.3) {
|
| 4334 |
+
const meta = presetMeta.find(p => p.id === worst.modelId);
|
| 4335 |
+
const pct = Math.round((worst.ratio_vs_baseline - 1) * 100);
|
| 4336 |
+
interpretation = `<p style="color:#f0883e;margin-top:0.5em;">⚠ <strong>${tFmt("tax.interp.worst", {
|
| 4337 |
+
label: meta?.label || worst.modelId,
|
| 4338 |
+
pct,
|
| 4339 |
+
}) || `${meta?.label || worst.modelId} costs ${pct}% more tokens than baseline for this text.`}</strong></p>`;
|
| 4340 |
+
} else if (worst && worst.ratio_vs_baseline <= 1.05) {
|
| 4341 |
+
interpretation = `<p style="color:#3fb950;margin-top:0.5em;">${t("tax.interp.uniform") || "✓ All tokenizers within ±5% — text is well-handled across vendors."}</p>`;
|
| 4342 |
+
}
|
| 4343 |
+
|
| 4344 |
+
return `<div class="arena-result">
|
| 4345 |
+
<p>
|
| 4346 |
+
<strong>${tFmt("tax.summary.input", { chars: res.chars.toLocaleString(), bytes: res.bytes.toLocaleString() }) || `Input: ${res.chars.toLocaleString()} chars, ${res.bytes.toLocaleString()} bytes`}</strong>
|
| 4347 |
+
${blocks.dominant ? `<span class="subtle"> · ${t("tax.script_breakdown") || "scripts"}: ${fmtBlocks(blocks)}</span>` : ""}
|
| 4348 |
+
</p>
|
| 4349 |
+
${interpretation}
|
| 4350 |
+
<table class="lean-table" style="margin-top:0.5em;width:100%;">
|
| 4351 |
+
<thead><tr>
|
| 4352 |
+
<th style="text-align:left;">${t("tax.col.tokenizer") || "Tokenizer"}</th>
|
| 4353 |
+
<th style="text-align:right;">${t("tax.col.tokens") || "Tokens"}</th>
|
| 4354 |
+
<th style="text-align:right;">${t("tax.col.cpt") || "Chars/tok"}</th>
|
| 4355 |
+
<th style="text-align:right;">${t("tax.col.ratio") || "Ratio"}</th>
|
| 4356 |
+
</tr></thead>
|
| 4357 |
+
<tbody>${rows}</tbody>
|
| 4358 |
+
</table>
|
| 4359 |
+
<p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;">
|
| 4360 |
+
${t("tax.attribution") || "Tokenizers via"}
|
| 4361 |
+
<a href="https://github.com/huggingface/transformers.js" target="_blank" rel="noopener noreferrer">@huggingface/transformers</a>
|
| 4362 |
+
(browser BPE runtime).
|
| 4363 |
+
${t("tax.attribution.privacy") || "Text is tokenized locally — never leaves the browser."}
|
| 4364 |
+
</p>
|
| 4365 |
+
</div>`;
|
| 4366 |
+
}
|
| 4367 |
+
|
| 4368 |
+
async function runTaxTokenize() {
|
| 4369 |
+
const text = $("tax-input")?.value || "";
|
| 4370 |
+
if (!text) {
|
| 4371 |
+
$("tax-status").textContent = t("tax.hint.empty") || "⚠ Paste some text first.";
|
| 4372 |
+
return;
|
| 4373 |
+
}
|
| 4374 |
+
$("tax-status").textContent = t("tax.status.loading") || "⏳ Loading transformers.js + tokenizers (first run can take 5-15s)…";
|
| 4375 |
+
$("tax-output").innerHTML = "";
|
| 4376 |
+
const ids = TAX_PRESETS.map(p => p.id);
|
| 4377 |
+
try {
|
| 4378 |
+
const t0 = Date.now();
|
| 4379 |
+
const res = await tokenizeAll(ids, text);
|
| 4380 |
+
const ms = Date.now() - t0;
|
| 4381 |
+
$("tax-output").innerHTML = renderTaxResult(res, TAX_PRESETS);
|
| 4382 |
+
const okN = res.results.filter(r => r.ok).length;
|
| 4383 |
+
$("tax-status").textContent = tFmt("tax.status.done", {
|
| 4384 |
+
n: okN, total: ids.length, ms,
|
| 4385 |
+
}) || `✅ ${okN}/${ids.length} tokenizers ran in ${ms}ms`;
|
| 4386 |
+
} catch (e) {
|
| 4387 |
+
$("tax-status").textContent = `❌ ${e.message || e}`;
|
| 4388 |
+
}
|
| 4389 |
+
}
|
| 4390 |
+
|
| 4391 |
+
$("tax-tokenize-btn")?.addEventListener("click", runTaxTokenize);
|
| 4392 |
+
$("tax-sample-en-btn")?.addEventListener("click", () => {
|
| 4393 |
+
$("tax-input").value = TAX_SAMPLES.english;
|
| 4394 |
+
runTaxTokenize();
|
| 4395 |
+
});
|
| 4396 |
+
$("tax-sample-zh-btn")?.addEventListener("click", () => {
|
| 4397 |
+
$("tax-input").value = TAX_SAMPLES.chinese;
|
| 4398 |
+
runTaxTokenize();
|
| 4399 |
+
});
|
| 4400 |
+
$("tax-sample-ar-btn")?.addEventListener("click", () => {
|
| 4401 |
+
$("tax-input").value = TAX_SAMPLES.arabic;
|
| 4402 |
+
runTaxTokenize();
|
| 4403 |
+
});
|
| 4404 |
+
$("tax-sample-mixed-btn")?.addEventListener("click", () => {
|
| 4405 |
+
$("tax-input").value = TAX_SAMPLES.mixed;
|
| 4406 |
+
runTaxTokenize();
|
| 4407 |
+
});
|
| 4408 |
+
$("tax-sample-code-btn")?.addEventListener("click", () => {
|
| 4409 |
+
$("tax-input").value = TAX_SAMPLES.code;
|
| 4410 |
+
runTaxTokenize();
|
| 4411 |
+
});
|
| 4412 |
+
|
| 4413 |
// ════════════════��═══════════════════════════════════════════════════
|
| 4414 |
// Bootstrap
|
| 4415 |
// ════════════════════════════════════════════════════════════════════
|
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
// Multilingual Tokenizer Tax Calculator (v0.8.7 anti-bullshit pack #13)
|
| 2 |
+
//
|
| 3 |
+
// Pain: "I bought 1M tokens of API credit for our English chatbot. Then
|
| 4 |
+
// we added Chinese support and the bill 3x'd overnight." The tokenizer
|
| 5 |
+
// tax is real and silently asymmetric across languages. tiktokenizer.
|
| 6 |
+
// vercel.app shows OpenAI's tokenizer; nothing public compares Llama vs
|
| 7 |
+
// Qwen vs Phi vs Gemma vs GPT for the SAME text in the SAME interface.
|
| 8 |
+
//
|
| 9 |
+
// This module loads HuggingFace's transformers.js (browser-side BPE
|
| 10 |
+
// runtime) lazily and tokenizes user-pasted text against a preset list
|
| 11 |
+
// of open-weight tokenizers. The output is REAL per-tokenizer token
|
| 12 |
+
// counts plus the cost asymmetry ratio (vs the user's chosen baseline).
|
| 13 |
+
//
|
| 14 |
+
// Pure logic + lazy CDN import. Codes/params only; main.js renders i18n.
|
| 15 |
+
|
| 16 |
+
// =============================================================================
|
| 17 |
+
// transformers.js lazy loader
|
| 18 |
+
// =============================================================================
|
| 19 |
+
//
|
| 20 |
+
// Pinned 3.x major because the API surface (AutoTokenizer.from_pretrained,
|
| 21 |
+
// .encode) is stable. Loaded from jsdelivr CDN — same pattern used
|
| 22 |
+
// across HF Spaces. ~3 MB compressed bundle, cached aggressively after
|
| 23 |
+
// first load.
|
| 24 |
+
|
| 25 |
+
const TRANSFORMERS_CDN_URL = "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2/dist/transformers.min.js";
|
| 26 |
+
|
| 27 |
+
let _autoTokenizer = null;
|
| 28 |
+
let _loadPromise = null;
|
| 29 |
+
|
| 30 |
+
async function loadTransformersJs() {
|
| 31 |
+
if (_autoTokenizer) return _autoTokenizer;
|
| 32 |
+
if (_loadPromise) return _loadPromise;
|
| 33 |
+
_loadPromise = (async () => {
|
| 34 |
+
const mod = await import(TRANSFORMERS_CDN_URL);
|
| 35 |
+
_autoTokenizer = mod.AutoTokenizer;
|
| 36 |
+
return _autoTokenizer;
|
| 37 |
+
})();
|
| 38 |
+
return _loadPromise;
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
// =============================================================================
|
| 42 |
+
// Per-tokenizer cache (avoid re-downloading tokenizer.json on every encode)
|
| 43 |
+
// =============================================================================
|
| 44 |
+
|
| 45 |
+
const _tokenizerCache = new Map();
|
| 46 |
+
|
| 47 |
+
async function loadTokenizer(modelId) {
|
| 48 |
+
if (_tokenizerCache.has(modelId)) return _tokenizerCache.get(modelId);
|
| 49 |
+
const AT = await loadTransformersJs();
|
| 50 |
+
const tok = await AT.from_pretrained(modelId);
|
| 51 |
+
_tokenizerCache.set(modelId, tok);
|
| 52 |
+
return tok;
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
// =============================================================================
|
| 56 |
+
// Public: tokenize one model
|
| 57 |
+
// =============================================================================
|
| 58 |
+
|
| 59 |
+
export async function tokenizeWithModel(modelId, text) {
|
| 60 |
+
if (typeof text !== "string") {
|
| 61 |
+
return { ok: false, modelId, error: "invalid_input" };
|
| 62 |
+
}
|
| 63 |
+
try {
|
| 64 |
+
const tok = await loadTokenizer(modelId);
|
| 65 |
+
// transformers.js returns Int32Array | number[]. Use .length for count.
|
| 66 |
+
const ids = await tok.encode(text);
|
| 67 |
+
return { ok: true, modelId, token_count: ids.length };
|
| 68 |
+
} catch (e) {
|
| 69 |
+
return {
|
| 70 |
+
ok: false,
|
| 71 |
+
modelId,
|
| 72 |
+
error: classifyTokenizerError(e),
|
| 73 |
+
raw: String(e?.message || e).slice(0, 200),
|
| 74 |
+
};
|
| 75 |
+
}
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
function classifyTokenizerError(e) {
|
| 79 |
+
const msg = String(e?.message || e).toLowerCase();
|
| 80 |
+
if (msg.includes("401") || msg.includes("403") || msg.includes("gated")) return "gated";
|
| 81 |
+
if (msg.includes("404") || msg.includes("not found")) return "not_found";
|
| 82 |
+
if (msg.includes("timeout") || msg.includes("aborted")) return "timeout";
|
| 83 |
+
if (msg.includes("network") || msg.includes("failed to fetch")) return "network";
|
| 84 |
+
return "fetch_failed";
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
// =============================================================================
|
| 88 |
+
// Public: tokenize many models in parallel + compute ratios
|
| 89 |
+
// =============================================================================
|
| 90 |
+
|
| 91 |
+
export async function tokenizeAll(modelIds, text, baseline_idx = 0) {
|
| 92 |
+
if (!Array.isArray(modelIds) || modelIds.length === 0 || typeof text !== "string") {
|
| 93 |
+
return { code: "empty_input", results: [], baseline: null };
|
| 94 |
+
}
|
| 95 |
+
const results = await Promise.all(
|
| 96 |
+
modelIds.map(id => tokenizeWithModel(id, text))
|
| 97 |
+
);
|
| 98 |
+
const okResults = results.filter(r => r.ok);
|
| 99 |
+
if (okResults.length === 0) {
|
| 100 |
+
return { code: "all_failed", results, baseline: null };
|
| 101 |
+
}
|
| 102 |
+
// Baseline: first OK tokenizer, or the user-specified index if it's OK.
|
| 103 |
+
let baseline = okResults[0];
|
| 104 |
+
if (baseline_idx >= 0 && baseline_idx < results.length && results[baseline_idx].ok) {
|
| 105 |
+
baseline = results[baseline_idx];
|
| 106 |
+
}
|
| 107 |
+
// Stamp ratio vs baseline + chars-per-token for each.
|
| 108 |
+
const charCount = text.length;
|
| 109 |
+
const byteCount = new TextEncoder().encode(text).length;
|
| 110 |
+
for (const r of results) {
|
| 111 |
+
if (!r.ok) continue;
|
| 112 |
+
r.chars_per_token = r.token_count > 0 ? charCount / r.token_count : null;
|
| 113 |
+
r.bytes_per_token = r.token_count > 0 ? byteCount / r.token_count : null;
|
| 114 |
+
r.ratio_vs_baseline = baseline.token_count > 0
|
| 115 |
+
? r.token_count / baseline.token_count
|
| 116 |
+
: null;
|
| 117 |
+
}
|
| 118 |
+
return {
|
| 119 |
+
code: "ok",
|
| 120 |
+
results,
|
| 121 |
+
baseline_id: baseline.modelId,
|
| 122 |
+
baseline_count: baseline.token_count,
|
| 123 |
+
chars: charCount,
|
| 124 |
+
bytes: byteCount,
|
| 125 |
+
};
|
| 126 |
+
}
|
| 127 |
+
|
| 128 |
+
// =============================================================================
|
| 129 |
+
// Language detection — Unicode block analysis (no external deps)
|
| 130 |
+
// =============================================================================
|
| 131 |
+
//
|
| 132 |
+
// Surfaced as context next to the token counts so users see "this text
|
| 133 |
+
// is 60% CJK, 40% Latin" — explains why one tokenizer is 3× another.
|
| 134 |
+
|
| 135 |
+
const UNICODE_BLOCKS = [
|
| 136 |
+
// [name, regex_class]
|
| 137 |
+
["latin", /[A-z]/g],
|
| 138 |
+
["cjk", /[-ゟ゠-ヿ一-鿿ヲ-ン]/g],
|
| 139 |
+
["korean", /[가-ᄀ-ᇿ]/g],
|
| 140 |
+
["arabic", /[-ۿݐ-ݿ]/g],
|
| 141 |
+
["cyrillic", /[Ѐ-ӿ]/g],
|
| 142 |
+
["devanagari", /[ऀ-ॿ]/g],
|
| 143 |
+
["thai", /[-]/g],
|
| 144 |
+
["greek", /[Ͱ-Ͽ]/g],
|
| 145 |
+
["hebrew", /[-]/g],
|
| 146 |
+
];
|
| 147 |
+
|
| 148 |
+
export function detectLanguageBlocks(text) {
|
| 149 |
+
if (typeof text !== "string" || !text) {
|
| 150 |
+
return { total_chars: 0, blocks: {}, dominant: null };
|
| 151 |
+
}
|
| 152 |
+
const blocks = {};
|
| 153 |
+
for (const [name, re] of UNICODE_BLOCKS) {
|
| 154 |
+
re.lastIndex = 0;
|
| 155 |
+
const m = text.match(re);
|
| 156 |
+
blocks[name] = m ? m.length : 0;
|
| 157 |
+
}
|
| 158 |
+
const total = text.length;
|
| 159 |
+
const dominant = Object.entries(blocks)
|
| 160 |
+
.filter(([, n]) => n > 0)
|
| 161 |
+
.sort((a, b) => b[1] - a[1])[0]?.[0] || null;
|
| 162 |
+
return { total_chars: total, blocks, dominant };
|
| 163 |
+
}
|
| 164 |
+
|
| 165 |
+
// =============================================================================
|
| 166 |
+
// Preset tokenizer list — all open-weight (no HF auth required)
|
| 167 |
+
// =============================================================================
|
| 168 |
+
//
|
| 169 |
+
// Curated for breadth: one per major tokenizer family. For gated
|
| 170 |
+
// originals (Llama, Mistral, Gemma) the unsloth open-mirror is used —
|
| 171 |
+
// tokenizer.json is byte-identical to the original because quantization
|
| 172 |
+
// touches weights, not tokens (see spec-decode docs for the same
|
| 173 |
+
// argument).
|
| 174 |
+
|
| 175 |
+
export const PRESET_TOKENIZERS = [
|
| 176 |
+
{
|
| 177 |
+
id: "Qwen/Qwen2.5-7B-Instruct",
|
| 178 |
+
label: "Qwen2.5",
|
| 179 |
+
family: "Qwen-BPE (152k vocab, CJK-aware)",
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
id: "microsoft/Phi-3.5-mini-instruct",
|
| 183 |
+
label: "Phi-3.5",
|
| 184 |
+
family: "tiktoken-style BPE (32k)",
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
id: "unsloth/Meta-Llama-3.1-8B-Instruct",
|
| 188 |
+
label: "Llama-3.1",
|
| 189 |
+
family: "Llama-3 BPE (128k)",
|
| 190 |
+
},
|
| 191 |
+
{
|
| 192 |
+
id: "unsloth/gemma-2-9b-it",
|
| 193 |
+
label: "Gemma-2",
|
| 194 |
+
family: "SentencePiece (256k)",
|
| 195 |
+
},
|
| 196 |
+
{
|
| 197 |
+
id: "Xenova/gpt-4",
|
| 198 |
+
label: "GPT-4 (cl100k)",
|
| 199 |
+
family: "OpenAI tiktoken cl100k_base",
|
| 200 |
+
},
|
| 201 |
+
{
|
| 202 |
+
id: "Xenova/claude-tokenizer",
|
| 203 |
+
label: "Claude (approx)",
|
| 204 |
+
family: "Anthropic open approx (community port)",
|
| 205 |
+
},
|
| 206 |
+
];
|
| 207 |
+
|
| 208 |
+
// Sample texts that demonstrate cost asymmetry — identical meaning
|
| 209 |
+
// across languages so the user sees per-language tax directly.
|
| 210 |
+
export const SAMPLE_TEXTS = {
|
| 211 |
+
english: "The quick brown fox jumps over the lazy dog. " +
|
| 212 |
+
"She sells seashells by the seashore. Pack my box with five dozen liquor jugs.",
|
| 213 |
+
chinese: "敏捷的棕色狐狸跳过了懒狗。她在海边卖海贝壳。请用五打酒壶装满我的箱子。" +
|
| 214 |
+
"中文用字符表示词义,所以一段文字所需的字符数远少于英文。",
|
| 215 |
+
arabic: "الثعلب البني السريع يقفز فوق الكلب الكسول. " +
|
| 216 |
+
"تبيع أصدافًا بحرية على شاطئ البحر. عبئ صندوقي بخمسين إبريقًا من الخمر.",
|
| 217 |
+
mixed: "Hello world! 你好世界 مرحبا بالعالم Привет мир नमस्ते दुनिया",
|
| 218 |
+
code: "def quick_brown_fox(jumps_over: int) -> str:\n" +
|
| 219 |
+
" return f'The fox jumped {jumps_over} times'\n\n" +
|
| 220 |
+
"for i in range(10):\n print(quick_brown_fox(i))",
|
| 221 |
+
};
|