karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
cd27f27
·
1 Parent(s): e9f9ac5

v0.8.7 Multilingual Tokenizer Tax Calculator — anti-bullshit pack #13

Browse files

Pain: tokenizers tax non-English text asymmetrically. The same
paragraph might be 100 tokens in English but 250+ tokens in Chinese
on a Latin-trained tokenizer (Llama, Phi). Both per-request cost
AND effective context degrade silently. tiktokenizer.vercel.app
covers OpenAI's cl100k only; nothing public compares Llama vs Qwen
vs Phi vs Gemma vs GPT vs Claude in one interface.

🌍 Token Tax (20th mode):
- Lazy-imports HuggingFace transformers.js (~750 KB pinned to 3.0.2,
jsdelivr CDN). First-mode-open pays the cost; subsequent runs
instant after browser cache.
- Tokenizes user-pasted text against 6 preset open-weight tokenizers
(Qwen/Qwen2.5-7B-Instruct, microsoft/Phi-3.5-mini-instruct,
unsloth/Meta-Llama-3.1-8B-Instruct, unsloth/gemma-2-9b-it,
Xenova/gpt-4 cl100k port, Xenova/claude-tokenizer community port).
All open — no HF auth required. Llama/Gemma use the unsloth open
mirrors that ship the byte-identical tokenizer.json (quantization
touches weights, not tokens).
- Output: per-tokenizer token count, chars-per-token, ratio vs
baseline, color-coded (red ≥1.5×, amber ≥1.15×, green within 5%).
Worst-tax interpretation surfaces the loudest mismatch
automatically.
- Auto-detects Unicode script blocks (Latin / CJK / Korean / Arabic
/ Cyrillic / Devanagari / Thai / Greek / Hebrew) so users see
"92% CJK" alongside "Phi-3.5 = 2.27×" → instantly understand
the WHY.
- 5 sample buttons (English / 中文 / عربى / mixed / code) for
one-click demo coverage.

Pure logic in `js/tokenizer_tax.js` (lazy CDN import + tokenizer
cache + parallel tokenize + script detection). 36 i18n keys × 4
langs (EN/ES/FR/ZH) = 144 keys, parity clean. Help modal v0.8.7
entry + Inventory + "Set up an eval correctly" task tile.

Privacy-by-design: all tokenization is local — pasted text never
leaves the browser. Status note explains first-load latency
(~5-15s for 6 tokenizers in parallel, then cached).

Verified locally: ZH sample (92% CJK) yields:
Qwen2.5 baseline 44 tokens (1.43 chars/tok) 1.00×
Phi-3.5 100 tokens (0.63 chars/tok) 2.27× ⚠
Llama-3.1 60 tokens (1.05 chars/tok) 1.36×
Gemma-2 49 tokens (1.29 chars/tok) 1.11×
GPT-4 cl100k 81 tokens (0.78 chars/tok) 1.84×
Claude (approx) 70 tokens (0.90 chars/tok) 1.59×

Phi's BPE (32k vocab, no CJK pre-training) charges 2.27× over
Qwen for the SAME Chinese paragraph. That is the silent tax this
tool surfaces.

Refs:
- https://github.com/huggingface/transformers.js
- https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- https://huggingface.co/Xenova/gpt-4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show
  1. index.html +35 -0
  2. js/i18n.js +156 -0
  3. js/main.js +163 -1
  4. js/tokenizer_tax.js +221 -0
index.html CHANGED
@@ -228,6 +228,9 @@
228
  <p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
229
  <p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
230
 
 
 
 
231
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
232
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
233
 
@@ -344,6 +347,7 @@
344
  <li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
345
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
346
  <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
 
347
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
348
  </ul>
349
  </details>
@@ -419,6 +423,7 @@
419
  <button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
420
  <button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
421
  <button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
 
422
  </div>
423
  </div>
424
  <div class="task-tile">
@@ -479,6 +484,7 @@
479
  <button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
480
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
481
  <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
 
482
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
483
  </div>
484
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
@@ -1143,6 +1149,35 @@
1143
  <div id="spec-output" style="margin-top: 1em;"></div>
1144
  </section>
1145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1146
  <section id="hub-section" style="display:none;">
1147
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1148
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
 
228
  <p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
229
  <p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
230
 
231
+ <p><strong data-i18n="help.v087.tax.title">🌍 Multilingual Tokenizer Tax</strong></p>
232
+ <p data-i18n="help.v087.tax.body">Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.</p>
233
+
234
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
235
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
236
 
 
347
  <li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
348
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
349
  <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
350
+ <li data-i18n="inv.v087.tax"><strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).</li>
351
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
352
  </ul>
353
  </details>
 
423
  <button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
424
  <button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
425
  <button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
426
+ <button data-mode-link="tax" data-i18n="modes.tax">🌍 Token Tax</button>
427
  </div>
428
  </div>
429
  <div class="task-tile">
 
484
  <button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
485
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
486
  <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
487
+ <button class="mode-btn" data-mode="tax" role="tab" aria-selected="false" data-i18n="modes.tax">🌍 Token Tax</button>
488
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
489
  </div>
490
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
 
1149
  <div id="spec-output" style="margin-top: 1em;"></div>
1150
  </section>
1151
 
1152
+ <!-- Multilingual Tokenizer Tax (mode=tax, v0.8.7 anti-bullshit pack #13) -->
1153
+ <section id="tax-section" style="display:none;">
1154
+ <h2><span data-i18n="tax.title">🌍 Multilingual Tokenizer Tax</span>
1155
+ <span class="info"><span class="tooltip" data-i18n="tax.tip">
1156
+ <strong>Why this matters</strong>: tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser.
1157
+ </span></span>
1158
+ </h2>
1159
+ <p class="recipe-desc" data-i18n="tax.desc">
1160
+ <strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline.
1161
+ </p>
1162
+ <div class="form-row">
1163
+ <label for="tax-input" data-i18n="tax.input_label">Text to tokenize:</label>
1164
+ <textarea id="tax-input" rows="8" style="width:100%;font-family:monospace;font-size:0.9em;" data-i18n-placeholder="tax.input.placeholder" placeholder="Paste any text — English, Chinese, Arabic, code, …"></textarea>
1165
+ </div>
1166
+ <div class="form-row">
1167
+ <button type="button" id="tax-tokenize-btn" data-i18n="tax.tokenize_btn">🔬 Tokenize all</button>
1168
+ <button type="button" id="tax-sample-en-btn" class="secondary" data-i18n="tax.sample_en_btn">↳ Sample: English</button>
1169
+ <button type="button" id="tax-sample-zh-btn" class="secondary" data-i18n="tax.sample_zh_btn">↳ Sample: 中文</button>
1170
+ <button type="button" id="tax-sample-ar-btn" class="secondary" data-i18n="tax.sample_ar_btn">↳ Sample: عربى</button>
1171
+ <button type="button" id="tax-sample-mixed-btn" class="secondary" data-i18n="tax.sample_mixed_btn">↳ Sample: mixed</button>
1172
+ <button type="button" id="tax-sample-code-btn" class="secondary" data-i18n="tax.sample_code_btn">↳ Sample: code</button>
1173
+ </div>
1174
+ <p id="tax-status" class="recipe-desc" style="font-size:0.92em;"></p>
1175
+ <div id="tax-output" style="margin-top: 1em;"></div>
1176
+ <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;" data-i18n="tax.firstload_note">
1177
+ 💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser.
1178
+ </p>
1179
+ </section>
1180
+
1181
  <section id="hub-section" style="display:none;">
1182
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1183
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
js/i18n.js CHANGED
@@ -714,6 +714,45 @@ export const TRANSLATIONS = {
714
  "help.v085.speculative.title": "🔬 Speculative-Decode Compatibility",
715
  "help.v085.speculative.body": "Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.",
716
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
717
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
718
  "help.v081.hub.title": "🧭 Solutions Hub",
719
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
@@ -1887,6 +1926,45 @@ export const TRANSLATIONS = {
1887
  "help.v085.speculative.title": "🔬 Compatibilidad de Speculative-Decode",
1888
  "help.v085.speculative.body": "El speculative decoding solo funciona si target y draft comparten exactamente el mismo vocabulario. Vocabs mismatched hacen que cada token del draft sea rechazado — pagas AMBOS computes y obtienes peor throughput que baseline. Peor: el sistema sigue emitiendo output correcto (solo más lento), así que el bug es invisible en tests unitarios. vLLM #4570 / #16757 / #20409 / #12488 surfacen variantes. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids, compara tipo de tokenizer, tamaño de vocab, mapa completo token→id, special tokens, y added tokens, luego estima una banda de speedup basada en ratio de params y tasas típicas α=0.5/0.7/0.85 de aceptación. <em>Caso de uso</em>: antes de lanzar un cluster vLLM con spec-dec habilitado, verifica que el par sea compatible.",
1889
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1890
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
1891
  "help.v081.hub.title": "🧭 Solutions Hub",
1892
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
@@ -2924,6 +3002,45 @@ export const TRANSLATIONS = {
2924
  "help.v085.speculative.title": "🔬 Compatibilité Speculative-Decode",
2925
  "help.v085.speculative.body": "Le speculative decoding ne marche que si target et draft partagent exactement le même vocabulaire. Des vocabs mismatched font que chaque token du draft est rejeté — vous payez LES DEUX coûts de compute et obtenez un pire débit que la baseline. Pire : le système émet toujours une sortie correcte (juste plus lente), donc le bug est invisible aux tests unitaires. vLLM #4570 / #16757 / #20409 / #12488 surfent les variantes. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux model ids, compare le type de tokenizer, la taille du vocab, la map complète token→id, les special tokens, et les added tokens, puis estime une bande de speedup basée sur le ratio de params et les taux α=0.5/0.7/0.85 d'acceptation typiques. <em>Cas d'usage</em> : avant de lancer un cluster vLLM avec spec-dec activé, vérifiez que la paire est compatible.",
2926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2927
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
2928
  "help.v081.hub.title": "🧭 Solutions Hub",
2929
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
@@ -3961,6 +4078,45 @@ export const TRANSLATIONS = {
3961
  "help.v085.speculative.title": "🔬 Speculative-Decode 兼容性",
3962
  "help.v085.speculative.body": "Speculative decoding 仅当 target 和 draft 共享完全相同的词汇表时才能工作。Vocab 不匹配导致每个 draft token 被拒绝——你支付双倍计算成本且吞吐量比 baseline 更差。更糟:系统仍输出正确(只是更慢),所以 bug 在单元测试中不可见。vLLM #4570 / #16757 / #20409 / #12488 都显示了变种。这个工具从 HF Hub 获取两个 model id 的 `tokenizer.json`,比较 tokenizer 类型、vocab 大小、完整 token→id 映射、special token 和 added token,然后基于参数比和典型 α=0.5/0.7/0.85 接受率估算 speedup 范围。<em>用例</em>:在启动启用了 spec-dec 的 vLLM 集群之前,验证这对模型是否真的兼容。",
3963
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3964
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
3965
  "help.v081.hub.title": "🧭 Solutions Hub",
3966
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
 
714
  "help.v085.speculative.title": "🔬 Speculative-Decode Compatibility",
715
  "help.v085.speculative.body": "Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.",
716
 
717
+ // v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
718
+ "modes.tax": "🌍 Token Tax",
719
+ "mode_desc.tax": "Real BPE encoding (browser-side via transformers.js) of pasted text across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages.",
720
+ "tax.title": "🌍 Multilingual Tokenizer Tax",
721
+ "tax.tip": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ tokens in Chinese on a Latin-trained tokenizer (Llama, Phi). Cost per request and effective context BOTH degrade silently. Paste your text, see actual token counts across vendor tokenizers — no estimation, real BPE encoding via transformers.js in your browser.",
722
+ "tax.desc": "<strong>Don't 3× your bill on Chinese support.</strong> Paste any text → real per-tokenizer BPE encoding across Qwen / Phi / Llama / Gemma / GPT-4 / Claude → see the cost asymmetry vs your baseline.",
723
+ "tax.input_label": "Text to tokenize:",
724
+ "tax.input.placeholder": "Paste any text — English, Chinese, Arabic, code, …",
725
+ "tax.tokenize_btn": "🔬 Tokenize all",
726
+ "tax.sample_en_btn": "↳ Sample: English",
727
+ "tax.sample_zh_btn": "↳ Sample: 中文",
728
+ "tax.sample_ar_btn": "↳ Sample: عربى",
729
+ "tax.sample_mixed_btn": "↳ Sample: mixed",
730
+ "tax.sample_code_btn": "↳ Sample: code",
731
+ "tax.status.loading": "⏳ Loading transformers.js + tokenizers (first run can take 5-15s)…",
732
+ "tax.status.done": "✅ {n}/{total} tokenizers ran in {ms}ms",
733
+ "tax.col.tokenizer": "Tokenizer",
734
+ "tax.col.tokens": "Tokens",
735
+ "tax.col.cpt": "Chars/tok",
736
+ "tax.col.ratio": "Ratio",
737
+ "tax.summary.input": "Input: {chars} chars, {bytes} bytes",
738
+ "tax.script_breakdown": "scripts",
739
+ "tax.interp.worst": "{label} costs {pct}% more tokens than baseline for this text.",
740
+ "tax.interp.uniform": "✓ All tokenizers within ±5% — text is well-handled across vendors.",
741
+ "tax.hint.empty": "Paste some text and click Tokenize.",
742
+ "tax.all_failed": "All tokenizers failed to load.",
743
+ "tax.error.gated": "model gated (HF auth required — try the open mirror)",
744
+ "tax.error.not_found": "model id not found",
745
+ "tax.error.timeout": "timeout (large tokenizer or slow connection)",
746
+ "tax.error.network": "network error",
747
+ "tax.error.fetch_failed": "fetch failed",
748
+ "tax.error.invalid_input": "invalid input",
749
+ "tax.attribution": "Tokenizers via",
750
+ "tax.attribution.privacy": "Text is tokenized locally — never leaves the browser.",
751
+ "tax.firstload_note": "💡 <strong>First-time load:</strong> the tool fetches transformers.js (~750 KB) + each tokenizer's vocab on demand (~5-15 MB per tokenizer, cached after). Subsequent runs are instant. All processing is local — your text never leaves the browser.",
752
+ "inv.v087.tax": "<strong>🌍 Token Tax</strong> — real BPE encoding across 6 vendor tokenizers. Surfaces the silent cost asymmetry across languages (CJK / Arabic / mixed).",
753
+ "help.v087.tax.title": "🌍 Multilingual Tokenizer Tax",
754
+ "help.v087.tax.body": "Tokenizers tax non-English text asymmetrically. The same paragraph might be 100 tokens in English but 250+ in Chinese on a Latin-trained tokenizer (Llama, Phi). Both cost-per-request AND effective context degrade silently. This tool loads HuggingFace transformers.js in your browser (~750 KB CDN) and tokenizes pasted text against 6 preset vendor tokenizers (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Output: per-tokenizer token count + chars-per-token + ratio vs baseline + cost-asymmetry interpretation. Auto-detects script blocks (Latin / CJK / Arabic / Cyrillic / Devanagari / Thai / Greek / Hebrew / Korean) so users see why one tokenizer is 3× another. <em>Use case</em>: 'My multilingual support added 30% to the bill — which language costs the most?' → paste real production text, see exact per-tokenizer breakdown.",
755
+
756
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
757
  "help.v081.hub.title": "🧭 Solutions Hub",
758
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
 
1926
  "help.v085.speculative.title": "🔬 Compatibilidad de Speculative-Decode",
1927
  "help.v085.speculative.body": "El speculative decoding solo funciona si target y draft comparten exactamente el mismo vocabulario. Vocabs mismatched hacen que cada token del draft sea rechazado — pagas AMBOS computes y obtienes peor throughput que baseline. Peor: el sistema sigue emitiendo output correcto (solo más lento), así que el bug es invisible en tests unitarios. vLLM #4570 / #16757 / #20409 / #12488 surfacen variantes. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids, compara tipo de tokenizer, tamaño de vocab, mapa completo token→id, special tokens, y added tokens, luego estima una banda de speedup basada en ratio de params y tasas típicas α=0.5/0.7/0.85 de aceptación. <em>Caso de uso</em>: antes de lanzar un cluster vLLM con spec-dec habilitado, verifica que el par sea compatible.",
1928
 
1929
+ // v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
1930
+ "modes.tax": "🌍 Token Tax",
1931
+ "mode_desc.tax": "BPE real (transformers.js en browser) sobre texto pegado a través de 6 tokenizers de vendor. Surface la asimetría de coste silenciosa entre idiomas.",
1932
+ "tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
1933
+ "tax.tip": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Coste por request Y contexto efectivo degradan silenciosamente. Pega tu texto, ve token counts reales a través de tokenizers de vendor — sin estimación, BPE real vía transformers.js en tu navegador.",
1934
+ "tax.desc": "<strong>No 3× tu factura en soporte chino.</strong> Pega cualquier texto → BPE real por-tokenizer a través de Qwen / Phi / Llama / Gemma / GPT-4 / Claude → ve la asimetría de coste vs tu baseline.",
1935
+ "tax.input_label": "Texto a tokenizar:",
1936
+ "tax.input.placeholder": "Pega cualquier texto — inglés, chino, árabe, código, …",
1937
+ "tax.tokenize_btn": "🔬 Tokenizar todos",
1938
+ "tax.sample_en_btn": "↳ Ejemplo: English",
1939
+ "tax.sample_zh_btn": "↳ Ejemplo: 中文",
1940
+ "tax.sample_ar_btn": "↳ Ejemplo: عربى",
1941
+ "tax.sample_mixed_btn": "↳ Ejemplo: mixto",
1942
+ "tax.sample_code_btn": "↳ Ejemplo: código",
1943
+ "tax.status.loading": "⏳ Cargando transformers.js + tokenizers (primera ejecución puede tardar 5-15s)…",
1944
+ "tax.status.done": "✅ {n}/{total} tokenizers en {ms}ms",
1945
+ "tax.col.tokenizer": "Tokenizer",
1946
+ "tax.col.tokens": "Tokens",
1947
+ "tax.col.cpt": "Chars/tok",
1948
+ "tax.col.ratio": "Ratio",
1949
+ "tax.summary.input": "Entrada: {chars} caracteres, {bytes} bytes",
1950
+ "tax.script_breakdown": "scripts",
1951
+ "tax.interp.worst": "{label} cuesta {pct}% más tokens que baseline para este texto.",
1952
+ "tax.interp.uniform": "✓ Todos los tokenizers dentro de ±5% — texto bien manejado entre vendors.",
1953
+ "tax.hint.empty": "Pega texto y haz click en Tokenizar.",
1954
+ "tax.all_failed": "Todos los tokenizers fallaron.",
1955
+ "tax.error.gated": "modelo gated (auth HF requerida — prueba mirror open)",
1956
+ "tax.error.not_found": "model id no encontrado",
1957
+ "tax.error.timeout": "timeout (tokenizer grande o conexión lenta)",
1958
+ "tax.error.network": "error de red",
1959
+ "tax.error.fetch_failed": "fetch falló",
1960
+ "tax.error.invalid_input": "entrada inválida",
1961
+ "tax.attribution": "Tokenizers vía",
1962
+ "tax.attribution.privacy": "El texto se tokeniza localmente — nunca sale del navegador.",
1963
+ "tax.firstload_note": "💡 <strong>Primera carga:</strong> la tool descarga transformers.js (~750 KB) + el vocab de cada tokenizer bajo demanda (~5-15 MB por tokenizer, cacheados después). Ejecuciones siguientes son instantáneas. Todo el procesamiento es local — tu texto nunca sale del navegador.",
1964
+ "inv.v087.tax": "<strong>🌍 Token Tax</strong> — BPE real sobre 6 tokenizers de vendor. Surface la asimetría de coste silenciosa entre idiomas (CJK / árabe / mixto).",
1965
+ "help.v087.tax.title": "🌍 Impuesto de Tokenizer Multilingüe",
1966
+ "help.v087.tax.body": "Los tokenizers gravan el texto no-inglés de forma asimétrica. El mismo párrafo puede ser 100 tokens en inglés pero 250+ en chino en un tokenizer entrenado en Latin (Llama, Phi). Tanto coste-por-request COMO contexto efectivo degradan silenciosamente. Esta tool carga HuggingFace transformers.js en tu navegador (~750 KB CDN) y tokeniza el texto pegado contra 6 tokenizers preset de vendor (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude aprox). Output: token count por tokenizer + chars-per-token + ratio vs baseline + interpretación de asimetría. Auto-detecta bloques de script (Latin / CJK / árabe / cirílico / devanagari / tailandés / griego / hebreo / coreano) para que veas por qué un tokenizer es 3× otro. <em>Caso de uso</em>: 'Mi soporte multilingüe añadió 30% a la factura — ¿qué idioma cuesta más?' → pega texto real de producción, ve breakdown exacto por tokenizer.",
1967
+
1968
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
1969
  "help.v081.hub.title": "🧭 Solutions Hub",
1970
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
 
3002
  "help.v085.speculative.title": "🔬 Compatibilité Speculative-Decode",
3003
  "help.v085.speculative.body": "Le speculative decoding ne marche que si target et draft partagent exactement le même vocabulaire. Des vocabs mismatched font que chaque token du draft est rejeté — vous payez LES DEUX coûts de compute et obtenez un pire débit que la baseline. Pire : le système émet toujours une sortie correcte (juste plus lente), donc le bug est invisible aux tests unitaires. vLLM #4570 / #16757 / #20409 / #12488 surfent les variantes. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux model ids, compare le type de tokenizer, la taille du vocab, la map complète token→id, les special tokens, et les added tokens, puis estime une bande de speedup basée sur le ratio de params et les taux α=0.5/0.7/0.85 d'acceptation typiques. <em>Cas d'usage</em> : avant de lancer un cluster vLLM avec spec-dec activé, vérifiez que la paire est compatible.",
3004
 
3005
+ // v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
3006
+ "modes.tax": "🌍 Token Tax",
3007
+ "mode_desc.tax": "Encodage BPE réel (côté navigateur via transformers.js) du texte collé sur 6 tokenizers de fournisseurs. Révèle l'asymétrie de coût silencieuse entre langues.",
3008
+ "tax.title": "🌍 Taxe Tokenizer Multilingue",
3009
+ "tax.tip": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût par requête ET contexte effectif dégradent silencieusement. Collez votre texte, voyez les vrais token counts à travers les tokenizers fournisseurs — pas d'estimation, BPE réel via transformers.js dans votre navigateur.",
3010
+ "tax.desc": "<strong>Ne 3× pas votre facture sur le support chinois.</strong> Collez n'importe quel texte → encodage BPE réel par tokenizer (Qwen / Phi / Llama / Gemma / GPT-4 / Claude) → voyez l'asymétrie de coût vs votre baseline.",
3011
+ "tax.input_label": "Texte à tokenizer :",
3012
+ "tax.input.placeholder": "Collez n'importe quel texte — anglais, chinois, arabe, code, …",
3013
+ "tax.tokenize_btn": "🔬 Tokenizer tous",
3014
+ "tax.sample_en_btn": "↳ Exemple : English",
3015
+ "tax.sample_zh_btn": "↳ Exemple : 中文",
3016
+ "tax.sample_ar_btn": "↳ Exemple : عربى",
3017
+ "tax.sample_mixed_btn": "↳ Exemple : mixte",
3018
+ "tax.sample_code_btn": "↳ Exemple : code",
3019
+ "tax.status.loading": "⏳ Chargement transformers.js + tokenizers (la première exécution peut prendre 5-15s)…",
3020
+ "tax.status.done": "✅ {n}/{total} tokenizers en {ms}ms",
3021
+ "tax.col.tokenizer": "Tokenizer",
3022
+ "tax.col.tokens": "Tokens",
3023
+ "tax.col.cpt": "Chars/tok",
3024
+ "tax.col.ratio": "Ratio",
3025
+ "tax.summary.input": "Entrée : {chars} caractères, {bytes} octets",
3026
+ "tax.script_breakdown": "scripts",
3027
+ "tax.interp.worst": "{label} coûte {pct}% de tokens en plus que la baseline pour ce texte.",
3028
+ "tax.interp.uniform": "✓ Tous les tokenizers à ±5% — texte bien géré par les fournisseurs.",
3029
+ "tax.hint.empty": "Collez du texte puis Tokenizer.",
3030
+ "tax.all_failed": "Tous les tokenizers ont échoué.",
3031
+ "tax.error.gated": "modèle gated (auth HF requise — essayez le mirror open)",
3032
+ "tax.error.not_found": "model id introuvable",
3033
+ "tax.error.timeout": "timeout (gros tokenizer ou connexion lente)",
3034
+ "tax.error.network": "erreur réseau",
3035
+ "tax.error.fetch_failed": "fetch échoué",
3036
+ "tax.error.invalid_input": "entrée invalide",
3037
+ "tax.attribution": "Tokenizers via",
3038
+ "tax.attribution.privacy": "Le texte est tokenizé localement — ne quitte jamais le navigateur.",
3039
+ "tax.firstload_note": "💡 <strong>Premier chargement :</strong> l'outil récupère transformers.js (~750 KB) + le vocab de chaque tokenizer à la demande (~5-15 MB par tokenizer, mis en cache après). Les exécutions suivantes sont instantanées. Tout le traitement est local — votre texte ne quitte jamais le navigateur.",
3040
+ "inv.v087.tax": "<strong>🌍 Token Tax</strong> — encodage BPE réel sur 6 tokenizers fournisseurs. Révèle l'asymétrie de coût silencieuse entre langues (CJK / arabe / mixte).",
3041
+ "help.v087.tax.title": "🌍 Taxe Tokenizer Multilingue",
3042
+ "help.v087.tax.body": "Les tokenizers taxent le texte non-anglais de façon asymétrique. Le même paragraphe peut faire 100 tokens en anglais mais 250+ en chinois sur un tokenizer entraîné en Latin (Llama, Phi). Coût-par-requête ET contexte effectif dégradent silencieusement. Cet outil charge HuggingFace transformers.js dans votre navigateur (~750 KB CDN) et tokenize le texte collé contre 6 tokenizers preset de fournisseurs (Qwen2.5, Phi-3.5, Llama-3.1, Gemma-2, GPT-4 cl100k, Claude approx). Sortie : token count par tokenizer + chars-per-token + ratio vs baseline + interprétation d'asymétrie. Auto-détecte les blocs de script (Latin / CJK / arabe / cyrillique / devanagari / thaï / grec / hébreu / coréen) pour voir pourquoi un tokenizer est 3× un autre. <em>Cas d'usage</em> : 'Mon support multilingue a ajouté 30% à la facture — quelle langue coûte le plus ?' → collez du texte de production réel, voyez le breakdown exact par tokenizer.",
3043
+
3044
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
3045
  "help.v081.hub.title": "🧭 Solutions Hub",
3046
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
 
4078
  "help.v085.speculative.title": "🔬 Speculative-Decode 兼容性",
4079
  "help.v085.speculative.body": "Speculative decoding 仅当 target 和 draft 共享完全相同的词汇表时才能工作。Vocab 不匹配导致每个 draft token 被拒绝——你支付双倍计算成本且吞吐量比 baseline 更差。更糟:系统仍输出正确(只是更慢),所以 bug 在单元测试中不可见。vLLM #4570 / #16757 / #20409 / #12488 都显示了变种。这个工具从 HF Hub 获取两个 model id 的 `tokenizer.json`,比较 tokenizer 类型、vocab 大小、完整 token→id 映射、special token 和 added token,然后基于参数比和典型 α=0.5/0.7/0.85 接受率估算 speedup 范围。<em>用例</em>:在启动启用了 spec-dec 的 vLLM 集群之前,验证这对模型是否真的兼容。",
4080
 
4081
+ // v0.8.7 — anti-bullshit pack #13: Multilingual Tokenizer Tax
4082
+ "modes.tax": "🌍 Token Tax",
4083
+ "mode_desc.tax": "通过浏览器端 transformers.js 对粘贴文本进行 6 个供应商 tokenizer 的真实 BPE 编码。揭示语言间的静默成本不对称。",
4084
+ "tax.title": "🌍 多语言 Tokenizer 税",
4085
+ "tax.tip": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。粘贴你的文本,通过供应商 tokenizer 查看实际 token 数——没有估算,通过 transformers.js 在浏览器中真实 BPE 编码。",
4086
+ "tax.desc": "<strong>不要因中文支持让账单 3 倍。</strong> 粘贴任意文本 → 通过 Qwen / Phi / Llama / Gemma / GPT-4 / Claude 的真实 BPE 编码 → 查看相对于 baseline 的成本不对称。",
4087
+ "tax.input_label": "要 tokenize 的文本:",
4088
+ "tax.input.placeholder": "粘贴任何文本——英语、中文、阿拉伯语、代码……",
4089
+ "tax.tokenize_btn": "🔬 Tokenize 全部",
4090
+ "tax.sample_en_btn": "↳ 示例:English",
4091
+ "tax.sample_zh_btn": "↳ 示例:中文",
4092
+ "tax.sample_ar_btn": "↳ 示例:عربى",
4093
+ "tax.sample_mixed_btn": "↳ 示例:混合",
4094
+ "tax.sample_code_btn": "↳ 示例:代码",
4095
+ "tax.status.loading": "⏳ 加载 transformers.js + tokenizer(首次运行可能需要 5-15 秒)…",
4096
+ "tax.status.done": "✅ {n}/{total} 个 tokenizer,用时 {ms}ms",
4097
+ "tax.col.tokenizer": "Tokenizer",
4098
+ "tax.col.tokens": "Token 数",
4099
+ "tax.col.cpt": "字符/token",
4100
+ "tax.col.ratio": "比率",
4101
+ "tax.summary.input": "输入:{chars} 字符,{bytes} 字节",
4102
+ "tax.script_breakdown": "脚本",
4103
+ "tax.interp.worst": "{label} 对此文本的 token 数比 baseline 多 {pct}%。",
4104
+ "tax.interp.uniform": "✓ 所有 tokenizer 在 ±5% 范围内——文本在各供应商间处理良好。",
4105
+ "tax.hint.empty": "粘贴文本然后点击 Tokenize。",
4106
+ "tax.all_failed": "所有 tokenizer 都失败了。",
4107
+ "tax.error.gated": "模型受限(需要 HF auth——尝试 open mirror)",
4108
+ "tax.error.not_found": "找不到 model id",
4109
+ "tax.error.timeout": "超时(大 tokenizer 或慢速连接)",
4110
+ "tax.error.network": "网络错误",
4111
+ "tax.error.fetch_failed": "获取失败",
4112
+ "tax.error.invalid_input": "无效输入",
4113
+ "tax.attribution": "Tokenizer 通过",
4114
+ "tax.attribution.privacy": "文本在本地 tokenize——永远不会离开浏览器。",
4115
+ "tax.firstload_note": "💡 <strong>首次加载:</strong>工具按需获取 transformers.js(~750 KB)+ 每个 tokenizer 的词汇表(每个 ~5-15 MB,加载后缓存)。后续运行即时。所有处理都是本地的——你的文本永远不会离开浏览器。",
4116
+ "inv.v087.tax": "<strong>🌍 Token Tax</strong> — 6 个供应商 tokenizer 的真实 BPE 编码。揭示语言间(CJK / 阿拉伯语 / 混合)的静默成本不对称。",
4117
+ "help.v087.tax.title": "🌍 多语言 Tokenizer 税",
4118
+ "help.v087.tax.body": "Tokenizer 对非英语文本的征税不对称。同一段落在英语中可能是 100 个 token,但在拉丁字母训练的 tokenizer(Llama、Phi)上的中文可能是 250+ 个 token。每次请求成本和有效上下文都会静默降级。这个工具在你的浏览器中加载 HuggingFace transformers.js(~750 KB CDN),并对粘贴的文本运行 6 个预设供应商 tokenizer(Qwen2.5、Phi-3.5、Llama-3.1、Gemma-2、GPT-4 cl100k、Claude 近似)的 tokenize。输出:每个 tokenizer 的 token 数 + 字符/token + 相对于 baseline 的比率 + 成本不对称解读。自动检测脚本块(拉丁/CJK/阿拉伯/西里尔/天城/泰/希腊/希伯来/韩文)让你看到为什么一个 tokenizer 是另一个的 3 倍。<em>用例</em>:『我的多语言支持给账单加了 30%——哪种语言成本最高?』→ 粘贴真实生产文本,查看每个 tokenizer 的精确分解。",
4119
+
4120
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
4121
  "help.v081.hub.title": "🧭 Solutions Hub",
4122
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
js/main.js CHANGED
@@ -31,6 +31,10 @@ import { lintJsonCot, reorderJsonText, classifyFieldName } from "./json_cot_lint
31
  import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
32
  import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
33
  import { checkCompatibility as specCheckCompat, parseParamHint } from "./spec_decode_compat.js";
 
 
 
 
34
 
35
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
36
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -224,6 +228,7 @@ document.addEventListener("click", (e) => {
224
  peft: "peft-section",
225
  cache: "cache-section",
226
  speculative: "speculative-section",
 
227
  hub: "hub-section",
228
  }[targetMode];
229
  if (sectionId) {
@@ -249,7 +254,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
249
  "diagnose-section", "phase-section", "unmask-section",
250
  "template-section", "arena-section", "contam-section",
251
  "quant-section", "drift-section", "niah-section",
252
- "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "hub-section"].forEach(id => {
253
  const el = $(id);
254
  if (el) el.style.display = "none";
255
  });
@@ -265,6 +270,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
265
  peft: "peft-section",
266
  cache: "cache-section",
267
  speculative: "speculative-section",
 
268
  hub: "hub-section",
269
  };
270
  const sectionId = sectionMap[mode];
@@ -276,6 +282,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
276
  if (mode === "peft") initPeft();
277
  if (mode === "cache") initCacheDiff();
278
  if (mode === "speculative") initSpeculative();
 
279
  if (mode === "hub") initHub();
280
  });
281
  });
@@ -4248,6 +4255,161 @@ $("spec-example-bad-btn")?.addEventListener("click", () => {
4248
  // (HF autocomplete on spec-target-id / spec-draft-id is registered via
4249
  // the known-id list in hf_autocomplete.js; no extra wiring needed here.)
4250
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4251
  // ════════════════��═══════════════════════════════════════════════════
4252
  // Bootstrap
4253
  // ════════════════════════════════════════════════════════════════════
 
31
  import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
32
  import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
33
  import { checkCompatibility as specCheckCompat, parseParamHint } from "./spec_decode_compat.js";
34
+ import {
35
+ tokenizeAll, detectLanguageBlocks,
36
+ PRESET_TOKENIZERS as TAX_PRESETS, SAMPLE_TEXTS as TAX_SAMPLES,
37
+ } from "./tokenizer_tax.js";
38
 
39
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
40
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
 
228
  peft: "peft-section",
229
  cache: "cache-section",
230
  speculative: "speculative-section",
231
+ tax: "tax-section",
232
  hub: "hub-section",
233
  }[targetMode];
234
  if (sectionId) {
 
254
  "diagnose-section", "phase-section", "unmask-section",
255
  "template-section", "arena-section", "contam-section",
256
  "quant-section", "drift-section", "niah-section",
257
+ "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "tax-section", "hub-section"].forEach(id => {
258
  const el = $(id);
259
  if (el) el.style.display = "none";
260
  });
 
270
  peft: "peft-section",
271
  cache: "cache-section",
272
  speculative: "speculative-section",
273
+ tax: "tax-section",
274
  hub: "hub-section",
275
  };
276
  const sectionId = sectionMap[mode];
 
282
  if (mode === "peft") initPeft();
283
  if (mode === "cache") initCacheDiff();
284
  if (mode === "speculative") initSpeculative();
285
+ if (mode === "tax") initTax();
286
  if (mode === "hub") initHub();
287
  });
288
  });
 
4255
  // (HF autocomplete on spec-target-id / spec-draft-id is registered via
4256
  // the known-id list in hf_autocomplete.js; no extra wiring needed here.)
4257
 
4258
+ // ════════════════════════════════════════════════════════════════════
4259
+ // 🌍 Multilingual Tokenizer Tax (v0.8.7 anti-bullshit pack #13)
4260
+ // ════════════════════════════════════════════════════════════════════
4261
+ let __taxInited = false;
4262
+
4263
+ function initTax() {
4264
+ if (__taxInited) return;
4265
+ __taxInited = true;
4266
+ // No async preload — transformers.js + tokenizer.json are lazy-loaded
4267
+ // on the first Tokenize click so users don't pay download cost just
4268
+ // for opening the tab. Status string explains the wait.
4269
+ }
4270
+
4271
+ function fmtBlocks(blocks) {
4272
+ // Build a compact "60% latin · 35% cjk · 5% other" string from the
4273
+ // detector output. Drops zero-counts and orders by descending size.
4274
+ if (!blocks || !blocks.blocks || !blocks.total_chars) return "";
4275
+ const total = blocks.total_chars;
4276
+ const entries = Object.entries(blocks.blocks)
4277
+ .filter(([, n]) => n > 0)
4278
+ .sort((a, b) => b[1] - a[1]);
4279
+ if (entries.length === 0) return "";
4280
+ const parts = entries.map(([name, n]) => {
4281
+ const pct = Math.round((n / total) * 100);
4282
+ return `${pct}% ${name}`;
4283
+ });
4284
+ return parts.join(" · ");
4285
+ }
4286
+
4287
+ function renderTaxResult(res, presetMeta) {
4288
+ if (res.code === "empty_input") {
4289
+ return `<div class="arena-result"><p>${t("tax.hint.empty") || "Paste some text and click Tokenize."}</p></div>`;
4290
+ }
4291
+ if (res.code === "all_failed") {
4292
+ const errLines = res.results.map(r => {
4293
+ const meta = presetMeta.find(p => p.id === r.modelId);
4294
+ return `<li><code>${escapeHtml(r.modelId)}</code> ${meta ? `<span class="subtle">(${escapeHtml(meta.label)})</span>` : ""}: ${t(`tax.error.${r.error}`) || r.error}</li>`;
4295
+ }).join("");
4296
+ return `<div class="arena-result"><p style="color:#f85149;"><strong>❌ ${t("tax.all_failed") || "All tokenizers failed to load."}</strong></p><ul>${errLines}</ul></div>`;
4297
+ }
4298
+
4299
+ const baselineCount = res.baseline_count;
4300
+ const blocks = detectLanguageBlocks($("tax-input").value);
4301
+ const ratioColor = (r) => {
4302
+ if (r == null) return "#8b949e";
4303
+ if (r >= 1.5) return "#f85149"; // big tax — red
4304
+ if (r >= 1.15) return "#f0883e"; // moderate
4305
+ if (r >= 0.85) return "#3fb950"; // about same
4306
+ return "#58a6ff"; // BETTER than baseline (rare)
4307
+ };
4308
+ const fmtRatio = (r) => r == null ? "—" : `${r.toFixed(2)}×`;
4309
+
4310
+ const rows = res.results.map(r => {
4311
+ const meta = presetMeta.find(p => p.id === r.modelId) || { label: r.modelId, family: "" };
4312
+ if (!r.ok) {
4313
+ return `<tr style="opacity:0.5;">
4314
+ <td><strong>${escapeHtml(meta.label)}</strong><br><span class="subtle" style="font-size:0.8em;">${escapeHtml(meta.family)}</span></td>
4315
+ <td colspan="3" style="color:#f0883e;">${t(`tax.error.${r.error}`) || r.error}</td>
4316
+ </tr>`;
4317
+ }
4318
+ const isBaseline = r.modelId === res.baseline_id;
4319
+ const baselineMark = isBaseline ? `<span class="subtle" style="font-size:0.8em;"> (baseline)</span>` : "";
4320
+ return `<tr ${isBaseline ? 'style="background:#1f2933;"' : ""}>
4321
+ <td><strong>${escapeHtml(meta.label)}</strong>${baselineMark}<br><span class="subtle" style="font-size:0.8em;">${escapeHtml(meta.family)}</span></td>
4322
+ <td style="text-align:right;font-family:monospace;"><strong>${r.token_count.toLocaleString()}</strong></td>
4323
+ <td style="text-align:right;font-family:monospace;">${r.chars_per_token != null ? r.chars_per_token.toFixed(2) : "—"}</td>
4324
+ <td style="text-align:right;font-family:monospace;color:${ratioColor(r.ratio_vs_baseline)};"><strong>${fmtRatio(r.ratio_vs_baseline)}</strong></td>
4325
+ </tr>`;
4326
+ }).join("");
4327
+
4328
+ // Worst-tax explanation — find the tokenizer that scored ≥1.5× baseline.
4329
+ const worst = res.results
4330
+ .filter(r => r.ok && r.ratio_vs_baseline != null)
4331
+ .sort((a, b) => b.ratio_vs_baseline - a.ratio_vs_baseline)[0];
4332
+ let interpretation = "";
4333
+ if (worst && worst.ratio_vs_baseline >= 1.3) {
4334
+ const meta = presetMeta.find(p => p.id === worst.modelId);
4335
+ const pct = Math.round((worst.ratio_vs_baseline - 1) * 100);
4336
+ interpretation = `<p style="color:#f0883e;margin-top:0.5em;">⚠ <strong>${tFmt("tax.interp.worst", {
4337
+ label: meta?.label || worst.modelId,
4338
+ pct,
4339
+ }) || `${meta?.label || worst.modelId} costs ${pct}% more tokens than baseline for this text.`}</strong></p>`;
4340
+ } else if (worst && worst.ratio_vs_baseline <= 1.05) {
4341
+ interpretation = `<p style="color:#3fb950;margin-top:0.5em;">${t("tax.interp.uniform") || "✓ All tokenizers within ±5% — text is well-handled across vendors."}</p>`;
4342
+ }
4343
+
4344
+ return `<div class="arena-result">
4345
+ <p>
4346
+ <strong>${tFmt("tax.summary.input", { chars: res.chars.toLocaleString(), bytes: res.bytes.toLocaleString() }) || `Input: ${res.chars.toLocaleString()} chars, ${res.bytes.toLocaleString()} bytes`}</strong>
4347
+ ${blocks.dominant ? `<span class="subtle"> · ${t("tax.script_breakdown") || "scripts"}: ${fmtBlocks(blocks)}</span>` : ""}
4348
+ </p>
4349
+ ${interpretation}
4350
+ <table class="lean-table" style="margin-top:0.5em;width:100%;">
4351
+ <thead><tr>
4352
+ <th style="text-align:left;">${t("tax.col.tokenizer") || "Tokenizer"}</th>
4353
+ <th style="text-align:right;">${t("tax.col.tokens") || "Tokens"}</th>
4354
+ <th style="text-align:right;">${t("tax.col.cpt") || "Chars/tok"}</th>
4355
+ <th style="text-align:right;">${t("tax.col.ratio") || "Ratio"}</th>
4356
+ </tr></thead>
4357
+ <tbody>${rows}</tbody>
4358
+ </table>
4359
+ <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;">
4360
+ ${t("tax.attribution") || "Tokenizers via"}
4361
+ <a href="https://github.com/huggingface/transformers.js" target="_blank" rel="noopener noreferrer">@huggingface/transformers</a>
4362
+ (browser BPE runtime).
4363
+ ${t("tax.attribution.privacy") || "Text is tokenized locally — never leaves the browser."}
4364
+ </p>
4365
+ </div>`;
4366
+ }
4367
+
4368
+ async function runTaxTokenize() {
4369
+ const text = $("tax-input")?.value || "";
4370
+ if (!text) {
4371
+ $("tax-status").textContent = t("tax.hint.empty") || "⚠ Paste some text first.";
4372
+ return;
4373
+ }
4374
+ $("tax-status").textContent = t("tax.status.loading") || "⏳ Loading transformers.js + tokenizers (first run can take 5-15s)…";
4375
+ $("tax-output").innerHTML = "";
4376
+ const ids = TAX_PRESETS.map(p => p.id);
4377
+ try {
4378
+ const t0 = Date.now();
4379
+ const res = await tokenizeAll(ids, text);
4380
+ const ms = Date.now() - t0;
4381
+ $("tax-output").innerHTML = renderTaxResult(res, TAX_PRESETS);
4382
+ const okN = res.results.filter(r => r.ok).length;
4383
+ $("tax-status").textContent = tFmt("tax.status.done", {
4384
+ n: okN, total: ids.length, ms,
4385
+ }) || `✅ ${okN}/${ids.length} tokenizers ran in ${ms}ms`;
4386
+ } catch (e) {
4387
+ $("tax-status").textContent = `❌ ${e.message || e}`;
4388
+ }
4389
+ }
4390
+
4391
+ $("tax-tokenize-btn")?.addEventListener("click", runTaxTokenize);
4392
+ $("tax-sample-en-btn")?.addEventListener("click", () => {
4393
+ $("tax-input").value = TAX_SAMPLES.english;
4394
+ runTaxTokenize();
4395
+ });
4396
+ $("tax-sample-zh-btn")?.addEventListener("click", () => {
4397
+ $("tax-input").value = TAX_SAMPLES.chinese;
4398
+ runTaxTokenize();
4399
+ });
4400
+ $("tax-sample-ar-btn")?.addEventListener("click", () => {
4401
+ $("tax-input").value = TAX_SAMPLES.arabic;
4402
+ runTaxTokenize();
4403
+ });
4404
+ $("tax-sample-mixed-btn")?.addEventListener("click", () => {
4405
+ $("tax-input").value = TAX_SAMPLES.mixed;
4406
+ runTaxTokenize();
4407
+ });
4408
+ $("tax-sample-code-btn")?.addEventListener("click", () => {
4409
+ $("tax-input").value = TAX_SAMPLES.code;
4410
+ runTaxTokenize();
4411
+ });
4412
+
4413
  // ════════════════��═══════════════════════════════════════════════════
4414
  // Bootstrap
4415
  // ════════════════════════════════════════════════════════════════════
js/tokenizer_tax.js ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Multilingual Tokenizer Tax Calculator (v0.8.7 anti-bullshit pack #13)
2
+ //
3
+ // Pain: "I bought 1M tokens of API credit for our English chatbot. Then
4
+ // we added Chinese support and the bill 3x'd overnight." The tokenizer
5
+ // tax is real and silently asymmetric across languages. tiktokenizer.
6
+ // vercel.app shows OpenAI's tokenizer; nothing public compares Llama vs
7
+ // Qwen vs Phi vs Gemma vs GPT for the SAME text in the SAME interface.
8
+ //
9
+ // This module loads HuggingFace's transformers.js (browser-side BPE
10
+ // runtime) lazily and tokenizes user-pasted text against a preset list
11
+ // of open-weight tokenizers. The output is REAL per-tokenizer token
12
+ // counts plus the cost asymmetry ratio (vs the user's chosen baseline).
13
+ //
14
+ // Pure logic + lazy CDN import. Codes/params only; main.js renders i18n.
15
+
16
+ // =============================================================================
17
+ // transformers.js lazy loader
18
+ // =============================================================================
19
+ //
20
+ // Pinned 3.x major because the API surface (AutoTokenizer.from_pretrained,
21
+ // .encode) is stable. Loaded from jsdelivr CDN — same pattern used
22
+ // across HF Spaces. ~3 MB compressed bundle, cached aggressively after
23
+ // first load.
24
+
25
+ const TRANSFORMERS_CDN_URL = "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2/dist/transformers.min.js";
26
+
27
+ let _autoTokenizer = null;
28
+ let _loadPromise = null;
29
+
30
+ async function loadTransformersJs() {
31
+ if (_autoTokenizer) return _autoTokenizer;
32
+ if (_loadPromise) return _loadPromise;
33
+ _loadPromise = (async () => {
34
+ const mod = await import(TRANSFORMERS_CDN_URL);
35
+ _autoTokenizer = mod.AutoTokenizer;
36
+ return _autoTokenizer;
37
+ })();
38
+ return _loadPromise;
39
+ }
40
+
41
+ // =============================================================================
42
+ // Per-tokenizer cache (avoid re-downloading tokenizer.json on every encode)
43
+ // =============================================================================
44
+
45
+ const _tokenizerCache = new Map();
46
+
47
+ async function loadTokenizer(modelId) {
48
+ if (_tokenizerCache.has(modelId)) return _tokenizerCache.get(modelId);
49
+ const AT = await loadTransformersJs();
50
+ const tok = await AT.from_pretrained(modelId);
51
+ _tokenizerCache.set(modelId, tok);
52
+ return tok;
53
+ }
54
+
55
+ // =============================================================================
56
+ // Public: tokenize one model
57
+ // =============================================================================
58
+
59
+ export async function tokenizeWithModel(modelId, text) {
60
+ if (typeof text !== "string") {
61
+ return { ok: false, modelId, error: "invalid_input" };
62
+ }
63
+ try {
64
+ const tok = await loadTokenizer(modelId);
65
+ // transformers.js returns Int32Array | number[]. Use .length for count.
66
+ const ids = await tok.encode(text);
67
+ return { ok: true, modelId, token_count: ids.length };
68
+ } catch (e) {
69
+ return {
70
+ ok: false,
71
+ modelId,
72
+ error: classifyTokenizerError(e),
73
+ raw: String(e?.message || e).slice(0, 200),
74
+ };
75
+ }
76
+ }
77
+
78
+ function classifyTokenizerError(e) {
79
+ const msg = String(e?.message || e).toLowerCase();
80
+ if (msg.includes("401") || msg.includes("403") || msg.includes("gated")) return "gated";
81
+ if (msg.includes("404") || msg.includes("not found")) return "not_found";
82
+ if (msg.includes("timeout") || msg.includes("aborted")) return "timeout";
83
+ if (msg.includes("network") || msg.includes("failed to fetch")) return "network";
84
+ return "fetch_failed";
85
+ }
86
+
87
+ // =============================================================================
88
+ // Public: tokenize many models in parallel + compute ratios
89
+ // =============================================================================
90
+
91
+ export async function tokenizeAll(modelIds, text, baseline_idx = 0) {
92
+ if (!Array.isArray(modelIds) || modelIds.length === 0 || typeof text !== "string") {
93
+ return { code: "empty_input", results: [], baseline: null };
94
+ }
95
+ const results = await Promise.all(
96
+ modelIds.map(id => tokenizeWithModel(id, text))
97
+ );
98
+ const okResults = results.filter(r => r.ok);
99
+ if (okResults.length === 0) {
100
+ return { code: "all_failed", results, baseline: null };
101
+ }
102
+ // Baseline: first OK tokenizer, or the user-specified index if it's OK.
103
+ let baseline = okResults[0];
104
+ if (baseline_idx >= 0 && baseline_idx < results.length && results[baseline_idx].ok) {
105
+ baseline = results[baseline_idx];
106
+ }
107
+ // Stamp ratio vs baseline + chars-per-token for each.
108
+ const charCount = text.length;
109
+ const byteCount = new TextEncoder().encode(text).length;
110
+ for (const r of results) {
111
+ if (!r.ok) continue;
112
+ r.chars_per_token = r.token_count > 0 ? charCount / r.token_count : null;
113
+ r.bytes_per_token = r.token_count > 0 ? byteCount / r.token_count : null;
114
+ r.ratio_vs_baseline = baseline.token_count > 0
115
+ ? r.token_count / baseline.token_count
116
+ : null;
117
+ }
118
+ return {
119
+ code: "ok",
120
+ results,
121
+ baseline_id: baseline.modelId,
122
+ baseline_count: baseline.token_count,
123
+ chars: charCount,
124
+ bytes: byteCount,
125
+ };
126
+ }
127
+
128
+ // =============================================================================
129
+ // Language detection — Unicode block analysis (no external deps)
130
+ // =============================================================================
131
+ //
132
+ // Surfaced as context next to the token counts so users see "this text
133
+ // is 60% CJK, 40% Latin" — explains why one tokenizer is 3× another.
134
+
135
+ const UNICODE_BLOCKS = [
136
+ // [name, regex_class]
137
+ ["latin", /[A-z]/g],
138
+ ["cjk", /[぀-ゟ゠-ヿ一-鿿ヲ-ン]/g],
139
+ ["korean", /[가-힯ᄀ-ᇿ]/g],
140
+ ["arabic", /[؀-ۿݐ-ݿ]/g],
141
+ ["cyrillic", /[Ѐ-ӿ]/g],
142
+ ["devanagari", /[ऀ-ॿ]/g],
143
+ ["thai", /[฀-๿]/g],
144
+ ["greek", /[Ͱ-Ͽ]/g],
145
+ ["hebrew", /[֐-׿]/g],
146
+ ];
147
+
148
+ export function detectLanguageBlocks(text) {
149
+ if (typeof text !== "string" || !text) {
150
+ return { total_chars: 0, blocks: {}, dominant: null };
151
+ }
152
+ const blocks = {};
153
+ for (const [name, re] of UNICODE_BLOCKS) {
154
+ re.lastIndex = 0;
155
+ const m = text.match(re);
156
+ blocks[name] = m ? m.length : 0;
157
+ }
158
+ const total = text.length;
159
+ const dominant = Object.entries(blocks)
160
+ .filter(([, n]) => n > 0)
161
+ .sort((a, b) => b[1] - a[1])[0]?.[0] || null;
162
+ return { total_chars: total, blocks, dominant };
163
+ }
164
+
165
+ // =============================================================================
166
+ // Preset tokenizer list — all open-weight (no HF auth required)
167
+ // =============================================================================
168
+ //
169
+ // Curated for breadth: one per major tokenizer family. For gated
170
+ // originals (Llama, Mistral, Gemma) the unsloth open-mirror is used —
171
+ // tokenizer.json is byte-identical to the original because quantization
172
+ // touches weights, not tokens (see spec-decode docs for the same
173
+ // argument).
174
+
175
+ export const PRESET_TOKENIZERS = [
176
+ {
177
+ id: "Qwen/Qwen2.5-7B-Instruct",
178
+ label: "Qwen2.5",
179
+ family: "Qwen-BPE (152k vocab, CJK-aware)",
180
+ },
181
+ {
182
+ id: "microsoft/Phi-3.5-mini-instruct",
183
+ label: "Phi-3.5",
184
+ family: "tiktoken-style BPE (32k)",
185
+ },
186
+ {
187
+ id: "unsloth/Meta-Llama-3.1-8B-Instruct",
188
+ label: "Llama-3.1",
189
+ family: "Llama-3 BPE (128k)",
190
+ },
191
+ {
192
+ id: "unsloth/gemma-2-9b-it",
193
+ label: "Gemma-2",
194
+ family: "SentencePiece (256k)",
195
+ },
196
+ {
197
+ id: "Xenova/gpt-4",
198
+ label: "GPT-4 (cl100k)",
199
+ family: "OpenAI tiktoken cl100k_base",
200
+ },
201
+ {
202
+ id: "Xenova/claude-tokenizer",
203
+ label: "Claude (approx)",
204
+ family: "Anthropic open approx (community port)",
205
+ },
206
+ ];
207
+
208
+ // Sample texts that demonstrate cost asymmetry — identical meaning
209
+ // across languages so the user sees per-language tax directly.
210
+ export const SAMPLE_TEXTS = {
211
+ english: "The quick brown fox jumps over the lazy dog. " +
212
+ "She sells seashells by the seashore. Pack my box with five dozen liquor jugs.",
213
+ chinese: "敏捷的棕色狐狸跳过了懒狗。她在海边卖海贝壳。请用五打酒壶装满我的箱子。" +
214
+ "中文用字符表示词义,所以一段文字所需的字符数远少于英文。",
215
+ arabic: "الثعلب البني السريع يقفز فوق الكلب الكسول. " +
216
+ "تبيع أصدافًا بحرية على شاطئ البحر. عبئ صندوقي بخمسين إبريقًا من الخمر.",
217
+ mixed: "Hello world! 你好世界 مرحبا بالعالم Привет мир नमस्ते दुनिया",
218
+ code: "def quick_brown_fox(jumps_over: int) -> str:\n" +
219
+ " return f'The fox jumped {jumps_over} times'\n\n" +
220
+ "for i in range(10):\n print(quick_brown_fox(i))",
221
+ };