karlexmarin Claude Opus 4.7 (1M context) commited on
Commit
a499fd5
·
1 Parent(s): 3d389cc

v0.8.5 Speculative-Decode Compatibility Checker — anti-bullshit pack #11

Browse files

Speculative decoding (vLLM, SGLang, llama.cpp, transformers
`assistant_model`) requires the draft and target model to share an
EXACT vocabulary. If token IDs disagree, every draft token is rejected
by the target's verifier — you pay BOTH compute costs AND get WORSE
throughput than baseline. The system reports nominal output (just
slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 /
#20409 / #12488 all surface variants of this.

🔬 Spec-Decode (19th mode):
- Two HF model id inputs with autocomplete (reuses v0.7.4 infra)
- Async fetch `tokenizer.json` from HF Hub for both ids; falls back
to `tokenizer_config.json` when the canonical artifact is missing
- Compares: tokenizer type (BPE/Unigram/WordPiece), vocab size,
full token→id sample on the smaller side, special tokens
(BOS/EOS/PAD/UNK), added tokens (chat-template extensions)
- Speedup band based on param ratio (parsed from id "8B"/"72B"
hints OR derived from `config.json` hidden_size × layers × vocab):
low (α=0.50), expected (α=0.70), high (α=0.85), capped at 3.5x
- 9 verdict codes: compatible / compatible_with_caveats /
partial_compatible / type_mismatch / vocab_size_mismatch /
incompatible / fetch_failed / identical_models / missing_input
- Per-side fetch error breakdown (gated/private/not_found/timeout/
parse_failed/network) so users debug the right input

Pure logic + async fetch in `js/spec_decode_compat.js`. Comment +
string-literal stripping is unnecessary here (input is two ids, not
code). 55 i18n keys × 4 langs (EN/ES/FR/ZH) = 220 keys, parity clean.

Solutions Hub: NEW `speculative_decode_mismatch` entry under setup
category, mapped to this mode + 4 external refs (vLLM docs, vLLM
#4570, transformers assistant_model, Leviathan et al. 2022).
HF autocomplete known-id list extended with `spec-target-id` /
`spec-draft-id`. Help modal v0.8.5 entry + Inventory + Setup tile.

Verified: 7/7 logic cases (vocab compare identical / cross-family /
type-mismatch + 4 param-hint parses) + 220/220 i18n parity + headless
e2e (tab/section/inputs/empty-hint/identical-hint render). 20 mode
tabs total.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (6) hide show
  1. data/solutions_hub.json +14 -0
  2. index.html +33 -0
  3. js/hf_autocomplete.js +4 -1
  4. js/i18n.js +228 -0
  5. js/main.js +177 -1
  6. js/spec_decode_compat.js +372 -0
data/solutions_hub.json CHANGED
@@ -218,6 +218,20 @@
218
  "best_for": "Generating the exact `python cli/diagnose_model.py` command for your model.",
219
  "not_for": "Browser-only diagnosis — this mode is a builder, not an executor."
220
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
  {
222
  "id": "peft_loading",
223
  "category": "training",
 
218
  "best_for": "Generating the exact `python cli/diagnose_model.py` command for your model.",
219
  "not_for": "Browser-only diagnosis — this mode is a builder, not an executor."
220
  },
221
+ {
222
+ "id": "speculative_decode_mismatch",
223
+ "category": "setup",
224
+ "pain": "Speculative decoding silently rejects all draft tokens when target+draft tokenizer vocabs differ — worse throughput than baseline, no error.",
225
+ "tafagent_mode": "🔬 Speculative-Decode Compatibility",
226
+ "external_tools": [
227
+ {"name": "vLLM speculative decoding docs", "url": "https://docs.vllm.ai/en/latest/serving/speculative_decoding.html", "type": "docs"},
228
+ {"name": "vLLM #4570 — vocab mismatch report", "url": "https://github.com/vllm-project/vllm/issues/4570", "type": "issue"},
229
+ {"name": "transformers assistant_model docs", "url": "https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding", "type": "docs"},
230
+ {"name": "Leviathan et al. 2022 — speculative-decoding paper", "url": "https://arxiv.org/abs/2211.17192", "type": "paper"}
231
+ ],
232
+ "best_for": "Verifying a (target, draft) pair before launching a vLLM/SGLang cluster with spec-dec enabled. Surfaces vocab-size, type, and special-token mismatches.",
233
+ "not_for": "Measuring real-world acceptance rate α — that requires running both models on your domain. The tool only predicts a band based on param ratio."
234
+ },
235
  {
236
  "id": "peft_loading",
237
  "category": "training",
index.html CHANGED
@@ -225,6 +225,9 @@
225
  <p><strong data-i18n="help.v084.cache.title">🔁 Prompt-Cache Diff Predictor</strong></p>
226
  <p data-i18n="help.v084.cache.body">Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.</p>
227
 
 
 
 
228
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
229
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
230
 
@@ -340,6 +343,7 @@
340
  <li data-i18n="inv.v082.cot"><strong>📋 JSON CoT</strong> — lints structured-output schemas for the answer-before-reasoning anti-pattern that silently breaks Chain-of-Thought.</li>
341
  <li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
342
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
 
343
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
344
  </ul>
345
  </details>
@@ -414,6 +418,7 @@
414
  <button data-mode-link="cot" data-i18n="modes.cot">📋 JSON CoT</button>
415
  <button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
416
  <button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
 
417
  </div>
418
  </div>
419
  <div class="task-tile">
@@ -473,6 +478,7 @@
473
  <button class="mode-btn" data-mode="cot" role="tab" aria-selected="false" data-i18n="modes.cot">📋 JSON CoT</button>
474
  <button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
475
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
 
476
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
477
  </div>
478
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
@@ -1107,6 +1113,33 @@
1107
  <div id="cache-output" style="margin-top: 1em;"></div>
1108
  </section>
1109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1110
  <section id="hub-section" style="display:none;">
1111
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1112
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
 
225
  <p><strong data-i18n="help.v084.cache.title">🔁 Prompt-Cache Diff Predictor</strong></p>
226
  <p data-i18n="help.v084.cache.body">Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.</p>
227
 
228
+ <p><strong data-i18n="help.v085.speculative.title">🔬 Speculative-Decode Compatibility</strong></p>
229
+ <p data-i18n="help.v085.speculative.body">Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.</p>
230
+
231
  <p><strong data-i18n="help.v081.hub.title">🧭 Solutions Hub</strong></p>
232
  <p data-i18n="help.v081.hub.body">tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'</p>
233
 
 
343
  <li data-i18n="inv.v082.cot"><strong>📋 JSON CoT</strong> — lints structured-output schemas for the answer-before-reasoning anti-pattern that silently breaks Chain-of-Thought.</li>
344
  <li data-i18n="inv.v083.peft"><strong>🔧 PEFT Lint</strong> — catches the silent <code>get_peft_model</code> base-load (peft #2115) + QLoRA order + target_modules / arch mismatch.</li>
345
  <li data-i18n="inv.v084.cache"><strong>🔁 Cache Diff</strong> — predicts whether a prompt edit invalidated the provider's prompt cache. Per-provider hit ratio + $ delta.</li>
346
+ <li data-i18n="inv.v085.speculative"><strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).</li>
347
  <li data-i18n="inv.v081.hub"><strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.</li>
348
  </ul>
349
  </details>
 
418
  <button data-mode-link="cot" data-i18n="modes.cot">📋 JSON CoT</button>
419
  <button data-mode-link="peft" data-i18n="modes.peft">🔧 PEFT Lint</button>
420
  <button data-mode-link="cache" data-i18n="modes.cache">🔁 Cache Diff</button>
421
+ <button data-mode-link="speculative" data-i18n="modes.speculative">🔬 Spec-Decode</button>
422
  </div>
423
  </div>
424
  <div class="task-tile">
 
478
  <button class="mode-btn" data-mode="cot" role="tab" aria-selected="false" data-i18n="modes.cot">📋 JSON CoT</button>
479
  <button class="mode-btn" data-mode="peft" role="tab" aria-selected="false" data-i18n="modes.peft">🔧 PEFT Lint</button>
480
  <button class="mode-btn" data-mode="cache" role="tab" aria-selected="false" data-i18n="modes.cache">🔁 Cache Diff</button>
481
+ <button class="mode-btn" data-mode="speculative" role="tab" aria-selected="false" data-i18n="modes.speculative">🔬 Spec-Decode</button>
482
  <button class="mode-btn" data-mode="hub" role="tab" aria-selected="false" data-i18n="modes.hub">🧭 Solutions</button>
483
  </div>
484
  <p id="mode-desc" class="recipe-desc" data-i18n="modes.desc">
 
1113
  <div id="cache-output" style="margin-top: 1em;"></div>
1114
  </section>
1115
 
1116
+ <!-- Speculative-Decode Compatibility Checker (mode=speculative, v0.8.5 #11) -->
1117
+ <section id="speculative-section" style="display:none;">
1118
+ <h2><span data-i18n="speculative.title">🔬 Speculative-Decode Compatibility</span>
1119
+ <span class="info"><span class="tooltip" data-i18n="speculative.tip">
1120
+ <strong>Why this matters</strong>: speculative decoding (vLLM, SGLang, llama.cpp, transformers) requires the draft and target model to share an EXACT vocabulary. Any token-id disagreement means the target rejects every draft token — you pay BOTH compute costs and get WORSE throughput than baseline. The system reports nominal output (just slower), so the bug is invisible in unit tests. This tool fetches `tokenizer.json` from HF Hub for both ids and compares.
1121
+ </span></span>
1122
+ </h2>
1123
+ <p class="recipe-desc" data-i18n="speculative.desc">
1124
+ <strong>Don't ship spec-dec with mismatched vocabs.</strong> Paste target + draft HF model ids → tool fetches tokenizers, compares vocab type, size, sampled token-ids, special tokens, added tokens → verdict + speedup estimate.
1125
+ </p>
1126
+ <div class="form-row">
1127
+ <label for="spec-target-id" data-i18n="speculative.target_label">Target (large) model id:</label>
1128
+ <input type="text" id="spec-target-id" placeholder="meta-llama/Llama-3.1-70B-Instruct" style="flex:1;" />
1129
+ </div>
1130
+ <div class="form-row">
1131
+ <label for="spec-draft-id" data-i18n="speculative.draft_label">Draft (small) model id:</label>
1132
+ <input type="text" id="spec-draft-id" placeholder="meta-llama/Llama-3.1-8B-Instruct" style="flex:1;" />
1133
+ </div>
1134
+ <div class="form-row">
1135
+ <button type="button" id="spec-check-btn" data-i18n="speculative.check_btn">🔍 Check compatibility</button>
1136
+ <button type="button" id="spec-example-good-btn" class="secondary" data-i18n="speculative.example_good_btn">↳ Example: Llama-3.1 8B/70B (good)</button>
1137
+ <button type="button" id="spec-example-bad-btn" class="secondary" data-i18n="speculative.example_bad_btn">↳ Example: cross-family (bad)</button>
1138
+ </div>
1139
+ <p id="spec-status" class="recipe-desc" style="font-size:0.92em;"></p>
1140
+ <div id="spec-output" style="margin-top: 1em;"></div>
1141
+ </section>
1142
+
1143
  <section id="hub-section" style="display:none;">
1144
  <h2><span data-i18n="hub.title">🧭 Solutions Hub</span>
1145
  <span class="info"><span class="tooltip" data-i18n="hub.tip">
js/hf_autocomplete.js CHANGED
@@ -200,7 +200,10 @@ export function attachHfAutocomplete(inputEl, options = {}) {
200
  // Convenience: attach to all known HF-id inputs in TAF Agent.
201
  // NIAH was added in v0.7.6 — keep this list in sync when adding new modes.
202
  export function attachAllHfAutocompletes() {
203
- const ids = ["hf-id", "profile-hf-id", "unmask-id", "template-id", "quant-id", "niah-id"];
 
 
 
204
  for (const id of ids) {
205
  const el = document.getElementById(id);
206
  if (el) attachHfAutocomplete(el);
 
200
  // Convenience: attach to all known HF-id inputs in TAF Agent.
201
  // NIAH was added in v0.7.6 — keep this list in sync when adding new modes.
202
  export function attachAllHfAutocompletes() {
203
+ const ids = [
204
+ "hf-id", "profile-hf-id", "unmask-id", "template-id", "quant-id", "niah-id",
205
+ "spec-target-id", "spec-draft-id",
206
+ ];
207
  for (const id of ids) {
208
  const el = document.getElementById(id);
209
  if (el) attachHfAutocomplete(el);
js/i18n.js CHANGED
@@ -637,6 +637,63 @@ export const TRANSLATIONS = {
637
  "help.v084.cache.title": "🔁 Prompt-Cache Diff Predictor",
638
  "help.v084.cache.body": "Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.",
639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
640
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
641
  "help.v081.hub.title": "🧭 Solutions Hub",
642
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
@@ -1733,6 +1790,63 @@ export const TRANSLATIONS = {
1733
  "help.v084.cache.title": "🔁 Predictor de Diff de Prompt-Cache",
1734
  "help.v084.cache.body": "Las prompt caches de cada proveedor tienen reglas distintas: el <code>cache_control</code> de Anthropic se rompe al primer token diferente del prefijo marcado; OpenAI auto-cachea prefijos ≥1024 tokens; las context caches de Gemini requieren ≥32K tokens. Una edición mal puesta silenciosamente 10x tu factura — la API no avisa, y el coste solo aparece en la siguiente factura. Pega prompt viejo + nuevo, el predictor halla el prefijo común más largo, estima tokens con tres perfiles de tokenizer (inglés / código / CJK), y muestra hit ratio por proveedor + delta $ vs sin caché para Claude Opus/Sonnet/Haiku, GPT-5/mini, y Gemini 2.5 Pro. <em>Caso de uso</em>: 'Tweaké el system prompt y la factura saltó — ¿qué se rompió?' → pega ambos prompts, ve exactamente qué proveedor dejó de cachear.",
1735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1736
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
1737
  "help.v081.hub.title": "🧭 Solutions Hub",
1738
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
@@ -2693,6 +2807,63 @@ export const TRANSLATIONS = {
2693
  "help.v084.cache.title": "🔁 Prédicteur de Diff Prompt-Cache",
2694
  "help.v084.cache.body": "Les caches prompt de chaque fournisseur ont des règles différentes : le <code>cache_control</code> d'Anthropic casse au premier token différent du préfixe marqué ; OpenAI auto-cache les préfixes ≥1024 tokens ; les context caches Gemini requièrent ≥32K tokens. Une édition mal placée 10x silencieusement votre facture — l'API ne prévient pas, et le coût n'apparaît qu'à la facture suivante. Collez ancien + nouveau prompt, le prédicteur trouve le plus long préfixe commun, estime les tokens avec trois profils de tokenizer (anglais / code / CJK), et montre le taux de hit par fournisseur + delta $ vs sans cache pour Claude Opus/Sonnet/Haiku, GPT-5/mini, et Gemini 2.5 Pro. <em>Cas d'usage</em> : 'J'ai modifié le system prompt et la facture a sauté — qu'est-ce qui a cassé ?' → collez les deux prompts, voyez exactement quel fournisseur a arrêté de cacher.",
2695
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2696
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
2697
  "help.v081.hub.title": "🧭 Solutions Hub",
2698
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
@@ -3653,6 +3824,63 @@ export const TRANSLATIONS = {
3653
  "help.v084.cache.title": "🔁 Prompt-Cache 差异预测器",
3654
  "help.v084.cache.body": "每个提供商的 prompt cache 有不同规则:Anthropic 的 <code>cache_control</code> 在标记前缀的第一个 token 差异处中断;OpenAI 自动缓存 ≥1024 token 的前缀;Gemini context cache 需要 ≥32K token。位置不当的编辑会悄悄使你的账单 10 倍——API 不会警告,成本只在下张账单上出现。粘贴新旧 prompt,预测器找到最长公共前缀,用三种 tokenizer 配置(英语/代码/CJK)估算 token,并显示每个提供商的命中率 + 与无缓存的 $ 差额,包括 Claude Opus/Sonnet/Haiku、GPT-5/mini 和 Gemini 2.5 Pro。<em>用例</em>:『我调整了 system prompt 后账单暴涨——什么坏了?』→ 粘贴两个 prompt,看到底哪个提供商停止缓存。",
3655
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3656
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
3657
  "help.v081.hub.title": "🧭 Solutions Hub",
3658
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
 
637
  "help.v084.cache.title": "🔁 Prompt-Cache Diff Predictor",
638
  "help.v084.cache.body": "Provider prompt caches each have different rules: Anthropic's <code>cache_control</code> breaks at the first token diff in the marked prefix; OpenAI auto-caches prefixes ≥1024 tokens; Gemini context caches require ≥32K tokens. A misplaced edit silently 10x's your bill — the API never warns you, and the cost only shows up on the next invoice. Paste old + new prompt, the predictor finds the longest common prefix, estimates tokens with three tokenizer profiles (English / code / CJK), and shows per-provider hit ratio + $ delta vs no-cache for Claude Opus/Sonnet/Haiku, GPT-5/mini, and Gemini 2.5 Pro. <em>Use case</em>: 'I tweaked the system prompt and the bill jumped — what broke?' → paste both prompts, see exactly which provider stopped caching.",
639
 
640
+ // v0.8.5 — anti-bullshit pack #11: Speculative-Decode Compatibility
641
+ "modes.speculative": "🔬 Spec-Decode",
642
+ "mode_desc.speculative": "Fetches `tokenizer.json` from HF Hub for two model ids and verifies vocab compatibility before you wire up speculative decoding. Catches the silent-mismatch bug that wastes draft compute.",
643
+ "speculative.title": "🔬 Speculative-Decode Compatibility",
644
+ "speculative.tip": "Speculative decoding (vLLM, SGLang, llama.cpp, transformers) requires the draft and target model to share an EXACT vocabulary. Any token-id disagreement means the target rejects every draft token — you pay BOTH compute costs and get WORSE throughput than baseline. The system reports nominal output (just slower), so the bug is invisible in unit tests. This tool fetches `tokenizer.json` from HF Hub for both ids and compares.",
645
+ "speculative.desc": "<strong>Don't ship spec-dec with mismatched vocabs.</strong> Paste target + draft HF model ids → tool fetches tokenizers, compares vocab type, size, sampled token-ids, special tokens, added tokens → verdict + speedup estimate.",
646
+ "speculative.target_label": "Target (large) model id:",
647
+ "speculative.draft_label": "Draft (small) model id:",
648
+ "speculative.target_label_short": "target",
649
+ "speculative.draft_label_short": "draft",
650
+ "speculative.check_btn": "🔍 Check compatibility",
651
+ "speculative.example_good_btn":"↳ Example: Llama-3.1 8B/70B (good)",
652
+ "speculative.example_bad_btn": "↳ Example: cross-family (bad)",
653
+ "speculative.status.fetching": "🔄 Fetching tokenizer.json from HF Hub for both models…",
654
+ "speculative.status.done": "✅ {verdict}",
655
+ "speculative.status.error": "❌ Error",
656
+ "speculative.type_mismatch_note": "tokenizer types differ; spec-dec impossible",
657
+ "speculative.vocab_size": "Vocab size",
658
+ "speculative.size_diff": "differ — every reused id is a misalignment",
659
+ "speculative.sampled": "Token-id sample match",
660
+ "speculative.first_mismatch": "First mismatch",
661
+ "speculative.special_diff": "Special-token differences",
662
+ "speculative.added_diff": "Added-token differences",
663
+ "speculative.added_diff_more": "+ more …",
664
+ "speculative.speedup.title": "Estimated speedup band",
665
+ "speculative.speedup.params": "target {target} / draft {draft} (param ratio {ratio})",
666
+ "speculative.speedup.low": "Low (α=0.50)",
667
+ "speculative.speedup.expected":"Expected (α=0.70)",
668
+ "speculative.speedup.high": "High (α=0.85)",
669
+ "speculative.speedup.disclaimer": "α = draft acceptance rate. Real speedup depends on prompt domain, lookahead K, and engine overhead. Bands assume ideal verifier batching.",
670
+ "speculative.speedup.draft_not_smaller": "Draft is not smaller than target — spec-dec is misuse here.",
671
+ "speculative.attribution": "Refs:",
672
+ "speculative.side.target": "Target",
673
+ "speculative.side.draft": "Draft",
674
+ "speculative.fetch_error.missing_model_id": "missing model id",
675
+ "speculative.fetch_error.gated_or_private": "model is gated or private — can't fetch tokenizer without auth",
676
+ "speculative.fetch_error.not_found": "model id not found on HF Hub",
677
+ "speculative.fetch_error.fetch_failed": "fetch failed (HTTP error)",
678
+ "speculative.fetch_error.parse_failed": "JSON parse failed (file malformed)",
679
+ "speculative.fetch_error.timeout": "timeout (>8s, large tokenizer or slow connection)",
680
+ "speculative.fetch_error.network": "network error",
681
+ "speculative.fetch_error.hint": "Check the model id spelling. For gated models you'll need to view the tokenizer file via your HF account — this tool can't auth.",
682
+ "speculative.hint.missing_input": "Enter both target and draft model ids, then Check.",
683
+ "speculative.hint.identical_models": "Target and draft are the same model — spec-dec is a no-op (and wasteful).",
684
+ "speculative.verdict.compatible": "✅ Compatible — vocabs match",
685
+ "speculative.verdict.compatible_with_caveats": "✅ Compatible — but special/added tokens differ (review)",
686
+ "speculative.verdict.partial_compatible": "⚠ Partial match (95-99.9% of sampled ids)",
687
+ "speculative.verdict.type_mismatch": "❌ Tokenizer types differ — spec-dec impossible",
688
+ "speculative.verdict.vocab_size_mismatch": "❌ Vocab sizes differ — id space misaligned",
689
+ "speculative.verdict.incompatible": "❌ Incompatible — too many id mismatches",
690
+ "speculative.verdict.fetch_failed": "ℹ Couldn't fetch tokenizer",
691
+ "speculative.verdict.identical_models": "ℹ Identical models — spec-dec is a no-op",
692
+ "speculative.verdict.missing_input": "ℹ Enter both ids",
693
+ "inv.v085.speculative": "<strong>🔬 Spec-Decode</strong> — verifies tokenizer vocab compatibility between target + draft before you ship speculative decoding (the bug that gives WORSE throughput silently).",
694
+ "help.v085.speculative.title": "🔬 Speculative-Decode Compatibility",
695
+ "help.v085.speculative.body": "Speculative decoding only works if target and draft share the exact same vocabulary. Mismatched vocabs cause every draft token to be rejected — you pay BOTH compute costs and get worse throughput than baseline. Worse, the system still emits correct output (just slower), so the bug is invisible in unit tests. vLLM #4570 / #16757 / #20409 / #12488 all surface variants. This tool fetches `tokenizer.json` from HF Hub for both model ids, compares tokenizer type, vocab size, full token→id map, special tokens, and added tokens, then estimates a speedup band based on param ratio and typical α=0.5/0.7/0.85 acceptance rates. <em>Use case</em>: before you launch a vLLM cluster with spec-dec enabled, verify the pair is actually compatible.",
696
+
697
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — every documented pain mapped to a tafagent mode or curated external tool. Don't reinvent — find.",
698
  "help.v081.hub.title": "🧭 Solutions Hub",
699
  "help.v081.hub.body": "tafagent as integrator, not silo. 30+ pains across 7 categories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), each mapped to (a) the tafagent mode that addresses it, if any, and (b) the best-of-breed external tools the community already trusts (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Search box matches across pain, scenario, and tool name. <em>Use case</em>: 'I have problem X — does tafagent solve it, and if not, who does?'",
 
1790
  "help.v084.cache.title": "🔁 Predictor de Diff de Prompt-Cache",
1791
  "help.v084.cache.body": "Las prompt caches de cada proveedor tienen reglas distintas: el <code>cache_control</code> de Anthropic se rompe al primer token diferente del prefijo marcado; OpenAI auto-cachea prefijos ≥1024 tokens; las context caches de Gemini requieren ≥32K tokens. Una edición mal puesta silenciosamente 10x tu factura — la API no avisa, y el coste solo aparece en la siguiente factura. Pega prompt viejo + nuevo, el predictor halla el prefijo común más largo, estima tokens con tres perfiles de tokenizer (inglés / código / CJK), y muestra hit ratio por proveedor + delta $ vs sin caché para Claude Opus/Sonnet/Haiku, GPT-5/mini, y Gemini 2.5 Pro. <em>Caso de uso</em>: 'Tweaké el system prompt y la factura saltó — ¿qué se rompió?' → pega ambos prompts, ve exactamente qué proveedor dejó de cachear.",
1792
 
1793
+ // v0.8.5 — anti-bullshit pack #11: Speculative-Decode Compatibility
1794
+ "modes.speculative": "🔬 Spec-Decode",
1795
+ "mode_desc.speculative": "Hace fetch del `tokenizer.json` desde HF Hub para dos model ids y verifica compatibilidad de vocab antes de cablear speculative decoding. Atrapa el bug de mismatch silencioso que desperdicia compute del draft.",
1796
+ "speculative.title": "🔬 Compatibilidad de Speculative-Decode",
1797
+ "speculative.tip": "El speculative decoding (vLLM, SGLang, llama.cpp, transformers) requiere que draft y target compartan vocabulario EXACTO. Cualquier desacuerdo de token-id hace que el target rechace cada token del draft — pagas AMBOS computes y obtienes PEOR throughput que baseline. El sistema reporta output nominal (solo más lento), así que el bug es invisible en tests unitarios. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids y compara.",
1798
+ "speculative.desc": "<strong>No envíes spec-dec con vocabs mismatched.</strong> Pega target + draft model ids → tool hace fetch de tokenizers, compara tipo de vocab, tamaño, token-ids muestreados, special tokens, added tokens → veredicto + estimación de speedup.",
1799
+ "speculative.target_label": "Model id del target (grande):",
1800
+ "speculative.draft_label": "Model id del draft (pequeño):",
1801
+ "speculative.target_label_short": "target",
1802
+ "speculative.draft_label_short": "draft",
1803
+ "speculative.check_btn": "🔍 Verificar compatibilidad",
1804
+ "speculative.example_good_btn":"↳ Ejemplo: Llama-3.1 8B/70B (bueno)",
1805
+ "speculative.example_bad_btn": "↳ Ejemplo: cross-family (malo)",
1806
+ "speculative.status.fetching": "🔄 Haciendo fetch de tokenizer.json desde HF Hub para ambos modelos…",
1807
+ "speculative.status.done": "✅ {verdict}",
1808
+ "speculative.status.error": "❌ Error",
1809
+ "speculative.type_mismatch_note": "tipos de tokenizer difieren; spec-dec imposible",
1810
+ "speculative.vocab_size": "Tamaño del vocab",
1811
+ "speculative.size_diff": "difieren — cada id reusado es un mismatch",
1812
+ "speculative.sampled": "Match de token-id muestreado",
1813
+ "speculative.first_mismatch": "Primer mismatch",
1814
+ "speculative.special_diff": "Diferencias de special tokens",
1815
+ "speculative.added_diff": "Diferencias de added tokens",
1816
+ "speculative.added_diff_more": "+ más …",
1817
+ "speculative.speedup.title": "Banda estimada de speedup",
1818
+ "speculative.speedup.params": "target {target} / draft {draft} (ratio de params {ratio})",
1819
+ "speculative.speedup.low": "Bajo (α=0.50)",
1820
+ "speculative.speedup.expected":"Esperado (α=0.70)",
1821
+ "speculative.speedup.high": "Alto (α=0.85)",
1822
+ "speculative.speedup.disclaimer": "α = tasa de aceptación del draft. El speedup real depende del dominio del prompt, lookahead K, y overhead del engine. Las bandas asumen verifier batching ideal.",
1823
+ "speculative.speedup.draft_not_smaller": "El draft no es más pequeño que el target — spec-dec es mal uso aquí.",
1824
+ "speculative.attribution": "Referencias:",
1825
+ "speculative.side.target": "Target",
1826
+ "speculative.side.draft": "Draft",
1827
+ "speculative.fetch_error.missing_model_id": "falta el model id",
1828
+ "speculative.fetch_error.gated_or_private": "modelo es gated o privado — no se puede hacer fetch del tokenizer sin auth",
1829
+ "speculative.fetch_error.not_found": "model id no encontrado en HF Hub",
1830
+ "speculative.fetch_error.fetch_failed": "fetch falló (error HTTP)",
1831
+ "speculative.fetch_error.parse_failed": "parse JSON falló (archivo malformado)",
1832
+ "speculative.fetch_error.timeout": "timeout (>8s, tokenizer grande o conexión lenta)",
1833
+ "speculative.fetch_error.network": "error de red",
1834
+ "speculative.fetch_error.hint": "Verifica el spelling del model id. Para modelos gated necesitas ver el tokenizer vía tu cuenta HF — esta tool no puede autenticar.",
1835
+ "speculative.hint.missing_input": "Ingresa ambos model ids (target y draft), luego Verificar.",
1836
+ "speculative.hint.identical_models": "Target y draft son el mismo modelo — spec-dec es un no-op (y desperdicio).",
1837
+ "speculative.verdict.compatible": "✅ Compatible — vocabs coinciden",
1838
+ "speculative.verdict.compatible_with_caveats": "✅ Compatible — pero special/added tokens difieren (revisar)",
1839
+ "speculative.verdict.partial_compatible": "⚠ Match parcial (95-99.9% de los ids muestreados)",
1840
+ "speculative.verdict.type_mismatch": "❌ Tipos de tokenizer difieren — spec-dec imposible",
1841
+ "speculative.verdict.vocab_size_mismatch": "❌ Tamaños de vocab difieren — espacio de id desalineado",
1842
+ "speculative.verdict.incompatible": "❌ Incompatibles — demasiados mismatches de id",
1843
+ "speculative.verdict.fetch_failed": "ℹ No se pudo hacer fetch del tokenizer",
1844
+ "speculative.verdict.identical_models": "ℹ Modelos idénticos — spec-dec es un no-op",
1845
+ "speculative.verdict.missing_input": "ℹ Ingresa ambos ids",
1846
+ "inv.v085.speculative": "<strong>🔬 Spec-Decode</strong> — verifica compatibilidad de vocab del tokenizer entre target + draft antes de enviar speculative decoding (el bug que da PEOR throughput silenciosamente).",
1847
+ "help.v085.speculative.title": "🔬 Compatibilidad de Speculative-Decode",
1848
+ "help.v085.speculative.body": "El speculative decoding solo funciona si target y draft comparten exactamente el mismo vocabulario. Vocabs mismatched hacen que cada token del draft sea rechazado — pagas AMBOS computes y obtienes peor throughput que baseline. Peor: el sistema sigue emitiendo output correcto (solo más lento), así que el bug es invisible en tests unitarios. vLLM #4570 / #16757 / #20409 / #12488 surfacen variantes. Esta tool hace fetch de `tokenizer.json` desde HF Hub para ambos ids, compara tipo de tokenizer, tamaño de vocab, mapa completo token→id, special tokens, y added tokens, luego estima una banda de speedup basada en ratio de params y tasas típicas α=0.5/0.7/0.85 de aceptación. <em>Caso de uso</em>: antes de lanzar un cluster vLLM con spec-dec habilitado, verifica que el par sea compatible.",
1849
+
1850
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — cada pain documentado mapeado a un mode tafagent o herramienta externa curada. No reinventes — encuentra.",
1851
  "help.v081.hub.title": "🧭 Solutions Hub",
1852
  "help.v081.hub.body": "tafagent como integrador, no silo. 30+ pains en 7 categorías (eval reliability · diagnósticos · setup · training · retrieval · multimodal · observability), cada uno mapeado a (a) el mode tafagent que lo resuelve, si existe, y (b) las herramientas externas best-of-breed que la comunidad ya usa (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). Caja de búsqueda matchea pain, scenario, y nombre de herramienta. <em>Caso de uso</em>: 'tengo problema X — ¿lo resuelve tafagent, y si no, quién?'",
 
2807
  "help.v084.cache.title": "🔁 Prédicteur de Diff Prompt-Cache",
2808
  "help.v084.cache.body": "Les caches prompt de chaque fournisseur ont des règles différentes : le <code>cache_control</code> d'Anthropic casse au premier token différent du préfixe marqué ; OpenAI auto-cache les préfixes ≥1024 tokens ; les context caches Gemini requièrent ≥32K tokens. Une édition mal placée 10x silencieusement votre facture — l'API ne prévient pas, et le coût n'apparaît qu'à la facture suivante. Collez ancien + nouveau prompt, le prédicteur trouve le plus long préfixe commun, estime les tokens avec trois profils de tokenizer (anglais / code / CJK), et montre le taux de hit par fournisseur + delta $ vs sans cache pour Claude Opus/Sonnet/Haiku, GPT-5/mini, et Gemini 2.5 Pro. <em>Cas d'usage</em> : 'J'ai modifié le system prompt et la facture a sauté — qu'est-ce qui a cassé ?' → collez les deux prompts, voyez exactement quel fournisseur a arrêté de cacher.",
2809
 
2810
+ // v0.8.5 — anti-bullshit pack #11: Speculative-Decode Compatibility
2811
+ "modes.speculative": "🔬 Spec-Decode",
2812
+ "mode_desc.speculative": "Récupère `tokenizer.json` depuis HF Hub pour deux model ids et vérifie la compatibilité du vocab avant de configurer le speculative decoding. Attrape le bug de mismatch silencieux qui gaspille le compute du draft.",
2813
+ "speculative.title": "🔬 Compatibilité Speculative-Decode",
2814
+ "speculative.tip": "Le speculative decoding (vLLM, SGLang, llama.cpp, transformers) requiert que draft et target partagent un vocabulaire EXACT. Tout désaccord de token-id fait que le target rejette chaque token du draft — vous payez LES DEUX coûts de compute et obtenez UN PIRE débit que la baseline. Le système rapporte une sortie nominale (juste plus lente), donc le bug est invisible aux tests unitaires. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux ids et compare.",
2815
+ "speculative.desc": "<strong>Ne déployez pas spec-dec avec des vocabs mismatched.</strong> Collez target + draft model ids → l'outil fetch les tokenizers, compare type de vocab, taille, token-ids échantillonnés, special tokens, added tokens → verdict + estimation de speedup.",
2816
+ "speculative.target_label": "Model id du target (gros) :",
2817
+ "speculative.draft_label": "Model id du draft (petit) :",
2818
+ "speculative.target_label_short": "target",
2819
+ "speculative.draft_label_short": "draft",
2820
+ "speculative.check_btn": "🔍 Vérifier compatibilité",
2821
+ "speculative.example_good_btn":"↳ Exemple : Llama-3.1 8B/70B (bon)",
2822
+ "speculative.example_bad_btn": "↳ Exemple : cross-family (mauvais)",
2823
+ "speculative.status.fetching": "🔄 Récupération de tokenizer.json depuis HF Hub pour les deux modèles…",
2824
+ "speculative.status.done": "✅ {verdict}",
2825
+ "speculative.status.error": "❌ Erreur",
2826
+ "speculative.type_mismatch_note": "types de tokenizer diffèrent ; spec-dec impossible",
2827
+ "speculative.vocab_size": "Taille du vocab",
2828
+ "speculative.size_diff": "diffèrent — chaque id réutilisé est un mismatch",
2829
+ "speculative.sampled": "Match de token-id échantillonné",
2830
+ "speculative.first_mismatch": "Premier mismatch",
2831
+ "speculative.special_diff": "Différences de special tokens",
2832
+ "speculative.added_diff": "Différences de added tokens",
2833
+ "speculative.added_diff_more": "+ plus …",
2834
+ "speculative.speedup.title": "Bande de speedup estimée",
2835
+ "speculative.speedup.params": "target {target} / draft {draft} (ratio de params {ratio})",
2836
+ "speculative.speedup.low": "Bas (α=0.50)",
2837
+ "speculative.speedup.expected":"Attendu (α=0.70)",
2838
+ "speculative.speedup.high": "Haut (α=0.85)",
2839
+ "speculative.speedup.disclaimer": "α = taux d'acceptation du draft. Le speedup réel dépend du domaine du prompt, lookahead K, et overhead du moteur. Les bandes supposent un verifier batching idéal.",
2840
+ "speculative.speedup.draft_not_smaller": "Le draft n'est pas plus petit que le target — spec-dec est un mauvais usage ici.",
2841
+ "speculative.attribution": "Réfs :",
2842
+ "speculative.side.target": "Target",
2843
+ "speculative.side.draft": "Draft",
2844
+ "speculative.fetch_error.missing_model_id": "model id manquant",
2845
+ "speculative.fetch_error.gated_or_private": "modèle gated ou privé — impossible de récupérer le tokenizer sans auth",
2846
+ "speculative.fetch_error.not_found": "model id non trouvé sur HF Hub",
2847
+ "speculative.fetch_error.fetch_failed": "fetch échoué (erreur HTTP)",
2848
+ "speculative.fetch_error.parse_failed": "parse JSON échoué (fichier malformé)",
2849
+ "speculative.fetch_error.timeout": "timeout (>8s, gros tokenizer ou connexion lente)",
2850
+ "speculative.fetch_error.network": "erreur réseau",
2851
+ "speculative.fetch_error.hint": "Vérifiez l'orthographe du model id. Pour les modèles gated, consultez le tokenizer via votre compte HF — cet outil ne peut pas auth.",
2852
+ "speculative.hint.missing_input": "Entrez les deux model ids (target et draft), puis Vérifier.",
2853
+ "speculative.hint.identical_models": "Target et draft sont le même modèle — spec-dec est un no-op (et un gaspillage).",
2854
+ "speculative.verdict.compatible": "✅ Compatible — vocabs correspondent",
2855
+ "speculative.verdict.compatible_with_caveats": "✅ Compatible — mais special/added tokens diffèrent (à revoir)",
2856
+ "speculative.verdict.partial_compatible": "⚠ Match partiel (95-99.9% des ids échantillonnés)",
2857
+ "speculative.verdict.type_mismatch": "❌ Types de tokenizer diffèrent — spec-dec impossible",
2858
+ "speculative.verdict.vocab_size_mismatch": "❌ Tailles de vocab diffèrent — espace d'id désaligné",
2859
+ "speculative.verdict.incompatible": "❌ Incompatibles — trop de mismatches d'id",
2860
+ "speculative.verdict.fetch_failed": "ℹ Récupération du tokenizer impossible",
2861
+ "speculative.verdict.identical_models": "ℹ Modèles identiques — spec-dec est un no-op",
2862
+ "speculative.verdict.missing_input": "ℹ Entrez les deux ids",
2863
+ "inv.v085.speculative": "<strong>🔬 Spec-Decode</strong> — vérifie la compatibilité du vocab du tokenizer entre target + draft avant de déployer le speculative decoding (le bug qui donne UN PIRE débit silencieusement).",
2864
+ "help.v085.speculative.title": "🔬 Compatibilité Speculative-Decode",
2865
+ "help.v085.speculative.body": "Le speculative decoding ne marche que si target et draft partagent exactement le même vocabulaire. Des vocabs mismatched font que chaque token du draft est rejeté — vous payez LES DEUX coûts de compute et obtenez un pire débit que la baseline. Pire : le système émet toujours une sortie correcte (juste plus lente), donc le bug est invisible aux tests unitaires. vLLM #4570 / #16757 / #20409 / #12488 surfent les variantes. Cet outil récupère `tokenizer.json` depuis HF Hub pour les deux model ids, compare le type de tokenizer, la taille du vocab, la map complète token→id, les special tokens, et les added tokens, puis estime une bande de speedup basée sur le ratio de params et les taux α=0.5/0.7/0.85 d'acceptation typiques. <em>Cas d'usage</em> : avant de lancer un cluster vLLM avec spec-dec activé, vérifiez que la paire est compatible.",
2866
+
2867
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — chaque pain documenté mappé à un mode tafagent ou outil externe curé. Ne réinventez pas — trouvez.",
2868
  "help.v081.hub.title": "🧭 Solutions Hub",
2869
  "help.v081.hub.body": "tafagent comme intégrateur, pas silo. 30+ pains à travers 7 catégories (eval reliability · diagnostics · setup · training · retrieval · multimodal · observability), chacun mappé à (a) le mode tafagent qui le résout, s'il existe, et (b) les outils externes best-of-breed que la communauté utilise déjà (RAGAS, MTEB, HELM, MCP Schema Validator, llm-stats, llguidance, GlitchMiner, etc.). La barre de recherche matche pain, scénario, et nom d'outil. <em>Cas d'usage</em> : 'j'ai le problème X — tafagent le résout-il, et sinon, qui ?'",
 
3824
  "help.v084.cache.title": "🔁 Prompt-Cache 差异预测器",
3825
  "help.v084.cache.body": "每个提供商的 prompt cache 有不同规则:Anthropic 的 <code>cache_control</code> 在标记前缀的第一个 token 差异处中断;OpenAI 自动缓存 ≥1024 token 的前缀;Gemini context cache 需要 ≥32K token。位置不当的编辑会悄悄使你的账单 10 倍——API 不会警告,成本只在下张账单上出现。粘贴新旧 prompt,预测器找到最长公共前缀,用三种 tokenizer 配置(英语/代码/CJK)估算 token,并显示每个提供商的命中率 + 与无缓存的 $ 差额,包括 Claude Opus/Sonnet/Haiku、GPT-5/mini 和 Gemini 2.5 Pro。<em>用例</em>:『我调整了 system prompt 后账单暴涨——什么坏了?』→ 粘贴两个 prompt,看到底哪个提供商停止缓存。",
3826
 
3827
+ // v0.8.5 — anti-bullshit pack #11: Speculative-Decode Compatibility
3828
+ "modes.speculative": "🔬 Spec-Decode",
3829
+ "mode_desc.speculative": "从 HF Hub 获取两个 model id 的 `tokenizer.json` 并在配置 speculative decoding 之前验证 vocab 兼容性。捕获浪费 draft 计算的静默不匹配 bug。",
3830
+ "speculative.title": "🔬 Speculative-Decode 兼容性",
3831
+ "speculative.tip": "Speculative decoding(vLLM、SGLang、llama.cpp、transformers)要求 draft 和 target 共享完全相同的词汇表。任何 token-id 不一致都会使 target 拒绝每个 draft token——你支付双倍计算成本且吞吐量比 baseline 更差。系统报告名义输出(只是更慢),所以 bug 在单元测试中不可见。这个工具从 HF Hub 获取两个 id 的 `tokenizer.json` 并比较。",
3832
+ "speculative.desc": "<strong>不要发布 vocab 不匹配的 spec-dec。</strong> 粘贴 target + draft model id → 工具获取 tokenizer,比较 vocab 类型、大小、采样的 token-id、special token、added token → 判定 + speedup 估算。",
3833
+ "speculative.target_label": "Target(大)model id:",
3834
+ "speculative.draft_label": "Draft(小)model id:",
3835
+ "speculative.target_label_short": "target",
3836
+ "speculative.draft_label_short": "draft",
3837
+ "speculative.check_btn": "🔍 检查兼容性",
3838
+ "speculative.example_good_btn":"↳ 示例:Llama-3.1 8B/70B(好)",
3839
+ "speculative.example_bad_btn": "↳ 示例:跨 family(坏)",
3840
+ "speculative.status.fetching": "🔄 从 HF Hub 获取两个模型的 tokenizer.json…",
3841
+ "speculative.status.done": "✅ {verdict}",
3842
+ "speculative.status.error": "❌ 错误",
3843
+ "speculative.type_mismatch_note": "tokenizer 类型不同;spec-dec 不可能",
3844
+ "speculative.vocab_size": "Vocab 大小",
3845
+ "speculative.size_diff": "不同——每个重用的 id 都是一个不对齐",
3846
+ "speculative.sampled": "Token-id 采样匹配",
3847
+ "speculative.first_mismatch": "首次不匹配",
3848
+ "speculative.special_diff": "Special token 差异",
3849
+ "speculative.added_diff": "Added token 差异",
3850
+ "speculative.added_diff_more": "+ 更多 …",
3851
+ "speculative.speedup.title": "估算的 speedup 范围",
3852
+ "speculative.speedup.params": "target {target} / draft {draft}(参数比 {ratio})",
3853
+ "speculative.speedup.low": "低(α=0.50)",
3854
+ "speculative.speedup.expected":"预期(α=0.70)",
3855
+ "speculative.speedup.high": "高(α=0.85)",
3856
+ "speculative.speedup.disclaimer": "α = draft 接受率。实际 speedup 取决于 prompt 域、lookahead K 和引擎开销。范围假设理想的 verifier batching。",
3857
+ "speculative.speedup.draft_not_smaller": "Draft 不比 target 小——这里 spec-dec 是误用。",
3858
+ "speculative.attribution": "参考:",
3859
+ "speculative.side.target": "Target",
3860
+ "speculative.side.draft": "Draft",
3861
+ "speculative.fetch_error.missing_model_id": "缺少 model id",
3862
+ "speculative.fetch_error.gated_or_private": "模型受限或私有——没有 auth 无法获取 tokenizer",
3863
+ "speculative.fetch_error.not_found": "在 HF Hub 上找不到 model id",
3864
+ "speculative.fetch_error.fetch_failed": "获取失败(HTTP 错误)",
3865
+ "speculative.fetch_error.parse_failed": "JSON 解析失败(文件格式不正确)",
3866
+ "speculative.fetch_error.timeout": "超时(>8 秒,大 tokenizer 或慢速连接)",
3867
+ "speculative.fetch_error.network": "网络错误",
3868
+ "speculative.fetch_error.hint": "检查 model id 拼写。受限模型需要通过你的 HF 账户查看 tokenizer 文件——这个工具无法 auth。",
3869
+ "speculative.hint.missing_input": "输入两个 model id(target 和 draft),然后检查。",
3870
+ "speculative.hint.identical_models": "Target 和 draft 是同一个模型——spec-dec 是 no-op(且浪费)。",
3871
+ "speculative.verdict.compatible": "✅ 兼容——vocab 匹配",
3872
+ "speculative.verdict.compatible_with_caveats": "✅ 兼容——但 special/added token 不同(请审查)",
3873
+ "speculative.verdict.partial_compatible": "⚠ 部分匹配(采样 id 的 95-99.9%)",
3874
+ "speculative.verdict.type_mismatch": "❌ Tokenizer 类型不同——spec-dec 不可能",
3875
+ "speculative.verdict.vocab_size_mismatch": "❌ Vocab 大小不同——id 空间不对齐",
3876
+ "speculative.verdict.incompatible": "❌ 不兼容——太多 id 不匹配",
3877
+ "speculative.verdict.fetch_failed": "ℹ 无法获取 tokenizer",
3878
+ "speculative.verdict.identical_models": "ℹ 模型相同——spec-dec 是 no-op",
3879
+ "speculative.verdict.missing_input": "ℹ 输入两个 id",
3880
+ "inv.v085.speculative": "<strong>🔬 Spec-Decode</strong> — 在发布 speculative decoding 前验证 target + draft 之间的 tokenizer vocab 兼容性(静默给出更差吞吐量的 bug)。",
3881
+ "help.v085.speculative.title": "🔬 Speculative-Decode 兼容性",
3882
+ "help.v085.speculative.body": "Speculative decoding 仅当 target 和 draft 共享完全相同的词汇表时才能工作。Vocab 不匹配导致每个 draft token 被拒绝——你支付双倍计算成本且吞吐量比 baseline 更差。更糟:系统仍输出正确(只是更慢),所以 bug 在单元测试中不可见。vLLM #4570 / #16757 / #20409 / #12488 都显示了变种。这个工具从 HF Hub 获取两个 model id 的 `tokenizer.json`,比较 tokenizer 类型、vocab 大小、完整 token→id 映射、special token 和 added token,然后基于参数比和典型 α=0.5/0.7/0.85 接受率估算 speedup 范围。<em>用例</em>:在启动启用了 spec-dec 的 vLLM 集群之前,验证这对模型是否真的兼容。",
3883
+
3884
  "inv.v081.hub": "<strong>🧭 Solutions Hub</strong> — 每个文档化的问题都映射到一个 tafagent 模式或精选外部工具。别重复发明 — 去找。",
3885
  "help.v081.hub.title": "🧭 Solutions Hub",
3886
  "help.v081.hub.body": "tafagent 作为集成者而非孤岛。30+ 问题跨 7 类别(评估可靠性 · 诊断 · 设置 · 训练 · 检索 · 多模态 · 可观测性),每个映射到(a)解决它的 tafagent 模式(若存在),以及(b)社区已信任的最佳外部工具(RAGAS、MTEB、HELM、MCP Schema Validator、llm-stats、llguidance、GlitchMiner 等)。搜索框匹配 pain、场景和工具名称。<em>用例</em>:'我有问题 X — tafagent 解决它吗,如果不,谁解决?'",
js/main.js CHANGED
@@ -30,6 +30,7 @@ import {
30
  import { lintJsonCot, reorderJsonText, classifyFieldName } from "./json_cot_linter.js";
31
  import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
32
  import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
 
33
 
34
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
35
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
@@ -222,6 +223,7 @@ document.addEventListener("click", (e) => {
222
  cot: "cot-section",
223
  peft: "peft-section",
224
  cache: "cache-section",
 
225
  hub: "hub-section",
226
  }[targetMode];
227
  if (sectionId) {
@@ -247,7 +249,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
247
  "diagnose-section", "phase-section", "unmask-section",
248
  "template-section", "arena-section", "contam-section",
249
  "quant-section", "drift-section", "niah-section",
250
- "saturation-section", "cot-section", "peft-section", "cache-section", "hub-section"].forEach(id => {
251
  const el = $(id);
252
  if (el) el.style.display = "none";
253
  });
@@ -262,6 +264,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
262
  cot: "cot-section",
263
  peft: "peft-section",
264
  cache: "cache-section",
 
265
  hub: "hub-section",
266
  };
267
  const sectionId = sectionMap[mode];
@@ -272,6 +275,7 @@ document.querySelectorAll(".mode-btn").forEach(btn => {
272
  if (mode === "cot") initCot();
273
  if (mode === "peft") initPeft();
274
  if (mode === "cache") initCacheDiff();
 
275
  if (mode === "hub") initHub();
276
  });
277
  });
@@ -3910,6 +3914,178 @@ $("cache-example-belowmin-btn")?.addEventListener("click", () => {
3910
  runCacheDiff();
3911
  });
3912
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3913
  // ════════════════════════════════════════════════════════════════════
3914
  // Bootstrap
3915
  // ════════════════════════════════════════════════════════════════════
 
30
  import { lintJsonCot, reorderJsonText, classifyFieldName } from "./json_cot_linter.js";
31
  import { lintPeftCode, ARCH_TARGET_MODULES } from "./peft_anti_pattern.js";
32
  import { diffPromptCache, PROVIDERS as CACHE_PROVIDERS } from "./prompt_cache_diff.js";
33
+ import { checkCompatibility as specCheckCompat, parseParamHint } from "./spec_decode_compat.js";
34
 
35
  // Attach HF Hub search-as-you-type to all 5 model id inputs (Profile, Recipe,
36
  // Unmask, Template, Quant). Hits public huggingface.co/api/models. Idempotent.
 
223
  cot: "cot-section",
224
  peft: "peft-section",
225
  cache: "cache-section",
226
+ speculative: "speculative-section",
227
  hub: "hub-section",
228
  }[targetMode];
229
  if (sectionId) {
 
249
  "diagnose-section", "phase-section", "unmask-section",
250
  "template-section", "arena-section", "contam-section",
251
  "quant-section", "drift-section", "niah-section",
252
+ "saturation-section", "cot-section", "peft-section", "cache-section", "speculative-section", "hub-section"].forEach(id => {
253
  const el = $(id);
254
  if (el) el.style.display = "none";
255
  });
 
264
  cot: "cot-section",
265
  peft: "peft-section",
266
  cache: "cache-section",
267
+ speculative: "speculative-section",
268
  hub: "hub-section",
269
  };
270
  const sectionId = sectionMap[mode];
 
275
  if (mode === "cot") initCot();
276
  if (mode === "peft") initPeft();
277
  if (mode === "cache") initCacheDiff();
278
+ if (mode === "speculative") initSpeculative();
279
  if (mode === "hub") initHub();
280
  });
281
  });
 
3914
  runCacheDiff();
3915
  });
3916
 
3917
+ // ════════════════════════════════════════════════════════════════════
3918
+ // 🔬 Speculative-Decode Compatibility (v0.8.5 anti-bullshit pack #11)
3919
+ // ════════════════════════════════════════════════════════════════════
3920
+ const SPEC_VERDICT_BG = {
3921
+ compatible: "#3fb950",
3922
+ compatible_with_caveats: "#3fb950",
3923
+ partial_compatible: "#d29922",
3924
+ type_mismatch: "#f85149",
3925
+ vocab_size_mismatch: "#f85149",
3926
+ incompatible: "#f85149",
3927
+ fetch_failed: "#8b949e",
3928
+ identical_models: "#58a6ff",
3929
+ missing_input: "#8b949e",
3930
+ };
3931
+
3932
+ let __specInited = false;
3933
+
3934
+ function initSpeculative() {
3935
+ if (__specInited) return;
3936
+ __specInited = true;
3937
+ // No-op (no async preload); placeholder kept for symmetry.
3938
+ }
3939
+
3940
+ function fmtParams(p) {
3941
+ if (!p) return "—";
3942
+ if (p >= 1e9) return `${(p / 1e9).toFixed(1)}B`;
3943
+ if (p >= 1e6) return `${(p / 1e6).toFixed(1)}M`;
3944
+ return p.toLocaleString();
3945
+ }
3946
+
3947
+ function renderSpecResult(result) {
3948
+ const verdict = t(`speculative.verdict.${result.code}`) || result.code;
3949
+ const verdictBg = SPEC_VERDICT_BG[result.code] || "#8b949e";
3950
+ const verdictBadge = `<span class="badge" style="background:${verdictBg};">${verdict}</span>`;
3951
+
3952
+ // Failure-mode short-circuits
3953
+ if (result.code === "missing_input" || result.code === "identical_models") {
3954
+ return `<div class="arena-result">
3955
+ <p style="font-size:1.1em;">${verdictBadge}</p>
3956
+ <p class="recipe-desc">${t(`speculative.hint.${result.code}`) || ""}</p>
3957
+ </div>`;
3958
+ }
3959
+ if (result.code === "fetch_failed") {
3960
+ const errs = (result.errors || []).map(e => {
3961
+ const sideLabel = e.side === "target" ? (t("speculative.side.target") || "Target") : (t("speculative.side.draft") || "Draft");
3962
+ const reason = t(`speculative.fetch_error.${e.error}`) || e.error;
3963
+ return `<li><strong>${sideLabel}</strong>: ${reason}${e.status ? ` (HTTP ${e.status})` : ""}</li>`;
3964
+ }).join("");
3965
+ return `<div class="arena-result">
3966
+ <p style="font-size:1.1em;">${verdictBadge}</p>
3967
+ <ul>${errs}</ul>
3968
+ <p class="recipe-desc subtle">${t("speculative.fetch_error.hint") || "Check the model id spelling. For gated models you'll need to view the tokenizer file via your HF account — this tool can't auth."}</p>
3969
+ </div>`;
3970
+ }
3971
+
3972
+ const p = result.params;
3973
+
3974
+ // Section 1 — vocab summary
3975
+ const typeBadge = (label, val, bg) =>
3976
+ `<span class="badge" style="background:${bg};">${label}: <code>${val ?? "—"}</code></span>`;
3977
+ const typeRow = `
3978
+ ${typeBadge(t("speculative.target_label_short") || "target", p.target_type, p.type_match ? "#3fb950" : "#f85149")}
3979
+ ${typeBadge(t("speculative.draft_label_short") || "draft", p.draft_type, p.type_match ? "#3fb950" : "#f85149")}
3980
+ ${p.type_match ? "" : `<span class="subtle"> ← ${t("speculative.type_mismatch_note") || "tokenizer types differ; spec-dec impossible"}</span>`}
3981
+ `;
3982
+
3983
+ const sizeRow = `
3984
+ <strong>${t("speculative.vocab_size") || "Vocab size"}:</strong>
3985
+ target = <code>${p.target_vocab_size.toLocaleString()}</code>,
3986
+ draft = <code>${p.draft_vocab_size.toLocaleString()}</code>
3987
+ ${p.vocab_size_match ? "" : `<span style="color:#f85149;"> ← ${t("speculative.size_diff") || "differ — every reused id is a misalignment"}</span>`}
3988
+ `;
3989
+
3990
+ // Sampled match
3991
+ const matchPct = p.sampled_total > 0 ? Math.round(p.sampled_match_ratio * 100) : 0;
3992
+ const matchColor = matchPct >= 99.9 ? "#3fb950" : matchPct >= 95 ? "#d29922" : "#f85149";
3993
+ const sampleRow = `
3994
+ <strong>${t("speculative.sampled") || "Token-id sample match"}:</strong>
3995
+ <span style="color:${matchColor};font-weight:600;">${matchPct}%</span>
3996
+ <span class="subtle">(${p.sampled_match_count.toLocaleString()} / ${p.sampled_total.toLocaleString()} tokens)</span>
3997
+ ${p.first_mismatch ? `<br><span class="subtle">${t("speculative.first_mismatch") || "First mismatch"}: <code>${escapeHtml(p.first_mismatch.token).slice(0, 40)}</code> → target id ${p.first_mismatch.target_id ?? "—"}, draft id ${p.first_mismatch.draft_id ?? "—"}</span>` : ""}
3998
+ `;
3999
+
4000
+ // Special / added token diffs
4001
+ const specDiffRows = (p.special_tokens_diff || []).map(d =>
4002
+ `<li><code>${d.name}</code>: target=<code>${escapeHtml(String(d.target ?? "—"))}</code>, draft=<code>${escapeHtml(String(d.draft ?? "—"))}</code></li>`
4003
+ ).join("");
4004
+ const specDiffBlock = specDiffRows
4005
+ ? `<details style="margin-top:0.5em;"><summary>${t("speculative.special_diff") || "Special-token differences"} (${p.special_tokens_diff.length})</summary><ul>${specDiffRows}</ul></details>`
4006
+ : "";
4007
+
4008
+ const addedDiffPreview = (p.added_tokens_diff || []).slice(0, 12).map(d =>
4009
+ `<li><span class="subtle">${d.side === "target_only" ? "target only" : "draft only"}:</span> <code>${escapeHtml(d.token).slice(0, 40)}</code></li>`
4010
+ ).join("");
4011
+ const addedDiffBlock = addedDiffPreview
4012
+ ? `<details style="margin-top:0.5em;"><summary>${t("speculative.added_diff") || "Added-token differences"} (${(p.added_tokens_diff||[]).length})</summary><ul>${addedDiffPreview}${p.added_tokens_diff.length > 12 ? `<li class="subtle">${t("speculative.added_diff_more") || "+ more …"}</li>` : ""}</ul></details>`
4013
+ : "";
4014
+
4015
+ // Section 2 — speedup band (only when compatible-ish)
4016
+ let speedupBlock = "";
4017
+ if (p.speedup_expected != null) {
4018
+ const ratio = p.param_ratio ? `${(p.param_ratio * 100).toFixed(1)}%` : "—";
4019
+ speedupBlock = `
4020
+ <div style="margin-top:1em;padding:0.75em;background:#161b22;border-left:3px solid #3fb950;border-radius:4px;">
4021
+ <strong>${t("speculative.speedup.title") || "Estimated speedup band"}</strong><br>
4022
+ <span class="subtle" style="font-size:0.85em;">${tFmt("speculative.speedup.params", { target: fmtParams(p.target_params), draft: fmtParams(p.draft_params), ratio }) || `target ${fmtParams(p.target_params)} / draft ${fmtParams(p.draft_params)} (param ratio ${ratio})`}</span>
4023
+ <div style="margin-top:0.5em;display:flex;gap:1em;flex-wrap:wrap;">
4024
+ <div>${t("speculative.speedup.low") || "Low (α=0.50)"}:<br><strong style="font-size:1.2em;">${p.speedup_low}×</strong></div>
4025
+ <div>${t("speculative.speedup.expected") || "Expected (α=0.70)"}:<br><strong style="font-size:1.4em;color:#3fb950;">${p.speedup_expected}×</strong></div>
4026
+ <div>${t("speculative.speedup.high") || "High (α=0.85)"}:<br><strong style="font-size:1.2em;">${p.speedup_high}×</strong></div>
4027
+ </div>
4028
+ <p class="subtle" style="font-size:0.78em;margin-top:0.5em;">${t("speculative.speedup.disclaimer") || "α = draft acceptance rate. Real speedup depends on prompt domain, lookahead K, and engine overhead. Bands assume ideal verifier batching."}</p>
4029
+ </div>
4030
+ `;
4031
+ } else if (p.target_params && p.draft_params && p.param_ratio >= 1) {
4032
+ speedupBlock = `<p class="recipe-desc" style="color:#f85149;margin-top:1em;">${t("speculative.speedup.draft_not_smaller") || "Draft is not smaller than target — spec-dec is misuse here."}</p>`;
4033
+ }
4034
+
4035
+ // Attribution
4036
+ const attribution = `
4037
+ <p class="recipe-desc subtle" style="font-size:0.82em;margin-top:1em;">
4038
+ ${t("speculative.attribution") || "Refs:"}
4039
+ <a href="https://docs.vllm.ai/en/latest/serving/speculative_decoding.html" target="_blank" rel="noopener noreferrer">vLLM spec-dec docs</a> ·
4040
+ <a href="https://docs.sglang.ai/router/router.html" target="_blank" rel="noopener noreferrer">SGLang</a> ·
4041
+ <a href="https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding" target="_blank" rel="noopener noreferrer">transformers assistant_model</a> ·
4042
+ <a href="https://arxiv.org/abs/2211.17192" target="_blank" rel="noopener noreferrer">Leviathan et al. 2022</a>
4043
+ </p>
4044
+ `;
4045
+
4046
+ return `<div class="arena-result">
4047
+ <p style="font-size:1.1em;">${verdictBadge}</p>
4048
+ <p>${typeRow}</p>
4049
+ <p>${sizeRow}</p>
4050
+ <p>${sampleRow}</p>
4051
+ ${specDiffBlock}
4052
+ ${addedDiffBlock}
4053
+ ${speedupBlock}
4054
+ ${attribution}
4055
+ </div>`;
4056
+ }
4057
+
4058
+ async function runSpecCheck() {
4059
+ const targetId = $("spec-target-id")?.value?.trim() || "";
4060
+ const draftId = $("spec-draft-id")?.value?.trim() || "";
4061
+ $("spec-status").textContent = t("speculative.status.fetching") || "🔄 Fetching tokenizer.json from HF Hub for both models…";
4062
+ $("spec-output").innerHTML = "";
4063
+ try {
4064
+ const result = await specCheckCompat(targetId, draftId);
4065
+ $("spec-output").innerHTML = renderSpecResult(result);
4066
+ $("spec-status").textContent = tFmt("speculative.status.done", {
4067
+ verdict: t(`speculative.verdict.${result.code}`) || result.code,
4068
+ });
4069
+ } catch (e) {
4070
+ $("spec-status").textContent = (t("speculative.status.error") || "❌ Error") + " " + (e.message || e);
4071
+ }
4072
+ }
4073
+
4074
+ $("spec-check-btn")?.addEventListener("click", runSpecCheck);
4075
+ $("spec-example-good-btn")?.addEventListener("click", () => {
4076
+ $("spec-target-id").value = "meta-llama/Llama-3.1-70B-Instruct";
4077
+ $("spec-draft-id").value = "meta-llama/Llama-3.1-8B-Instruct";
4078
+ runSpecCheck();
4079
+ });
4080
+ $("spec-example-bad-btn")?.addEventListener("click", () => {
4081
+ $("spec-target-id").value = "meta-llama/Llama-3.1-8B-Instruct";
4082
+ $("spec-draft-id").value = "mistralai/Mistral-7B-Instruct-v0.3";
4083
+ runSpecCheck();
4084
+ });
4085
+
4086
+ // (HF autocomplete on spec-target-id / spec-draft-id is registered via
4087
+ // the known-id list in hf_autocomplete.js; no extra wiring needed here.)
4088
+
4089
  // ════════════════════════════════════════════════════════════════════
4090
  // Bootstrap
4091
  // ════════════════════════════════════════════════════════════════════
js/spec_decode_compat.js ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ // Speculative-Decode Compatibility Checker (v0.8.5 anti-bullshit pack #11)
2
+ //
3
+ // Pain: speculative decoding (vLLM, SGLang, llama.cpp, transformers
4
+ // `assistant_model`) requires the draft and target model to share an
5
+ // EXACT vocabulary. If token IDs disagree, every draft token is
6
+ // rejected by the target's verifier — the user pays the draft compute
7
+ // AND the full target compute, getting WORSE throughput than baseline.
8
+ // Worse, the system reports nominal output (just slower) so the bug
9
+ // is invisible in unit tests.
10
+ //
11
+ // Common silent failures:
12
+ // - Llama-3.1 draft + Llama-3.2 target (vocab differs by added tokens)
13
+ // - Mistral draft + Llama target (different tokenizer family entirely)
14
+ // - Quantized variant with different special tokens
15
+ // - Chat-template additions (`<|im_start|>` etc) on one side only
16
+ //
17
+ // vLLM #4570 / #16757 / #20409 / #12488 all surface variants of this.
18
+ //
19
+ // Tool: paste two HF model ids → fetch `tokenizer.json` from HF Hub for
20
+ // both → compare vocab type, size, token-to-id sample, special tokens,
21
+ // added tokens → verdict + speedup estimate when compatible.
22
+ //
23
+ // Pure logic + async fetch. No human strings; main.js does i18n.
24
+
25
+ // =============================================================================
26
+ // HF Hub fetching
27
+ // =============================================================================
28
+ //
29
+ // HF Hub serves text-content files (tokenizer.json, tokenizer_config.json,
30
+ // config.json) with CORS. The v0.7.4 autocomplete already proved this
31
+ // path is reachable from the browser. We fetch with a short timeout so
32
+ // the UI doesn't hang on gated/private/missing models.
33
+
34
+ const HF_BASE = "https://huggingface.co";
35
+ const FETCH_TIMEOUT_MS = 8000;
36
+
37
+ async function fetchHfJson(modelId, fileName) {
38
+ if (typeof modelId !== "string" || !modelId.trim()) {
39
+ return { ok: false, error: "missing_model_id" };
40
+ }
41
+ const url = `${HF_BASE}/${encodeURI(modelId.trim())}/raw/main/${fileName}`;
42
+ const controller = new AbortController();
43
+ const timer = setTimeout(() => controller.abort(), FETCH_TIMEOUT_MS);
44
+ try {
45
+ const res = await fetch(url, { signal: controller.signal });
46
+ clearTimeout(timer);
47
+ if (res.status === 401 || res.status === 403) {
48
+ return { ok: false, error: "gated_or_private", status: res.status };
49
+ }
50
+ if (res.status === 404) {
51
+ return { ok: false, error: "not_found", status: 404 };
52
+ }
53
+ if (!res.ok) {
54
+ return { ok: false, error: "fetch_failed", status: res.status };
55
+ }
56
+ const text = await res.text();
57
+ try {
58
+ return { ok: true, data: JSON.parse(text), bytes: text.length };
59
+ } catch (e) {
60
+ return { ok: false, error: "parse_failed", message: String(e).slice(0, 200) };
61
+ }
62
+ } catch (e) {
63
+ clearTimeout(timer);
64
+ if (e.name === "AbortError") {
65
+ return { ok: false, error: "timeout" };
66
+ }
67
+ return { ok: false, error: "network", message: String(e).slice(0, 200) };
68
+ }
69
+ }
70
+
71
+ export async function fetchTokenizer(modelId) {
72
+ // tokenizer.json is the canonical fast-tokenizer artifact. If it's
73
+ // absent (some older models ship only sentencepiece), fall back to
74
+ // tokenizer_config.json which carries the special-tokens metadata
75
+ // even without the BPE merges.
76
+ const main = await fetchHfJson(modelId, "tokenizer.json");
77
+ if (main.ok) return { ...main, source: "tokenizer.json" };
78
+ const fallback = await fetchHfJson(modelId, "tokenizer_config.json");
79
+ if (fallback.ok) return { ...fallback, source: "tokenizer_config.json" };
80
+ return main; // surface the original error code
81
+ }
82
+
83
+ export async function fetchConfig(modelId) {
84
+ return await fetchHfJson(modelId, "config.json");
85
+ }
86
+
87
+ // =============================================================================
88
+ // Vocab extraction + comparison
89
+ // =============================================================================
90
+
91
+ // Return a Map<string,id> for whatever shape the tokenizer.json carries.
92
+ // HF fast tokenizers store vocab under `model.vocab`, which is either
93
+ // {token: id} (BPE) or [[token, score], ...] (Unigram). Special tokens
94
+ // live under top-level `added_tokens` (with id) and the model itself
95
+ // keeps an `unk_token`/`bos_token`/`eos_token` etc shape.
96
+ function extractVocab(tokenizer) {
97
+ if (!tokenizer || typeof tokenizer !== "object") return null;
98
+ const model = tokenizer.model;
99
+ if (!model) return null;
100
+ let vocab = null;
101
+ if (model.vocab && typeof model.vocab === "object" && !Array.isArray(model.vocab)) {
102
+ // BPE / WordPiece form
103
+ vocab = model.vocab;
104
+ } else if (Array.isArray(model.vocab)) {
105
+ // Unigram form: [[token, log_prob], ...]
106
+ vocab = {};
107
+ for (let i = 0; i < model.vocab.length; i++) {
108
+ const entry = model.vocab[i];
109
+ if (Array.isArray(entry)) vocab[entry[0]] = i;
110
+ }
111
+ }
112
+ return vocab;
113
+ }
114
+
115
+ function extractAddedTokens(tokenizer) {
116
+ if (!tokenizer || typeof tokenizer !== "object") return [];
117
+ const arr = tokenizer.added_tokens;
118
+ if (!Array.isArray(arr)) return [];
119
+ return arr.map(t => ({
120
+ id: typeof t.id === "number" ? t.id : null,
121
+ content: typeof t.content === "string" ? t.content : "",
122
+ special: !!t.special,
123
+ })).filter(t => t.content);
124
+ }
125
+
126
+ function extractSpecialTokens(tokenizer) {
127
+ // tokenizer.json places special-token strings on the post-processor /
128
+ // template — but the canonical names are in tokenizer_config.json.
129
+ // Return what's available; the UI can show "—" for missing.
130
+ if (!tokenizer || typeof tokenizer !== "object") return {};
131
+ return {
132
+ bos_token: tokenizer.bos_token ?? null,
133
+ eos_token: tokenizer.eos_token ?? null,
134
+ pad_token: tokenizer.pad_token ?? null,
135
+ unk_token: tokenizer.unk_token ?? null,
136
+ };
137
+ }
138
+
139
+ function tokenizerType(tokenizer) {
140
+ return tokenizer?.model?.type || null;
141
+ }
142
+
143
+ // Sample-match strategy: for full-vocab compare (which is fine in JS
144
+ // for vocabs up to ~150K), build both maps and check equality. The
145
+ // expensive branch — VOCABS DIFFER — short-circuits on the first
146
+ // mismatch so the cost is bounded by the number of differing tokens.
147
+ export function compareVocabs(targetTok, draftTok) {
148
+ const tType = tokenizerType(targetTok);
149
+ const dType = tokenizerType(draftTok);
150
+ const tVocab = extractVocab(targetTok);
151
+ const dVocab = extractVocab(draftTok);
152
+
153
+ if (!tVocab || !dVocab) {
154
+ return {
155
+ type_match: tType !== null && tType === dType,
156
+ target_type: tType,
157
+ draft_type: dType,
158
+ vocab_size_match: false,
159
+ target_vocab_size: tVocab ? Object.keys(tVocab).length : 0,
160
+ draft_vocab_size: dVocab ? Object.keys(dVocab).length : 0,
161
+ sampled_total: 0,
162
+ sampled_match_count: 0,
163
+ first_mismatch: null,
164
+ special_tokens_diff: [],
165
+ added_tokens_diff: [],
166
+ };
167
+ }
168
+
169
+ const tKeys = Object.keys(tVocab);
170
+ const dKeys = Object.keys(dVocab);
171
+ const tSize = tKeys.length;
172
+ const dSize = dKeys.length;
173
+ const sizeMatch = tSize === dSize;
174
+
175
+ // Sample comparison: walk every key on the SMALLER side. For each
176
+ // key, check the id matches exactly. First mismatch is recorded.
177
+ const sampleKeys = tSize <= dSize ? tKeys : dKeys;
178
+ const a = tSize <= dSize ? tVocab : dVocab;
179
+ const b = tSize <= dSize ? dVocab : tVocab;
180
+ const sideA = tSize <= dSize ? "target" : "draft";
181
+ const sideB = sideA === "target" ? "draft" : "target";
182
+
183
+ let matchCount = 0;
184
+ let firstMismatch = null;
185
+ for (const key of sampleKeys) {
186
+ const aId = a[key];
187
+ const bId = b[key];
188
+ if (aId === bId) {
189
+ matchCount++;
190
+ } else if (firstMismatch === null) {
191
+ firstMismatch = { token: key, [`${sideA}_id`]: aId, [`${sideB}_id`]: bId };
192
+ }
193
+ }
194
+
195
+ // Special-token diff
196
+ const tSpec = extractSpecialTokens(targetTok);
197
+ const dSpec = extractSpecialTokens(draftTok);
198
+ const specDiff = [];
199
+ for (const name of ["bos_token", "eos_token", "pad_token", "unk_token"]) {
200
+ if ((tSpec[name] ?? null) !== (dSpec[name] ?? null)) {
201
+ specDiff.push({ name, target: tSpec[name], draft: dSpec[name] });
202
+ }
203
+ }
204
+
205
+ // Added-tokens diff (chat-template tokens etc.)
206
+ const tAdded = extractAddedTokens(targetTok);
207
+ const dAdded = extractAddedTokens(draftTok);
208
+ const tAddedSet = new Set(tAdded.map(x => `${x.id}:${x.content}`));
209
+ const dAddedSet = new Set(dAdded.map(x => `${x.id}:${x.content}`));
210
+ const addedDiff = [];
211
+ for (const k of tAddedSet) if (!dAddedSet.has(k)) addedDiff.push({ side: "target_only", token: k });
212
+ for (const k of dAddedSet) if (!tAddedSet.has(k)) addedDiff.push({ side: "draft_only", token: k });
213
+
214
+ return {
215
+ type_match: tType === dType,
216
+ target_type: tType,
217
+ draft_type: dType,
218
+ vocab_size_match: sizeMatch,
219
+ target_vocab_size: tSize,
220
+ draft_vocab_size: dSize,
221
+ sampled_total: sampleKeys.length,
222
+ sampled_match_count: matchCount,
223
+ first_mismatch: firstMismatch,
224
+ special_tokens_diff: specDiff,
225
+ added_tokens_diff: addedDiff,
226
+ };
227
+ }
228
+
229
+ // =============================================================================
230
+ // Param-count parsing — best-effort from model id strings
231
+ // =============================================================================
232
+ //
233
+ // HF model ids commonly carry a size hint: "Llama-3.1-8B", "Qwen2.5-72B",
234
+ // "Mistral-7B-v0.3". Parse the largest "{N}{B|M}" token; fall back to
235
+ // fetched config.json hidden_size × num_hidden_layers heuristic.
236
+
237
+ const PARAM_HINT_RE = /(\d+(?:\.\d+)?)\s*([bm])\b/i;
238
+
239
+ export function parseParamHint(modelId) {
240
+ if (typeof modelId !== "string") return null;
241
+ // Pick the LAST match — for "Llama-3.1-8B" we want 8B, not the "3.1"
242
+ // (which doesn't carry b/m suffix anyway). Iterating to ensure we
243
+ // find size hints not just version numbers.
244
+ const matches = [...modelId.matchAll(/(\d+(?:\.\d+)?)\s*([bm])\b/gi)];
245
+ if (matches.length === 0) return null;
246
+ const last = matches[matches.length - 1];
247
+ const value = parseFloat(last[1]);
248
+ const unit = last[2].toLowerCase();
249
+ if (isNaN(value)) return null;
250
+ const params = unit === "b" ? value * 1e9 : value * 1e6;
251
+ return params;
252
+ }
253
+
254
+ // Approximate param count from config.json. Highly heuristic.
255
+ function paramsFromConfig(config) {
256
+ if (!config) return null;
257
+ const h = config.hidden_size ?? config.n_embd ?? config.d_model;
258
+ const l = config.num_hidden_layers ?? config.n_layer ?? config.num_layers;
259
+ const v = config.vocab_size;
260
+ if (typeof h !== "number" || typeof l !== "number" || typeof v !== "number") return null;
261
+ // Rough transformer param count: 12 × h² × l + h × v (embedding) + h × v (output, if not tied).
262
+ // Not exact but order-of-magnitude usable for ratio computation.
263
+ return 12 * h * h * l + 2 * h * v;
264
+ }
265
+
266
+ // =============================================================================
267
+ // Speedup estimation
268
+ // =============================================================================
269
+ //
270
+ // Speculative decoding theoretical maximum speedup:
271
+ // S = 1 / ((1 - α^(K+1)) / (1 - α) × (T_d / T_t) + α^(K+1))
272
+ // where α = draft acceptance rate, K = lookahead, T_d/T_t = ratio of
273
+ // draft to target step time. For practical config (K=4-7, α=0.6-0.8):
274
+ // S ≈ 1 + α × (1 - param_ratio)
275
+ // up to a ceiling of ~3-4x. Anything beyond that is wishful.
276
+ //
277
+ // Without α measured in-domain, return a band: low (α=0.5), expected
278
+ // (α=0.7), high (α=0.85). Surfaces the uncertainty honestly.
279
+
280
+ function speedupBand(targetParams, draftParams) {
281
+ if (!targetParams || !draftParams) return null;
282
+ const ratio = draftParams / targetParams;
283
+ if (ratio >= 1) {
284
+ // Draft must be smaller; this is misuse.
285
+ return { ratio, code: "draft_not_smaller" };
286
+ }
287
+ const compute = (alpha) => {
288
+ const s = 1 + alpha * (1 - ratio);
289
+ // Cap at empirical 3.5x ceiling — beyond that, the assumptions break.
290
+ return Math.min(s, 3.5);
291
+ };
292
+ return {
293
+ ratio,
294
+ low: Math.round(compute(0.50) * 100) / 100,
295
+ expected: Math.round(compute(0.70) * 100) / 100,
296
+ high: Math.round(compute(0.85) * 100) / 100,
297
+ };
298
+ }
299
+
300
+ // =============================================================================
301
+ // Public entry point — orchestrates fetch + compare + speedup
302
+ // =============================================================================
303
+
304
+ const COMPATIBLE_THRESHOLD = 0.999; // 99.9% of sampled tokens map identically
305
+ const PARTIAL_THRESHOLD = 0.95; // >=95% but <99.9%
306
+
307
+ export async function checkCompatibility(targetId, draftId) {
308
+ if (!targetId || !draftId) {
309
+ return { code: "missing_input", params: { targetId, draftId }, errors: [] };
310
+ }
311
+ if (targetId.trim() === draftId.trim()) {
312
+ return { code: "identical_models", params: { targetId, draftId }, errors: [] };
313
+ }
314
+
315
+ const [tTok, dTok, tCfg, dCfg] = await Promise.all([
316
+ fetchTokenizer(targetId),
317
+ fetchTokenizer(draftId),
318
+ fetchConfig(targetId),
319
+ fetchConfig(draftId),
320
+ ]);
321
+
322
+ const errors = [];
323
+ if (!tTok.ok) errors.push({ side: "target", error: tTok.error, status: tTok.status });
324
+ if (!dTok.ok) errors.push({ side: "draft", error: dTok.error, status: dTok.status });
325
+ if (!tTok.ok || !dTok.ok) {
326
+ return { code: "fetch_failed", params: { targetId, draftId }, errors };
327
+ }
328
+
329
+ const cmp = compareVocabs(tTok.data, dTok.data);
330
+
331
+ // Param ratio + speedup estimate
332
+ const tParams = paramsFromConfig(tCfg.ok ? tCfg.data : null) || parseParamHint(targetId);
333
+ const dParams = paramsFromConfig(dCfg.ok ? dCfg.data : null) || parseParamHint(draftId);
334
+ const speedup = speedupBand(tParams, dParams);
335
+
336
+ const sampledMatchRatio = cmp.sampled_total === 0
337
+ ? 0
338
+ : cmp.sampled_match_count / cmp.sampled_total;
339
+
340
+ let code;
341
+ if (!cmp.type_match) {
342
+ code = "type_mismatch";
343
+ } else if (!cmp.vocab_size_match) {
344
+ code = "vocab_size_mismatch";
345
+ } else if (sampledMatchRatio >= COMPATIBLE_THRESHOLD) {
346
+ code = cmp.special_tokens_diff.length || cmp.added_tokens_diff.length
347
+ ? "compatible_with_caveats"
348
+ : "compatible";
349
+ } else if (sampledMatchRatio >= PARTIAL_THRESHOLD) {
350
+ code = "partial_compatible";
351
+ } else {
352
+ code = "incompatible";
353
+ }
354
+
355
+ return {
356
+ code,
357
+ params: {
358
+ targetId, draftId,
359
+ ...cmp,
360
+ sampled_match_ratio: Math.round(sampledMatchRatio * 10000) / 10000,
361
+ target_params: tParams,
362
+ draft_params: dParams,
363
+ param_ratio: speedup?.ratio ?? null,
364
+ speedup_low: speedup?.low ?? null,
365
+ speedup_expected: speedup?.expected ?? null,
366
+ speedup_high: speedup?.high ?? null,
367
+ target_source: tTok.source,
368
+ draft_source: dTok.source,
369
+ },
370
+ errors,
371
+ };
372
+ }