Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 6 days ago

Commit

a6f36b3

1 Parent(s): 678d90e

v0.8.5+ spec-decode: open-mirror fallback for gated models (no token needed)

User reported the v0.8.5-fix examples (Qwen) avoided the 401 problem
but the underlying limitation remained — anyone pasting a Llama /
Mistral / Gemma id still got "gated, no auth" with no recourse.

Web research confirmed HF's official position: their own transformers.js
docs explicitly state browser-side tokens are NOT supported for
gated/private models ("possibility of leaking access tokens"). So
adding a token field would violate HF's recommendation AND put users
at risk.

Implemented Option B (open-mirror suggester) instead, no token needed:

- `fetchTokenizerWithMirrorFallback(modelId)` tries 4 patterns when
the original returns 401:
1. unsloth/{name} ← bare unsloth redistribution
2. unsloth/Meta-{name} ← Meta-prefixed mirror (Llama)
3. unsloth/{name}-bnb-4bit ← quantized variant
4. unsloth/Meta-{name}-bnb-4bit ← Meta-prefixed quantized
First success wins. Pattern coverage verified empirically:
Llama-3.1-8B → pattern 1 (unsloth/Llama-3.1-8B-Instruct = 200);
Llama-3.1-70B → pattern 2 (unsloth/Meta-Llama-3.1-70B-Instruct
= 200). 8 URLs probed via curl + browser fetch.

- Mirror tokenizers are typically byte-identical to the gated
original because quantization touches WEIGHTS, not the tokenizer
artifact (BPE vocab + merges). Caveat surfaced inline: unsloth
issue #880 documents occasional chat-template drift.

- UI gets a yellow "Open-mirror fallback" banner when either side
used a mirror, naming the original AND the resolved mirror id.
Users see exactly what was substituted and can click through to
verify chat-template if their use case demands exact match.

- Defaults updated to demonstrate the new path: "good" example is
now meta-llama/Llama-3.1-70B-Instruct vs meta-llama/Llama-3.1-8B-
Instruct — both gated, both auto-resolve to unsloth mirrors,
produces a "compatible_with_caveats" verdict + mirror banner.

- Bad example stays Qwen vs Phi-3.5 (open-weight cross-family
incompatibility, no fallback needed).

NOT implemented (Option A — HF token in browser): officially
discouraged by HuggingFace; would expose users to token theft via
any future XSS vector. Documented in the inline note.

Verified: direct logic test (3/3) confirms Llama-3.1-70B resolves
to unsloth/Meta-Llama-3.1-70B-Instruct, Qwen passes through with no
mirror, nonexistent returns the underlying 401 (HF returns 401 for
non-existent too — can't be distinguished). 4-lang i18n updated for
the new mirror banner + reworded gated-note.

Refs:
- https://huggingface.co/docs/transformers.js/guides/private (HF
official: no browser tokens for gated models)
- https://github.com/unslothai/unsloth/issues/880 (chat-template
drift in some unsloth releases — surfaced as caveat in the UI)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

js/i18n.js +24 -8
js/main.js +34 -6
js/spec_decode_compat.js +84 -7

js/i18n.js CHANGED Viewed

@@ -648,9 +648,13 @@ export const TRANSLATIONS = {
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Check compatibility",
-    "speculative.example_good_btn":"↳ Example: Qwen2.5 7B/72B (good)",
     "speculative.example_bad_btn": "↳ Example: cross-family (bad)",
-    "speculative.gated_note":      "💡 <strong>Gated models</strong> (Llama, Mistral, Gemma) require HF login + license acceptance — this tool can't auth, so they return 401. Use open-weight pairs (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) for demos.",
     "speculative.status.fetching": "🔄 Fetching tokenizer.json from HF Hub for both models…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Error",
@@ -1802,9 +1806,13 @@ export const TRANSLATIONS = {
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Verificar compatibilidad",
-    "speculative.example_good_btn":"↳ Ejemplo: Qwen2.5 7B/72B (bueno)",
     "speculative.example_bad_btn": "↳ Ejemplo: cross-family (malo)",
-    "speculative.gated_note":      "💡 <strong>Modelos gated</strong> (Llama, Mistral, Gemma) requieren login HF + aceptar licencia — esta tool no puede autenticar, así que devuelven 401. Usa pares open-weight (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) para demos.",
     "speculative.status.fetching": "🔄 Haciendo fetch de tokenizer.json desde HF Hub para ambos modelos…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Error",
@@ -2820,9 +2828,13 @@ export const TRANSLATIONS = {
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Vérifier compatibilité",
-    "speculative.example_good_btn":"↳ Exemple : Qwen2.5 7B/72B (bon)",
     "speculative.example_bad_btn": "↳ Exemple : cross-family (mauvais)",
-    "speculative.gated_note":      "💡 <strong>Modèles gated</strong> (Llama, Mistral, Gemma) requièrent login HF + acceptation de licence — cet outil ne peut pas auth, ils renvoient 401. Utilisez des paires open-weight (Qwen, Phi, DeepSeek, Yi, StarCoder, Falcon) pour les démos.",
     "speculative.status.fetching": "🔄 Récupération de tokenizer.json depuis HF Hub pour les deux modèles…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Erreur",
@@ -3838,9 +3850,13 @@ export const TRANSLATIONS = {
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 检查兼容性",
-    "speculative.example_good_btn":"↳ 示例：Qwen2.5 7B/72B（好）",
     "speculative.example_bad_btn": "↳ 示例：跨 family（坏）",
-    "speculative.gated_note":      "💡 <strong>受限模型</strong>（Llama、Mistral、Gemma）需要 HF 登录 + 接受许可——这个工具无法 auth，所以会返回 401。请使用 open-weight 对（Qwen、Phi、DeepSeek、Yi、StarCoder、Falcon）作为演示。",
     "speculative.status.fetching": "🔄 从 HF Hub 获取两个模型的 tokenizer.json…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ 错误",

     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Check compatibility",
+    "speculative.example_good_btn":"↳ Example: Llama-3.1 8B/70B (gated → mirror)",
     "speculative.example_bad_btn": "↳ Example: cross-family (bad)",
+    "speculative.gated_note":      "💡 <strong>Gated models</strong> (Llama, Mistral, Gemma) trigger an automatic open-mirror fallback (unsloth/...). HF officially discourages browser-side tokens, so the tool can't auth — but mirror tokenizers are typically byte-identical because quantization touches weights, not the tokenizer artifact.",
+    "speculative.mirror.heading":     "Open-mirror fallback",
+    "speculative.mirror.target_used": "Target <code>{original}</code> was gated; used mirror <code>{mirror}</code>.",
+    "speculative.mirror.draft_used":  "Draft <code>{original}</code> was gated; used mirror <code>{mirror}</code>.",
+    "speculative.mirror.warn":        "Mirror tokenizers (e.g. unsloth/) are usually byte-identical to the gated original because quantization touches weights, not tokens. Verify chat-template if exact match is required (unsloth #880 documents occasional drift).",
     "speculative.status.fetching": "🔄 Fetching tokenizer.json from HF Hub for both models…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Error",
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Verificar compatibilidad",
+    "speculative.example_good_btn":"↳ Ejemplo: Llama-3.1 8B/70B (gated → mirror)",
     "speculative.example_bad_btn": "↳ Ejemplo: cross-family (malo)",
+    "speculative.gated_note":      "💡 <strong>Modelos gated</strong> (Llama, Mistral, Gemma) disparan un fallback automático a mirror open (unsloth/...). HF desaconseja oficialmente tokens en browser, así que la tool no puede autenticar — pero los tokenizers de mirrors son típicamente byte-idénticos porque la cuantización toca weights, no el artefacto del tokenizer.",
+    "speculative.mirror.heading":     "Fallback a open-mirror",
+    "speculative.mirror.target_used": "Target <code>{original}</code> estaba gated; se usó mirror <code>{mirror}</code>.",
+    "speculative.mirror.draft_used":  "Draft <code>{original}</code> estaba gated; se usó mirror <code>{mirror}</code>.",
+    "speculative.mirror.warn":        "Los tokenizers de mirror (ej. unsloth/) suelen ser byte-idénticos al original gated porque la cuantización toca weights, no tokens. Verifica chat-template si necesitas match exacto (unsloth #880 documenta drift ocasional).",
     "speculative.status.fetching": "🔄 Haciendo fetch de tokenizer.json desde HF Hub para ambos modelos…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Error",
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 Vérifier compatibilité",
+    "speculative.example_good_btn":"↳ Exemple : Llama-3.1 8B/70B (gated → mirror)",
     "speculative.example_bad_btn": "↳ Exemple : cross-family (mauvais)",
+    "speculative.gated_note":      "💡 <strong>Modèles gated</strong> (Llama, Mistral, Gemma) déclenchent un fallback automatique vers un open-mirror (unsloth/...). HF déconseille officiellement les tokens côté navigateur, donc l'outil ne peut pas auth — mais les tokenizers des mirrors sont typiquement byte-identiques car la quantification touche les poids, pas l'artefact du tokenizer.",
+    "speculative.mirror.heading":     "Fallback open-mirror",
+    "speculative.mirror.target_used": "Target <code>{original}</code> était gated ; utilisation du mirror <code>{mirror}</code>.",
+    "speculative.mirror.draft_used":  "Draft <code>{original}</code> était gated ; utilisation du mirror <code>{mirror}</code>.",
+    "speculative.mirror.warn":        "Les tokenizers mirror (ex. unsloth/) sont habituellement byte-identiques au gated original car la quantification touche les poids, pas les tokens. Vérifiez le chat-template si un match exact est requis (unsloth #880 documente une dérive occasionnelle).",
     "speculative.status.fetching": "🔄 Récupération de tokenizer.json depuis HF Hub pour les deux modèles…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ Erreur",
     "speculative.target_label_short": "target",
     "speculative.draft_label_short":  "draft",
     "speculative.check_btn":       "🔍 检查兼容性",
+    "speculative.example_good_btn":"↳ 示例：Llama-3.1 8B/70B（受限 → mirror）",
     "speculative.example_bad_btn": "↳ 示例：跨 family（坏）",
+    "speculative.gated_note":      "💡 <strong>受限模型</strong>（Llama、Mistral、Gemma）会触发自动 open-mirror 回退（unsloth/...）。HF 官方不推荐浏览器端 token，所以工具无法 auth——但 mirror 的 tokenizer 通常字节级等同，因为量化只影响权重，不影响 tokenizer 工件。",
+    "speculative.mirror.heading":     "Open-mirror 回退",
+    "speculative.mirror.target_used": "Target <code>{original}</code> 受限；使用 mirror <code>{mirror}</code>。",
+    "speculative.mirror.draft_used":  "Draft <code>{original}</code> 受限；使用 mirror <code>{mirror}</code>。",
+    "speculative.mirror.warn":        "Mirror tokenizer（例如 unsloth/）通常与受限原版字节级等同，因为量化只影响权重而非 token。如需精确匹配，请验证 chat-template（unsloth #880 记录了偶发的漂移）。",
     "speculative.status.fetching": "🔄 从 HF Hub 获取两个模型的 tokenizer.json…",
     "speculative.status.done":     "✅ {verdict}",
     "speculative.status.error":    "❌ 错误",

js/main.js CHANGED Viewed

@@ -3971,6 +3971,31 @@ function renderSpecResult(result) {
   const p = result.params;
   // Section 1 — vocab summary
   const typeBadge = (label, val, bg) =>
     `<span class="badge" style="background:${bg};">${label}: <code>${val ?? "—"}</code></span>`;
@@ -4045,6 +4070,7 @@ function renderSpecResult(result) {
   return `<div class="arena-result">
     <p style="font-size:1.1em;">${verdictBadge}</p>
     <p>${typeRow}</p>
     <p>${sizeRow}</p>
     <p>${sampleRow}</p>
@@ -4072,16 +4098,18 @@ async function runSpecCheck() {
 }
 $("spec-check-btn")?.addEventListener("click", runSpecCheck);
-// Examples use OPEN-WEIGHT models (no HF gating). Llama / Mistral /
-// Gemma require license acceptance — they 401 from a public browser
-// fetch. Qwen2.5 + Phi-3.5 ship under permissive licenses so the
-// in-app demo Just Works without the user having to log in to HF.
 $("spec-example-good-btn")?.addEventListener("click", () => {
-  $("spec-target-id").value = "Qwen/Qwen2.5-72B-Instruct";
-  $("spec-draft-id").value  = "Qwen/Qwen2.5-7B-Instruct";
   runSpecCheck();
 });
 $("spec-example-bad-btn")?.addEventListener("click", () => {
   $("spec-target-id").value = "Qwen/Qwen2.5-7B-Instruct";
   $("spec-draft-id").value  = "microsoft/Phi-3.5-mini-instruct";
   runSpecCheck();

   const p = result.params;
+  // Mirror banner — when a gated model was fetched via an open mirror.
+  let mirrorBanner = "";
+  if (p.target_via_mirror || p.draft_via_mirror) {
+    const lines = [];
+    if (p.target_via_mirror) {
+      lines.push(tFmt("speculative.mirror.target_used", {
+        original: escapeHtml(p.targetId),
+        mirror: escapeHtml(p.target_via_mirror),
+      }) || `Target was gated; used mirror <code>${escapeHtml(p.target_via_mirror)}</code>.`);
+    }
+    if (p.draft_via_mirror) {
+      lines.push(tFmt("speculative.mirror.draft_used", {
+        original: escapeHtml(p.draftId),
+        mirror: escapeHtml(p.draft_via_mirror),
+      }) || `Draft was gated; used mirror <code>${escapeHtml(p.draft_via_mirror)}</code>.`);
+    }
+    mirrorBanner = `
+      <div style="margin-bottom:0.75em;padding:0.6em;background:#332b00;border-left:3px solid #d29922;border-radius:4px;font-size:0.92em;">
+        <strong>ℹ ${t("speculative.mirror.heading") || "Open-mirror fallback"}</strong>
+        ${lines.map(l => `<br>${l}`).join("")}
+        <br><span class="subtle" style="font-size:0.85em;">${t("speculative.mirror.warn") || "Mirror tokenizers (e.g. unsloth/) are usually byte-identical to the gated original because quantization touches weights, not tokens. Verify chat-template if exact match is required."}</span>
+      </div>
+    `;
+  }
   // Section 1 — vocab summary
   const typeBadge = (label, val, bg) =>
     `<span class="badge" style="background:${bg};">${label}: <code>${val ?? "—"}</code></span>`;
   return `<div class="arena-result">
     <p style="font-size:1.1em;">${verdictBadge}</p>
+    ${mirrorBanner}
     <p>${typeRow}</p>
     <p>${sizeRow}</p>
     <p>${sampleRow}</p>
 }
 $("spec-check-btn")?.addEventListener("click", runSpecCheck);
+// Examples mix gated + open: gated ids (Llama) trigger the open-mirror
+// fallback (unsloth/...) so the user sees both the demo result AND the
+// mirror-resolution mechanism. Pure open-weight pairs (Qwen + Phi)
+// stay as the "no fallback needed" path for the second example.
 $("spec-example-good-btn")?.addEventListener("click", () => {
+  // Gated → triggers unsloth mirror fallback for both sides.
+  $("spec-target-id").value = "meta-llama/Llama-3.1-70B-Instruct";
+  $("spec-draft-id").value  = "meta-llama/Llama-3.1-8B-Instruct";
   runSpecCheck();
 });
 $("spec-example-bad-btn")?.addEventListener("click", () => {
+  // Open-weight cross-family → no fallback, plain incompatibility demo.
   $("spec-target-id").value = "Qwen/Qwen2.5-7B-Instruct";
   $("spec-draft-id").value  = "microsoft/Phi-3.5-mini-instruct";
   runSpecCheck();

js/spec_decode_compat.js CHANGED Viewed

@@ -84,6 +84,77 @@ export async function fetchConfig(modelId) {
   return await fetchHfJson(modelId, "config.json");
 }
 // =============================================================================
 // Vocab extraction + comparison
 // =============================================================================
@@ -312,20 +383,24 @@ export async function checkCompatibility(targetId, draftId) {
     return { code: "identical_models", params: { targetId, draftId }, errors: [] };
   }
-  const [tTok, dTok, tCfg, dCfg] = await Promise.all([
-    fetchTokenizer(targetId),
-    fetchTokenizer(draftId),
-    fetchConfig(targetId),
-    fetchConfig(draftId),
   ]);
   const errors = [];
-  if (!tTok.ok) errors.push({ side: "target", error: tTok.error, status: tTok.status });
-  if (!dTok.ok) errors.push({ side: "draft",  error: dTok.error, status: dTok.status });
   if (!tTok.ok || !dTok.ok) {
     return { code: "fetch_failed", params: { targetId, draftId }, errors };
   }
   const cmp = compareVocabs(tTok.data, dTok.data);
   // Param ratio + speedup estimate
@@ -366,6 +441,8 @@ export async function checkCompatibility(targetId, draftId) {
       speedup_high:     speedup?.high ?? null,
       target_source: tTok.source,
       draft_source: dTok.source,
     },
     errors,
   };

   return await fetchHfJson(modelId, "config.json");
 }
+// =============================================================================
+// Open-mirror fallback for gated models
+// =============================================================================
+//
+// HF officially DISCOURAGES browser-side tokens (their own transformers.js
+// docs: "we only support accessing private/gated models from server-side
+// environments"). For client-only tools, the practical workaround for
+// gated families (Llama, Mistral, Gemma) is to fall back to public mirrors
+// that re-host the same tokenizer:
+//   - unsloth/{name}            ← unsloth's open redistributions
+//   - unsloth/Meta-{name}       ← Meta-prefixed Llama mirrors
+//   - unsloth/{name}-bnb-4bit   ← quantized variants (tokenizer preserved)
+//
+// Tokenizer (BPE merges + vocab) is text — quantization touches weights,
+// not the tokenizer artifact, so the mirror's tokenizer.json is usually
+// byte-identical to the gated original. Caveat: some unsloth releases
+// patch chat-template tokens (issue #880); we surface that in the UI
+// with a "verify chat-template if exact match required" note.
+const MIRROR_PATTERN_BUILDERS = [
+  (id) => {
+    const last = id.split("/").slice(-1)[0];
+    return `unsloth/${last}`;
+  },
+  (id) => {
+    const last = id.split("/").slice(-1)[0];
+    return last.startsWith("Meta-") ? `unsloth/${last}` : `unsloth/Meta-${last}`;
+  },
+  (id) => {
+    const last = id.split("/").slice(-1)[0];
+    return `unsloth/${last}-bnb-4bit`;
+  },
+  (id) => {
+    const last = id.split("/").slice(-1)[0];
+    return last.startsWith("Meta-") ? `unsloth/${last}-bnb-4bit` : `unsloth/Meta-${last}-bnb-4bit`;
+  },
+];
+export async function fetchTokenizerWithMirrorFallback(modelId) {
+  const original = await fetchTokenizer(modelId);
+  if (original.ok) return { ...original, viaMirror: null };
+  // Only attempt mirror fallback when the failure is gated/private.
+  // 404 / network / parse errors aren't fixable by trying a mirror.
+  if (original.error !== "gated_or_private") {
+    return { ...original, viaMirror: null };
+  }
+  const tried = new Set([modelId]);
+  for (const build of MIRROR_PATTERN_BUILDERS) {
+    let candidate;
+    try { candidate = build(modelId); }
+    catch { continue; }
+    if (!candidate || tried.has(candidate)) continue;
+    tried.add(candidate);
+    const r = await fetchTokenizer(candidate);
+    if (r.ok) return { ...r, viaMirror: candidate, mirrorOf: modelId };
+  }
+  return { ...original, viaMirror: null, triedMirrors: [...tried].slice(1) };
+}
+export async function fetchConfigWithMirrorFallback(modelId, mirrorId) {
+  // Prefer the mirror's config when one was used (param counts come from
+  // there), but also try the ORIGINAL config — some unsloth mirrors omit
+  // it. Falls back gracefully.
+  if (mirrorId) {
+    const m = await fetchConfig(mirrorId);
+    if (m.ok) return { ...m, viaMirror: mirrorId };
+  }
+  const o = await fetchConfig(modelId);
+  return { ...o, viaMirror: null };
+}
 // =============================================================================
 // Vocab extraction + comparison
 // =============================================================================
     return { code: "identical_models", params: { targetId, draftId }, errors: [] };
   }
+  const [tTok, dTok] = await Promise.all([
+    fetchTokenizerWithMirrorFallback(targetId),
+    fetchTokenizerWithMirrorFallback(draftId),
   ]);
   const errors = [];
+  if (!tTok.ok) errors.push({ side: "target", error: tTok.error, status: tTok.status, triedMirrors: tTok.triedMirrors });
+  if (!dTok.ok) errors.push({ side: "draft",  error: dTok.error, status: dTok.status, triedMirrors: dTok.triedMirrors });
   if (!tTok.ok || !dTok.ok) {
     return { code: "fetch_failed", params: { targetId, draftId }, errors };
   }
+  // Fetch configs — prefer mirror when one was used.
+  const [tCfg, dCfg] = await Promise.all([
+    fetchConfigWithMirrorFallback(targetId, tTok.viaMirror),
+    fetchConfigWithMirrorFallback(draftId, dTok.viaMirror),
+  ]);
   const cmp = compareVocabs(tTok.data, dTok.data);
   // Param ratio + speedup estimate
       speedup_high:     speedup?.high ?? null,
       target_source: tTok.source,
       draft_source: dTok.source,
+      target_via_mirror: tTok.viaMirror || null,
+      draft_via_mirror:  dTok.viaMirror || null,
     },
     errors,
   };