Spaces:

karlexmarin
/

taf-agent

Running

karlexmarin Claude Opus 4.7 (1M context) commited on 7 days ago

Commit

6378efa

1 Parent(s): 7700f2f

v0.7.7-fix: ⓘ tooltips on each task tile, 4 langs (EN/ES/FR/ZH)

Each of the 5 task tiles now has an info icon (ⓘ) next to the title that opens a detailed tooltip listing the modes inside, what each one does, and concrete example use cases. Matches the existing tooltip pattern (modes.tip, etc.).

5 new keys × 4 langs (698 total, 0 missing / 0 extra parity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

index.html +20 -5
js/i18n.js +20 -0

index.html CHANGED Viewed

@@ -355,7 +355,10 @@
       <p class="recipe-desc" data-i18n="tiles.subtitle">Pick a task. Each one opens the right tool below. Or scroll down for the full list of 14 modes.</p>
       <div class="tiles-grid">
         <div class="task-tile">
-          <h3 data-i18n="tile.diagnose.title">🔬 Diagnose a model</h3>
           <p class="tile-desc" data-i18n="tile.diagnose.desc">Will this specific model work for my use case?</p>
           <div class="tile-modes">
             <button data-mode-link="profile" data-i18n="modes.profile">📇 Profile a model</button>
@@ -366,7 +369,10 @@
           </div>
         </div>
         <div class="task-tile">
-          <h3 data-i18n="tile.trust.title">✓ Trust a benchmark score</h3>
           <p class="tile-desc" data-i18n="tile.trust.desc">Should I believe this number? Bug or noise?</p>
           <div class="tile-modes">
             <button data-mode-link="contam" data-i18n="modes.contam">🧪 Contamination</button>
@@ -375,7 +381,10 @@
           </div>
         </div>
         <div class="task-tile">
-          <h3 data-i18n="tile.eval.title">⚙️ Set up an eval correctly</h3>
           <p class="tile-desc" data-i18n="tile.eval.desc">Get the exact CLI flag for lm-eval / vLLM / transformers.</p>
           <div class="tile-modes">
             <button data-mode-link="template" data-i18n="modes.template">📜 Chat-template</button>
@@ -383,7 +392,10 @@
           </div>
         </div>
         <div class="task-tile">
-          <h3 data-i18n="tile.compare.title">🆚 Compare models</h3>
           <p class="tile-desc" data-i18n="tile.compare.desc">Side-by-side, or browse the empirical model landscape.</p>
           <div class="tile-modes">
             <button data-mode-link="compare" data-i18n="modes.compare">🆚 Compare models</button>
@@ -391,7 +403,10 @@
           </div>
         </div>
         <div class="task-tile">
-          <h3 data-i18n="tile.manual.title">📋 Manual / free-form</h3>
           <p class="tile-desc" data-i18n="tile.manual.desc">Pick a specific recipe by hand, or ask in plain English.</p>
           <div class="tile-modes">
             <button data-mode-link="recipe" data-i18n="modes.recipe">📋 Pick recipe</button>

       <p class="recipe-desc" data-i18n="tiles.subtitle">Pick a task. Each one opens the right tool below. Or scroll down for the full list of 14 modes.</p>
       <div class="tiles-grid">
         <div class="task-tile">
+          <h3>
+            <span data-i18n="tile.diagnose.title">🔬 Diagnose a model</span>
+            <span class="info"><span class="tooltip" data-i18n="tile.diagnose.tip">Start here when you have a specific model id and want a full diagnostic: <strong>Profile</strong> runs all 5 recipes at once. <strong>Unmask</strong> checks if max_position_embeddings is honest. <strong>NIAH→Reason</strong> predicts retrieval-vs-reasoning gap. <strong>Quant</strong> predicts whether quantizing will break it. <strong>Inspect</strong> lets you paste raw config.json for private/in-dev models.</span></span>
+          </h3>
           <p class="tile-desc" data-i18n="tile.diagnose.desc">Will this specific model work for my use case?</p>
           <div class="tile-modes">
             <button data-mode-link="profile" data-i18n="modes.profile">📇 Profile a model</button>
           </div>
         </div>
         <div class="task-tile">
+          <h3>
+            <span data-i18n="tile.trust.title">✓ Trust a benchmark score</span>
+            <span class="info"><span class="tooltip" data-i18n="tile.trust.tip">When you see a score and want to know if it's real. <strong>Contamination</strong> rates 20+ benchmarks for likelihood the model saw them during training. <strong>Drift</strong> tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). <strong>Arena CI</strong> reconstructs the confidence intervals Chatbot Arena hides — many top-Elo "wins" are statistically tied.</span></span>
+          </h3>
           <p class="tile-desc" data-i18n="tile.trust.desc">Should I believe this number? Bug or noise?</p>
           <div class="tile-modes">
             <button data-mode-link="contam" data-i18n="modes.contam">🧪 Contamination</button>
           </div>
         </div>
         <div class="task-tile">
+          <h3>
+            <span data-i18n="tile.eval.title">⚙️ Set up an eval correctly</span>
+            <span class="info"><span class="tooltip" data-i18n="tile.eval.tip">Before you run lm-eval-harness or vLLM serve, get the right CLI flag. <strong>Chat-template Sniffer</strong> detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact <code>--apply_chat_template</code> / <code>--chat-template</code> invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). <strong>Diagnose CLI</strong> generates the Python command to measure γ_obs on your local GPU.</span></span>
+          </h3>
           <p class="tile-desc" data-i18n="tile.eval.desc">Get the exact CLI flag for lm-eval / vLLM / transformers.</p>
           <div class="tile-modes">
             <button data-mode-link="template" data-i18n="modes.template">📜 Chat-template</button>
           </div>
         </div>
         <div class="task-tile">
+          <h3>
+            <span data-i18n="tile.compare.title">🆚 Compare models</span>
+            <span class="info"><span class="tooltip" data-i18n="tile.compare.tip"><strong>Compare</strong>: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). <strong>Phase diagram</strong>: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.</span></span>
+          </h3>
           <p class="tile-desc" data-i18n="tile.compare.desc">Side-by-side, or browse the empirical model landscape.</p>
           <div class="tile-modes">
             <button data-mode-link="compare" data-i18n="modes.compare">🆚 Compare models</button>
           </div>
         </div>
         <div class="task-tile">
+          <h3>
+            <span data-i18n="tile.manual.title">📋 Manual / free-form</span>
+            <span class="info"><span class="tooltip" data-i18n="tile.manual.tip"><strong>Recipe</strong>: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. <strong>Ask</strong>: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for "what would happen if..." exploration.</span></span>
+          </h3>
           <p class="tile-desc" data-i18n="tile.manual.desc">Pick a specific recipe by hand, or ask in plain English.</p>
           <div class="tile-modes">
             <button data-mode-link="recipe" data-i18n="modes.recipe">📋 Pick recipe</button>

js/i18n.js CHANGED Viewed

@@ -479,6 +479,11 @@ export const TRANSLATIONS = {
     "tile.compare.desc":           "Side-by-side, or browse the empirical model landscape.",
     "tile.manual.title":           "📋 Manual / free-form",
     "tile.manual.desc":            "Pick a specific recipe by hand, or ask in plain English.",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
@@ -1382,6 +1387,11 @@ export const TRANSLATIONS = {
     "tile.compare.desc":           "Lado a lado, o explora el panel empírico de modelos.",
     "tile.manual.title":           "📋 Manual / libre",
     "tile.manual.desc":            "Elige una receta concreta a mano, o pregunta en inglés llano.",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
@@ -2149,6 +2159,11 @@ export const TRANSLATIONS = {
     "tile.compare.desc":           "Côte à côte, ou explorez le panel empirique de modèles.",
     "tile.manual.title":           "📋 Manuel / libre",
     "tile.manual.desc":            "Choisissez une recette à la main, ou demandez en langage naturel.",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
@@ -2916,6 +2931,11 @@ export const TRANSLATIONS = {
     "tile.compare.desc":           "并排，或浏览经验模型面板。",
     "tile.manual.title":           "📋 手动 / 自由",
     "tile.manual.desc":            "手动挑一个具体 recipe，或用自然语言提问。",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",

     "tile.compare.desc":           "Side-by-side, or browse the empirical model landscape.",
     "tile.manual.title":           "📋 Manual / free-form",
     "tile.manual.desc":            "Pick a specific recipe by hand, or ask in plain English.",
+    "tile.diagnose.tip":           "Start here when you have a specific model id and want a full diagnostic: <strong>Profile</strong> runs all 5 recipes at once. <strong>Unmask</strong> checks if max_position_embeddings is honest. <strong>NIAH→Reason</strong> predicts retrieval-vs-reasoning gap. <strong>Quant</strong> predicts whether quantizing will break it. <strong>Inspect</strong> lets you paste raw config.json for private/in-dev models.",
+    "tile.trust.tip":              "When you see a score and want to know if it's real. <strong>Contamination</strong> rates 20+ benchmarks for likelihood the model saw them during training. <strong>Drift</strong> tells you if a gap between two evals is numerical noise or a real bug (chat-template mismatch, KV-cache layout, etc.). <strong>Arena CI</strong> reconstructs the confidence intervals Chatbot Arena hides — many top-Elo &quot;wins&quot; are statistically tied.",
+    "tile.eval.tip":               "Before you run lm-eval-harness or vLLM serve, get the right CLI flag. <strong>Chat-template Sniffer</strong> detects the template family (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) and emits the exact <code>--apply_chat_template</code> / <code>--chat-template</code> invocation. Solves issue #1841 in lm-eval-harness (silent ÷2 accuracy). <strong>Diagnose CLI</strong> generates the Python command to measure γ_obs on your local GPU.",
+    "tile.compare.tip":            "<strong>Compare</strong>: pick 2-3 candidate models + one recipe, see verdicts in a side-by-side table (e.g. Llama-3-8B vs Mistral-7B at 32k context). <strong>Phase diagram</strong>: scatter of 23 empirical models on the (log θ, γ) plane, with the Padé curve overlaid. Hover dots for details, click to load that model into the Recipe form.",
+    "tile.manual.tip":             "<strong>Recipe</strong>: pick a specific X-N recipe (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 KV compression, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) and fill the form by hand for full control. <strong>Ask</strong>: type a free-form question; an in-browser 0.5B LLM (Qwen2.5) picks the right recipe and runs it. Best for &quot;what would happen if...&quot; exploration.",
     "share.import_desc":       "Got a JSON file from someone else's TAF analysis? Load it here to see the verdict + chain locally. Same view as if you'd run it yourself.",
     "share.import_btn":        "📂 Load shared JSON",
     "synthesis.system":        "You are a precise transformer LLM diagnostic assistant. Given pre-computed TAF formula results, write a clear plain-English summary in 4-6 sentences. Cite the section number (§X.Y) for each number you mention. Always give a concrete recommendation. Do NOT invent numbers.",
     "tile.compare.desc":           "Lado a lado, o explora el panel empírico de modelos.",
     "tile.manual.title":           "📋 Manual / libre",
     "tile.manual.desc":            "Elige una receta concreta a mano, o pregunta en inglés llano.",
+    "tile.diagnose.tip":           "Empieza aquí cuando tengas un id de modelo concreto y quieras diagnóstico completo: <strong>Profile</strong> corre las 5 recetas a la vez. <strong>Unmask</strong> comprueba si max_position_embeddings es honesto. <strong>NIAH→Reason</strong> predice el gap retrieval-vs-reasoning. <strong>Quant</strong> predice si cuantizar lo romperá. <strong>Inspect</strong> permite pegar config.json crudo para modelos privados / en desarrollo.",
+    "tile.trust.tip":              "Cuando ves un score y quieres saber si es real. <strong>Contamination</strong> puntúa 20+ benchmarks por probabilidad de que el modelo los viera en entrenamiento. <strong>Drift</strong> te dice si el gap entre dos evals es ruido numérico o bug real (chat-template mismatch, layout KV-cache, etc.). <strong>Arena CI</strong> reconstruye los intervalos de confianza que Chatbot Arena oculta — muchas &quot;victorias&quot; top-Elo están estadísticamente empatadas.",
+    "tile.eval.tip":               "Antes de correr lm-eval-harness o vLLM serve, obtén el flag CLI correcto. <strong>Chat-template Sniffer</strong> detecta la familia de template (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) y emite la invocación exacta <code>--apply_chat_template</code> / <code>--chat-template</code>. Resuelve el issue #1841 de lm-eval-harness (÷2 accuracy silencioso). <strong>Diagnose CLI</strong> genera el comando Python para medir γ_obs en tu GPU local.",
+    "tile.compare.tip":            "<strong>Compare</strong>: elige 2-3 modelos candidatos + una receta, ve veredictos en tabla lado a lado (ej. Llama-3-8B vs Mistral-7B a 32k). <strong>Phase diagram</strong>: scatter de 23 modelos empíricos en el plano (log θ, γ), con la curva Padé superpuesta. Hover puntos para detalles, click para cargar ese modelo en la Recipe form.",
+    "tile.manual.tip":             "<strong>Recipe</strong>: elige una receta X-N específica (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 compresión KV, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) y rellena la form a mano para control total. <strong>Ask</strong>: escribe una pregunta libre; un LLM 0.5B (Qwen2.5) en tu navegador elige la receta correcta y la ejecuta. Ideal para exploración &quot;qué pasaría si...&quot;.",
     "share.import_desc":       "¿Tienes un fichero JSON del análisis TAF de alguien? Cárgalo aquí para ver el veredicto + cadena localmente. La misma vista que si lo hubieras ejecutado tú.",
     "share.import_btn":        "📂 Cargar JSON compartido",
     "synthesis.system":        "Eres un asistente de diagnóstico preciso para LLMs transformer. Dados resultados de fórmulas TAF pre-calculados, escribe un resumen claro en español de 4-6 frases. Cita el número de sección (§X.Y) para cada número que menciones. Da siempre una recomendación concreta. NO inventes números.",
     "tile.compare.desc":           "Côte à côte, ou explorez le panel empirique de modèles.",
     "tile.manual.title":           "📋 Manuel / libre",
     "tile.manual.desc":            "Choisissez une recette à la main, ou demandez en langage naturel.",
+    "tile.diagnose.tip":           "Commencez ici quand vous avez un id de modèle spécifique et voulez un diagnostic complet : <strong>Profile</strong> lance les 5 recettes d'un coup. <strong>Unmask</strong> vérifie si max_position_embeddings est honnête. <strong>NIAH→Reason</strong> prédit le gap retrieval-vs-reasoning. <strong>Quant</strong> prédit si quantifier va le casser. <strong>Inspect</strong> permet de coller un config.json brut pour modèles privés / en dev.",
+    "tile.trust.tip":              "Quand vous voyez un score et voulez savoir s'il est réel. <strong>Contamination</strong> note 20+ benchmarks selon la probabilité que le modèle les ait vus en entraînement. <strong>Drift</strong> vous dit si l'écart entre deux évals est du bruit numérique ou un vrai bug (chat-template mismatch, layout KV-cache, etc.). <strong>Arena CI</strong> reconstruit les intervalles de confiance que Chatbot Arena cache — beaucoup de &quot;victoires&quot; top-Elo sont statistiquement à égalité.",
+    "tile.eval.tip":               "Avant de lancer lm-eval-harness ou vLLM serve, obtenez le bon flag CLI. <strong>Chat-template Sniffer</strong> détecte la famille de template (Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none) et émet l'invocation exacte <code>--apply_chat_template</code> / <code>--chat-template</code>. Résout l'issue #1841 de lm-eval-harness (÷2 accuracy silencieux). <strong>Diagnose CLI</strong> génère la commande Python pour mesurer γ_obs sur votre GPU local.",
+    "tile.compare.tip":            "<strong>Compare</strong> : choisissez 2-3 modèles candidats + une recette, voyez les verdicts dans un tableau côte à côte (ex. Llama-3-8B vs Mistral-7B à 32k). <strong>Phase diagram</strong> : nuage de 23 modèles empiriques dans le plan (log θ, γ), avec la courbe Padé superposée. Survolez les points pour détails, cliquez pour charger ce modèle dans le formulaire Recipe.",
+    "tile.manual.tip":             "<strong>Recipe</strong> : choisissez une recette X-N spécifique (X-1 custom-vs-API, X-2 long context, X-3 budget, X-5 hardware, X-19 compression KV, X-21 imprint, X-22 compute-context invariant, X-23 IH-phase) et remplissez le formulaire à la main pour contrôle total. <strong>Ask</strong> : tapez une question libre ; un LLM 0.5B (Qwen2.5) dans votre navigateur choisit la bonne recette et la lance. Idéal pour explorer &quot;que se passerait-il si...&quot;.",
     "share.import_desc":       "Vous avez un fichier JSON de l'analyse TAF de quelqu'un ? Chargez-le ici pour voir le verdict + la chaîne localement. La même vue que si vous l'aviez exécuté vous-même.",
     "share.import_btn":        "📂 Charger JSON partagé",
     "synthesis.system":        "Vous êtes un assistant de diagnostic précis pour LLMs transformer. Étant donné des résultats de formules TAF pré-calculés, écrivez un résumé clair en français de 4-6 phrases. Citez le numéro de section (§X.Y) pour chaque nombre mentionné. Donnez toujours une recommandation concrète. N'INVENTEZ PAS de nombres.",
     "tile.compare.desc":           "并排，或浏览经验模型面板。",
     "tile.manual.title":           "📋 手动 / 自由",
     "tile.manual.desc":            "手动挑一个具体 recipe，或用自然语言提问。",
+    "tile.diagnose.tip":           "当你有具体的 model id 并想要完整诊断时从这里开始：<strong>Profile</strong> 一次运行所有 5 个 recipe。<strong>Unmask</strong> 检查 max_position_embeddings 是否诚实。<strong>NIAH→Reason</strong> 预测 retrieval-vs-reasoning 的 gap。<strong>Quant</strong> 预测量化是否会破坏它。<strong>Inspect</strong> 允许粘贴原始 config.json，适用于私有 / 在研模型。",
+    "tile.trust.tip":              "当你看到一个分数想知道它是否可靠。<strong>Contamination</strong> 按模型在训练时看到 benchmark 的可能性给 20+ 个 benchmark 评级。<strong>Drift</strong> 告诉你两个 eval 之间的 gap 是数值噪声还是真实 bug（chat-template 不匹配、KV-cache 布局等）。<strong>Arena CI</strong> 重建 Chatbot Arena 隐藏的置信区间——很多 top-Elo 的 &quot;胜利&quot; 在统计上是并列。",
+    "tile.eval.tip":               "在运行 lm-eval-harness 或 vLLM serve 之前，获取正确的 CLI flag。<strong>Chat-template Sniffer</strong> 检测 template 系列（Llama-3 / ChatML / Mistral / Phi-3 / DeepSeek / Alpaca / custom / none）并输出精确的 <code>--apply_chat_template</code> / <code>--chat-template</code> 调用。解决 lm-eval-harness 的 issue #1841（accuracy 静默对半）。<strong>Diagnose CLI</strong> 生成 Python 命令在你的本地 GPU 上测量 γ_obs。",
+    "tile.compare.tip":            "<strong>Compare</strong>：选择 2-3 个候选模型 + 一个 recipe，在并排表格中看判定（例如 Llama-3-8B vs Mistral-7B 在 32k 上下文）。<strong>Phase diagram</strong>：23 个经验模型在 (log θ, γ) 平面上的散点图，叠加 Padé 曲线。悬停点查看详情，点击将该模型加载到 Recipe 表单。",
+    "tile.manual.tip":             "<strong>Recipe</strong>：挑选具体的 X-N recipe（X-1 自训 vs API、X-2 长上下文、X-3 预算、X-5 硬件、X-19 KV 压缩、X-21 imprint、X-22 compute-context 不变量、X-23 IH 相位）并手动填表，完全控制。<strong>Ask</strong>：输入自由问题；浏览器内的 0.5B LLM（Qwen2.5）选择合适的 recipe 并运行。最适合 &quot;如果……会怎样&quot; 的探索。",
     "share.import_desc":       "有他人 TAF 分析的 JSON 文件? 在这里加载以本地查看判定 + 链。与您自己运行的视图相同。",
     "share.import_btn":        "📂 加载共享的 JSON",
     "synthesis.system":        "您是 transformer LLM 的精确诊断助手。给定预先计算的 TAF 公式结果,用 4-6 句中文写出清晰的摘要。为每个提到的数字引用章节号 (§X.Y)。始终给出具体建议。不要编造数字。",