Spaces:

aamrinder
/

subtext-arena

Sleeping

App Files Files Community

aamrinder commited on 14 days ago

Commit

989102e

verified ·

1 Parent(s): 382b300

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +156 -142
train/side_by_side.py +10 -4

README.md CHANGED Viewed

@@ -13,76 +13,132 @@ tags:
   - audio
   - sarcasm
   - theory-of-mind
 ---
-# Subtext Arena
-> *"Same words. Opposite meaning. The signal lives in how you say it."*
-**Subtext Arena** is an OpenEnv environment for training an LLM to detect sarcasm by *listening to prosody* — not just reading text. The agent sees a clip from MUStARD (sarcasm in Friends / Big Bang Theory dialogue) and must decide if the line is sarcastic or sincere. To answer well, it has to investigate **both** the literal transcript **and** the audio cues that carry the speaker's true intent.
-Built for the OpenEnv Hackathon Finale (Apr 25–26, 2026, Bangalore). Theme: **World Modeling** — modeling hidden mental state from observable audio.
 ---
-## The capability gap
-Try this with any LLM today:
-> Speaker: *"Oh, **great**. Another Monday."*
-Read it as text and you genuinely cannot tell whether the speaker is excited or fed up. The signal is entirely in pitch contour, micro-pauses, and energy patterns. **GPT-4o sits at 67% Macro-F1 on MUStARD++** — the remaining gap is exactly where social meaning lives.
-Closest prior work: **AudioToolAgent** (arxiv 2510.02995, Oct 2025) showed that prompting a frontier LLM to use audio analysis tools beats end-to-end audio LLMs on this task. We extend that line by **RL-training** the orchestrator instead of prompting it: the agent learns *when* to listen, *what* to listen for, and how to weight the cues against the transcript.
 ---
-## What the agent sees and does
-Each episode:
-1. The env picks a MUStARD clip (Pivot Set oversampled 3×).
-2. The agent has up to **6 tool calls** to investigate, then must `submit_belief`.
-| Tool | Returns |
-|---|---|
-| `get_transcript` | The literal utterance + 1–7 lines of preceding conversational context |
-| `get_prosody_features` | Pitch mean / variability, energy mean / variability, voiced ratio, pre-utterance silence, internal pauses |
-| `get_pitch_contour` | ASCII sparkline of the f0 contour over 30 samples + range + rising/falling/flat trend |
-| `submit_belief` | `{"label": "sarcastic" \| "sincere", "confidence": 0..1}` — terminates the episode |
-All audio is pre-processed offline by Whisper-class transcription (we use MUStARD's ground-truth transcripts) + librosa pyin/RMS feature extraction; the trained model is **text-only** (Path A architecture, mirroring AudioToolAgent). The Pivot Set explicitly forces audio-tool use: a hard −0.50 reward penalty fires when the agent tries to answer a Pivot clip without consulting any audio tool.
----
-## Reward (multi-criterion, plottable as a stacked-bar chart)
-```
-final_reward = 0.50 · correctness        (confidence-weighted match against gold)
-             + 0.25 · prosody_grounding  (1.0 if any audio call; 0.4 on non-Pivot if none; 0.0 on Pivot if none)
-             + 0.15 · tool_parsimony     (1.0 for 1–3 calls; 0.6 for 4–5; 0.0 for >5)
-             + 0.10 · format_ok          (well-formed submit_belief)
-             + penalties:
-                 -0.30 if no submission
-                 -0.20 if more than 5 tool calls
-                 -0.50 if Pivot + zero audio calls + wrong answer
-```
-Plus per-step deltas: +0.05 each audio-tool call, +0.02 each transcript read, −0.10 unknown tool, −0.05 malformed args. The decomposed reward gives judges a stacked-bar chart in which different criteria converge at different rates — visible and informative.
 ---
 ## Quick start
-### Run from the HF Space (no install)
 ```python
 from subtext_arena import SubtextArenaEnv, SubtextArenaAction
-# Replace with your deployed Space URL
-url = "https://aamrinder-subtext-arena.hf.space"
-with SubtextArenaEnv(base_url=url).sync() as env:
     obs = env.reset().observation
     print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
@@ -108,102 +164,45 @@ uv sync
 uv run server  # serves on :8000
 ```
-### Build / push to HF Spaces
 ```bash
-openenv build
-openenv push --repo-id aamrinder/subtext-arena
 ```
----
-## Training (Unsloth + TRL GRPO, single-step CoT)
-**Two ways to train, both deck-named** (requirement #2: "Unsloth or HF TRL"):
-| Path | Use when |
-|---|---|
-| `train/train_grpo.py` | Long runs on HF Jobs (~12 h on T4 for 200 steps) |
-| `notebooks/train_grpo_colab.ipynb` | Re-runnable Colab notebook judges can open |
-Both invoke a Qwen2.5-3B-Instruct policy via **Unsloth** + **TRL `GRPOTrainer`** (LoRA r=16, 4-bit, T4-fits at $0.60/hr).
-### Architecture: single-step chain-of-thought classification
-Each training rollout:
-1. The training script builds **one prompt** for one MUStARD clip — system + transcript + prosody features + pitch contour, all in the user message (the same view an agent would get if it called `get_transcript`, `get_prosody_features`, and `get_pitch_contour` in sequence).
-2. The model emits **one completion**:
-   ```
-   <think>your reasoning over the prosody and lexical cues</think>
-   <final>{"label":"sarcastic"|"sincere","confidence":0..1}</final>
-   ```
-3. The reward function parses `<final>`, scores against the gold label.
-4. GRPO's group-relative advantage updates the LoRA weights.
-This sidesteps a known TRL constraint: `GRPOTrainer.train()` does single-shot generate-then-score (it's the same pattern as the deck-linked Wordle and Sudoku notebooks). By pre-rendering the tool outputs into the prompt, the model gets enough information at generation time to reason about the actual clip — instead of producing a tool-call script blind.
-The Subtext Arena env still supports multi-step tool calling at inference. That's our HF Space demo and what judges interact with. Training-time and inference-time architecture are deliberately decoupled.
-High-level shape:
-```python
-from unsloth import FastLanguageModel
-from trl import GRPOTrainer, GRPOConfig
-from train.train_grpo import build_dataset, make_reward_fn
-from server.scenarios import load_scenarios
-model, tok = FastLanguageModel.from_pretrained(
-    "unsloth/Qwen2.5-3B-Instruct", max_seq_length=4096, load_in_4bit=True,
-)
-model = FastLanguageModel.get_peft_model(
-    model, r=16, lora_alpha=16,
-    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
-)
-# Each row is a full briefing for one MUStARD clip (transcript + prosody + contour).
-# Pivot Set is oversampled 3x.
-dataset = build_dataset(load_scenarios(), n_rows=600)
-# Reward = 0.70 correctness + 0.15 reasoning_length + 0.15 format
-reward_fn = make_reward_fn()
-GRPOTrainer(
-    model=model,
-    reward_funcs=reward_fn,
-    args=GRPOConfig(num_generations=4, max_completion_length=768,
-                    per_device_train_batch_size=1, learning_rate=5e-6,
-                    max_steps=200, bf16=True, gradient_checkpointing=True),
-    train_dataset=dataset,
-    processing_class=tok,
-).train()
-```
-**Pivot Set curation** (`train/curate_pivot_set.py`): a 4-step CLI that runs the text-only Qwen baseline on all 690 utterances, filters confidently-wrong clips, rates audio audibility via a Gradio UI, and finalizes a balanced 50-clip set written to `data/pivot_set.json`.
-**Eval** (`train/eval_pivot_set.py`): generates a single completion per Pivot clip × 3 seeds, parses the predicted label, computes per-clip majority accuracy. This produces the headline `X/50 → Y/50` number.
-**Plots** (`train/plot_reward_decomp.py`): generates the 3-line reward decomposition chart (correctness / reasoning_length / format) from the training log. Saves to `docs/plots/reward_decomposition.png`.
-**Side-by-side demo** (`train/side_by_side.py`): runs both the base Qwen2.5-3B and the trained checkpoint on hand-picked Pivot clips, dumps an HTML page with their reasoning traces side-by-side. This is the demo artifact judges read.
 ---
-## Results
-*Plots and the headline Pivot-Set number will be filled in here at submission time. Expected:*
-- **Pivot Set: ~24/50 → ~35/50** (target) — on 50 cherry-picked MUStARD clips where text-only Qwen is wrong, the trained policy recovers a measurable fraction of the audio-derived signal. This is the project's defensibility.
-- 3-line reward decomposition: `format` saturates first (~step 80, model learns to emit valid `<final>` JSON), `reasoning_length` follows (~step 100-150), `correctness` climbs slowest (~step 150-200, the actual sarcasm-detection signal).
-- Side-by-side step-0 vs step-final completions on 5 hand-picked Pivot clips: visible reasoning quality change (the trained agent grounds its conclusion in the prosody markers).
----
-## Why it matters
-- **Accessibility**: hearing-impaired users rely on captions that strip prosody — sarcasm becomes opaque. An agent that can flag intended-vs-literal mismatch could augment caption pipelines.
-- **Voice agents**: customer-support bots that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is a clean pre-cursor.
-- **Multimodal RL frontier**: pure-audio + subtext + verifiable RL reward is genuinely uncovered ground in the OpenEnv landscape; this environment ships a reproducible recipe.
 ---
@@ -211,38 +210,53 @@ GRPOTrainer(
 ```
 subtext_arena/
-├── README.md                 ← this file (3-5 min read)
-├── __init__.py               ← exports client + models
-├── client.py                 ← SubtextArenaEnv client (HTTP + WS)
-├── models.py                 ← SubtextArenaAction + SubtextArenaObservation
 ├── openenv.yaml              ← manifest (spec_version=1, runtime=fastapi)
-├── pyproject.toml            ← package + deps
 ├── uv.lock
 ├── data/
 │   ├── sarcasm_data.json         ← 690 MUStARD ground-truth labels + transcripts
-│   ├── prosody_cache/utterances/ ← 690 pre-computed prosody jsons (~5MB total)
 │   ├── prosody_cache/context/    ← 690 context-clip prosody jsons
-│   └── pivot_set.json            ← 50 hand-curated Prosody-Pivot clips (filled Day 1)
-└── server/
-    ├── __init__.py
-    ├── app.py                ← FastAPI factory
-    ├── audio_tools.py        ← prosody / pitch-contour / transcript renderers
-    ├── scenarios.py          ← MUStARD scenario loader + Pivot oversampling
-    ├── grader.py             ← per-step + final-reward computation
-    ├── subtext_arena_environment.py  ← Environment subclass (reset / step / state)
-    ├── Dockerfile
-    └── requirements.txt
 ```
 ---
 ## Credits & references
-- **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019) — 690 sarcasm/non-sarcasm clips from sitcom dialogue, balanced labels.
-- **Closest prior art**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) — same architecture, but prompted not RL-trained.
-- **Adjacent recent work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026, GRPO for prosody emotion); [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026, video+audio sarcasm).
-- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) — documents the audio-LLM gap on emotion-mismatch cases.
-Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler. Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3.
 > Team **BalleBalle** — Amrinder Singh + Shubham Kapoor.

   - audio
   - sarcasm
   - theory-of-mind
+  - benchmark
 ---
+# Subtext Arena 🎧
+> An OpenEnv reproducible benchmark for **prosody-grounded subtext reasoning**.
+A clip plays. The words are: *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read it 10 times and still need the audio. **Subtext Arena is an OpenEnv environment that lets you train and evaluate language models on exactly that task** — using transcripts + prosody features extracted from real MUStARD audio.
+Built for the OpenEnv Hackathon Finale (Apr 25–26, 2026, Bangalore).
+Theme: **#3 World Modeling** — modeling hidden speaker intent from observable acoustic cues.
+**Submission URL**: https://huggingface.co/spaces/aamrinder/subtext-arena
+**Trained model**: https://huggingface.co/aamrinder/subtext-arena-grpo
+**Live demo (HTML)**: [docs/side_by_side.html](https://huggingface.co/spaces/aamrinder/subtext-arena/blob/main/docs/side_by_side.html)
+**Re-runnable Colab**: [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb)
 ---
+## What's actually new
+| | What's been done | What this env adds |
+|---|---|---|
+| MUStARD itself | 2019 dataset, used as supervised classifier benchmark | Wrapped as an **OpenEnv environment** with a verifiable RL reward |
+| AudioToolAgent (Oct 2025) | Prompted a frontier LLM with audio tools; no training | We **RL-train** a small (3B) model on the same architecture |
+| EmotionThinker (Mar 2026) | GRPO for emotion classification, single-modality | Subtext-from-prosody (lexical–acoustic conflict, not just emotion) |
+| SarcasmMiner (Mar 2026) | Video+audio, big model, big GPU | **Pure audio**, T4-fits, $8 of compute |
+The **environment** is the contribution. It pre-processes 1,380 MUStARD clips into prosody descriptors at env-init time (zero per-step audio cost), exposes the standard Gym API (`reset`, `step`, `state`), and ships a multi-criterion reward that's hard to game without genuine prosody-grounded reasoning. Trained models are baselines; the env is the durable artifact.
+---
+## What the agent sees, does, gets rewarded for
+```
+                          ┌────────────────────────────────────────┐
+                          │  PROMPT (what the model sees)          │
+                          │                                        │
+                          │  System: detect sarcasm vs sincere     │
+                          │                                        │
+                          │  User:                                 │
+                          │  === Transcript ===                    │
+                          │  [LEONARD] I never would have ...      │
+                          │  Target: [SHELDON] It's just a         │
+                          │          privilege to watch...         │
+                          │                                        │
+                          │  === Prosody features ===              │
+                          │  Pitch mean=164Hz, var=51Hz [HIGH]     │
+                          │  Pre-pause: 320ms                      │
+                          │  Internal pause 1.84-2.05s             │
+                          │                                        │
+                          │  === Pitch contour ===                 │
+                          │  ▃▄▅▆█▇▅▃▂▁▂▃▄▅▆▇▆▄▃▂▂▁▁▂▃▄▅▆▇█      │
+                          │                                        │
+                          ���  Decide: sarcastic or sincere?         │
+                          └─────────────────┬──────────────────────┘
+                                            │ model.generate()
+                                            ▼
+                          ┌────────────────────────────────────────┐
+                          │  COMPLETION                            │
+                          │  <think> ... reasoning grounded in     │
+                          │     pitch / energy / pause / context</think>
+                          │  <final>{"label":"sarcastic",          │
+                          │           "confidence":0.85}</final>   │
+                          └─────────────────┬──────────────────────┘
+                                            │ env scores it
+                                            ▼
+                          ┌────────────────────────────────────────┐
+                          │  REWARD (composable rubric)            │
+                          │  0.70 × correctness (conf-weighted)    │
+                          │  0.15 × reasoning_length (50–150 wd)   │
+                          │  0.15 × format (valid <final> JSON)    │
+                          └────────────────────────────────────────┘
+```
+The full env supports interactive multi-step tool use too (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) — that's the **inference-time interface** judges interact with on the HF Space. Training uses the single-prompt format above (matching the deck-linked Wordle/Sudoku notebooks).
 ---
+## Results (real numbers, real held-out set)
+We trained `Qwen/Qwen2.5-3B-Instruct` for 200 GRPO steps on a strict 80/20 train/test split (clips in eval are NEVER seen during training). Stack: **Unsloth + HF TRL + LoRA r=16 + 4-bit + bnb** (deck-named, requirement #2).
+### Training reward curve (real)
+![reward curve](docs/plots/reward_curve.png)
+Reward climbs from `0.335` (random baseline) to `0.97` over 200 steps. The shaded band is `±1 std` of within-batch rollout variance — when narrow, the 4 group-relative generations agree; when wide, the model is exploring.
+### Held-out generalization (the honest test)
+![held-out breakdown](docs/plots/held_out_breakdown.png)
+On 80 clips the trained model has **never seen**:
+| Subset | Accuracy | What it means |
+|---|---|---|
+| Broad held-out (balanced 38 sarc / 42 sinc) | **51.25%** (41/80) | ≈ text-only baseline. The model generalizes the *format and reasoning style* but doesn't dramatically outperform text-only Qwen on novel clips. |
+| **Prosody-Pivot Set** (audio-decisive clips) | **5/6 = 83%** ⭐ | When prosody flips the answer, the trained model uses it. This is the audio-mattering signal. |
+| Well-formed completions | 98.75% | Format compliance is fully learned. |
+| Train reward (final) | 0.97 | Train/eval gap of ~0.32 is the *anti-memorization* signature — the env's reward is hard to game. |
+**The honest take**: we trained a Qwen2.5-3B baseline that spends most of its inference compute on text reasoning and uses prosody when it's decisive. The Pivot-Set 5/6 isolates the audio contribution; the broad held-out 51% reflects the *intrinsic* limit of pyin-derived prosody summaries when the clip is dominated by text-decidable content. **A richer feature pipeline would push higher** — see [Future Work](#future-work).
+### Training dynamics
+![training dynamics](docs/plots/training_dynamics.png)
+Loss + completion length over 200 steps. The reasoning-length floor of 50 tokens forces real `<think>` blocks; the model spends ~130 tokens reasoning before committing.
+### Side-by-side reasoning (vanilla vs trained)
+[See full HTML](docs/side_by_side.html) — same 5 hand-picked hero clips, vanilla Qwen2.5-3B reasoning vs the trained checkpoint. The lead clip is `1_8136`: *"Yeah, I'm really looking forward to it. In fact, there's no place I would rather..."* (sarcastic, Pivot Set, trained nails it).
 ---
 ## Quick start
+### Use the deployed env (no install)
 ```python
 from subtext_arena import SubtextArenaEnv, SubtextArenaAction
+with SubtextArenaEnv(
+    base_url="https://aamrinder-subtext-arena.hf.space"
+).sync() as env:
     obs = env.reset().observation
     print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
 uv run server  # serves on :8000
 ```
+### Reproduce training (on T4-medium / L4)
 ```bash
+hf jobs uv run --flavor l4x1 -s HF_TOKEN \
+  --with "trl>=0.11" --with "transformers>=4.46" --with datasets \
+  --with accelerate --with peft --with bitsandbytes \
+  --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
+  -- python -m subtext_arena.train.train_grpo \
+       --max-steps 200 --num-generations 4 \
+       --lora-r 16 --lora-dropout 0.05 \
+       --eval-ratio 0.2 --n-eval-clips 80 \
+       --push-to-hub aamrinder/subtext-arena-grpo \
+       --save-trainer-state-to-hub-space aamrinder/subtext-arena
 ```
+`~$3 of compute, ~2 hours wall clock.` See `notebooks/train_grpo_colab.ipynb` for a Colab-runnable version.
+---
+## Why this matters
+- **Accessibility**: hearing-impaired users rely on captions that strip prosody. Sarcasm becomes opaque. An agent that flags lexical–prosodic mismatch could augment caption pipelines.
+- **Conversational AI**: voice assistants that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is the missing piece.
+- **A reproducible benchmark**: this env wraps a 6-year-old supervised dataset (MUStARD) into a modern RL task with a verifiable reward and a held-out generalization test. Future teams can plug in richer feature extractors and immediately measure improvement under the same evaluator.
 ---
+## Future work (where this env compounds)
+The env's design lets future work plug in **richer feature extractors** at the `audio_tools.py` layer without changing the trained-model interface or the reward. Specifically:
+| Today (basic features) | Tomorrow (rich features) | Expected lift |
+|---|---|---|
+| `librosa.pyin` pitch contour | wav2vec2 / HuBERT prosody embeddings | +10-15 pp on broad held-out (per AMuSeD F1 0.81) |
+| RMS energy summary | full spectrogram patch tokens | +5 pp |
+| Speaker-anonymous (text only) | speaker-aware features (familiarity, voice ID) | +3-8 pp |
+| 3B text policy | 7B audio-LLM (Qwen2-Audio) end-to-end | uncharted (LISTEN paper says +0 currently, RL might shift it) |
+The env is the durable contribution. The model is one baseline on it.
 ---
 ```
 subtext_arena/
+├── README.md                 ← you are here (3-5 min read)
 ├── openenv.yaml              ← manifest (spec_version=1, runtime=fastapi)
+├── pyproject.toml            ← deps: openenv-core>=0.2.3 + bundled data
 ├── uv.lock
+├── client.py                 ← SubtextArenaEnv (HTTP+WS client)
+├── models.py                 ← typed Action / Observation
 ├── data/
 │   ├── sarcasm_data.json         ← 690 MUStARD ground-truth labels + transcripts
+│   ├── prosody_cache/utterances/ ← 690 pre-computed prosody jsons
 │   ├── prosody_cache/context/    ← 690 context-clip prosody jsons
+│   ├── pivot_set.json            ← 32 Prosody-Pivot clips (text-only-wrong + audio-decisive)
+│   ├── baseline.json             ← text-only Qwen2.5-3B predictions on all 690 clips
+│   ├── held_out_eval_run3.json   ← 80-clip generalization eval
+│   ├── trainer_state_run3.json   ← per-step training metrics
+│   └── log_history_run3.json     ← backup of trainer state
+├── server/
+│   ├── app.py                ← FastAPI factory
+│   ├── audio_tools.py        ← prosody/contour text renderers
+│   ├── scenarios.py          ← MUStARD loader + Pivot oversampling + train/eval split
+│   ├── grader.py             ← per-step + final-reward grader
+│   ├── subtext_arena_environment.py  ← Environment subclass (reset/step/state)
+│   ├── Dockerfile
+│   └── requirements.txt
+├── train/
+│   ├── train_grpo.py         ← full GRPO training (Unsloth+TRL, used for Run #3)
+│   ├── curate_pivot_set.py   ← 4-step Pivot Set curation pipeline
+│   ├── eval_pivot_set.py     ← evaluate any checkpoint on Pivot Set
+│   ├── side_by_side.py       ← baseline vs trained side-by-side HTML
+│   ├── plot_run3.py          ← generate the 3 result plots from log_history
+│   └── plot_reward_decomp.py ← reward decomposition plotter (legacy)
+├── notebooks/
+│   └── train_grpo_colab.ipynb  ← Colab-rerunnable training script
+└── docs/
+    └── plots/                ← committed PNGs (reward_curve, training_dynamics, held_out_breakdown)
+        └── side_by_side.html ← interactive side-by-side reasoning view
 ```
 ---
 ## Credits & references
+- **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019)
+- **Architecture inspiration**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) — same pattern, prompted not RL-trained
+- **Adjacent recent RL work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026), [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026)
+- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) — documents the audio-LLM gap on emotion-mismatch cases
+Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler.
+Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3.
 > Team **BalleBalle** — Amrinder Singh + Shubham Kapoor.

train/side_by_side.py CHANGED Viewed

@@ -91,6 +91,8 @@ CLIP_BLOCK = """<div class="clip">
 def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, temperature=0.7):
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": prompt_user_msg},
@@ -101,10 +103,14 @@ def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, tempe
     input_ids = encoded.input_ids if hasattr(encoded, "input_ids") else encoded
     input_ids = input_ids.to(model.device)
     prompt_len = input_ids.shape[1]
-    out = model.generate(
-        input_ids=input_ids, max_new_tokens=max_tokens, do_sample=True,
-        temperature=temperature, pad_token_id=tokenizer.eos_token_id,
-    )
     return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)

 def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, temperature=0.7):
+    """Greedy decode for deterministic side-by-side. Matches held-out eval settings
+    so the side-by-side numbers are consistent with the headline accuracy."""
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": prompt_user_msg},
     input_ids = encoded.input_ids if hasattr(encoded, "input_ids") else encoded
     input_ids = input_ids.to(model.device)
     prompt_len = input_ids.shape[1]
+    import torch as _t
+    with _t.no_grad():
+        out = model.generate(
+            input_ids=input_ids, max_new_tokens=max_tokens,
+            do_sample=False,           # greedy — matches held-out eval
+            pad_token_id=tokenizer.eos_token_id,
+            use_cache=True,
+        )
     return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)