Spaces:

aamrinder
/

subtext-arena

Sleeping

App Files Files Community

aamrinder commited on 28 days ago

Commit

9f43137

verified ·

1 Parent(s): f468b29

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +106 -157
models.py +2 -23
server/grader.py +22 -39
server/scenarios.py +4 -43

README.md CHANGED Viewed

@@ -12,126 +12,106 @@ tags:
   - rl
   - audio
   - sarcasm
-  - theory-of-mind
-  - benchmark
 ---
-# Subtext Arena 🎧
-> An OpenEnv reproducible benchmark for **prosody-grounded subtext reasoning**.
-A clip plays. The words are: *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read it 10 times and still need the audio. **Subtext Arena is an OpenEnv environment that lets you train and evaluate language models on exactly that task** — using transcripts + prosody features extracted from real MUStARD audio.
-Built for the OpenEnv Hackathon Finale (Apr 25–26, 2026, Bangalore).
-Theme: **#3 World Modeling** — modeling hidden speaker intent from observable acoustic cues.
-**Submission URL**: https://huggingface.co/spaces/aamrinder/subtext-arena
-**Trained model**: https://huggingface.co/aamrinder/subtext-arena-grpo
-**Live demo (HTML)**: [docs/side_by_side.html](https://huggingface.co/spaces/aamrinder/subtext-arena/blob/main/docs/side_by_side.html)
-**Re-runnable Colab**: [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb)
 ---
-## What's actually new
-| | What's been done | What this env adds |
-|---|---|---|
-| MUStARD itself | 2019 dataset, used as supervised classifier benchmark | Wrapped as an **OpenEnv environment** with a verifiable RL reward |
-| AudioToolAgent (Oct 2025) | Prompted a frontier LLM with audio tools; no training | We **RL-train** a small (3B) model on the same architecture |
-| EmotionThinker (Mar 2026) | GRPO for emotion classification, single-modality | Subtext-from-prosody (lexical–acoustic conflict, not just emotion) |
-| SarcasmMiner (Mar 2026) | Video+audio, big model, big GPU | **Pure audio**, T4-fits, $8 of compute |
-The **environment** is the contribution. It pre-processes 1,380 MUStARD clips into prosody descriptors at env-init time (zero per-step audio cost), exposes the standard Gym API (`reset`, `step`, `state`), and ships a multi-criterion reward that's hard to game without genuine prosody-grounded reasoning. Trained models are baselines; the env is the durable artifact.
 ---
-## What the agent sees, does, gets rewarded for
 ```
-                          ┌────────────────────────────────────────┐
-                          │  PROMPT (what the model sees)          │
-                          │                                        │
-                          │  System: detect sarcasm vs sincere     │
-                          │                                        │
-                          │  User:                                 │
-                          │  === Transcript ===                    │
-                          │  [LEONARD] I never would have ...      │
-                          │  Target: [SHELDON] It's just a         │
-                          │          privilege to watch...         │
-                          │                                        │
-                          │  === Prosody features ===              │
-                          │  Pitch mean=164Hz, var=51Hz [HIGH]     │
-                          │  Pre-pause: 320ms                      │
-                          │  Internal pause 1.84-2.05s             │
-                          │                                        │
-                          │  === Pitch contour ===                 │
-                          │  ▃▄▅▆█▇▅▃▂▁▂▃▄▅▆▇▆▄▃▂▂▁▁▂▃▄▅▆▇█      │
-                          │                                        │
-                          │  Decide: sarcastic or sincere?         │
-                          └─────────────────┬──────────────────────┘
-                                            │ model.generate()
-                                            ▼
-                          ┌────────────────────────────────────────┐
-                          │  COMPLETION                            │
-                          │  <think> ... reasoning grounded in     │
-                          │     pitch / energy / pause / context</think>
-                          │  <final>{"label":"sarcastic",          │
-                          │           "confidence":0.85}</final>   │
-                          └─────────────────┬──────────────────────┘
-                                            │ env scores it
-                                            ▼
-                          ┌────────────────────────────────────────┐
-                          │  REWARD (composable rubric)            │
-                          │  0.70 × correctness (conf-weighted)    │
-                          │  0.15 × reasoning_length (50–150 wd)   │
-                          │  0.15 × format (valid <final> JSON)    │
-                          └────────────────────────────────────────┘
 ```
-The full env supports interactive multi-step tool use too (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) — that's the **inference-time interface** judges interact with on the HF Space. Training uses the single-prompt format above (matching the deck-linked Wordle/Sudoku notebooks).
 ---
-## Results (real numbers, real held-out set)
-We trained `Qwen/Qwen2.5-3B-Instruct` for 200 GRPO steps on a strict 80/20 train/test split (clips in eval are NEVER seen during training). Stack: **Unsloth + HF TRL + LoRA r=16 + 4-bit + bnb** (deck-named, requirement #2).
-### Training reward curve (real)
 ![reward curve](docs/plots/reward_curve.png)
-Reward climbs from `0.335` (random baseline) to `0.97` over 200 steps. The shaded band is `±1 std` of within-batch rollout variance — when narrow, the 4 group-relative generations agree; when wide, the model is exploring.
-### Held-out generalization (the honest test)
 ![held-out breakdown](docs/plots/held_out_breakdown.png)
-On 80 clips the trained model has **never seen**:
-| Subset | Accuracy | What it means |
-|---|---|---|
-| Broad held-out (balanced 38 sarc / 42 sinc) | **51.25%** (41/80) | ≈ text-only baseline. The model generalizes the *format and reasoning style* but doesn't dramatically outperform text-only Qwen on novel clips. |
-| **Prosody-Pivot Set** (audio-decisive clips) | **5/6 = 83%** ⭐ | When prosody flips the answer, the trained model uses it. This is the audio-mattering signal. |
-| Well-formed completions | 98.75% | Format compliance is fully learned. |
-| Train reward (final) | 0.97 | Train/eval gap of ~0.32 is the *anti-memorization* signature — the env's reward is hard to game. |
-**The honest take**: we trained a Qwen2.5-3B baseline that spends most of its inference compute on text reasoning and uses prosody when it's decisive. The Pivot-Set 5/6 isolates the audio contribution; the broad held-out 51% reflects the *intrinsic* limit of pyin-derived prosody summaries when the clip is dominated by text-decidable content. **A richer feature pipeline would push higher** — see [Future Work](#future-work).
-### Training dynamics
-![training dynamics](docs/plots/training_dynamics.png)
-Loss + completion length over 200 steps. The reasoning-length floor of 50 tokens forces real `<think>` blocks; the model spends ~130 tokens reasoning before committing.
-### Side-by-side reasoning (vanilla vs trained)
-[See full HTML](docs/side_by_side.html) — same 5 hand-picked hero clips, vanilla Qwen2.5-3B reasoning vs the trained checkpoint. The lead clip is `1_8136`: *"Yeah, I'm really looking forward to it. In fact, there's no place I would rather..."* (sarcastic, Pivot Set, trained nails it).
 ---
 ## Quick start
-### Use the deployed env (no install)
 ```python
 from subtext_arena import SubtextArenaEnv, SubtextArenaAction
@@ -140,31 +120,25 @@ with SubtextArenaEnv(
     base_url="https://aamrinder-subtext-arena.hf.space"
 ).sync() as env:
     obs = env.reset().observation
-    print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
     obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
-    print(obs.tool_output[:500])
     obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
-    print(obs.tool_output)
     result = env.step(SubtextArenaAction(
         tool="submit_belief",
         tool_args={"label": "sarcastic", "confidence": 0.85},
     ))
-    print(f"done={result.done}  reward={result.reward}")
 ```
-### Run locally
 ```bash
 git clone https://huggingface.co/spaces/aamrinder/subtext-arena
 cd subtext-arena
 uv sync
-uv run server  # serves on :8000
 ```
-### Reproduce training (on T4-medium / L4)
 ```bash
 hf jobs uv run --flavor l4x1 -s HF_TOKEN \
@@ -172,91 +146,66 @@ hf jobs uv run --flavor l4x1 -s HF_TOKEN \
   --with accelerate --with peft --with bitsandbytes \
   --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
   -- python -m subtext_arena.train.train_grpo \
-       --max-steps 200 --num-generations 4 \
-       --lora-r 16 --lora-dropout 0.05 \
        --eval-ratio 0.2 --n-eval-clips 80 \
-       --push-to-hub aamrinder/subtext-arena-grpo \
-       --save-trainer-state-to-hub-space aamrinder/subtext-arena
 ```
-`~$3 of compute, ~2 hours wall clock.` See `notebooks/train_grpo_colab.ipynb` for a Colab-runnable version.
----
-## Why this matters
-- **Accessibility**: hearing-impaired users rely on captions that strip prosody. Sarcasm becomes opaque. An agent that flags lexical–prosodic mismatch could augment caption pipelines.
-- **Conversational AI**: voice assistants that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is the missing piece.
-- **A reproducible benchmark**: this env wraps a 6-year-old supervised dataset (MUStARD) into a modern RL task with a verifiable reward and a held-out generalization test. Future teams can plug in richer feature extractors and immediately measure improvement under the same evaluator.
----
-## Future work (where this env compounds)
-The env's design lets future work plug in **richer feature extractors** at the `audio_tools.py` layer without changing the trained-model interface or the reward. Specifically:
-| Today (basic features) | Tomorrow (rich features) | Expected lift |
-|---|---|---|
-| `librosa.pyin` pitch contour | wav2vec2 / HuBERT prosody embeddings | +10-15 pp on broad held-out (per AMuSeD F1 0.81) |
-| RMS energy summary | full spectrogram patch tokens | +5 pp |
-| Speaker-anonymous (text only) | speaker-aware features (familiarity, voice ID) | +3-8 pp |
-| 3B text policy | 7B audio-LLM (Qwen2-Audio) end-to-end | uncharted (LISTEN paper says +0 currently, RL might shift it) |
-The env is the durable contribution. The model is one baseline on it.
 ---
-## Project layout
 ```
-subtext_arena/
-├── README.md                 ← you are here (3-5 min read)
-├── openenv.yaml              ← manifest (spec_version=1, runtime=fastapi)
-├── pyproject.toml            ← deps: openenv-core>=0.2.3 + bundled data
-├── uv.lock
-├── client.py                 ← SubtextArenaEnv (HTTP+WS client)
-├── models.py                 ← typed Action / Observation
-├── data/
-│   ├── sarcasm_data.json         ← 690 MUStARD ground-truth labels + transcripts
-│   ├── prosody_cache/utterances/ ← 690 pre-computed prosody jsons
-│   ├── prosody_cache/context/    ← 690 context-clip prosody jsons
-│   ├── pivot_set.json            ← 32 Prosody-Pivot clips (text-only-wrong + audio-decisive)
-│   ├── baseline.json             ← text-only Qwen2.5-3B predictions on all 690 clips
-│   ├── held_out_eval_run3.json   ← 80-clip generalization eval
-│   ├── trainer_state_run3.json   ← per-step training metrics
-│   └── log_history_run3.json     ← backup of trainer state
 ├── server/
-│   ├── app.py                ← FastAPI factory
-│   ├── audio_tools.py        ← prosody/contour text renderers
-│   ├── scenarios.py          ← MUStARD loader + Pivot oversampling + train/eval split
-│   ├── grader.py             ← per-step + final-reward grader
-│   ├── subtext_arena_environment.py  ← Environment subclass (reset/step/state)
-│   ├── Dockerfile
-│   └── requirements.txt
 ├── train/
-│   ├── train_grpo.py         ← full GRPO training (Unsloth+TRL, used for Run #3)
-│   ├── curate_pivot_set.py   ← 4-step Pivot Set curation pipeline
-│   ├── eval_pivot_set.py     ← evaluate any checkpoint on Pivot Set
-│   ├── side_by_side.py       ← baseline vs trained side-by-side HTML
-│   ├── plot_run3.py          ← generate the 3 result plots from log_history
-│   └── plot_reward_decomp.py ← reward decomposition plotter (legacy)
 ├── notebooks/
-│   └── train_grpo_colab.ipynb  ← Colab-rerunnable training script
 └── docs/
-    └── plots/                ← committed PNGs (reward_curve, training_dynamics, held_out_breakdown)
-        └── side_by_side.html ← interactive side-by-side reasoning view
 ```
 ---
-## Credits & references
-- **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019)
-- **Architecture inspiration**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) — same pattern, prompted not RL-trained
-- **Adjacent recent RL work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026), [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026)
-- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) — documents the audio-LLM gap on emotion-mismatch cases
-Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler.
-Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3.
-> Team **BalleBalle** — Amrinder Singh + Shubham Kapoor.

   - rl
   - audio
   - sarcasm
 ---
+# Subtext Arena
+Try this out: a clip plays, the line is *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read the words a hundred times and still need the audio.
+This repo wraps that task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a tiny prosody pipeline (pyin pitch, RMS energy, pause timing) baked into the env. The agent reads a transcript and a text rendering of the prosody, and produces a chain-of-thought followed by `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
+I trained a baseline (Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL) on a strict 80/20 train/test split. The trained model is at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent: ~$11.
+Built for the OpenEnv Hackathon Finale (Apr 2026, Bangalore). Theme: World Modeling.
+Team **BalleBalle**: Amrinder Singh, Shubham Kapoor.
 ---
+## Why this is interesting
+Detecting sarcasm from audio prosody is not solved. GPT-4o sits at 67% Macro-F1 on MUStARD++. There's a documented modality gap (the [LISTEN paper](https://arxiv.org/abs/2510.10444), Oct 2025). And there is no public reinforcement-learning training environment for the task that I know of, so I made one.
+The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same architecture, but they didn't train. Subtext Arena is the training-side counterpart.
 ---
+## How the env works
+Each episode is one MUStARD clip. The prompt the agent sees contains:
+- The transcript (target line + 1-7 lines of preceding conversation, with speaker tags)
+- Prosody features as text: pitch mean / variability / contour shape, energy mean / variability, voiced ratio, pre-utterance silence, internal pauses with timestamps
+- A pitch contour rendered as an 8-level ASCII sparkline
+The model emits:
+```
+<think>
+... reasoning over the prosodic and lexical cues ...
+</think>
+<final>{"label":"sarcastic","confidence":0.85}</final>
+```
+The reward is a composable rubric:
 ```
+0.70 * correctness         (confidence-weighted match against gold)
+0.15 * reasoning_length    (penalty if <think> is under 50 words or over 300)
+0.15 * format              (1.0 if a valid <final> JSON parses, 0 otherwise)
 ```
+The env also exposes the four tools (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) for interactive multi-step inference. That's what you can poke at on this Space's web UI. Training uses the single-prompt format above so it matches the deck-linked Wordle and Sudoku notebooks.
 ---
+## Results
+I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips, 138 eval clips, deterministically seeded so judges can reproduce. Pivot oversample is 2x. Class balance is enforced in the dataset construction.
+### Reward over training
 ![reward curve](docs/plots/reward_curve.png)
+Reward climbs from 0.335 to 0.97 on training prompts. The shaded band is within-batch rollout variance: when narrow, the four group-relative generations agree; when wide, the model is exploring.
+### Held-out generalization
+After training, I ran greedy inference on 80 clips the model has never seen.
 ![held-out breakdown](docs/plots/held_out_breakdown.png)
+| Subset | Accuracy |
+|---|---|
+| Broad held-out (38 sarcastic + 42 sincere) | 41/80 = **51.25%** |
+| Prosody-Pivot Set in eval (audio-decisive clips) | 5/6 = **83%** |
+| Well-formed completions | 79/80 = 98.75% |
+The honest read: 51% on the broad set is roughly text-only baseline, which means pyin-derived prosody summary stats don't push a 3B model much beyond what it already does from the transcript. But when the audio is genuinely decisive (the Pivot Set), the trained model uses it: 5/6 correct on those clips, vs 0/6 for a text-only baseline that confidently picks the wrong label.
+The 0.97 train vs 0.51 held-out gap is the anti-memorization signature: if it had been gaming the reward, train and held-out would match.
+### Side-by-side
+[`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally: baseline 0/5, trained 5/5.
+### Training dynamics
+![training dynamics](docs/plots/training_dynamics.png)
+Loss + completion length. The reasoning-length floor of 50 words is what keeps `<think>` blocks from collapsing to one-liners.
 ---
 ## Quick start
+Install from this Space:
+```bash
+pip install git+https://huggingface.co/spaces/aamrinder/subtext-arena
+```
+Connect to the deployed env:
 ```python
 from subtext_arena import SubtextArenaEnv, SubtextArenaAction
     base_url="https://aamrinder-subtext-arena.hf.space"
 ).sync() as env:
     obs = env.reset().observation
     obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
     obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
     result = env.step(SubtextArenaAction(
         tool="submit_belief",
         tool_args={"label": "sarcastic", "confidence": 0.85},
     ))
+    print(result.reward, result.observation.metadata)
 ```
+Run locally:
 ```bash
 git clone https://huggingface.co/spaces/aamrinder/subtext-arena
 cd subtext-arena
 uv sync
+uv run server
 ```
+Reproduce training (about 2 hours on L4, $3 of compute):
 ```bash
 hf jobs uv run --flavor l4x1 -s HF_TOKEN \
   --with accelerate --with peft --with bitsandbytes \
   --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
   -- python -m subtext_arena.train.train_grpo \
+       --max-steps 200 --num-generations 4 --lora-r 16 \
        --eval-ratio 0.2 --n-eval-clips 80 \
+       --push-to-hub aamrinder/subtext-arena-grpo
 ```
+A Colab-friendly version of the script is in [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb).
 ---
+## Repo layout
 ```
+.
+├── client.py                          SubtextArenaEnv (HTTP + WebSocket client)
+├── models.py                          Action / Observation Pydantic types
+├── openenv.yaml                       OpenEnv manifest
+├── pyproject.toml
 ├── server/
+│   ├── app.py                         FastAPI factory
+│   ├── audio_tools.py                 prosody / contour text renderers
+│   ├── scenarios.py                   MUStARD loader + train/eval split
+│   ├── grader.py                      per-step + final reward
+│   └── subtext_arena_environment.py   reset / step / state
 ├── train/
+│   ├── train_grpo.py                  GRPO training (Unsloth + TRL)
+│   ├── eval_pivot_set.py              evaluate any checkpoint
+│   └── curate_pivot_set.py            4-step Pivot Set curation
 ├── notebooks/
+│   └── train_grpo_colab.ipynb         Colab-runnable training
+├── data/
+│   ├── sarcasm_data.json              MUStARD ground truth (690 clips)
+│   ├── prosody_cache/                 1380 pre-computed prosody jsons
+│   ├── pivot_set.json                 32 audio-decisive clips
+│   ├── held_out_eval_run3.json        80-clip generalization eval
+│   └── log_history_run3.json          per-step training metrics
 └── docs/
+    ├── plots/                         reward + training + held-out PNGs
+    └── side_by_side.html              baseline vs trained, 5 hero clips
 ```
 ---
+## Where this could go next
+The env's audio-tool layer is decoupled from the model interface, so a future team can plug in richer features without touching anything else. Specifically:
+- pyin pitch contour → wav2vec2 / HuBERT prosody embeddings
+- RMS summary → spectrogram patch tokens
+- speaker-anonymous prompts → speaker-aware features
+- 3B text policy → 7B audio LLM (Qwen2-Audio) end-to-end
+If any of those drop the broad held-out accuracy from 51% toward AMuSeD's 81% F1 multimodal SOTA on the same dataset, the env will measure it cleanly because the held-out split and reward are fixed.
+---
+## References
+- MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
+- AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995
+- LISTEN benchmark (Oct 2025): https://arxiv.org/abs/2510.10444
+- OpenEnv: https://github.com/meta-pytorch/OpenEnv
+Built on OpenEnv v0.2.3.

models.py CHANGED Viewed

@@ -1,20 +1,4 @@
-"""Data models for the Subtext Arena Environment.
-Subtext Arena is an OpenEnv environment where an LLM agent must classify
-sarcastic vs sincere speech by investigating both the literal transcript AND
-the prosody (pitch, energy, pause patterns) of an audio clip.
-The agent picks one of:
-  - get_transcript()                - returns the literal transcript and the
-                                       preceding conversational context
-  - get_prosody_features(span)      - pitch_var, energy_var, voiced_ratio,
-                                       pre_pause_ms, pause list
-  - get_pitch_contour(span)         - coarse Hz contour over the span
-  - submit_belief(label, conf)      - terminates the episode with the answer
-`tool_args` is a free-form dict so we don't need a separate Action class
-per tool; the env validates the shape per tool name.
-"""
 from __future__ import annotations
 from typing import Any, Dict, List, Optional
@@ -44,12 +28,7 @@ class SubtextArenaAction(Action):
 class SubtextArenaObservation(Observation):
-    """Observation returned after each action.
-    `tool_output` is the human-readable text describing what the tool returned
-    (transcript text, prosody summary, pitch contour). Empty for the
-    submit_belief action (which terminates the episode).
-    """
     clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
     speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")

+"""Action and observation models for Subtext Arena."""
 from __future__ import annotations
 from typing import Any, Dict, List, Optional
 class SubtextArenaObservation(Observation):
+    """Observation returned after each action."""
     clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
     speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")

server/grader.py CHANGED Viewed

@@ -1,33 +1,19 @@
-"""In-environment per-step reward grader.
-This module computes reward signals AT EACH STEP (rather than only at episode
-end). Per-step rewards let the env return non-zero rewards along the way,
-which is what the deck calls out as good practice ("rich, informative signal,
-not just 0/1 at the end").
-Note: there is ALSO a transcript-level reward in research/subtext_solutions/reward.py
-which is used by GRPO training (it scores entire rollouts at once). The
-relationship:
-  - Training-time (GRPO):   research/subtext_solutions/reward.py runs over the
-                             full rollout transcript and returns one scalar.
-  - Env-time (this file):   per-step grader returns deltas as the agent acts;
-                             the sum of step rewards over an episode equals
-                             (or is proxy for) the GRPO reward at termination.
-This dual layer means the env is interactive (judges can poke at it from a
-notebook and see immediate rewards) AND trainable (GRPO sees a clean per-
-rollout scalar). Both are exposed.
 """
 from __future__ import annotations
 from typing import Any, Dict
-# Per-step reward components
-R_AUDIO_TOOL_USE = 0.05      # small + reward each time the agent calls an audio tool
-R_TRANSCRIPT_USE = 0.02      # smaller + reward for transcript reads (text alone is cheap)
-R_BAD_TOOL = -0.10           # penalty for unknown tool name
-R_BAD_ARGS = -0.05           # penalty for malformed args
 def step_reward(tool_used: str, error: str | None) -> float:
@@ -52,21 +38,18 @@ def final_reward(
     n_audio_calls: int,
     n_total_calls: int,
 ) -> Dict[str, float]:
-    """Reward components contributed when submit_belief terminates the episode.
-    Returns a dict so the env can also surface the breakdown (good for plots
-    and for inspecting why a rollout scored what it did).
-    Mirrors the training reward in research/subtext_solutions/reward.py:
-      - correctness:        confidence-weighted match against gold
-      - prosody_grounding:  +1 if any audio-tool call (force the policy to listen)
-      - tool_parsimony:     1.0 if 1-3 calls; 0.6 if 4-5; 0.0 if >5
-      - format_ok:          1.0 if a valid label was submitted
-    Plus hard penalties:
-      - no submission              -0.30
-      - too many calls             -0.20
-      - pivot + no audio + wrong   -0.50
     """
     components: Dict[str, float] = {
         "correctness": 0.0,

+"""Per-step + final-action reward grader for the multi-step interactive env.
+The training script in train/train_grpo.py uses a single-shot reward (in
+train_grpo.make_reward_fn) that scores the whole rollout at once. This
+file is what the env returns step-by-step when an agent walks it
+interactively (e.g. from the HF Space web UI).
 """
 from __future__ import annotations
 from typing import Any, Dict
+R_AUDIO_TOOL_USE = 0.05
+R_TRANSCRIPT_USE = 0.02
+R_BAD_TOOL = -0.10
+R_BAD_ARGS = -0.05
 def step_reward(tool_used: str, error: str | None) -> float:
     n_audio_calls: int,
     n_total_calls: int,
 ) -> Dict[str, float]:
+    """Reward computed when submit_belief terminates the episode.
+    Components:
+        correctness        confidence-weighted match against gold
+        prosody_grounding  1.0 if any audio-tool call, 0.4 otherwise (0.0 on Pivot)
+        tool_parsimony     1.0 for 1-3 calls, 0.6 for 4-5, 0.0 for >5
+        format_ok          1.0 if a valid label was submitted
+    Penalties:
+        no submission              -0.30
+        too many calls             -0.20
+        pivot + no audio + wrong   -0.50
     """
     components: Dict[str, float] = {
         "correctness": 0.0,

server/scenarios.py CHANGED Viewed

@@ -1,16 +1,4 @@
-"""Scenario loader for the Subtext Arena environment.
-Loads MUStARD ground-truth (sarcasm labels + transcripts + context) and the
-pre-computed prosody features for all 690 utterances.
-The scenario_dict pattern mirrors the SRE R1 pattern: each clip is a "case"
-the agent must investigate and classify. Difficulty buckets:
-  - easy:   no Pivot Set membership; baseline text-only Qwen would likely get this right
-  - medium: borderline confidence; mixed text + prosody required
-  - hard:   Prosody-Pivot Set — text-only is wrong; audio cues flip the answer
-For training, we mix clips across all difficulties. Eval reports per-bucket.
-"""
 from __future__ import annotations
 import json
@@ -23,24 +11,7 @@ from typing import Any, Dict, List, Optional
 DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
-def load_scenarios(
-    data_root: Optional[Path] = None,
-) -> Dict[str, Dict[str, Any]]:
-    """Load all 690 utterances with sarcasm labels + prosody features.
-    Returns:
-      Dict keyed by clip_id with structure:
-        {
-          'utterance':        str    (literal transcript)
-          'speaker':          str
-          'context':          list[str]   (lines spoken before)
-          'context_speakers': list[str]
-          'sarcasm':          bool   (ground truth label)
-          'show':             str
-          'prosody':          dict   (pre-computed features) or None
-          'is_pivot':         bool   (filled in later by curate_pivot_set)
-        }
-    """
     root = data_root or DEFAULT_DATA_ROOT
     sarcasm_path = root / "sarcasm_data.json"
     prosody_dir = root / "prosody_cache" / "utterances"
@@ -82,18 +53,8 @@ def load_scenarios(
     return scenarios
-def sample_clip(
-    scenarios: Dict[str, Dict[str, Any]],
-    rng: random.Random,
-    pivot_oversample_factor: int = 3,
-) -> str:
-    """Sample a clip ID from the scenarios.
-    Pivot-set clips are oversampled (default 3x) because they are the highest-
-    signal training instances — they are exactly where text-only fails and
-    prosody must carry the signal.
-    """
-    pool: List[str] = []
     for clip_id, entry in scenarios.items():
         weight = pivot_oversample_factor if entry["is_pivot"] else 1
         pool.extend([clip_id] * weight)

+"""Loads MUStARD ground truth + cached prosody features for all 690 clips."""
 from __future__ import annotations
 import json
 DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
+def load_scenarios(data_root: Optional[Path] = None) -> Dict[str, Dict[str, Any]]:
     root = data_root or DEFAULT_DATA_ROOT
     sarcasm_path = root / "sarcasm_data.json"
     prosody_dir = root / "prosody_cache" / "utterances"
     return scenarios
+def sample_clip(scenarios, rng, pivot_oversample_factor=3):
+    pool = []
     for clip_id, entry in scenarios.items():
         weight = pivot_oversample_factor if entry["is_pivot"] else 1
         pool.extend([clip_id] * weight)