aamrinder commited on
Commit
9f43137
·
verified ·
1 Parent(s): f468b29

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +106 -157
  2. models.py +2 -23
  3. server/grader.py +22 -39
  4. server/scenarios.py +4 -43
README.md CHANGED
@@ -12,126 +12,106 @@ tags:
12
  - rl
13
  - audio
14
  - sarcasm
15
- - theory-of-mind
16
- - benchmark
17
  ---
18
 
19
- # Subtext Arena 🎧
20
 
21
- > An OpenEnv reproducible benchmark for **prosody-grounded subtext reasoning**.
22
 
23
- A clip plays. The words are: *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read it 10 times and still need the audio. **Subtext Arena is an OpenEnv environment that lets you train and evaluate language models on exactly that task** using transcripts + prosody features extracted from real MUStARD audio.
24
 
25
- Built for the OpenEnv Hackathon Finale (Apr 25–26, 2026, Bangalore).
26
- Theme: **#3 World Modeling** — modeling hidden speaker intent from observable acoustic cues.
27
 
28
- **Submission URL**: https://huggingface.co/spaces/aamrinder/subtext-arena
29
- **Trained model**: https://huggingface.co/aamrinder/subtext-arena-grpo
30
- **Live demo (HTML)**: [docs/side_by_side.html](https://huggingface.co/spaces/aamrinder/subtext-arena/blob/main/docs/side_by_side.html)
31
- **Re-runnable Colab**: [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb)
32
 
33
  ---
34
 
35
- ## What's actually new
36
 
37
- | | What's been done | What this env adds |
38
- |---|---|---|
39
- | MUStARD itself | 2019 dataset, used as supervised classifier benchmark | Wrapped as an **OpenEnv environment** with a verifiable RL reward |
40
- | AudioToolAgent (Oct 2025) | Prompted a frontier LLM with audio tools; no training | We **RL-train** a small (3B) model on the same architecture |
41
- | EmotionThinker (Mar 2026) | GRPO for emotion classification, single-modality | Subtext-from-prosody (lexical–acoustic conflict, not just emotion) |
42
- | SarcasmMiner (Mar 2026) | Video+audio, big model, big GPU | **Pure audio**, T4-fits, $8 of compute |
43
 
44
- The **environment** is the contribution. It pre-processes 1,380 MUStARD clips into prosody descriptors at env-init time (zero per-step audio cost), exposes the standard Gym API (`reset`, `step`, `state`), and ships a multi-criterion reward that's hard to game without genuine prosody-grounded reasoning. Trained models are baselines; the env is the durable artifact.
45
 
46
  ---
47
 
48
- ## What the agent sees, does, gets rewarded for
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ```
51
- ┌────────────────────────────────────────┐
52
- │ PROMPT (what the model sees)
53
- │ │
54
- │ System: detect sarcasm vs sincere │
55
- │ │
56
- │ User: │
57
- │ === Transcript === │
58
- │ [LEONARD] I never would have ... │
59
- │ Target: [SHELDON] It's just a │
60
- │ privilege to watch... │
61
- │ │
62
- │ === Prosody features === │
63
- │ Pitch mean=164Hz, var=51Hz [HIGH] │
64
- │ Pre-pause: 320ms │
65
- │ Internal pause 1.84-2.05s │
66
- │ │
67
- │ === Pitch contour === │
68
- │ ▃▄▅▆█▇▅▃▂▁▂▃▄▅▆▇▆▄▃▂▂▁▁▂▃▄▅▆▇█ │
69
- │ │
70
- │ Decide: sarcastic or sincere? │
71
- └─────────────────┬──────────────────────┘
72
- │ model.generate()
73
-
74
- ┌────────────────────────────────────────┐
75
- │ COMPLETION │
76
- │ <think> ... reasoning grounded in │
77
- │ pitch / energy / pause / context</think>
78
- │ <final>{"label":"sarcastic", │
79
- │ "confidence":0.85}</final> │
80
- └─────────────────┬──────────────────────┘
81
- │ env scores it
82
-
83
- ┌────────────────────────────────────────┐
84
- │ REWARD (composable rubric) │
85
- │ 0.70 × correctness (conf-weighted) │
86
- │ 0.15 × reasoning_length (50–150 wd) │
87
- │ 0.15 × format (valid <final> JSON) │
88
- └────────────────────────────────────────┘
89
  ```
90
 
91
- The full env supports interactive multi-step tool use too (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) that's the **inference-time interface** judges interact with on the HF Space. Training uses the single-prompt format above (matching the deck-linked Wordle/Sudoku notebooks).
92
 
93
  ---
94
 
95
- ## Results (real numbers, real held-out set)
96
 
97
- We trained `Qwen/Qwen2.5-3B-Instruct` for 200 GRPO steps on a strict 80/20 train/test split (clips in eval are NEVER seen during training). Stack: **Unsloth + HF TRL + LoRA r=16 + 4-bit + bnb** (deck-named, requirement #2).
98
 
99
- ### Training reward curve (real)
100
 
101
  ![reward curve](docs/plots/reward_curve.png)
102
 
103
- Reward climbs from `0.335` (random baseline) to `0.97` over 200 steps. The shaded band is `±1 std` of within-batch rollout variance when narrow, the 4 group-relative generations agree; when wide, the model is exploring.
104
 
105
- ### Held-out generalization (the honest test)
 
 
106
 
107
  ![held-out breakdown](docs/plots/held_out_breakdown.png)
108
 
109
- On 80 clips the trained model has **never seen**:
 
 
 
 
110
 
111
- | Subset | Accuracy | What it means |
112
- |---|---|---|
113
- | Broad held-out (balanced 38 sarc / 42 sinc) | **51.25%** (41/80) | ≈ text-only baseline. The model generalizes the *format and reasoning style* but doesn't dramatically outperform text-only Qwen on novel clips. |
114
- | **Prosody-Pivot Set** (audio-decisive clips) | **5/6 = 83%** ⭐ | When prosody flips the answer, the trained model uses it. This is the audio-mattering signal. |
115
- | Well-formed completions | 98.75% | Format compliance is fully learned. |
116
- | Train reward (final) | 0.97 | Train/eval gap of ~0.32 is the *anti-memorization* signature — the env's reward is hard to game. |
117
 
118
- **The honest take**: we trained a Qwen2.5-3B baseline that spends most of its inference compute on text reasoning and uses prosody when it's decisive. The Pivot-Set 5/6 isolates the audio contribution; the broad held-out 51% reflects the *intrinsic* limit of pyin-derived prosody summaries when the clip is dominated by text-decidable content. **A richer feature pipeline would push higher** — see [Future Work](#future-work).
119
 
120
- ### Training dynamics
121
 
122
- ![training dynamics](docs/plots/training_dynamics.png)
123
 
124
- Loss + completion length over 200 steps. The reasoning-length floor of 50 tokens forces real `<think>` blocks; the model spends ~130 tokens reasoning before committing.
125
 
126
- ### Side-by-side reasoning (vanilla vs trained)
127
 
128
- [See full HTML](docs/side_by_side.html) same 5 hand-picked hero clips, vanilla Qwen2.5-3B reasoning vs the trained checkpoint. The lead clip is `1_8136`: *"Yeah, I'm really looking forward to it. In fact, there's no place I would rather..."* (sarcastic, Pivot Set, trained nails it).
129
 
130
  ---
131
 
132
  ## Quick start
133
 
134
- ### Use the deployed env (no install)
 
 
 
 
 
 
135
 
136
  ```python
137
  from subtext_arena import SubtextArenaEnv, SubtextArenaAction
@@ -140,31 +120,25 @@ with SubtextArenaEnv(
140
  base_url="https://aamrinder-subtext-arena.hf.space"
141
  ).sync() as env:
142
  obs = env.reset().observation
143
- print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
144
-
145
  obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
146
- print(obs.tool_output[:500])
147
-
148
  obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
149
- print(obs.tool_output)
150
-
151
  result = env.step(SubtextArenaAction(
152
  tool="submit_belief",
153
  tool_args={"label": "sarcastic", "confidence": 0.85},
154
  ))
155
- print(f"done={result.done} reward={result.reward}")
156
  ```
157
 
158
- ### Run locally
159
 
160
  ```bash
161
  git clone https://huggingface.co/spaces/aamrinder/subtext-arena
162
  cd subtext-arena
163
  uv sync
164
- uv run server # serves on :8000
165
  ```
166
 
167
- ### Reproduce training (on T4-medium / L4)
168
 
169
  ```bash
170
  hf jobs uv run --flavor l4x1 -s HF_TOKEN \
@@ -172,91 +146,66 @@ hf jobs uv run --flavor l4x1 -s HF_TOKEN \
172
  --with accelerate --with peft --with bitsandbytes \
173
  --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
174
  -- python -m subtext_arena.train.train_grpo \
175
- --max-steps 200 --num-generations 4 \
176
- --lora-r 16 --lora-dropout 0.05 \
177
  --eval-ratio 0.2 --n-eval-clips 80 \
178
- --push-to-hub aamrinder/subtext-arena-grpo \
179
- --save-trainer-state-to-hub-space aamrinder/subtext-arena
180
  ```
181
 
182
- `~$3 of compute, ~2 hours wall clock.` See `notebooks/train_grpo_colab.ipynb` for a Colab-runnable version.
183
-
184
- ---
185
-
186
- ## Why this matters
187
-
188
- - **Accessibility**: hearing-impaired users rely on captions that strip prosody. Sarcasm becomes opaque. An agent that flags lexical–prosodic mismatch could augment caption pipelines.
189
- - **Conversational AI**: voice assistants that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is the missing piece.
190
- - **A reproducible benchmark**: this env wraps a 6-year-old supervised dataset (MUStARD) into a modern RL task with a verifiable reward and a held-out generalization test. Future teams can plug in richer feature extractors and immediately measure improvement under the same evaluator.
191
-
192
- ---
193
-
194
- ## Future work (where this env compounds)
195
-
196
- The env's design lets future work plug in **richer feature extractors** at the `audio_tools.py` layer without changing the trained-model interface or the reward. Specifically:
197
-
198
- | Today (basic features) | Tomorrow (rich features) | Expected lift |
199
- |---|---|---|
200
- | `librosa.pyin` pitch contour | wav2vec2 / HuBERT prosody embeddings | +10-15 pp on broad held-out (per AMuSeD F1 0.81) |
201
- | RMS energy summary | full spectrogram patch tokens | +5 pp |
202
- | Speaker-anonymous (text only) | speaker-aware features (familiarity, voice ID) | +3-8 pp |
203
- | 3B text policy | 7B audio-LLM (Qwen2-Audio) end-to-end | uncharted (LISTEN paper says +0 currently, RL might shift it) |
204
-
205
- The env is the durable contribution. The model is one baseline on it.
206
 
207
  ---
208
 
209
- ## Project layout
210
 
211
  ```
212
- subtext_arena/
213
- ├── README.md ← you are here (3-5 min read)
214
- ├── openenv.yaml ← manifest (spec_version=1, runtime=fastapi)
215
- ├── pyproject.toml ← deps: openenv-core>=0.2.3 + bundled data
216
- ├── uv.lock
217
- ├── client.py ← SubtextArenaEnv (HTTP+WS client)
218
- ├── models.py ← typed Action / Observation
219
- ├── data/
220
- │ ├── sarcasm_data.json ← 690 MUStARD ground-truth labels + transcripts
221
- │ ├── prosody_cache/utterances/ ← 690 pre-computed prosody jsons
222
- │ ├── prosody_cache/context/ ← 690 context-clip prosody jsons
223
- │ ├── pivot_set.json ← 32 Prosody-Pivot clips (text-only-wrong + audio-decisive)
224
- │ ├── baseline.json ← text-only Qwen2.5-3B predictions on all 690 clips
225
- │ ├── held_out_eval_run3.json ← 80-clip generalization eval
226
- │ ├── trainer_state_run3.json ← per-step training metrics
227
- │ └── log_history_run3.json ← backup of trainer state
228
  ├── server/
229
- │ ├── app.pyFastAPI factory
230
- │ ├── audio_tools.pyprosody/contour text renderers
231
- │ ├── scenarios.pyMUStARD loader + Pivot oversampling + train/eval split
232
- │ ├── grader.pyper-step + final-reward grader
233
- ── subtext_arena_environment.py Environment subclass (reset/step/state)
234
- │ ├── Dockerfile
235
- │ └── requirements.txt
236
  ├── train/
237
- │ ├── train_grpo.py ← full GRPO training (Unsloth+TRL, used for Run #3)
238
- │ ├── curate_pivot_set.py 4-step Pivot Set curation pipeline
239
- ── eval_pivot_set.py evaluate any checkpoint on Pivot Set
240
- │ ├── side_by_side.py ← baseline vs trained side-by-side HTML
241
- │ ├── plot_run3.py ← generate the 3 result plots from log_history
242
- │ └── plot_reward_decomp.py ← reward decomposition plotter (legacy)
243
  ├── notebooks/
244
- │ └── train_grpo_colab.ipynbColab-rerunnable training script
 
 
 
 
 
 
245
  └── docs/
246
- ── plots/ committed PNGs (reward_curve, training_dynamics, held_out_breakdown)
247
- └── side_by_side.html interactive side-by-side reasoning view
248
  ```
249
 
250
  ---
251
 
252
- ## Credits & references
 
 
 
 
 
 
 
 
 
 
 
253
 
254
- - **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019)
255
- - **Architecture inspiration**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) — same pattern, prompted not RL-trained
256
- - **Adjacent recent RL work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026), [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026)
257
- - **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) — documents the audio-LLM gap on emotion-mismatch cases
258
 
259
- Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler.
260
- Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3.
 
 
261
 
262
- > Team **BalleBalle** — Amrinder Singh + Shubham Kapoor.
 
12
  - rl
13
  - audio
14
  - sarcasm
 
 
15
  ---
16
 
17
+ # Subtext Arena
18
 
19
+ Try this out: a clip plays, the line is *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read the words a hundred times and still need the audio.
20
 
21
+ This repo wraps that task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a tiny prosody pipeline (pyin pitch, RMS energy, pause timing) baked into the env. The agent reads a transcript and a text rendering of the prosody, and produces a chain-of-thought followed by `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
22
 
23
+ I trained a baseline (Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL) on a strict 80/20 train/test split. The trained model is at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent: ~$11.
 
24
 
25
+ Built for the OpenEnv Hackathon Finale (Apr 2026, Bangalore). Theme: World Modeling.
26
+
27
+ Team **BalleBalle**: Amrinder Singh, Shubham Kapoor.
 
28
 
29
  ---
30
 
31
+ ## Why this is interesting
32
 
33
+ Detecting sarcasm from audio prosody is not solved. GPT-4o sits at 67% Macro-F1 on MUStARD++. There's a documented modality gap (the [LISTEN paper](https://arxiv.org/abs/2510.10444), Oct 2025). And there is no public reinforcement-learning training environment for the task that I know of, so I made one.
 
 
 
 
 
34
 
35
+ The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same architecture, but they didn't train. Subtext Arena is the training-side counterpart.
36
 
37
  ---
38
 
39
+ ## How the env works
40
+
41
+ Each episode is one MUStARD clip. The prompt the agent sees contains:
42
+
43
+ - The transcript (target line + 1-7 lines of preceding conversation, with speaker tags)
44
+ - Prosody features as text: pitch mean / variability / contour shape, energy mean / variability, voiced ratio, pre-utterance silence, internal pauses with timestamps
45
+ - A pitch contour rendered as an 8-level ASCII sparkline
46
+
47
+ The model emits:
48
+
49
+ ```
50
+ <think>
51
+ ... reasoning over the prosodic and lexical cues ...
52
+ </think>
53
+ <final>{"label":"sarcastic","confidence":0.85}</final>
54
+ ```
55
+
56
+ The reward is a composable rubric:
57
 
58
  ```
59
+ 0.70 * correctness (confidence-weighted match against gold)
60
+ 0.15 * reasoning_length (penalty if <think> is under 50 words or over 300)
61
+ 0.15 * format (1.0 if a valid <final> JSON parses, 0 otherwise)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
 
64
+ The env also exposes the four tools (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) for interactive multi-step inference. That's what you can poke at on this Space's web UI. Training uses the single-prompt format above so it matches the deck-linked Wordle and Sudoku notebooks.
65
 
66
  ---
67
 
68
+ ## Results
69
 
70
+ I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips, 138 eval clips, deterministically seeded so judges can reproduce. Pivot oversample is 2x. Class balance is enforced in the dataset construction.
71
 
72
+ ### Reward over training
73
 
74
  ![reward curve](docs/plots/reward_curve.png)
75
 
76
+ Reward climbs from 0.335 to 0.97 on training prompts. The shaded band is within-batch rollout variance: when narrow, the four group-relative generations agree; when wide, the model is exploring.
77
 
78
+ ### Held-out generalization
79
+
80
+ After training, I ran greedy inference on 80 clips the model has never seen.
81
 
82
  ![held-out breakdown](docs/plots/held_out_breakdown.png)
83
 
84
+ | Subset | Accuracy |
85
+ |---|---|
86
+ | Broad held-out (38 sarcastic + 42 sincere) | 41/80 = **51.25%** |
87
+ | Prosody-Pivot Set in eval (audio-decisive clips) | 5/6 = **83%** |
88
+ | Well-formed completions | 79/80 = 98.75% |
89
 
90
+ The honest read: 51% on the broad set is roughly text-only baseline, which means pyin-derived prosody summary stats don't push a 3B model much beyond what it already does from the transcript. But when the audio is genuinely decisive (the Pivot Set), the trained model uses it: 5/6 correct on those clips, vs 0/6 for a text-only baseline that confidently picks the wrong label.
 
 
 
 
 
91
 
92
+ The 0.97 train vs 0.51 held-out gap is the anti-memorization signature: if it had been gaming the reward, train and held-out would match.
93
 
94
+ ### Side-by-side
95
 
96
+ [`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally: baseline 0/5, trained 5/5.
97
 
98
+ ### Training dynamics
99
 
100
+ ![training dynamics](docs/plots/training_dynamics.png)
101
 
102
+ Loss + completion length. The reasoning-length floor of 50 words is what keeps `<think>` blocks from collapsing to one-liners.
103
 
104
  ---
105
 
106
  ## Quick start
107
 
108
+ Install from this Space:
109
+
110
+ ```bash
111
+ pip install git+https://huggingface.co/spaces/aamrinder/subtext-arena
112
+ ```
113
+
114
+ Connect to the deployed env:
115
 
116
  ```python
117
  from subtext_arena import SubtextArenaEnv, SubtextArenaAction
 
120
  base_url="https://aamrinder-subtext-arena.hf.space"
121
  ).sync() as env:
122
  obs = env.reset().observation
 
 
123
  obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
 
 
124
  obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
 
 
125
  result = env.step(SubtextArenaAction(
126
  tool="submit_belief",
127
  tool_args={"label": "sarcastic", "confidence": 0.85},
128
  ))
129
+ print(result.reward, result.observation.metadata)
130
  ```
131
 
132
+ Run locally:
133
 
134
  ```bash
135
  git clone https://huggingface.co/spaces/aamrinder/subtext-arena
136
  cd subtext-arena
137
  uv sync
138
+ uv run server
139
  ```
140
 
141
+ Reproduce training (about 2 hours on L4, $3 of compute):
142
 
143
  ```bash
144
  hf jobs uv run --flavor l4x1 -s HF_TOKEN \
 
146
  --with accelerate --with peft --with bitsandbytes \
147
  --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
148
  -- python -m subtext_arena.train.train_grpo \
149
+ --max-steps 200 --num-generations 4 --lora-r 16 \
 
150
  --eval-ratio 0.2 --n-eval-clips 80 \
151
+ --push-to-hub aamrinder/subtext-arena-grpo
 
152
  ```
153
 
154
+ A Colab-friendly version of the script is in [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  ---
157
 
158
+ ## Repo layout
159
 
160
  ```
161
+ .
162
+ ├── client.py SubtextArenaEnv (HTTP + WebSocket client)
163
+ ├── models.py Action / Observation Pydantic types
164
+ ├── openenv.yaml OpenEnv manifest
165
+ ├── pyproject.toml
 
 
 
 
 
 
 
 
 
 
 
166
  ├── server/
167
+ │ ├── app.py FastAPI factory
168
+ │ ├── audio_tools.py prosody / contour text renderers
169
+ │ ├── scenarios.py MUStARD loader + train/eval split
170
+ │ ├── grader.py per-step + final reward
171
+ ── subtext_arena_environment.py reset / step / state
 
 
172
  ├── train/
173
+ │ ├── train_grpo.py GRPO training (Unsloth + TRL)
174
+ │ ├── eval_pivot_set.py evaluate any checkpoint
175
+ ── curate_pivot_set.py 4-step Pivot Set curation
 
 
 
176
  ├── notebooks/
177
+ │ └── train_grpo_colab.ipynb Colab-runnable training
178
+ ├── data/
179
+ │ ├── sarcasm_data.json MUStARD ground truth (690 clips)
180
+ │ ├── prosody_cache/ 1380 pre-computed prosody jsons
181
+ │ ├── pivot_set.json 32 audio-decisive clips
182
+ │ ├── held_out_eval_run3.json 80-clip generalization eval
183
+ │ └── log_history_run3.json per-step training metrics
184
  └── docs/
185
+ ── plots/ reward + training + held-out PNGs
186
+ └── side_by_side.html baseline vs trained, 5 hero clips
187
  ```
188
 
189
  ---
190
 
191
+ ## Where this could go next
192
+
193
+ The env's audio-tool layer is decoupled from the model interface, so a future team can plug in richer features without touching anything else. Specifically:
194
+
195
+ - pyin pitch contour → wav2vec2 / HuBERT prosody embeddings
196
+ - RMS summary → spectrogram patch tokens
197
+ - speaker-anonymous prompts → speaker-aware features
198
+ - 3B text policy → 7B audio LLM (Qwen2-Audio) end-to-end
199
+
200
+ If any of those drop the broad held-out accuracy from 51% toward AMuSeD's 81% F1 multimodal SOTA on the same dataset, the env will measure it cleanly because the held-out split and reward are fixed.
201
+
202
+ ---
203
 
204
+ ## References
 
 
 
205
 
206
+ - MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
207
+ - AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995
208
+ - LISTEN benchmark (Oct 2025): https://arxiv.org/abs/2510.10444
209
+ - OpenEnv: https://github.com/meta-pytorch/OpenEnv
210
 
211
+ Built on OpenEnv v0.2.3.
models.py CHANGED
@@ -1,20 +1,4 @@
1
- """Data models for the Subtext Arena Environment.
2
-
3
- Subtext Arena is an OpenEnv environment where an LLM agent must classify
4
- sarcastic vs sincere speech by investigating both the literal transcript AND
5
- the prosody (pitch, energy, pause patterns) of an audio clip.
6
-
7
- The agent picks one of:
8
- - get_transcript() - returns the literal transcript and the
9
- preceding conversational context
10
- - get_prosody_features(span) - pitch_var, energy_var, voiced_ratio,
11
- pre_pause_ms, pause list
12
- - get_pitch_contour(span) - coarse Hz contour over the span
13
- - submit_belief(label, conf) - terminates the episode with the answer
14
-
15
- `tool_args` is a free-form dict so we don't need a separate Action class
16
- per tool; the env validates the shape per tool name.
17
- """
18
  from __future__ import annotations
19
 
20
  from typing import Any, Dict, List, Optional
@@ -44,12 +28,7 @@ class SubtextArenaAction(Action):
44
 
45
 
46
  class SubtextArenaObservation(Observation):
47
- """Observation returned after each action.
48
-
49
- `tool_output` is the human-readable text describing what the tool returned
50
- (transcript text, prosody summary, pitch contour). Empty for the
51
- submit_belief action (which terminates the episode).
52
- """
53
 
54
  clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
55
  speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")
 
1
+ """Action and observation models for Subtext Arena."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  from __future__ import annotations
3
 
4
  from typing import Any, Dict, List, Optional
 
28
 
29
 
30
  class SubtextArenaObservation(Observation):
31
+ """Observation returned after each action."""
 
 
 
 
 
32
 
33
  clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
34
  speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")
server/grader.py CHANGED
@@ -1,33 +1,19 @@
1
- """In-environment per-step reward grader.
2
-
3
- This module computes reward signals AT EACH STEP (rather than only at episode
4
- end). Per-step rewards let the env return non-zero rewards along the way,
5
- which is what the deck calls out as good practice ("rich, informative signal,
6
- not just 0/1 at the end").
7
-
8
- Note: there is ALSO a transcript-level reward in research/subtext_solutions/reward.py
9
- which is used by GRPO training (it scores entire rollouts at once). The
10
- relationship:
11
- - Training-time (GRPO): research/subtext_solutions/reward.py runs over the
12
- full rollout transcript and returns one scalar.
13
- - Env-time (this file): per-step grader returns deltas as the agent acts;
14
- the sum of step rewards over an episode equals
15
- (or is proxy for) the GRPO reward at termination.
16
-
17
- This dual layer means the env is interactive (judges can poke at it from a
18
- notebook and see immediate rewards) AND trainable (GRPO sees a clean per-
19
- rollout scalar). Both are exposed.
20
  """
21
  from __future__ import annotations
22
 
23
  from typing import Any, Dict
24
 
25
 
26
- # Per-step reward components
27
- R_AUDIO_TOOL_USE = 0.05 # small + reward each time the agent calls an audio tool
28
- R_TRANSCRIPT_USE = 0.02 # smaller + reward for transcript reads (text alone is cheap)
29
- R_BAD_TOOL = -0.10 # penalty for unknown tool name
30
- R_BAD_ARGS = -0.05 # penalty for malformed args
31
 
32
 
33
  def step_reward(tool_used: str, error: str | None) -> float:
@@ -52,21 +38,18 @@ def final_reward(
52
  n_audio_calls: int,
53
  n_total_calls: int,
54
  ) -> Dict[str, float]:
55
- """Reward components contributed when submit_belief terminates the episode.
56
-
57
- Returns a dict so the env can also surface the breakdown (good for plots
58
- and for inspecting why a rollout scored what it did).
59
-
60
- Mirrors the training reward in research/subtext_solutions/reward.py:
61
- - correctness: confidence-weighted match against gold
62
- - prosody_grounding: +1 if any audio-tool call (force the policy to listen)
63
- - tool_parsimony: 1.0 if 1-3 calls; 0.6 if 4-5; 0.0 if >5
64
- - format_ok: 1.0 if a valid label was submitted
65
-
66
- Plus hard penalties:
67
- - no submission -0.30
68
- - too many calls -0.20
69
- - pivot + no audio + wrong -0.50
70
  """
71
  components: Dict[str, float] = {
72
  "correctness": 0.0,
 
1
+ """Per-step + final-action reward grader for the multi-step interactive env.
2
+
3
+ The training script in train/train_grpo.py uses a single-shot reward (in
4
+ train_grpo.make_reward_fn) that scores the whole rollout at once. This
5
+ file is what the env returns step-by-step when an agent walks it
6
+ interactively (e.g. from the HF Space web UI).
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  """
8
  from __future__ import annotations
9
 
10
  from typing import Any, Dict
11
 
12
 
13
+ R_AUDIO_TOOL_USE = 0.05
14
+ R_TRANSCRIPT_USE = 0.02
15
+ R_BAD_TOOL = -0.10
16
+ R_BAD_ARGS = -0.05
 
17
 
18
 
19
  def step_reward(tool_used: str, error: str | None) -> float:
 
38
  n_audio_calls: int,
39
  n_total_calls: int,
40
  ) -> Dict[str, float]:
41
+ """Reward computed when submit_belief terminates the episode.
42
+
43
+ Components:
44
+ correctness confidence-weighted match against gold
45
+ prosody_grounding 1.0 if any audio-tool call, 0.4 otherwise (0.0 on Pivot)
46
+ tool_parsimony 1.0 for 1-3 calls, 0.6 for 4-5, 0.0 for >5
47
+ format_ok 1.0 if a valid label was submitted
48
+
49
+ Penalties:
50
+ no submission -0.30
51
+ too many calls -0.20
52
+ pivot + no audio + wrong -0.50
 
 
 
53
  """
54
  components: Dict[str, float] = {
55
  "correctness": 0.0,
server/scenarios.py CHANGED
@@ -1,16 +1,4 @@
1
- """Scenario loader for the Subtext Arena environment.
2
-
3
- Loads MUStARD ground-truth (sarcasm labels + transcripts + context) and the
4
- pre-computed prosody features for all 690 utterances.
5
-
6
- The scenario_dict pattern mirrors the SRE R1 pattern: each clip is a "case"
7
- the agent must investigate and classify. Difficulty buckets:
8
- - easy: no Pivot Set membership; baseline text-only Qwen would likely get this right
9
- - medium: borderline confidence; mixed text + prosody required
10
- - hard: Prosody-Pivot Set — text-only is wrong; audio cues flip the answer
11
-
12
- For training, we mix clips across all difficulties. Eval reports per-bucket.
13
- """
14
  from __future__ import annotations
15
 
16
  import json
@@ -23,24 +11,7 @@ from typing import Any, Dict, List, Optional
23
  DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
24
 
25
 
26
- def load_scenarios(
27
- data_root: Optional[Path] = None,
28
- ) -> Dict[str, Dict[str, Any]]:
29
- """Load all 690 utterances with sarcasm labels + prosody features.
30
-
31
- Returns:
32
- Dict keyed by clip_id with structure:
33
- {
34
- 'utterance': str (literal transcript)
35
- 'speaker': str
36
- 'context': list[str] (lines spoken before)
37
- 'context_speakers': list[str]
38
- 'sarcasm': bool (ground truth label)
39
- 'show': str
40
- 'prosody': dict (pre-computed features) or None
41
- 'is_pivot': bool (filled in later by curate_pivot_set)
42
- }
43
- """
44
  root = data_root or DEFAULT_DATA_ROOT
45
  sarcasm_path = root / "sarcasm_data.json"
46
  prosody_dir = root / "prosody_cache" / "utterances"
@@ -82,18 +53,8 @@ def load_scenarios(
82
  return scenarios
83
 
84
 
85
- def sample_clip(
86
- scenarios: Dict[str, Dict[str, Any]],
87
- rng: random.Random,
88
- pivot_oversample_factor: int = 3,
89
- ) -> str:
90
- """Sample a clip ID from the scenarios.
91
-
92
- Pivot-set clips are oversampled (default 3x) because they are the highest-
93
- signal training instances — they are exactly where text-only fails and
94
- prosody must carry the signal.
95
- """
96
- pool: List[str] = []
97
  for clip_id, entry in scenarios.items():
98
  weight = pivot_oversample_factor if entry["is_pivot"] else 1
99
  pool.extend([clip_id] * weight)
 
1
+ """Loads MUStARD ground truth + cached prosody features for all 690 clips."""
 
 
 
 
 
 
 
 
 
 
 
 
2
  from __future__ import annotations
3
 
4
  import json
 
11
  DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
12
 
13
 
14
+ def load_scenarios(data_root: Optional[Path] = None) -> Dict[str, Dict[str, Any]]:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  root = data_root or DEFAULT_DATA_ROOT
16
  sarcasm_path = root / "sarcasm_data.json"
17
  prosody_dir = root / "prosody_cache" / "utterances"
 
53
  return scenarios
54
 
55
 
56
+ def sample_clip(scenarios, rng, pivot_oversample_factor=3):
57
+ pool = []
 
 
 
 
 
 
 
 
 
 
58
  for clip_id, entry in scenarios.items():
59
  weight = pivot_oversample_factor if entry["is_pivot"] else 1
60
  pool.extend([clip_id] * weight)