Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +106 -157
- models.py +2 -23
- server/grader.py +22 -39
- server/scenarios.py +4 -43
README.md
CHANGED
|
@@ -12,126 +12,106 @@ tags:
|
|
| 12 |
- rl
|
| 13 |
- audio
|
| 14 |
- sarcasm
|
| 15 |
-
- theory-of-mind
|
| 16 |
-
- benchmark
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# Subtext Arena
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
Theme: **#3 World Modeling** — modeling hidden speaker intent from observable acoustic cues.
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
**
|
| 31 |
-
**Re-runnable Colab**: [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb)
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
-
##
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|---|---|---|
|
| 39 |
-
| MUStARD itself | 2019 dataset, used as supervised classifier benchmark | Wrapped as an **OpenEnv environment** with a verifiable RL reward |
|
| 40 |
-
| AudioToolAgent (Oct 2025) | Prompted a frontier LLM with audio tools; no training | We **RL-train** a small (3B) model on the same architecture |
|
| 41 |
-
| EmotionThinker (Mar 2026) | GRPO for emotion classification, single-modality | Subtext-from-prosody (lexical–acoustic conflict, not just emotion) |
|
| 42 |
-
| SarcasmMiner (Mar 2026) | Video+audio, big model, big GPU | **Pure audio**, T4-fits, $8 of compute |
|
| 43 |
|
| 44 |
-
The
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
```
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
│ System: detect sarcasm vs sincere │
|
| 55 |
-
│ │
|
| 56 |
-
│ User: │
|
| 57 |
-
│ === Transcript === │
|
| 58 |
-
│ [LEONARD] I never would have ... │
|
| 59 |
-
│ Target: [SHELDON] It's just a │
|
| 60 |
-
│ privilege to watch... │
|
| 61 |
-
│ │
|
| 62 |
-
│ === Prosody features === │
|
| 63 |
-
│ Pitch mean=164Hz, var=51Hz [HIGH] │
|
| 64 |
-
│ Pre-pause: 320ms │
|
| 65 |
-
│ Internal pause 1.84-2.05s │
|
| 66 |
-
│ │
|
| 67 |
-
│ === Pitch contour === │
|
| 68 |
-
│ ▃▄▅▆█▇▅▃▂▁▂▃▄▅▆▇▆▄▃▂▂▁▁▂▃▄▅▆▇█ │
|
| 69 |
-
│ │
|
| 70 |
-
│ Decide: sarcastic or sincere? │
|
| 71 |
-
└─────────────────┬──────────────────────┘
|
| 72 |
-
│ model.generate()
|
| 73 |
-
▼
|
| 74 |
-
┌────────────────────────────────────────┐
|
| 75 |
-
│ COMPLETION │
|
| 76 |
-
│ <think> ... reasoning grounded in │
|
| 77 |
-
│ pitch / energy / pause / context</think>
|
| 78 |
-
│ <final>{"label":"sarcastic", │
|
| 79 |
-
│ "confidence":0.85}</final> │
|
| 80 |
-
└─────────────────┬──────────────────────┘
|
| 81 |
-
│ env scores it
|
| 82 |
-
▼
|
| 83 |
-
┌────────────────────────────────────────┐
|
| 84 |
-
│ REWARD (composable rubric) │
|
| 85 |
-
│ 0.70 × correctness (conf-weighted) │
|
| 86 |
-
│ 0.15 × reasoning_length (50–150 wd) │
|
| 87 |
-
│ 0.15 × format (valid <final> JSON) │
|
| 88 |
-
└────────────────────────────────────────┘
|
| 89 |
```
|
| 90 |
|
| 91 |
-
The
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
## Results
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
###
|
| 100 |
|
| 101 |

|
| 102 |
|
| 103 |
-
Reward climbs from
|
| 104 |
|
| 105 |
-
### Held-out generalization
|
|
|
|
|
|
|
| 106 |
|
| 107 |

|
| 108 |
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|---|---|---|
|
| 113 |
-
| Broad held-out (balanced 38 sarc / 42 sinc) | **51.25%** (41/80) | ≈ text-only baseline. The model generalizes the *format and reasoning style* but doesn't dramatically outperform text-only Qwen on novel clips. |
|
| 114 |
-
| **Prosody-Pivot Set** (audio-decisive clips) | **5/6 = 83%** ⭐ | When prosody flips the answer, the trained model uses it. This is the audio-mattering signal. |
|
| 115 |
-
| Well-formed completions | 98.75% | Format compliance is fully learned. |
|
| 116 |
-
| Train reward (final) | 0.97 | Train/eval gap of ~0.32 is the *anti-memorization* signature — the env's reward is hard to game. |
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
###
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
## Quick start
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
```python
|
| 137 |
from subtext_arena import SubtextArenaEnv, SubtextArenaAction
|
|
@@ -140,31 +120,25 @@ with SubtextArenaEnv(
|
|
| 140 |
base_url="https://aamrinder-subtext-arena.hf.space"
|
| 141 |
).sync() as env:
|
| 142 |
obs = env.reset().observation
|
| 143 |
-
print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
|
| 144 |
-
|
| 145 |
obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
|
| 146 |
-
print(obs.tool_output[:500])
|
| 147 |
-
|
| 148 |
obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
|
| 149 |
-
print(obs.tool_output)
|
| 150 |
-
|
| 151 |
result = env.step(SubtextArenaAction(
|
| 152 |
tool="submit_belief",
|
| 153 |
tool_args={"label": "sarcastic", "confidence": 0.85},
|
| 154 |
))
|
| 155 |
-
print(
|
| 156 |
```
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
```bash
|
| 161 |
git clone https://huggingface.co/spaces/aamrinder/subtext-arena
|
| 162 |
cd subtext-arena
|
| 163 |
uv sync
|
| 164 |
-
uv run server
|
| 165 |
```
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
```bash
|
| 170 |
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
|
|
@@ -172,91 +146,66 @@ hf jobs uv run --flavor l4x1 -s HF_TOKEN \
|
|
| 172 |
--with accelerate --with peft --with bitsandbytes \
|
| 173 |
--with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
|
| 174 |
-- python -m subtext_arena.train.train_grpo \
|
| 175 |
-
--max-steps 200 --num-generations 4 \
|
| 176 |
-
--lora-r 16 --lora-dropout 0.05 \
|
| 177 |
--eval-ratio 0.2 --n-eval-clips 80 \
|
| 178 |
-
--push-to-hub aamrinder/subtext-arena-grpo
|
| 179 |
-
--save-trainer-state-to-hub-space aamrinder/subtext-arena
|
| 180 |
```
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
---
|
| 185 |
-
|
| 186 |
-
## Why this matters
|
| 187 |
-
|
| 188 |
-
- **Accessibility**: hearing-impaired users rely on captions that strip prosody. Sarcasm becomes opaque. An agent that flags lexical–prosodic mismatch could augment caption pipelines.
|
| 189 |
-
- **Conversational AI**: voice assistants that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is the missing piece.
|
| 190 |
-
- **A reproducible benchmark**: this env wraps a 6-year-old supervised dataset (MUStARD) into a modern RL task with a verifiable reward and a held-out generalization test. Future teams can plug in richer feature extractors and immediately measure improvement under the same evaluator.
|
| 191 |
-
|
| 192 |
-
---
|
| 193 |
-
|
| 194 |
-
## Future work (where this env compounds)
|
| 195 |
-
|
| 196 |
-
The env's design lets future work plug in **richer feature extractors** at the `audio_tools.py` layer without changing the trained-model interface or the reward. Specifically:
|
| 197 |
-
|
| 198 |
-
| Today (basic features) | Tomorrow (rich features) | Expected lift |
|
| 199 |
-
|---|---|---|
|
| 200 |
-
| `librosa.pyin` pitch contour | wav2vec2 / HuBERT prosody embeddings | +10-15 pp on broad held-out (per AMuSeD F1 0.81) |
|
| 201 |
-
| RMS energy summary | full spectrogram patch tokens | +5 pp |
|
| 202 |
-
| Speaker-anonymous (text only) | speaker-aware features (familiarity, voice ID) | +3-8 pp |
|
| 203 |
-
| 3B text policy | 7B audio-LLM (Qwen2-Audio) end-to-end | uncharted (LISTEN paper says +0 currently, RL might shift it) |
|
| 204 |
-
|
| 205 |
-
The env is the durable contribution. The model is one baseline on it.
|
| 206 |
|
| 207 |
---
|
| 208 |
|
| 209 |
-
##
|
| 210 |
|
| 211 |
```
|
| 212 |
-
|
| 213 |
-
├──
|
| 214 |
-
├──
|
| 215 |
-
├──
|
| 216 |
-
├──
|
| 217 |
-
├── client.py ← SubtextArenaEnv (HTTP+WS client)
|
| 218 |
-
├── models.py ← typed Action / Observation
|
| 219 |
-
├── data/
|
| 220 |
-
│ ├── sarcasm_data.json ← 690 MUStARD ground-truth labels + transcripts
|
| 221 |
-
│ ├── prosody_cache/utterances/ ← 690 pre-computed prosody jsons
|
| 222 |
-
│ ├── prosody_cache/context/ ← 690 context-clip prosody jsons
|
| 223 |
-
│ ├── pivot_set.json ← 32 Prosody-Pivot clips (text-only-wrong + audio-decisive)
|
| 224 |
-
│ ├── baseline.json ← text-only Qwen2.5-3B predictions on all 690 clips
|
| 225 |
-
│ ├── held_out_eval_run3.json ← 80-clip generalization eval
|
| 226 |
-
│ ├── trainer_state_run3.json ← per-step training metrics
|
| 227 |
-
│ └── log_history_run3.json ← backup of trainer state
|
| 228 |
├── server/
|
| 229 |
-
│ ├── app.py
|
| 230 |
-
│ ├── audio_tools.py
|
| 231 |
-
│ ├── scenarios.py
|
| 232 |
-
│ ├── grader.py
|
| 233 |
-
│
|
| 234 |
-
│ ├── Dockerfile
|
| 235 |
-
│ └── requirements.txt
|
| 236 |
├── train/
|
| 237 |
-
│ ├── train_grpo.py
|
| 238 |
-
│ ├──
|
| 239 |
-
│
|
| 240 |
-
│ ├── side_by_side.py ← baseline vs trained side-by-side HTML
|
| 241 |
-
│ ├── plot_run3.py ← generate the 3 result plots from log_history
|
| 242 |
-
│ └── plot_reward_decomp.py ← reward decomposition plotter (legacy)
|
| 243 |
├── notebooks/
|
| 244 |
-
│ └── train_grpo_colab.ipynb
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
└── docs/
|
| 246 |
-
|
| 247 |
-
|
| 248 |
```
|
| 249 |
|
| 250 |
---
|
| 251 |
|
| 252 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
|
| 254 |
-
|
| 255 |
-
- **Architecture inspiration**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) — same pattern, prompted not RL-trained
|
| 256 |
-
- **Adjacent recent RL work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026), [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026)
|
| 257 |
-
- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) — documents the audio-LLM gap on emotion-mismatch cases
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
|
|
|
|
|
|
| 261 |
|
| 262 |
-
|
|
|
|
| 12 |
- rl
|
| 13 |
- audio
|
| 14 |
- sarcasm
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Subtext Arena
|
| 18 |
|
| 19 |
+
Try this out: a clip plays, the line is *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read the words a hundred times and still need the audio.
|
| 20 |
|
| 21 |
+
This repo wraps that task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a tiny prosody pipeline (pyin pitch, RMS energy, pause timing) baked into the env. The agent reads a transcript and a text rendering of the prosody, and produces a chain-of-thought followed by `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
|
| 22 |
|
| 23 |
+
I trained a baseline (Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL) on a strict 80/20 train/test split. The trained model is at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent: ~$11.
|
|
|
|
| 24 |
|
| 25 |
+
Built for the OpenEnv Hackathon Finale (Apr 2026, Bangalore). Theme: World Modeling.
|
| 26 |
+
|
| 27 |
+
Team **BalleBalle**: Amrinder Singh, Shubham Kapoor.
|
|
|
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
## Why this is interesting
|
| 32 |
|
| 33 |
+
Detecting sarcasm from audio prosody is not solved. GPT-4o sits at 67% Macro-F1 on MUStARD++. There's a documented modality gap (the [LISTEN paper](https://arxiv.org/abs/2510.10444), Oct 2025). And there is no public reinforcement-learning training environment for the task that I know of, so I made one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same architecture, but they didn't train. Subtext Arena is the training-side counterpart.
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
+
## How the env works
|
| 40 |
+
|
| 41 |
+
Each episode is one MUStARD clip. The prompt the agent sees contains:
|
| 42 |
+
|
| 43 |
+
- The transcript (target line + 1-7 lines of preceding conversation, with speaker tags)
|
| 44 |
+
- Prosody features as text: pitch mean / variability / contour shape, energy mean / variability, voiced ratio, pre-utterance silence, internal pauses with timestamps
|
| 45 |
+
- A pitch contour rendered as an 8-level ASCII sparkline
|
| 46 |
+
|
| 47 |
+
The model emits:
|
| 48 |
+
|
| 49 |
+
```
|
| 50 |
+
<think>
|
| 51 |
+
... reasoning over the prosodic and lexical cues ...
|
| 52 |
+
</think>
|
| 53 |
+
<final>{"label":"sarcastic","confidence":0.85}</final>
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
The reward is a composable rubric:
|
| 57 |
|
| 58 |
```
|
| 59 |
+
0.70 * correctness (confidence-weighted match against gold)
|
| 60 |
+
0.15 * reasoning_length (penalty if <think> is under 50 words or over 300)
|
| 61 |
+
0.15 * format (1.0 if a valid <final> JSON parses, 0 otherwise)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
```
|
| 63 |
|
| 64 |
+
The env also exposes the four tools (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) for interactive multi-step inference. That's what you can poke at on this Space's web UI. Training uses the single-prompt format above so it matches the deck-linked Wordle and Sudoku notebooks.
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
## Results
|
| 69 |
|
| 70 |
+
I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips, 138 eval clips, deterministically seeded so judges can reproduce. Pivot oversample is 2x. Class balance is enforced in the dataset construction.
|
| 71 |
|
| 72 |
+
### Reward over training
|
| 73 |
|
| 74 |

|
| 75 |
|
| 76 |
+
Reward climbs from 0.335 to 0.97 on training prompts. The shaded band is within-batch rollout variance: when narrow, the four group-relative generations agree; when wide, the model is exploring.
|
| 77 |
|
| 78 |
+
### Held-out generalization
|
| 79 |
+
|
| 80 |
+
After training, I ran greedy inference on 80 clips the model has never seen.
|
| 81 |
|
| 82 |

|
| 83 |
|
| 84 |
+
| Subset | Accuracy |
|
| 85 |
+
|---|---|
|
| 86 |
+
| Broad held-out (38 sarcastic + 42 sincere) | 41/80 = **51.25%** |
|
| 87 |
+
| Prosody-Pivot Set in eval (audio-decisive clips) | 5/6 = **83%** |
|
| 88 |
+
| Well-formed completions | 79/80 = 98.75% |
|
| 89 |
|
| 90 |
+
The honest read: 51% on the broad set is roughly text-only baseline, which means pyin-derived prosody summary stats don't push a 3B model much beyond what it already does from the transcript. But when the audio is genuinely decisive (the Pivot Set), the trained model uses it: 5/6 correct on those clips, vs 0/6 for a text-only baseline that confidently picks the wrong label.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
The 0.97 train vs 0.51 held-out gap is the anti-memorization signature: if it had been gaming the reward, train and held-out would match.
|
| 93 |
|
| 94 |
+
### Side-by-side
|
| 95 |
|
| 96 |
+
[`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally: baseline 0/5, trained 5/5.
|
| 97 |
|
| 98 |
+
### Training dynamics
|
| 99 |
|
| 100 |
+

|
| 101 |
|
| 102 |
+
Loss + completion length. The reasoning-length floor of 50 words is what keeps `<think>` blocks from collapsing to one-liners.
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
## Quick start
|
| 107 |
|
| 108 |
+
Install from this Space:
|
| 109 |
+
|
| 110 |
+
```bash
|
| 111 |
+
pip install git+https://huggingface.co/spaces/aamrinder/subtext-arena
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
Connect to the deployed env:
|
| 115 |
|
| 116 |
```python
|
| 117 |
from subtext_arena import SubtextArenaEnv, SubtextArenaAction
|
|
|
|
| 120 |
base_url="https://aamrinder-subtext-arena.hf.space"
|
| 121 |
).sync() as env:
|
| 122 |
obs = env.reset().observation
|
|
|
|
|
|
|
| 123 |
obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
|
|
|
|
|
|
|
| 124 |
obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
|
|
|
|
|
|
|
| 125 |
result = env.step(SubtextArenaAction(
|
| 126 |
tool="submit_belief",
|
| 127 |
tool_args={"label": "sarcastic", "confidence": 0.85},
|
| 128 |
))
|
| 129 |
+
print(result.reward, result.observation.metadata)
|
| 130 |
```
|
| 131 |
|
| 132 |
+
Run locally:
|
| 133 |
|
| 134 |
```bash
|
| 135 |
git clone https://huggingface.co/spaces/aamrinder/subtext-arena
|
| 136 |
cd subtext-arena
|
| 137 |
uv sync
|
| 138 |
+
uv run server
|
| 139 |
```
|
| 140 |
|
| 141 |
+
Reproduce training (about 2 hours on L4, $3 of compute):
|
| 142 |
|
| 143 |
```bash
|
| 144 |
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
|
|
|
|
| 146 |
--with accelerate --with peft --with bitsandbytes \
|
| 147 |
--with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
|
| 148 |
-- python -m subtext_arena.train.train_grpo \
|
| 149 |
+
--max-steps 200 --num-generations 4 --lora-r 16 \
|
|
|
|
| 150 |
--eval-ratio 0.2 --n-eval-clips 80 \
|
| 151 |
+
--push-to-hub aamrinder/subtext-arena-grpo
|
|
|
|
| 152 |
```
|
| 153 |
|
| 154 |
+
A Colab-friendly version of the script is in [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
---
|
| 157 |
|
| 158 |
+
## Repo layout
|
| 159 |
|
| 160 |
```
|
| 161 |
+
.
|
| 162 |
+
├── client.py SubtextArenaEnv (HTTP + WebSocket client)
|
| 163 |
+
├── models.py Action / Observation Pydantic types
|
| 164 |
+
├── openenv.yaml OpenEnv manifest
|
| 165 |
+
├── pyproject.toml
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
├── server/
|
| 167 |
+
│ ├── app.py FastAPI factory
|
| 168 |
+
│ ├── audio_tools.py prosody / contour text renderers
|
| 169 |
+
│ ├── scenarios.py MUStARD loader + train/eval split
|
| 170 |
+
│ ├── grader.py per-step + final reward
|
| 171 |
+
│ └── subtext_arena_environment.py reset / step / state
|
|
|
|
|
|
|
| 172 |
├── train/
|
| 173 |
+
│ ├── train_grpo.py GRPO training (Unsloth + TRL)
|
| 174 |
+
│ ├── eval_pivot_set.py evaluate any checkpoint
|
| 175 |
+
│ └── curate_pivot_set.py 4-step Pivot Set curation
|
|
|
|
|
|
|
|
|
|
| 176 |
├── notebooks/
|
| 177 |
+
│ └── train_grpo_colab.ipynb Colab-runnable training
|
| 178 |
+
├── data/
|
| 179 |
+
│ ├── sarcasm_data.json MUStARD ground truth (690 clips)
|
| 180 |
+
│ ├── prosody_cache/ 1380 pre-computed prosody jsons
|
| 181 |
+
│ ├── pivot_set.json 32 audio-decisive clips
|
| 182 |
+
│ ├── held_out_eval_run3.json 80-clip generalization eval
|
| 183 |
+
│ └── log_history_run3.json per-step training metrics
|
| 184 |
└── docs/
|
| 185 |
+
├── plots/ reward + training + held-out PNGs
|
| 186 |
+
└── side_by_side.html baseline vs trained, 5 hero clips
|
| 187 |
```
|
| 188 |
|
| 189 |
---
|
| 190 |
|
| 191 |
+
## Where this could go next
|
| 192 |
+
|
| 193 |
+
The env's audio-tool layer is decoupled from the model interface, so a future team can plug in richer features without touching anything else. Specifically:
|
| 194 |
+
|
| 195 |
+
- pyin pitch contour → wav2vec2 / HuBERT prosody embeddings
|
| 196 |
+
- RMS summary → spectrogram patch tokens
|
| 197 |
+
- speaker-anonymous prompts → speaker-aware features
|
| 198 |
+
- 3B text policy → 7B audio LLM (Qwen2-Audio) end-to-end
|
| 199 |
+
|
| 200 |
+
If any of those drop the broad held-out accuracy from 51% toward AMuSeD's 81% F1 multimodal SOTA on the same dataset, the env will measure it cleanly because the held-out split and reward are fixed.
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
|
| 204 |
+
## References
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
+
- MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
|
| 207 |
+
- AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995
|
| 208 |
+
- LISTEN benchmark (Oct 2025): https://arxiv.org/abs/2510.10444
|
| 209 |
+
- OpenEnv: https://github.com/meta-pytorch/OpenEnv
|
| 210 |
|
| 211 |
+
Built on OpenEnv v0.2.3.
|
models.py
CHANGED
|
@@ -1,20 +1,4 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
Subtext Arena is an OpenEnv environment where an LLM agent must classify
|
| 4 |
-
sarcastic vs sincere speech by investigating both the literal transcript AND
|
| 5 |
-
the prosody (pitch, energy, pause patterns) of an audio clip.
|
| 6 |
-
|
| 7 |
-
The agent picks one of:
|
| 8 |
-
- get_transcript() - returns the literal transcript and the
|
| 9 |
-
preceding conversational context
|
| 10 |
-
- get_prosody_features(span) - pitch_var, energy_var, voiced_ratio,
|
| 11 |
-
pre_pause_ms, pause list
|
| 12 |
-
- get_pitch_contour(span) - coarse Hz contour over the span
|
| 13 |
-
- submit_belief(label, conf) - terminates the episode with the answer
|
| 14 |
-
|
| 15 |
-
`tool_args` is a free-form dict so we don't need a separate Action class
|
| 16 |
-
per tool; the env validates the shape per tool name.
|
| 17 |
-
"""
|
| 18 |
from __future__ import annotations
|
| 19 |
|
| 20 |
from typing import Any, Dict, List, Optional
|
|
@@ -44,12 +28,7 @@ class SubtextArenaAction(Action):
|
|
| 44 |
|
| 45 |
|
| 46 |
class SubtextArenaObservation(Observation):
|
| 47 |
-
"""Observation returned after each action.
|
| 48 |
-
|
| 49 |
-
`tool_output` is the human-readable text describing what the tool returned
|
| 50 |
-
(transcript text, prosody summary, pitch contour). Empty for the
|
| 51 |
-
submit_belief action (which terminates the episode).
|
| 52 |
-
"""
|
| 53 |
|
| 54 |
clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
|
| 55 |
speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")
|
|
|
|
| 1 |
+
"""Action and observation models for Subtext Arena."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
from __future__ import annotations
|
| 3 |
|
| 4 |
from typing import Any, Dict, List, Optional
|
|
|
|
| 28 |
|
| 29 |
|
| 30 |
class SubtextArenaObservation(Observation):
|
| 31 |
+
"""Observation returned after each action."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
clip_id: str = Field(default="", description="MUStARD clip identifier (e.g. '1_60')")
|
| 34 |
speaker: str = Field(default="", description="Speaker name when available (Friends/BBT)")
|
server/grader.py
CHANGED
|
@@ -1,33 +1,19 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
Note: there is ALSO a transcript-level reward in research/subtext_solutions/reward.py
|
| 9 |
-
which is used by GRPO training (it scores entire rollouts at once). The
|
| 10 |
-
relationship:
|
| 11 |
-
- Training-time (GRPO): research/subtext_solutions/reward.py runs over the
|
| 12 |
-
full rollout transcript and returns one scalar.
|
| 13 |
-
- Env-time (this file): per-step grader returns deltas as the agent acts;
|
| 14 |
-
the sum of step rewards over an episode equals
|
| 15 |
-
(or is proxy for) the GRPO reward at termination.
|
| 16 |
-
|
| 17 |
-
This dual layer means the env is interactive (judges can poke at it from a
|
| 18 |
-
notebook and see immediate rewards) AND trainable (GRPO sees a clean per-
|
| 19 |
-
rollout scalar). Both are exposed.
|
| 20 |
"""
|
| 21 |
from __future__ import annotations
|
| 22 |
|
| 23 |
from typing import Any, Dict
|
| 24 |
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
R_BAD_ARGS = -0.05 # penalty for malformed args
|
| 31 |
|
| 32 |
|
| 33 |
def step_reward(tool_used: str, error: str | None) -> float:
|
|
@@ -52,21 +38,18 @@ def final_reward(
|
|
| 52 |
n_audio_calls: int,
|
| 53 |
n_total_calls: int,
|
| 54 |
) -> Dict[str, float]:
|
| 55 |
-
"""Reward
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- no submission -0.30
|
| 68 |
-
- too many calls -0.20
|
| 69 |
-
- pivot + no audio + wrong -0.50
|
| 70 |
"""
|
| 71 |
components: Dict[str, float] = {
|
| 72 |
"correctness": 0.0,
|
|
|
|
| 1 |
+
"""Per-step + final-action reward grader for the multi-step interactive env.
|
| 2 |
+
|
| 3 |
+
The training script in train/train_grpo.py uses a single-shot reward (in
|
| 4 |
+
train_grpo.make_reward_fn) that scores the whole rollout at once. This
|
| 5 |
+
file is what the env returns step-by-step when an agent walks it
|
| 6 |
+
interactively (e.g. from the HF Space web UI).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
"""
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
from typing import Any, Dict
|
| 11 |
|
| 12 |
|
| 13 |
+
R_AUDIO_TOOL_USE = 0.05
|
| 14 |
+
R_TRANSCRIPT_USE = 0.02
|
| 15 |
+
R_BAD_TOOL = -0.10
|
| 16 |
+
R_BAD_ARGS = -0.05
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
def step_reward(tool_used: str, error: str | None) -> float:
|
|
|
|
| 38 |
n_audio_calls: int,
|
| 39 |
n_total_calls: int,
|
| 40 |
) -> Dict[str, float]:
|
| 41 |
+
"""Reward computed when submit_belief terminates the episode.
|
| 42 |
+
|
| 43 |
+
Components:
|
| 44 |
+
correctness confidence-weighted match against gold
|
| 45 |
+
prosody_grounding 1.0 if any audio-tool call, 0.4 otherwise (0.0 on Pivot)
|
| 46 |
+
tool_parsimony 1.0 for 1-3 calls, 0.6 for 4-5, 0.0 for >5
|
| 47 |
+
format_ok 1.0 if a valid label was submitted
|
| 48 |
+
|
| 49 |
+
Penalties:
|
| 50 |
+
no submission -0.30
|
| 51 |
+
too many calls -0.20
|
| 52 |
+
pivot + no audio + wrong -0.50
|
|
|
|
|
|
|
|
|
|
| 53 |
"""
|
| 54 |
components: Dict[str, float] = {
|
| 55 |
"correctness": 0.0,
|
server/scenarios.py
CHANGED
|
@@ -1,16 +1,4 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
Loads MUStARD ground-truth (sarcasm labels + transcripts + context) and the
|
| 4 |
-
pre-computed prosody features for all 690 utterances.
|
| 5 |
-
|
| 6 |
-
The scenario_dict pattern mirrors the SRE R1 pattern: each clip is a "case"
|
| 7 |
-
the agent must investigate and classify. Difficulty buckets:
|
| 8 |
-
- easy: no Pivot Set membership; baseline text-only Qwen would likely get this right
|
| 9 |
-
- medium: borderline confidence; mixed text + prosody required
|
| 10 |
-
- hard: Prosody-Pivot Set — text-only is wrong; audio cues flip the answer
|
| 11 |
-
|
| 12 |
-
For training, we mix clips across all difficulties. Eval reports per-bucket.
|
| 13 |
-
"""
|
| 14 |
from __future__ import annotations
|
| 15 |
|
| 16 |
import json
|
|
@@ -23,24 +11,7 @@ from typing import Any, Dict, List, Optional
|
|
| 23 |
DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
|
| 24 |
|
| 25 |
|
| 26 |
-
def load_scenarios(
|
| 27 |
-
data_root: Optional[Path] = None,
|
| 28 |
-
) -> Dict[str, Dict[str, Any]]:
|
| 29 |
-
"""Load all 690 utterances with sarcasm labels + prosody features.
|
| 30 |
-
|
| 31 |
-
Returns:
|
| 32 |
-
Dict keyed by clip_id with structure:
|
| 33 |
-
{
|
| 34 |
-
'utterance': str (literal transcript)
|
| 35 |
-
'speaker': str
|
| 36 |
-
'context': list[str] (lines spoken before)
|
| 37 |
-
'context_speakers': list[str]
|
| 38 |
-
'sarcasm': bool (ground truth label)
|
| 39 |
-
'show': str
|
| 40 |
-
'prosody': dict (pre-computed features) or None
|
| 41 |
-
'is_pivot': bool (filled in later by curate_pivot_set)
|
| 42 |
-
}
|
| 43 |
-
"""
|
| 44 |
root = data_root or DEFAULT_DATA_ROOT
|
| 45 |
sarcasm_path = root / "sarcasm_data.json"
|
| 46 |
prosody_dir = root / "prosody_cache" / "utterances"
|
|
@@ -82,18 +53,8 @@ def load_scenarios(
|
|
| 82 |
return scenarios
|
| 83 |
|
| 84 |
|
| 85 |
-
def sample_clip(
|
| 86 |
-
|
| 87 |
-
rng: random.Random,
|
| 88 |
-
pivot_oversample_factor: int = 3,
|
| 89 |
-
) -> str:
|
| 90 |
-
"""Sample a clip ID from the scenarios.
|
| 91 |
-
|
| 92 |
-
Pivot-set clips are oversampled (default 3x) because they are the highest-
|
| 93 |
-
signal training instances — they are exactly where text-only fails and
|
| 94 |
-
prosody must carry the signal.
|
| 95 |
-
"""
|
| 96 |
-
pool: List[str] = []
|
| 97 |
for clip_id, entry in scenarios.items():
|
| 98 |
weight = pivot_oversample_factor if entry["is_pivot"] else 1
|
| 99 |
pool.extend([clip_id] * weight)
|
|
|
|
| 1 |
+
"""Loads MUStARD ground truth + cached prosody features for all 690 clips."""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
from __future__ import annotations
|
| 3 |
|
| 4 |
import json
|
|
|
|
| 11 |
DEFAULT_DATA_ROOT = Path(__file__).resolve().parent.parent / "data"
|
| 12 |
|
| 13 |
|
| 14 |
+
def load_scenarios(data_root: Optional[Path] = None) -> Dict[str, Dict[str, Any]]:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
root = data_root or DEFAULT_DATA_ROOT
|
| 16 |
sarcasm_path = root / "sarcasm_data.json"
|
| 17 |
prosody_dir = root / "prosody_cache" / "utterances"
|
|
|
|
| 53 |
return scenarios
|
| 54 |
|
| 55 |
|
| 56 |
+
def sample_clip(scenarios, rng, pivot_oversample_factor=3):
|
| 57 |
+
pool = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
for clip_id, entry in scenarios.items():
|
| 59 |
weight = pivot_oversample_factor if entry["is_pivot"] else 1
|
| 60 |
pool.extend([clip_id] * weight)
|