Spaces:
Sleeping
Sleeping
File size: 9,919 Bytes
9c0981e 225e725 9c0981e 225e725 9c0981e 9f43137 225e725 4cdc991 225e725 8083b95 989102e 9f43137 8083b95 4cdc991 30cf95a 225e725 8083b95 225e725 4cdc991 225e725 989102e 225e725 8083b95 9f43137 4cdc991 9f43137 225e725 989102e 9f43137 989102e 8083b95 225e725 4cdc991 225e725 989102e 225e725 8083b95 225e725 4cdc991 225e725 989102e 225e725 9f43137 225e725 8083b95 989102e 8083b95 989102e 8083b95 989102e 9f43137 989102e 4cdc991 225e725 9f43137 225e725 989102e 225e725 9f43137 225e725 9f43137 225e725 9f43137 225e725 4cdc991 225e725 989102e 9f43137 989102e 9f43137 225e725 30cf95a 225e725 9f43137 989102e 9f43137 989102e 9f43137 989102e 9f43137 989102e 9f43137 225e725 4cdc991 9f43137 8083b95 9f43137 4cdc991 9f43137 225e725 629a13b 225e725 9f43137 225e725 9f43137 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | ---
title: Subtext Arena
emoji: π§
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- rl
- audio
- sarcasm
---
# Subtext Arena
This repo wraps that exact task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a small prosody pipeline of pyin pitch, RMS energy and pause timing, all baked straight into the env. The agent sees a transcript plus a text rendering of the prosody, does a chain of thought, and finishes with `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
I trained a baseline on top of this. Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL, on a strict 80/20 train/test split. The trained model is sitting at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent on the whole thing was eleven dollars only.
Built for the OpenEnv Hackathon Finale (Apr 2026, Bangalore). Theme: World Modeling.
Team **BalleBalle**: Amrinder Singh, Shubham Kapoor.
**Links:**
- Environment (live HF Space, with web UI): https://huggingface.co/spaces/aamrinder/subtext-arena
- Trained model: https://huggingface.co/aamrinder/subtext-arena-grpo
- Blog post (under 2 min read): [blog.md](blog.md)
- Side-by-side demo (5 hero clips, baseline vs trained): [docs/side_by_side.html](docs/side_by_side.html)
- Colab notebook (re-run the training): https://colab.research.google.com/drive/1PL_nc6oiLahPiReKG-x8CWCBarCUR3qc?usp=sharing
---
Detecting sarcasm from audio prosody is honestly not a solved problem at all. GPT-4o sits at 67% Macro-F1 on MUStARD++, meaning one out of three times the most powerful model on the planet is being fooled by tone of voice itself. There is a documented modality gap on this, the [LISTEN paper](https://arxiv.org/abs/2510.10444) from October 2025, and as of today the gap is still wide open. And there is no public RL training environment for this task that I could find, so one fine evening I just sat down and made one.
The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same kind of architecture really, but they didn't actually train anything. Subtext Arena is the training-side counterpart of that idea.
---
Each episode is one MUStARD clip. The prompt the agent sees has the transcript (target line plus 1-7 lines of preceding conversation, with speaker tags), prosody features rendered as text (pitch mean and variability, contour shape, energy mean and variability, voiced ratio, pre-utterance silence, internal pauses with timestamps), and a pitch contour drawn out as an 8-level ASCII sparkline.
The model emits something like:
```
<think>
... reasoning over the prosodic and lexical cues ...
</think>
<final>{"label":"sarcastic","confidence":0.85}</final>
```
The reward is a composable rubric:
```
0.70 * correctness (confidence-weighted match against gold)
0.15 * reasoning_length (penalty if <think> is under 50 words or over 300)
0.15 * format (1.0 if a valid <final> JSON parses, 0 otherwise)
```
The env also exposes four tools (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) for interactive multi-step inference. That is what you can poke at on this Space's web UI itself. Training uses the single-prompt format above so it matches the deck-linked Wordle and Sudoku notebooks.
---
I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips and 138 eval clips, deterministically seeded so judges can reproduce the same numbers. Pivot oversample is 2x. Class balance is enforced in the dataset construction itself.

Reward climbs from 0.335 to 0.97 on training prompts. The shaded band around the line is within-batch rollout variance. When it is narrow, the four group-relative generations are mostly agreeing; when it goes wide, the model is exploring.
After training, I ran greedy inference on 80 clips the model had never seen during training.

| Subset | Accuracy |
|---|---|
| Broad held-out (38 sarcastic + 42 sincere) | 41/80 = **51.25%** |
| Prosody-Pivot Set in eval (audio-decisive clips) | 5/6 = **83%** |
| Well-formed completions | 79/80 = 98.75% |
The honest read: 51% on the broad set is roughly what a plain text-only baseline would also do, meaning pyin-derived prosody summary stats are still too simple to push a 3B model much beyond what it already gets from just reading the transcript. Fair hit, no hiding it. But when the audio is genuinely decisive (the Pivot Set), the trained model actually uses it. 5 out of 6 correct on those clips, vs 0/6 for a text-only baseline that confidently picks the wrong label every single time.
The 0.97 train vs 0.51 held-out gap is itself the anti-memorization signal. If the model had been just gaming the reward, train and held-out would have matched.
[`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally: baseline 0/5, trained 5/5.

Loss plus completion length over the 200 steps. The reasoning-length floor of 50 words is what keeps the `<think>` block from collapsing into a one-liner.
---
Install from this Space:
```bash
pip install git+https://huggingface.co/spaces/aamrinder/subtext-arena
```
Connect to the deployed env:
```python
from subtext_arena import SubtextArenaEnv, SubtextArenaAction
with SubtextArenaEnv(
base_url="https://aamrinder-subtext-arena.hf.space"
).sync() as env:
obs = env.reset().observation
obs = env.step(SubtextArenaAction(tool="get_transcript")).observation
obs = env.step(SubtextArenaAction(tool="get_prosody_features")).observation
result = env.step(SubtextArenaAction(
tool="submit_belief",
tool_args={"label": "sarcastic", "confidence": 0.85},
))
print(result.reward, result.observation.metadata)
```
Run locally:
```bash
git clone https://huggingface.co/spaces/aamrinder/subtext-arena
cd subtext-arena
uv sync
uv run server
```
Reproduce the training (about 2 hours on L4, $3 of compute):
```bash
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
--with "trl>=0.11" --with "transformers>=4.46" --with datasets \
--with accelerate --with peft --with bitsandbytes \
--with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
-- python -m subtext_arena.train.train_grpo \
--max-steps 200 --num-generations 4 --lora-r 16 \
--eval-ratio 0.2 --n-eval-clips 80 \
--push-to-hub aamrinder/subtext-arena-grpo
```
A Colab-friendly version of the same script is at https://colab.research.google.com/drive/1PL_nc6oiLahPiReKG-x8CWCBarCUR3qc?usp=sharing.
---
```
.
βββ client.py SubtextArenaEnv (HTTP + WebSocket client)
βββ models.py Action / Observation Pydantic types
βββ openenv.yaml OpenEnv manifest
βββ pyproject.toml
βββ server/
β βββ app.py FastAPI factory
β βββ audio_tools.py prosody / contour text renderers
β βββ scenarios.py MUStARD loader + train/eval split
β βββ grader.py per-step + final reward
β βββ subtext_arena_environment.py reset / step / state
βββ train/
β βββ train_grpo.py GRPO training (Unsloth + TRL)
β βββ eval_pivot_set.py evaluate any checkpoint
β βββ curate_pivot_set.py 4-step Pivot Set curation
βββ notebooks/
β βββ train_grpo_colab.ipynb Colab-runnable training
βββ data/
β βββ sarcasm_data.json MUStARD ground truth (690 clips)
β βββ prosody_cache/ 1380 pre-computed prosody jsons
β βββ pivot_set.json 32 audio-decisive clips
β βββ held_out_eval_run3.json 80-clip generalization eval
β βββ log_history_run3.json per-step training metrics
βββ docs/
βββ plots/ reward + training + held-out PNGs
βββ side_by_side.html baseline vs trained, 5 hero clips
```
---
This is just the start really, I am not stopping here at all. The whole point of putting in the effort to build an environment in the first place is that it stays alive even after the hackathon is done, leaderboard is locked, credits run out, all of that. The audio-tool layer of the env is decoupled from the model interface itself, so anyone with a richer feature stack can plug in straight on top without touching anything else. Specifically:
- pyin pitch contour to wav2vec2 / HuBERT prosody embeddings
- RMS summary to spectrogram patch tokens
- speaker-anonymous prompts to speaker-aware features
- 3B text policy to 7B audio LLM (Qwen2-Audio) end-to-end
- sarcasm only to polite refusals, hidden anger, suppressed feelings, basically any case where the words and the actual meaning don't agree
If any of these drop the broad held-out number from 51% toward AMuSeD's 81% F1 multimodal SOTA on the same dataset, the env will measure it cleanly because the held-out split and reward function are fixed.
---
References:
- MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
- AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995
- LISTEN benchmark (Oct 2025): https://arxiv.org/abs/2510.10444
- OpenEnv: https://github.com/meta-pytorch/OpenEnv
Built on OpenEnv v0.2.3.
|