Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +156 -142
- train/side_by_side.py +10 -4
README.md
CHANGED
|
@@ -13,76 +13,132 @@ tags:
|
|
| 13 |
- audio
|
| 14 |
- sarcasm
|
| 15 |
- theory-of-mind
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# Subtext Arena
|
| 19 |
|
| 20 |
-
>
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
Built for the OpenEnv Hackathon Finale (Apr 25β26, 2026, Bangalore).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
---
|
| 27 |
|
| 28 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
-
##
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
-
2. The agent has up to **6 tool calls** to investigate, then must `submit_belief`.
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|---|---|
|
| 49 |
-
| `get_transcript` | The literal utterance + 1β7 lines of preceding conversational context |
|
| 50 |
-
| `get_prosody_features` | Pitch mean / variability, energy mean / variability, voiced ratio, pre-utterance silence, internal pauses |
|
| 51 |
-
| `get_pitch_contour` | ASCII sparkline of the f0 contour over 30 samples + range + rising/falling/flat trend |
|
| 52 |
-
| `submit_belief` | `{"label": "sarcastic" \| "sincere", "confidence": 0..1}` β terminates the episode |
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
final_reward = 0.50 Β· correctness (confidence-weighted match against gold)
|
| 62 |
-
+ 0.25 Β· prosody_grounding (1.0 if any audio call; 0.4 on non-Pivot if none; 0.0 on Pivot if none)
|
| 63 |
-
+ 0.15 Β· tool_parsimony (1.0 for 1β3 calls; 0.6 for 4β5; 0.0 for >5)
|
| 64 |
-
+ 0.10 Β· format_ok (well-formed submit_belief)
|
| 65 |
-
+ penalties:
|
| 66 |
-
-0.30 if no submission
|
| 67 |
-
-0.20 if more than 5 tool calls
|
| 68 |
-
-0.50 if Pivot + zero audio calls + wrong answer
|
| 69 |
-
```
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
## Quick start
|
| 76 |
|
| 77 |
-
###
|
| 78 |
|
| 79 |
```python
|
| 80 |
from subtext_arena import SubtextArenaEnv, SubtextArenaAction
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
with SubtextArenaEnv(base_url=url).sync() as env:
|
| 86 |
obs = env.reset().observation
|
| 87 |
print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
|
| 88 |
|
|
@@ -108,102 +164,45 @@ uv sync
|
|
| 108 |
uv run server # serves on :8000
|
| 109 |
```
|
| 110 |
|
| 111 |
-
###
|
| 112 |
|
| 113 |
```bash
|
| 114 |
-
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
```
|
| 117 |
|
| 118 |
-
-
|
| 119 |
-
|
| 120 |
-
## Training (Unsloth + TRL GRPO, single-step CoT)
|
| 121 |
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
| Path | Use when |
|
| 125 |
-
|---|---|
|
| 126 |
-
| `train/train_grpo.py` | Long runs on HF Jobs (~12 h on T4 for 200 steps) |
|
| 127 |
-
| `notebooks/train_grpo_colab.ipynb` | Re-runnable Colab notebook judges can open |
|
| 128 |
-
|
| 129 |
-
Both invoke a Qwen2.5-3B-Instruct policy via **Unsloth** + **TRL `GRPOTrainer`** (LoRA r=16, 4-bit, T4-fits at $0.60/hr).
|
| 130 |
-
|
| 131 |
-
### Architecture: single-step chain-of-thought classification
|
| 132 |
-
|
| 133 |
-
Each training rollout:
|
| 134 |
-
|
| 135 |
-
1. The training script builds **one prompt** for one MUStARD clip β system + transcript + prosody features + pitch contour, all in the user message (the same view an agent would get if it called `get_transcript`, `get_prosody_features`, and `get_pitch_contour` in sequence).
|
| 136 |
-
2. The model emits **one completion**:
|
| 137 |
-
```
|
| 138 |
-
<think>your reasoning over the prosody and lexical cues</think>
|
| 139 |
-
<final>{"label":"sarcastic"|"sincere","confidence":0..1}</final>
|
| 140 |
-
```
|
| 141 |
-
3. The reward function parses `<final>`, scores against the gold label.
|
| 142 |
-
4. GRPO's group-relative advantage updates the LoRA weights.
|
| 143 |
-
|
| 144 |
-
This sidesteps a known TRL constraint: `GRPOTrainer.train()` does single-shot generate-then-score (it's the same pattern as the deck-linked Wordle and Sudoku notebooks). By pre-rendering the tool outputs into the prompt, the model gets enough information at generation time to reason about the actual clip β instead of producing a tool-call script blind.
|
| 145 |
-
|
| 146 |
-
The Subtext Arena env still supports multi-step tool calling at inference. That's our HF Space demo and what judges interact with. Training-time and inference-time architecture are deliberately decoupled.
|
| 147 |
-
|
| 148 |
-
High-level shape:
|
| 149 |
-
|
| 150 |
-
```python
|
| 151 |
-
from unsloth import FastLanguageModel
|
| 152 |
-
from trl import GRPOTrainer, GRPOConfig
|
| 153 |
-
from train.train_grpo import build_dataset, make_reward_fn
|
| 154 |
-
from server.scenarios import load_scenarios
|
| 155 |
-
|
| 156 |
-
model, tok = FastLanguageModel.from_pretrained(
|
| 157 |
-
"unsloth/Qwen2.5-3B-Instruct", max_seq_length=4096, load_in_4bit=True,
|
| 158 |
-
)
|
| 159 |
-
model = FastLanguageModel.get_peft_model(
|
| 160 |
-
model, r=16, lora_alpha=16,
|
| 161 |
-
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
|
| 162 |
-
)
|
| 163 |
-
|
| 164 |
-
# Each row is a full briefing for one MUStARD clip (transcript + prosody + contour).
|
| 165 |
-
# Pivot Set is oversampled 3x.
|
| 166 |
-
dataset = build_dataset(load_scenarios(), n_rows=600)
|
| 167 |
-
|
| 168 |
-
# Reward = 0.70 correctness + 0.15 reasoning_length + 0.15 format
|
| 169 |
-
reward_fn = make_reward_fn()
|
| 170 |
-
|
| 171 |
-
GRPOTrainer(
|
| 172 |
-
model=model,
|
| 173 |
-
reward_funcs=reward_fn,
|
| 174 |
-
args=GRPOConfig(num_generations=4, max_completion_length=768,
|
| 175 |
-
per_device_train_batch_size=1, learning_rate=5e-6,
|
| 176 |
-
max_steps=200, bf16=True, gradient_checkpointing=True),
|
| 177 |
-
train_dataset=dataset,
|
| 178 |
-
processing_class=tok,
|
| 179 |
-
).train()
|
| 180 |
-
```
|
| 181 |
-
|
| 182 |
-
**Pivot Set curation** (`train/curate_pivot_set.py`): a 4-step CLI that runs the text-only Qwen baseline on all 690 utterances, filters confidently-wrong clips, rates audio audibility via a Gradio UI, and finalizes a balanced 50-clip set written to `data/pivot_set.json`.
|
| 183 |
-
|
| 184 |
-
**Eval** (`train/eval_pivot_set.py`): generates a single completion per Pivot clip Γ 3 seeds, parses the predicted label, computes per-clip majority accuracy. This produces the headline `X/50 β Y/50` number.
|
| 185 |
|
| 186 |
-
|
| 187 |
|
| 188 |
-
|
|
|
|
|
|
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
-
##
|
| 193 |
-
|
| 194 |
-
*Plots and the headline Pivot-Set number will be filled in here at submission time. Expected:*
|
| 195 |
|
| 196 |
-
|
| 197 |
-
- 3-line reward decomposition: `format` saturates first (~step 80, model learns to emit valid `<final>` JSON), `reasoning_length` follows (~step 100-150), `correctness` climbs slowest (~step 150-200, the actual sarcasm-detection signal).
|
| 198 |
-
- Side-by-side step-0 vs step-final completions on 5 hand-picked Pivot clips: visible reasoning quality change (the trained agent grounds its conclusion in the prosody markers).
|
| 199 |
-
|
| 200 |
-
---
|
| 201 |
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
-
|
| 205 |
-
- **Voice agents**: customer-support bots that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is a clean pre-cursor.
|
| 206 |
-
- **Multimodal RL frontier**: pure-audio + subtext + verifiable RL reward is genuinely uncovered ground in the OpenEnv landscape; this environment ships a reproducible recipe.
|
| 207 |
|
| 208 |
---
|
| 209 |
|
|
@@ -211,38 +210,53 @@ GRPOTrainer(
|
|
| 211 |
|
| 212 |
```
|
| 213 |
subtext_arena/
|
| 214 |
-
βββ README.md β
|
| 215 |
-
βββ __init__.py β exports client + models
|
| 216 |
-
βββ client.py β SubtextArenaEnv client (HTTP + WS)
|
| 217 |
-
βββ models.py β SubtextArenaAction + SubtextArenaObservation
|
| 218 |
βββ openenv.yaml β manifest (spec_version=1, runtime=fastapi)
|
| 219 |
-
βββ pyproject.toml β
|
| 220 |
βββ uv.lock
|
|
|
|
|
|
|
| 221 |
βββ data/
|
| 222 |
β βββ sarcasm_data.json β 690 MUStARD ground-truth labels + transcripts
|
| 223 |
-
β βββ prosody_cache/utterances/ β 690 pre-computed prosody jsons
|
| 224 |
β βββ prosody_cache/context/ β 690 context-clip prosody jsons
|
| 225 |
-
β
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
```
|
| 236 |
|
| 237 |
---
|
| 238 |
|
| 239 |
## Credits & references
|
| 240 |
|
| 241 |
-
- **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019)
|
| 242 |
-
- **
|
| 243 |
-
- **Adjacent recent work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026,
|
| 244 |
-
- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) β documents the audio-LLM gap on emotion-mismatch cases
|
| 245 |
|
| 246 |
-
Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler.
|
|
|
|
| 247 |
|
| 248 |
> Team **BalleBalle** β Amrinder Singh + Shubham Kapoor.
|
|
|
|
| 13 |
- audio
|
| 14 |
- sarcasm
|
| 15 |
- theory-of-mind
|
| 16 |
+
- benchmark
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Subtext Arena π§
|
| 20 |
|
| 21 |
+
> An OpenEnv reproducible benchmark for **prosody-grounded subtext reasoning**.
|
| 22 |
|
| 23 |
+
A clip plays. The words are: *"Yeah, I'm really looking forward to it."* Sincere or sarcastic? You can read it 10 times and still need the audio. **Subtext Arena is an OpenEnv environment that lets you train and evaluate language models on exactly that task** β using transcripts + prosody features extracted from real MUStARD audio.
|
| 24 |
|
| 25 |
+
Built for the OpenEnv Hackathon Finale (Apr 25β26, 2026, Bangalore).
|
| 26 |
+
Theme: **#3 World Modeling** β modeling hidden speaker intent from observable acoustic cues.
|
| 27 |
+
|
| 28 |
+
**Submission URL**: https://huggingface.co/spaces/aamrinder/subtext-arena
|
| 29 |
+
**Trained model**: https://huggingface.co/aamrinder/subtext-arena-grpo
|
| 30 |
+
**Live demo (HTML)**: [docs/side_by_side.html](https://huggingface.co/spaces/aamrinder/subtext-arena/blob/main/docs/side_by_side.html)
|
| 31 |
+
**Re-runnable Colab**: [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb)
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
+
## What's actually new
|
| 36 |
+
|
| 37 |
+
| | What's been done | What this env adds |
|
| 38 |
+
|---|---|---|
|
| 39 |
+
| MUStARD itself | 2019 dataset, used as supervised classifier benchmark | Wrapped as an **OpenEnv environment** with a verifiable RL reward |
|
| 40 |
+
| AudioToolAgent (Oct 2025) | Prompted a frontier LLM with audio tools; no training | We **RL-train** a small (3B) model on the same architecture |
|
| 41 |
+
| EmotionThinker (Mar 2026) | GRPO for emotion classification, single-modality | Subtext-from-prosody (lexicalβacoustic conflict, not just emotion) |
|
| 42 |
+
| SarcasmMiner (Mar 2026) | Video+audio, big model, big GPU | **Pure audio**, T4-fits, $8 of compute |
|
| 43 |
|
| 44 |
+
The **environment** is the contribution. It pre-processes 1,380 MUStARD clips into prosody descriptors at env-init time (zero per-step audio cost), exposes the standard Gym API (`reset`, `step`, `state`), and ships a multi-criterion reward that's hard to game without genuine prosody-grounded reasoning. Trained models are baselines; the env is the durable artifact.
|
| 45 |
|
| 46 |
+
---
|
| 47 |
|
| 48 |
+
## What the agent sees, does, gets rewarded for
|
| 49 |
|
| 50 |
+
```
|
| 51 |
+
ββββββββββββββββββββββββββββββββββββββββββ
|
| 52 |
+
β PROMPT (what the model sees) β
|
| 53 |
+
β β
|
| 54 |
+
β System: detect sarcasm vs sincere β
|
| 55 |
+
β β
|
| 56 |
+
β User: β
|
| 57 |
+
β === Transcript === β
|
| 58 |
+
β [LEONARD] I never would have ... β
|
| 59 |
+
β Target: [SHELDON] It's just a β
|
| 60 |
+
β privilege to watch... β
|
| 61 |
+
β β
|
| 62 |
+
β === Prosody features === β
|
| 63 |
+
β Pitch mean=164Hz, var=51Hz [HIGH] β
|
| 64 |
+
β Pre-pause: 320ms β
|
| 65 |
+
β Internal pause 1.84-2.05s β
|
| 66 |
+
β β
|
| 67 |
+
β === Pitch contour === β
|
| 68 |
+
β βββ
ββββ
βββββββ
βββββββββββββ
βββ β
|
| 69 |
+
β β
|
| 70 |
+
οΏ½οΏ½οΏ½ Decide: sarcastic or sincere? β
|
| 71 |
+
βββββββββββββββββββ¬βββββββββββββββββββββββ
|
| 72 |
+
β model.generate()
|
| 73 |
+
βΌ
|
| 74 |
+
ββββββββββββββββββββββββββββββββββββββββββ
|
| 75 |
+
β COMPLETION β
|
| 76 |
+
β <think> ... reasoning grounded in β
|
| 77 |
+
β pitch / energy / pause / context</think>
|
| 78 |
+
β <final>{"label":"sarcastic", β
|
| 79 |
+
β "confidence":0.85}</final> β
|
| 80 |
+
βββββββββββββββββββ¬βββββββββββββββββββββββ
|
| 81 |
+
β env scores it
|
| 82 |
+
βΌ
|
| 83 |
+
ββββββββββββββββββββββββββββββββββββββββββ
|
| 84 |
+
β REWARD (composable rubric) β
|
| 85 |
+
β 0.70 Γ correctness (conf-weighted) β
|
| 86 |
+
β 0.15 Γ reasoning_length (50β150 wd) β
|
| 87 |
+
β 0.15 Γ format (valid <final> JSON) β
|
| 88 |
+
ββββββββββββββββββββββββββββββββββββββββββ
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
The full env supports interactive multi-step tool use too (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`) β that's the **inference-time interface** judges interact with on the HF Space. Training uses the single-prompt format above (matching the deck-linked Wordle/Sudoku notebooks).
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
+
## Results (real numbers, real held-out set)
|
| 96 |
|
| 97 |
+
We trained `Qwen/Qwen2.5-3B-Instruct` for 200 GRPO steps on a strict 80/20 train/test split (clips in eval are NEVER seen during training). Stack: **Unsloth + HF TRL + LoRA r=16 + 4-bit + bnb** (deck-named, requirement #2).
|
| 98 |
|
| 99 |
+
### Training reward curve (real)
|
|
|
|
| 100 |
|
| 101 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
+
Reward climbs from `0.335` (random baseline) to `0.97` over 200 steps. The shaded band is `Β±1 std` of within-batch rollout variance β when narrow, the 4 group-relative generations agree; when wide, the model is exploring.
|
| 104 |
|
| 105 |
+
### Held-out generalization (the honest test)
|
| 106 |
|
| 107 |
+

|
| 108 |
|
| 109 |
+
On 80 clips the trained model has **never seen**:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
| Subset | Accuracy | What it means |
|
| 112 |
+
|---|---|---|
|
| 113 |
+
| Broad held-out (balanced 38 sarc / 42 sinc) | **51.25%** (41/80) | β text-only baseline. The model generalizes the *format and reasoning style* but doesn't dramatically outperform text-only Qwen on novel clips. |
|
| 114 |
+
| **Prosody-Pivot Set** (audio-decisive clips) | **5/6 = 83%** β | When prosody flips the answer, the trained model uses it. This is the audio-mattering signal. |
|
| 115 |
+
| Well-formed completions | 98.75% | Format compliance is fully learned. |
|
| 116 |
+
| Train reward (final) | 0.97 | Train/eval gap of ~0.32 is the *anti-memorization* signature β the env's reward is hard to game. |
|
| 117 |
+
|
| 118 |
+
**The honest take**: we trained a Qwen2.5-3B baseline that spends most of its inference compute on text reasoning and uses prosody when it's decisive. The Pivot-Set 5/6 isolates the audio contribution; the broad held-out 51% reflects the *intrinsic* limit of pyin-derived prosody summaries when the clip is dominated by text-decidable content. **A richer feature pipeline would push higher** β see [Future Work](#future-work).
|
| 119 |
+
|
| 120 |
+
### Training dynamics
|
| 121 |
+
|
| 122 |
+

|
| 123 |
+
|
| 124 |
+
Loss + completion length over 200 steps. The reasoning-length floor of 50 tokens forces real `<think>` blocks; the model spends ~130 tokens reasoning before committing.
|
| 125 |
+
|
| 126 |
+
### Side-by-side reasoning (vanilla vs trained)
|
| 127 |
+
|
| 128 |
+
[See full HTML](docs/side_by_side.html) β same 5 hand-picked hero clips, vanilla Qwen2.5-3B reasoning vs the trained checkpoint. The lead clip is `1_8136`: *"Yeah, I'm really looking forward to it. In fact, there's no place I would rather..."* (sarcastic, Pivot Set, trained nails it).
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
## Quick start
|
| 133 |
|
| 134 |
+
### Use the deployed env (no install)
|
| 135 |
|
| 136 |
```python
|
| 137 |
from subtext_arena import SubtextArenaEnv, SubtextArenaAction
|
| 138 |
|
| 139 |
+
with SubtextArenaEnv(
|
| 140 |
+
base_url="https://aamrinder-subtext-arena.hf.space"
|
| 141 |
+
).sync() as env:
|
|
|
|
| 142 |
obs = env.reset().observation
|
| 143 |
print(f"Clip {obs.clip_id} ({obs.speaker}, {obs.duration_s:.1f}s, pivot={obs.is_pivot})")
|
| 144 |
|
|
|
|
| 164 |
uv run server # serves on :8000
|
| 165 |
```
|
| 166 |
|
| 167 |
+
### Reproduce training (on T4-medium / L4)
|
| 168 |
|
| 169 |
```bash
|
| 170 |
+
hf jobs uv run --flavor l4x1 -s HF_TOKEN \
|
| 171 |
+
--with "trl>=0.11" --with "transformers>=4.46" --with datasets \
|
| 172 |
+
--with accelerate --with peft --with bitsandbytes \
|
| 173 |
+
--with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \
|
| 174 |
+
-- python -m subtext_arena.train.train_grpo \
|
| 175 |
+
--max-steps 200 --num-generations 4 \
|
| 176 |
+
--lora-r 16 --lora-dropout 0.05 \
|
| 177 |
+
--eval-ratio 0.2 --n-eval-clips 80 \
|
| 178 |
+
--push-to-hub aamrinder/subtext-arena-grpo \
|
| 179 |
+
--save-trainer-state-to-hub-space aamrinder/subtext-arena
|
| 180 |
```
|
| 181 |
|
| 182 |
+
`~$3 of compute, ~2 hours wall clock.` See `notebooks/train_grpo_colab.ipynb` for a Colab-runnable version.
|
|
|
|
|
|
|
| 183 |
|
| 184 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
+
## Why this matters
|
| 187 |
|
| 188 |
+
- **Accessibility**: hearing-impaired users rely on captions that strip prosody. Sarcasm becomes opaque. An agent that flags lexicalβprosodic mismatch could augment caption pipelines.
|
| 189 |
+
- **Conversational AI**: voice assistants that mistake sarcastic complaints for sincere praise are a known failure mode. Prosody-aware classification is the missing piece.
|
| 190 |
+
- **A reproducible benchmark**: this env wraps a 6-year-old supervised dataset (MUStARD) into a modern RL task with a verifiable reward and a held-out generalization test. Future teams can plug in richer feature extractors and immediately measure improvement under the same evaluator.
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
+
## Future work (where this env compounds)
|
|
|
|
|
|
|
| 195 |
|
| 196 |
+
The env's design lets future work plug in **richer feature extractors** at the `audio_tools.py` layer without changing the trained-model interface or the reward. Specifically:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
|
| 198 |
+
| Today (basic features) | Tomorrow (rich features) | Expected lift |
|
| 199 |
+
|---|---|---|
|
| 200 |
+
| `librosa.pyin` pitch contour | wav2vec2 / HuBERT prosody embeddings | +10-15 pp on broad held-out (per AMuSeD F1 0.81) |
|
| 201 |
+
| RMS energy summary | full spectrogram patch tokens | +5 pp |
|
| 202 |
+
| Speaker-anonymous (text only) | speaker-aware features (familiarity, voice ID) | +3-8 pp |
|
| 203 |
+
| 3B text policy | 7B audio-LLM (Qwen2-Audio) end-to-end | uncharted (LISTEN paper says +0 currently, RL might shift it) |
|
| 204 |
|
| 205 |
+
The env is the durable contribution. The model is one baseline on it.
|
|
|
|
|
|
|
| 206 |
|
| 207 |
---
|
| 208 |
|
|
|
|
| 210 |
|
| 211 |
```
|
| 212 |
subtext_arena/
|
| 213 |
+
βββ README.md β you are here (3-5 min read)
|
|
|
|
|
|
|
|
|
|
| 214 |
βββ openenv.yaml β manifest (spec_version=1, runtime=fastapi)
|
| 215 |
+
βββ pyproject.toml β deps: openenv-core>=0.2.3 + bundled data
|
| 216 |
βββ uv.lock
|
| 217 |
+
βββ client.py β SubtextArenaEnv (HTTP+WS client)
|
| 218 |
+
βββ models.py β typed Action / Observation
|
| 219 |
βββ data/
|
| 220 |
β βββ sarcasm_data.json β 690 MUStARD ground-truth labels + transcripts
|
| 221 |
+
β βββ prosody_cache/utterances/ β 690 pre-computed prosody jsons
|
| 222 |
β βββ prosody_cache/context/ β 690 context-clip prosody jsons
|
| 223 |
+
β βββ pivot_set.json β 32 Prosody-Pivot clips (text-only-wrong + audio-decisive)
|
| 224 |
+
β βββ baseline.json β text-only Qwen2.5-3B predictions on all 690 clips
|
| 225 |
+
β βββ held_out_eval_run3.json β 80-clip generalization eval
|
| 226 |
+
β βββ trainer_state_run3.json β per-step training metrics
|
| 227 |
+
β βββ log_history_run3.json β backup of trainer state
|
| 228 |
+
βββ server/
|
| 229 |
+
β βββ app.py β FastAPI factory
|
| 230 |
+
β βββ audio_tools.py β prosody/contour text renderers
|
| 231 |
+
β βββ scenarios.py β MUStARD loader + Pivot oversampling + train/eval split
|
| 232 |
+
β βββ grader.py β per-step + final-reward grader
|
| 233 |
+
β βββ subtext_arena_environment.py β Environment subclass (reset/step/state)
|
| 234 |
+
β βββ Dockerfile
|
| 235 |
+
β βββ requirements.txt
|
| 236 |
+
βββ train/
|
| 237 |
+
β βββ train_grpo.py β full GRPO training (Unsloth+TRL, used for Run #3)
|
| 238 |
+
β βββ curate_pivot_set.py β 4-step Pivot Set curation pipeline
|
| 239 |
+
β βββ eval_pivot_set.py β evaluate any checkpoint on Pivot Set
|
| 240 |
+
β βββ side_by_side.py β baseline vs trained side-by-side HTML
|
| 241 |
+
β βββ plot_run3.py β generate the 3 result plots from log_history
|
| 242 |
+
β βββ plot_reward_decomp.py β reward decomposition plotter (legacy)
|
| 243 |
+
βββ notebooks/
|
| 244 |
+
β βββ train_grpo_colab.ipynb β Colab-rerunnable training script
|
| 245 |
+
βββ docs/
|
| 246 |
+
βββ plots/ β committed PNGs (reward_curve, training_dynamics, held_out_breakdown)
|
| 247 |
+
βββ side_by_side.html β interactive side-by-side reasoning view
|
| 248 |
```
|
| 249 |
|
| 250 |
---
|
| 251 |
|
| 252 |
## Credits & references
|
| 253 |
|
| 254 |
+
- **Dataset**: [MUStARD](https://github.com/soujanyaporia/MUStARD) (Castro et al., ACL 2019)
|
| 255 |
+
- **Architecture inspiration**: [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025) β same pattern, prompted not RL-trained
|
| 256 |
+
- **Adjacent recent RL work**: [EmotionThinker](https://arxiv.org/abs/2601.15668) (Mar 2026), [SarcasmMiner](https://arxiv.org/abs/2603.05275) (Mar 2026)
|
| 257 |
+
- **Modality-fail benchmark**: [LISTEN](https://arxiv.org/abs/2510.10444) (Oct 2025) β documents the audio-LLM gap on emotion-mismatch cases
|
| 258 |
|
| 259 |
+
Built for the **OpenEnv Hackathon Finale** sponsored by Meta PyTorch / Hugging Face / Scaler.
|
| 260 |
+
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3.
|
| 261 |
|
| 262 |
> Team **BalleBalle** β Amrinder Singh + Shubham Kapoor.
|
train/side_by_side.py
CHANGED
|
@@ -91,6 +91,8 @@ CLIP_BLOCK = """<div class="clip">
|
|
| 91 |
|
| 92 |
|
| 93 |
def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, temperature=0.7):
|
|
|
|
|
|
|
| 94 |
messages = [
|
| 95 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 96 |
{"role": "user", "content": prompt_user_msg},
|
|
@@ -101,10 +103,14 @@ def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, tempe
|
|
| 101 |
input_ids = encoded.input_ids if hasattr(encoded, "input_ids") else encoded
|
| 102 |
input_ids = input_ids.to(model.device)
|
| 103 |
prompt_len = input_ids.shape[1]
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
|
| 109 |
|
| 110 |
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, temperature=0.7):
|
| 94 |
+
"""Greedy decode for deterministic side-by-side. Matches held-out eval settings
|
| 95 |
+
so the side-by-side numbers are consistent with the headline accuracy."""
|
| 96 |
messages = [
|
| 97 |
{"role": "system", "content": SYSTEM_PROMPT},
|
| 98 |
{"role": "user", "content": prompt_user_msg},
|
|
|
|
| 103 |
input_ids = encoded.input_ids if hasattr(encoded, "input_ids") else encoded
|
| 104 |
input_ids = input_ids.to(model.device)
|
| 105 |
prompt_len = input_ids.shape[1]
|
| 106 |
+
import torch as _t
|
| 107 |
+
with _t.no_grad():
|
| 108 |
+
out = model.generate(
|
| 109 |
+
input_ids=input_ids, max_new_tokens=max_tokens,
|
| 110 |
+
do_sample=False, # greedy β matches held-out eval
|
| 111 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 112 |
+
use_cache=True,
|
| 113 |
+
)
|
| 114 |
return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
|
| 115 |
|
| 116 |
|