aamrinder commited on
Commit
629a13b
Β·
verified Β·
1 Parent(s): 9bd1f77

strip section headings, single title only

Browse files
Files changed (1) hide show
  1. README.md +2 -28
README.md CHANGED
@@ -16,8 +16,6 @@ tags:
16
 
17
  # Subtext Arena
18
 
19
- Try this. A clip plays, and the line is *"Yeah, I'm really looking forward to it."* Sincere, or sarcastic? Honestly you can read those eight words a hundred times on paper and still you won't know β€” you have to actually hear the line.
20
-
21
  This repo wraps that exact task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a small prosody pipeline of pyin pitch, RMS energy and pause timing, all baked straight into the env. The agent sees a transcript plus a text rendering of the prosody, does a chain of thought, and finishes with `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
22
 
23
  I trained a baseline on top of this β€” Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL β€” on a strict 80/20 train/test split. The trained model is sitting at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent on the whole thing β€” eleven dollars only.
@@ -35,21 +33,13 @@ Team **BalleBalle** β€” Amrinder Singh, Shubham Kapoor.
35
 
36
  ---
37
 
38
- ## Why this is interesting
39
-
40
  Detecting sarcasm from audio prosody is honestly not a solved problem at all. GPT-4o sits at 67% Macro-F1 on MUStARD++, meaning one out of three times the most powerful model on the planet is being fooled by tone of voice itself. There is a documented modality gap on this β€” the [LISTEN paper](https://arxiv.org/abs/2510.10444) from October 2025 β€” and as of today the gap is still wide open. And there is no public RL training environment for this task that I could find, so one fine evening I just sat down and made one.
41
 
42
  The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same kind of architecture really, but they didn't actually train anything. Subtext Arena is the training-side counterpart of that idea.
43
 
44
  ---
45
 
46
- ## How the env works
47
-
48
- Each episode is one MUStARD clip. The prompt the agent sees has:
49
-
50
- - The transcript (target line plus 1-7 lines of preceding conversation, with speaker tags)
51
- - Prosody features rendered as text β€” pitch mean and variability, contour shape, energy mean and variability, voiced ratio, pre-utterance silence, internal pauses with timestamps
52
- - A pitch contour drawn out as an 8-level ASCII sparkline
53
 
54
  The model emits something like:
55
 
@@ -72,18 +62,12 @@ The env also exposes four tools β€” `get_transcript`, `get_prosody_features`, `g
72
 
73
  ---
74
 
75
- ## Results
76
-
77
  I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips and 138 eval clips, deterministically seeded so judges can reproduce the same numbers. Pivot oversample is 2x. Class balance is enforced in the dataset construction itself.
78
 
79
- ### Reward over training
80
-
81
  ![reward curve](docs/plots/reward_curve.png)
82
 
83
  Reward climbs from 0.335 to 0.97 on training prompts. The shaded band around the line is within-batch rollout variance β€” when it is narrow, the four group-relative generations are mostly agreeing; when it goes wide, the model is exploring.
84
 
85
- ### Held-out generalization
86
-
87
  After training, I ran greedy inference on 80 clips the model had never seen during training.
88
 
89
  ![held-out breakdown](docs/plots/held_out_breakdown.png)
@@ -98,20 +82,14 @@ The honest read β€” 51% on the broad set is roughly what a plain text-only basel
98
 
99
  The 0.97 train vs 0.51 held-out gap is itself the anti-memorization signal β€” if the model had been just gaming the reward, train and held-out would have matched.
100
 
101
- ### Side-by-side
102
-
103
  [`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally β€” baseline 0/5, trained 5/5.
104
 
105
- ### Training dynamics
106
-
107
  ![training dynamics](docs/plots/training_dynamics.png)
108
 
109
  Loss plus completion length over the 200 steps. The reasoning-length floor of 50 words is what keeps the `<think>` block from collapsing into a one-liner.
110
 
111
  ---
112
 
113
- ## Quick start
114
-
115
  Install from this Space:
116
 
117
  ```bash
@@ -162,8 +140,6 @@ A Colab-friendly version of the same script is at [notebooks/train_grpo_colab.ip
162
 
163
  ---
164
 
165
- ## Repo layout
166
-
167
  ```
168
  .
169
  β”œβ”€β”€ client.py SubtextArenaEnv (HTTP + WebSocket client)
@@ -195,8 +171,6 @@ A Colab-friendly version of the same script is at [notebooks/train_grpo_colab.ip
195
 
196
  ---
197
 
198
- ## Where this could go next
199
-
200
  This is just the start really, I am not stopping here at all. The whole point of putting in the effort to build an environment in the first place is that it stays alive even after the hackathon is done, leaderboard is locked, credits run out, all of that. The audio-tool layer of the env is decoupled from the model interface itself, so anyone with a richer feature stack can plug in straight on top without touching anything else. Specifically:
201
 
202
  - pyin pitch contour β†’ wav2vec2 / HuBERT prosody embeddings
@@ -209,7 +183,7 @@ If any of these drop the broad held-out number from 51% toward AMuSeD's 81% F1 m
209
 
210
  ---
211
 
212
- ## References
213
 
214
  - MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
215
  - AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995
 
16
 
17
  # Subtext Arena
18
 
 
 
19
  This repo wraps that exact task as an **OpenEnv environment**. Built on MUStARD (a 690-clip sarcasm-in-sitcom dataset) and a small prosody pipeline of pyin pitch, RMS energy and pause timing, all baked straight into the env. The agent sees a transcript plus a text rendering of the prosody, does a chain of thought, and finishes with `<final>{"label":..., "confidence":...}</final>`. Reward is graded on correctness, reasoning, and format.
20
 
21
  I trained a baseline on top of this β€” Qwen2.5-3B-Instruct, LoRA r=16, 200 GRPO steps via Unsloth + TRL β€” on a strict 80/20 train/test split. The trained model is sitting at [aamrinder/subtext-arena-grpo](https://huggingface.co/aamrinder/subtext-arena-grpo). Total compute spent on the whole thing β€” eleven dollars only.
 
33
 
34
  ---
35
 
 
 
36
  Detecting sarcasm from audio prosody is honestly not a solved problem at all. GPT-4o sits at 67% Macro-F1 on MUStARD++, meaning one out of three times the most powerful model on the planet is being fooled by tone of voice itself. There is a documented modality gap on this β€” the [LISTEN paper](https://arxiv.org/abs/2510.10444) from October 2025 β€” and as of today the gap is still wide open. And there is no public RL training environment for this task that I could find, so one fine evening I just sat down and made one.
37
 
38
  The closest prior work is [AudioToolAgent](https://arxiv.org/abs/2510.02995) (Oct 2025), which prompted a frontier LLM to use audio analysis tools. Same kind of architecture really, but they didn't actually train anything. Subtext Arena is the training-side counterpart of that idea.
39
 
40
  ---
41
 
42
+ Each episode is one MUStARD clip. The prompt the agent sees has the transcript (target line plus 1-7 lines of preceding conversation, with speaker tags), prosody features rendered as text β€” pitch mean and variability, contour shape, energy mean and variability, voiced ratio, pre-utterance silence, internal pauses with timestamps β€” and a pitch contour drawn out as an 8-level ASCII sparkline.
 
 
 
 
 
 
43
 
44
  The model emits something like:
45
 
 
62
 
63
  ---
64
 
 
 
65
  I trained for 200 steps with `num_generations=4`, LoRA r=16, dropout 0.05, on an L4 ($0.80/hr). The split is 552 train clips and 138 eval clips, deterministically seeded so judges can reproduce the same numbers. Pivot oversample is 2x. Class balance is enforced in the dataset construction itself.
66
 
 
 
67
  ![reward curve](docs/plots/reward_curve.png)
68
 
69
  Reward climbs from 0.335 to 0.97 on training prompts. The shaded band around the line is within-batch rollout variance β€” when it is narrow, the four group-relative generations are mostly agreeing; when it goes wide, the model is exploring.
70
 
 
 
71
  After training, I ran greedy inference on 80 clips the model had never seen during training.
72
 
73
  ![held-out breakdown](docs/plots/held_out_breakdown.png)
 
82
 
83
  The 0.97 train vs 0.51 held-out gap is itself the anti-memorization signal β€” if the model had been just gaming the reward, train and held-out would have matched.
84
 
 
 
85
  [`docs/side_by_side.html`](docs/side_by_side.html) shows 5 hand-picked clips from the held-out set where text-only Qwen confidently picks the wrong label and the prosody-trained model picks the right one. Tally β€” baseline 0/5, trained 5/5.
86
 
 
 
87
  ![training dynamics](docs/plots/training_dynamics.png)
88
 
89
  Loss plus completion length over the 200 steps. The reasoning-length floor of 50 words is what keeps the `<think>` block from collapsing into a one-liner.
90
 
91
  ---
92
 
 
 
93
  Install from this Space:
94
 
95
  ```bash
 
140
 
141
  ---
142
 
 
 
143
  ```
144
  .
145
  β”œβ”€β”€ client.py SubtextArenaEnv (HTTP + WebSocket client)
 
171
 
172
  ---
173
 
 
 
174
  This is just the start really, I am not stopping here at all. The whole point of putting in the effort to build an environment in the first place is that it stays alive even after the hackathon is done, leaderboard is locked, credits run out, all of that. The audio-tool layer of the env is decoupled from the model interface itself, so anyone with a richer feature stack can plug in straight on top without touching anything else. Specifically:
175
 
176
  - pyin pitch contour β†’ wav2vec2 / HuBERT prosody embeddings
 
183
 
184
  ---
185
 
186
+ References:
187
 
188
  - MUStARD (Castro et al., ACL 2019): https://github.com/soujanyaporia/MUStARD
189
  - AudioToolAgent (Oct 2025): https://arxiv.org/abs/2510.02995