Subtext Arena: text-only baseline vs prosody-trained, hand-picked clips

Five hand-picked clips from the held-out set, all with the same gold labels. On the left, vanilla Qwen2.5-3B reading just the transcript, no prosody whatsoever. On the right, the GRPO-trained model reading transcript plus prosody features baked in. The gap between the two columns is honestly what the audio actually buys us.

Tally on these 5 clips: baseline (text-only) 0/5 correct, trained (with prosody) 5/5 correct.
These were picked from the 31 clips in the held-out eval where this exact pattern holds. The full 80-clip held-out eval still shows around 51% balanced accuracy overall, so these 5 are basically the prosody-decisive subset only.

Clip 1_8136, speaker PERSON, gold label sarcasticPIVOT SET

"Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here."

Baseline (text-only)

predicted: sincere (conf 0.90)

(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)

Trained (with prosody)

predicted: sarcastic (conf 0.90)

<think>
The speaker uses positive language like "looking forward to it" and "no place I would rather be than here," which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather trying to sound enthusiastic to mask their true feelings.
</think>
<final>{"label":"sarcastic","confidence":0.9}</final>

Clip 1_2792, speaker PENNY, gold label sincerePIVOT SET

"No, I mean, we can always go back to being friends."

Baseline (text-only)

predicted: sarcastic (conf 0.95)

(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)

Trained (with prosody)

predicted: sincere (conf 0.90)

<think>
The prosody analysis shows that Penny's pitch is relatively high but not extremely variable, and there are no internal pauses or emphasized pauses. Her tone is neutral and slightly flat, which suggests sincerity rather than sarcasm. Additionally, her statement "No, I mean, we can always go back to being friends" doesn't contain any irony or hidden meaning that would make it sarcastic. The way she says it, with a neutral tone and without any dramatic emphasis, aligns more with sincerity.
</think>
<final>{"label":"sincere","confidence":0.9}</final>

Clip 1_5581, speaker SHELDON, gold label sincerePIVOT SET

"I gargled with tequila and may have swallowed a teeny bit."

Baseline (text-only)

predicted: sarcastic (conf 0.95)

(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)

Trained (with prosody)

predicted: sincere (conf 0.90)

<think>
The prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact and informative, without any indication of sarcasm. He is simply stating facts about parental behavior without adding any ironic or mocking tone.
</think>
<final>{"label":"sincere","confidence":0.9}</final>

Clip 1_4792, speaker SHELDON, gold label sincerePIVOT SET

"Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes."

Baseline (text-only)

predicted: sarcastic (conf 0.95)

(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)

Trained (with prosody)

predicted: sincere (conf 0.90)

<think>
The prosody features indicate a very flat pitch contour with minimal variability, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact, as if he's simply stating a fact. There are no emphatic pauses or rising intonations that would typically accompany sarcasm. The content of his statement, "Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes," is also quite neutral and doesn't contain any obvious irony or mockery. Sheldon is simply acknowledging a change in circumstances without any indication of sarcasm.
</think>
<final>{"label":"sincere","confidence":0.9}</final>

Clip 1_1001, speaker RAJ, gold label sincere

"What else? Sell it on eBay as "slightly used.""

Baseline (text-only)

predicted: sarcastic (conf 0.85)

(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)

Trained (with prosody)

predicted: sincere (conf 0.90)

<think>
The prosody features indicate that Raj speaks with high energy and no internal pauses, which suggests sincerity rather than sarcasm. Additionally, his pitch contour is flat, which aligns with a sincere tone. The phrase "What else? Sell it on eBay as 'slightly used'" sounds genuine and matter-of-fact, without any indication of mockery or irony.
</think>
<final>{"label":"sincere","confidence":0.9}</final>

Side-by-side built locally from held_out_eval_run3.json (trained model completions) and baseline.json (text-only Qwen with a simple prompt, generated by curate_pivot_set.py). Both runs use greedy decoding for determinism.

🎧 Subtext Arena: does the prosody-trained model actually do something the text-only baseline cannot?