File size: 10,792 Bytes
55ef837
 
 
0b7c141
55ef837
f468b29
 
 
 
 
 
 
 
0b7c141
f468b29
 
 
 
 
 
55ef837
 
0b7c141
 
f468b29
 
 
 
 
 
 
 
 
 
 
 
 
 
55ef837
 
f468b29
0b7c141
 
f468b29
 
 
0b7c141
f468b29
 
55ef837
0b7c141
55ef837
 
 
0b7c141
f468b29
0b7c141
 
55ef837
 
0b7c141
f468b29
55ef837
f468b29
55ef837
f468b29
55ef837
 
 
 
 
0b7c141
f468b29
55ef837
 
0b7c141
f468b29
0b7c141
 
55ef837
 
0b7c141
55ef837
 
f468b29
55ef837
 
 
 
 
 
 
0b7c141
f468b29
55ef837
 
0b7c141
f468b29
0b7c141
 
55ef837
 
0b7c141
f468b29
55ef837
f468b29
55ef837
 
 
 
 
 
 
0b7c141
f468b29
55ef837
 
0b7c141
f468b29
0b7c141
 
55ef837
 
0b7c141
f468b29
55ef837
f468b29
55ef837
f468b29
55ef837
 
 
 
 
0b7c141
f468b29
55ef837
 
0b7c141
f468b29
0b7c141
 
55ef837
 
0b7c141
55ef837
 
f468b29
 
55ef837
 
 
 
 
f468b29
 
0b7c141
f468b29
 
55ef837
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<title>Subtext Arena: text-only baseline vs prosody-trained, hand-picked clips</title>
<style>
  body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif;
         max-width: 1100px; margin: 32px auto; padding: 0 20px; color: #222;
         line-height: 1.5; }
  h1 { font-size: 26px; margin-bottom: 4px; }
  h1 .emoji { font-style: normal; }
  p.lede { color: #555; margin-top: 0; font-size: 15px; }
  .clip { border: 1px solid #ddd; border-radius: 10px; padding: 18px;
           margin-bottom: 28px; background: #fafafa; }
  .clip-label { font-size: 15px; font-weight: 600; margin: 0 0 10px 0; color: #333; }
  .gold-sarcastic { color: #b3274d; font-weight: 600; }
  .gold-sincere   { color: #1d7a4a; font-weight: 600; }
  .pivot-tag { background: #fff3cd; color: #8a6d3b; padding: 2px 6px;
               border-radius: 3px; font-size: 12px; font-weight: 600; margin-left: 6px; }
  .columns { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }
  .col { padding: 12px 14px; border-radius: 7px; }
  .baseline { background: #fff5f5; border: 1px solid #f8c4c4; }
  .trained  { background: #effaf3; border: 1px solid #b6e2c1; }
  .col-label { margin: 0 0 8px 0; font-size: 12px; text-transform: uppercase;
                letter-spacing: 0.06em; color: #666; font-weight: 700; }
  .verdict-correct { color: #1d7a4a; font-weight: 700; }
  .verdict-wrong   { color: #b3274d; font-weight: 700; }
  pre { white-space: pre-wrap; word-wrap: break-word; font-size: 12.5px;
         line-height: 1.42; background: white; padding: 9px 11px;
         border-radius: 5px; border: 1px solid #eaeaea; max-height: 360px;
         overflow-y: auto; margin: 6px 0 0 0; }
  .transcript { font-style: italic; color: #555; margin-bottom: 10px;
                 padding: 8px 12px; background: white; border-left: 3px solid #ccc;
                 border-radius: 0 5px 5px 0; }
  .summary { background: #f0f4ff; border: 1px solid #b6c8f0; padding: 14px 18px;
              border-radius: 8px; margin: 18px 0; font-size: 14px; }
  .summary strong { color: #1f4080; }
  .footnote { color: #777; font-size: 13px; margin-top: 30px;
               padding-top: 16px; border-top: 1px solid #eee; }
</style>
</head><body>

<h1><span class="emoji">🎧</span> Subtext Arena: does the prosody-trained model actually do something the text-only baseline cannot?</h1>
<p class="lede">Five hand-picked clips from the held-out set, all with the same gold labels. On the left, vanilla Qwen2.5-3B reading just the transcript, no prosody whatsoever. On the right, the GRPO-trained model reading transcript plus prosody features baked in. The gap between the two columns is honestly what the audio actually buys us.</p>

<div class="summary">
<strong>Tally on these 5 clips:</strong> baseline (text-only) <span class="verdict-wrong">0/5 correct</span>, trained (with prosody) <span class="verdict-correct">5/5 correct</span>.<br>
<small><em>These were picked from the 31 clips in the held-out eval where this exact pattern holds. The full 80-clip held-out eval still shows around 51% balanced accuracy overall, so these 5 are basically the prosody-decisive subset only.</em></small>
</div>

<div class="clip">
  <p class="clip-label">Clip 1_8136, speaker PERSON, gold label <span class="gold-sarcastic">sarcastic</span><span class="pivot-tag">PIVOT SET</span></p>
  <div class="transcript">"Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here."</div>
  <div class="columns">
    <div class="col baseline">
      <p class="col-label">Baseline (text-only)</p>
      <p>predicted: <span class="verdict-wrong">sincere</span> (conf 0.90)</p>
      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)</pre>
    </div>
    <div class="col trained">
      <p class="col-label">Trained (with prosody)</p>
      <p>predicted: <span class="verdict-correct">sarcastic</span> (conf 0.90)</p>
      <pre>&lt;think&gt;
The speaker uses positive language like "looking forward to it" and "no place I would rather be than here," which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather trying to sound enthusiastic to mask their true feelings.
&lt;/think&gt;
&lt;final&gt;{"label":"sarcastic","confidence":0.9}&lt;/final&gt;</pre>
    </div>
  </div>
</div>

<div class="clip">
  <p class="clip-label">Clip 1_2792, speaker PENNY, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
  <div class="transcript">"No, I mean, we can always go back to being friends."</div>
  <div class="columns">
    <div class="col baseline">
      <p class="col-label">Baseline (text-only)</p>
      <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)</pre>
    </div>
    <div class="col trained">
      <p class="col-label">Trained (with prosody)</p>
      <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
      <pre>&lt;think&gt;
The prosody analysis shows that Penny's pitch is relatively high but not extremely variable, and there are no internal pauses or emphasized pauses. Her tone is neutral and slightly flat, which suggests sincerity rather than sarcasm. Additionally, her statement "No, I mean, we can always go back to being friends" doesn't contain any irony or hidden meaning that would make it sarcastic. The way she says it, with a neutral tone and without any dramatic emphasis, aligns more with sincerity.
&lt;/think&gt;
&lt;final&gt;{"label":"sincere","confidence":0.9}&lt;/final&gt;</pre>
    </div>
  </div>
</div>

<div class="clip">
  <p class="clip-label">Clip 1_5581, speaker SHELDON, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
  <div class="transcript">"I gargled with tequila and may have swallowed a teeny bit."</div>
  <div class="columns">
    <div class="col baseline">
      <p class="col-label">Baseline (text-only)</p>
      <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)</pre>
    </div>
    <div class="col trained">
      <p class="col-label">Trained (with prosody)</p>
      <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
      <pre>&lt;think&gt;
The prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact and informative, without any indication of sarcasm. He is simply stating facts about parental behavior without adding any ironic or mocking tone.
&lt;/think&gt;
&lt;final&gt;{"label":"sincere","confidence":0.9}&lt;/final&gt;</pre>
    </div>
  </div>
</div>

<div class="clip">
  <p class="clip-label">Clip 1_4792, speaker SHELDON, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
  <div class="transcript">"Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes."</div>
  <div class="columns">
    <div class="col baseline">
      <p class="col-label">Baseline (text-only)</p>
      <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)</pre>
    </div>
    <div class="col trained">
      <p class="col-label">Trained (with prosody)</p>
      <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
      <pre>&lt;think&gt;
The prosody features indicate a very flat pitch contour with minimal variability, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact, as if he's simply stating a fact. There are no emphatic pauses or rising intonations that would typically accompany sarcasm. The content of his statement, "Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes," is also quite neutral and doesn't contain any obvious irony or mockery. Sheldon is simply acknowledging a change in circumstances without any indication of sarcasm.
&lt;/think&gt;
&lt;final&gt;{"label":"sincere","confidence":0.9}&lt;/final&gt;</pre>
    </div>
  </div>
</div>

<div class="clip">
  <p class="clip-label">Clip 1_1001, speaker RAJ, gold label <span class="gold-sincere">sincere</span></p>
  <div class="transcript">"What else? Sell it on eBay as &quot;slightly used.&quot;"</div>
  <div class="columns">
    <div class="col baseline">
      <p class="col-label">Baseline (text-only)</p>
      <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.85)</p>
      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
The judge had only the transcript plus context, no audio descriptors at all.)</pre>
    </div>
    <div class="col trained">
      <p class="col-label">Trained (with prosody)</p>
      <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
      <pre>&lt;think&gt;
The prosody features indicate that Raj speaks with high energy and no internal pauses, which suggests sincerity rather than sarcasm. Additionally, his pitch contour is flat, which aligns with a sincere tone. The phrase "What else? Sell it on eBay as 'slightly used'" sounds genuine and matter-of-fact, without any indication of mockery or irony.
&lt;/think&gt;
&lt;final&gt;{"label":"sincere","confidence":0.9}&lt;/final&gt;</pre>
    </div>
  </div>
</div>


<div class="footnote">
Side-by-side built locally from <code>held_out_eval_run3.json</code> (trained model completions) and <code>baseline.json</code> (text-only Qwen with a simple prompt, generated by <code>curate_pivot_set.py</code>). Both runs use greedy decoding for determinism.
</div>

</body></html>