Spaces:

aamrinder
/

subtext-arena

Sleeping

App Files Files Community

aamrinder commited on 12 days ago

Commit

0b7c141

verified ·

1 Parent(s): 94aea4c

strip em-dashes and section headings, conversational prose

Browse files

Files changed (1) hide show

docs/side_by_side.html +33 -34

docs/side_by_side.html CHANGED Viewed

@@ -1,7 +1,7 @@
 <!DOCTYPE html>
 <html><head>
 <meta charset="utf-8">
-<title>Subtext Arena — text-only baseline vs prosody-trained, hand-picked clips</title>
 <style>
   body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif;
          max-width: 1100px; margin: 32px auto; padding: 0 20px; color: #222;
@@ -11,7 +11,7 @@
   p.lede { color: #555; margin-top: 0; font-size: 15px; }
   .clip { border: 1px solid #ddd; border-radius: 10px; padding: 18px;
            margin-bottom: 28px; background: #fafafa; }
-  .clip h2 { font-size: 17px; margin: 0 0 10px 0; color: #333; }
   .gold-sarcastic { color: #b3274d; font-weight: 600; }
   .gold-sincere   { color: #1d7a4a; font-weight: 600; }
   .pivot-tag { background: #fff3cd; color: #8a6d3b; padding: 2px 6px;
@@ -20,8 +20,8 @@
   .col { padding: 12px 14px; border-radius: 7px; }
   .baseline { background: #fff5f5; border: 1px solid #f8c4c4; }
   .trained  { background: #effaf3; border: 1px solid #b6e2c1; }
-  .col h3 { margin: 0 0 8px 0; font-size: 12px; text-transform: uppercase;
-             letter-spacing: 0.06em; color: #666; font-weight: 700; }
   .verdict-correct { color: #1d7a4a; font-weight: 700; }
   .verdict-wrong   { color: #b3274d; font-weight: 700; }
   pre { white-space: pre-wrap; word-wrap: break-word; font-size: 12.5px;
@@ -39,26 +39,26 @@
 </style>
 </head><body>
-<h1><span class="emoji">🎧</span> Subtext Arena — does the prosody-trained model do something the text-only baseline cannot?</h1>
-<p class="lede">5 hand-picked MUStARD clips. Same gold labels. Left: vanilla Qwen2.5-3B reading the transcript ALONE (text-only). Right: the GRPO-trained model reading transcript + prosody features. The gap is the audio contribution.</p>
 <div class="summary">
 <strong>Tally on these 5 clips:</strong> baseline (text-only) <span class="verdict-wrong">0/5 correct</span>, trained (with prosody) <span class="verdict-correct">5/5 correct</span>.<br>
-<small><em>Selected from the 31 clips in the held-out eval where this pattern holds. The full 80-clip held-out eval still shows ~51% balanced accuracy — these 5 are the prosody-decisive subset.</em></small>
 </div>
 <div class="clip">
-  <h2>Clip 1_8136 — speaker: PERSON | gold: <span class="gold-sarcastic">sarcastic</span><span class="pivot-tag">PIVOT SET</span></h2>
   <div class="transcript">"Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here."</div>
   <div class="columns">
     <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
       <p>predicted: <span class="verdict-wrong">sincere</span> (conf 0.90)</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
     </div>
     <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
       <p>predicted: <span class="verdict-correct">sarcastic</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The speaker uses positive language like "looking forward to it" and "no place I would rather be than here," which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather trying to sound enthusiastic to mask their true feelings.
@@ -69,17 +69,17 @@ The speaker uses positive language like "looking forward to it" and "no place I
 </div>
 <div class="clip">
-  <h2>Clip 1_2792 — speaker: PENNY | gold: <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></h2>
   <div class="transcript">"No, I mean, we can always go back to being friends."</div>
   <div class="columns">
     <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
     </div>
     <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody analysis shows that Penny's pitch is relatively high but not extremely variable, and there are no internal pauses or emphasized pauses. Her tone is neutral and slightly flat, which suggests sincerity rather than sarcasm. Additionally, her statement "No, I mean, we can always go back to being friends" doesn't contain any irony or hidden meaning that would make it sarcastic. The way she says it, with a neutral tone and without any dramatic emphasis, aligns more with sincerity.
@@ -90,17 +90,17 @@ The prosody analysis shows that Penny's pitch is relatively high but not extreme
 </div>
 <div class="clip">
-  <h2>Clip 1_5581 — speaker: SHELDON | gold: <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></h2>
   <div class="transcript">"I gargled with tequila and may have swallowed a teeny bit."</div>
   <div class="columns">
     <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
     </div>
     <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact and informative, without any indication of sarcasm. He is simply stating facts about parental behavior without adding any ironic or mocking tone.
@@ -111,17 +111,17 @@ The prosody analysis shows a flat affect with no internal pauses, which suggests
 </div>
 <div class="clip">
-  <h2>Clip 1_4792 — speaker: SHELDON | gold: <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></h2>
   <div class="transcript">"Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes."</div>
   <div class="columns">
     <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
     </div>
     <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody features indicate a very flat pitch contour with minimal variability, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact, as if he's simply stating a fact. There are no emphatic pauses or rising intonations that would typically accompany sarcasm. The content of his statement, "Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes," is also quite neutral and doesn't contain any obvious irony or mockery. Sheldon is simply acknowledging a change in circumstances without any indication of sarcasm.
@@ -132,17 +132,17 @@ The prosody features indicate a very flat pitch contour with minimal variability
 </div>
 <div class="clip">
-  <h2>Clip 1_1001 — speaker: RAJ | gold: <span class="gold-sincere">sincere</span></h2>
   <div class="transcript">"What else? Sell it on eBay as &quot;slightly used.&quot;"</div>
   <div class="columns">
     <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.85)</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
     </div>
     <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody features indicate that Raj speaks with high energy and no internal pauses, which suggests sincerity rather than sarcasm. Additionally, his pitch contour is flat, which aligns with a sincere tone. The phrase "What else? Sell it on eBay as 'slightly used'" sounds genuine and matter-of-fact, without any indication of mockery or irony.
@@ -154,8 +154,7 @@ The prosody features indicate that Raj speaks with high energy and no internal p
 <div class="footnote">
-Side-by-side built locally from <code>held_out_eval_run3.json</code> (trained model completions) and
-<code>baseline.json</code> (text-only Qwen with simple prompt from <code>curate_pivot_set.py</code>). Both runs use greedy decoding for determinism.
 </div>
 </body></html>

 <!DOCTYPE html>
 <html><head>
 <meta charset="utf-8">
+<title>Subtext Arena: text-only baseline vs prosody-trained, hand-picked clips</title>
 <style>
   body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif;
          max-width: 1100px; margin: 32px auto; padding: 0 20px; color: #222;
   p.lede { color: #555; margin-top: 0; font-size: 15px; }
   .clip { border: 1px solid #ddd; border-radius: 10px; padding: 18px;
            margin-bottom: 28px; background: #fafafa; }
+  .clip-label { font-size: 15px; font-weight: 600; margin: 0 0 10px 0; color: #333; }
   .gold-sarcastic { color: #b3274d; font-weight: 600; }
   .gold-sincere   { color: #1d7a4a; font-weight: 600; }
   .pivot-tag { background: #fff3cd; color: #8a6d3b; padding: 2px 6px;
   .col { padding: 12px 14px; border-radius: 7px; }
   .baseline { background: #fff5f5; border: 1px solid #f8c4c4; }
   .trained  { background: #effaf3; border: 1px solid #b6e2c1; }
+  .col-label { margin: 0 0 8px 0; font-size: 12px; text-transform: uppercase;
+                letter-spacing: 0.06em; color: #666; font-weight: 700; }
   .verdict-correct { color: #1d7a4a; font-weight: 700; }
   .verdict-wrong   { color: #b3274d; font-weight: 700; }
   pre { white-space: pre-wrap; word-wrap: break-word; font-size: 12.5px;
 </style>
 </head><body>
+<h1><span class="emoji">🎧</span> Subtext Arena: does the prosody-trained model actually do something the text-only baseline cannot?</h1>
+<p class="lede">Five hand-picked clips from the held-out set, all with the same gold labels. On the left, vanilla Qwen2.5-3B reading just the transcript, no prosody whatsoever. On the right, the GRPO-trained model reading transcript plus prosody features baked in. The gap between the two columns is honestly what the audio actually buys us.</p>
 <div class="summary">
 <strong>Tally on these 5 clips:</strong> baseline (text-only) <span class="verdict-wrong">0/5 correct</span>, trained (with prosody) <span class="verdict-correct">5/5 correct</span>.<br>
+<small><em>These were picked from the 31 clips in the held-out eval where this exact pattern holds. The full 80-clip held-out eval still shows around 51% balanced accuracy overall, so these 5 are basically the prosody-decisive subset only.</em></small>
 </div>
 <div class="clip">
+  <p class="clip-label">Clip 1_8136, speaker PERSON, gold label <span class="gold-sarcastic">sarcastic</span><span class="pivot-tag">PIVOT SET</span></p>
   <div class="transcript">"Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here."</div>
   <div class="columns">
     <div class="col baseline">
+      <p class="col-label">Baseline (text-only)</p>
       <p>predicted: <span class="verdict-wrong">sincere</span> (conf 0.90)</p>
+      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
+The judge had only the transcript plus context, no audio descriptors at all.)</pre>
     </div>
     <div class="col trained">
+      <p class="col-label">Trained (with prosody)</p>
       <p>predicted: <span class="verdict-correct">sarcastic</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The speaker uses positive language like "looking forward to it" and "no place I would rather be than here," which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather trying to sound enthusiastic to mask their true feelings.
 </div>
 <div class="clip">
+  <p class="clip-label">Clip 1_2792, speaker PENNY, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
   <div class="transcript">"No, I mean, we can always go back to being friends."</div>
   <div class="columns">
     <div class="col baseline">
+      <p class="col-label">Baseline (text-only)</p>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
+      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
+The judge had only the transcript plus context, no audio descriptors at all.)</pre>
     </div>
     <div class="col trained">
+      <p class="col-label">Trained (with prosody)</p>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody analysis shows that Penny's pitch is relatively high but not extremely variable, and there are no internal pauses or emphasized pauses. Her tone is neutral and slightly flat, which suggests sincerity rather than sarcasm. Additionally, her statement "No, I mean, we can always go back to being friends" doesn't contain any irony or hidden meaning that would make it sarcastic. The way she says it, with a neutral tone and without any dramatic emphasis, aligns more with sincerity.
 </div>
 <div class="clip">
+  <p class="clip-label">Clip 1_5581, speaker SHELDON, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
   <div class="transcript">"I gargled with tequila and may have swallowed a teeny bit."</div>
   <div class="columns">
     <div class="col baseline">
+      <p class="col-label">Baseline (text-only)</p>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
+      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
+The judge had only the transcript plus context, no audio descriptors at all.)</pre>
     </div>
     <div class="col trained">
+      <p class="col-label">Trained (with prosody)</p>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact and informative, without any indication of sarcasm. He is simply stating facts about parental behavior without adding any ironic or mocking tone.
 </div>
 <div class="clip">
+  <p class="clip-label">Clip 1_4792, speaker SHELDON, gold label <span class="gold-sincere">sincere</span><span class="pivot-tag">PIVOT SET</span></p>
   <div class="transcript">"Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes."</div>
   <div class="columns">
     <div class="col baseline">
+      <p class="col-label">Baseline (text-only)</p>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.95)</p>
+      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
+The judge had only the transcript plus context, no audio descriptors at all.)</pre>
     </div>
     <div class="col trained">
+      <p class="col-label">Trained (with prosody)</p>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody features indicate a very flat pitch contour with minimal variability, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact, as if he's simply stating a fact. There are no emphatic pauses or rising intonations that would typically accompany sarcasm. The content of his statement, "Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes," is also quite neutral and doesn't contain any obvious irony or mockery. Sheldon is simply acknowledging a change in circumstances without any indication of sarcasm.
 </div>
 <div class="clip">
+  <p class="clip-label">Clip 1_1001, speaker RAJ, gold label <span class="gold-sincere">sincere</span></p>
   <div class="transcript">"What else? Sell it on eBay as &quot;slightly used.&quot;"</div>
   <div class="columns">
     <div class="col baseline">
+      <p class="col-label">Baseline (text-only)</p>
       <p>predicted: <span class="verdict-wrong">sarcastic</span> (conf 0.85)</p>
+      <pre>(text-only baseline doesn't generate any reasoning, it just outputs label and confidence.
+The judge had only the transcript plus context, no audio descriptors at all.)</pre>
     </div>
     <div class="col trained">
+      <p class="col-label">Trained (with prosody)</p>
       <p>predicted: <span class="verdict-correct">sincere</span> (conf 0.90)</p>
       <pre>&lt;think&gt;
 The prosody features indicate that Raj speaks with high energy and no internal pauses, which suggests sincerity rather than sarcasm. Additionally, his pitch contour is flat, which aligns with a sincere tone. The phrase "What else? Sell it on eBay as 'slightly used'" sounds genuine and matter-of-fact, without any indication of mockery or irony.
 <div class="footnote">
+Side-by-side built locally from <code>held_out_eval_run3.json</code> (trained model completions) and <code>baseline.json</code> (text-only Qwen with a simple prompt, generated by <code>curate_pivot_set.py</code>). Both runs use greedy decoding for determinism.
 </div>
 </body></html>