Spaces:

aamrinder
/

subtext-arena

Sleeping

App Files Files Community

aamrinder commited on 29 days ago

Commit

60b425a

verified ·

1 Parent(s): 9f43137

cleanup: remove stale dev/abandoned files

Browse files

Files changed (18) hide show

SUBMISSION_CHECKLIST.md +0 -174
data/baseline.json +0 -0
data/candidates.json +0 -2308
data/held_out_eval_run1.json +0 -890
data/trainer_state_run3.json +0 -0
docs/side_by_side.json +0 -92
openenv_subtext_arena.egg-info/PKG-INFO +0 -9
openenv_subtext_arena.egg-info/SOURCES.txt +0 -26
openenv_subtext_arena.egg-info/dependency_links.txt +0 -1
openenv_subtext_arena.egg-info/entry_points.txt +0 -2
openenv_subtext_arena.egg-info/requires.txt +0 -5
openenv_subtext_arena.egg-info/top_level.txt +0 -1
train/build_side_by_side_local.py +0 -144
train/hour1_smoke.py +0 -179
train/plot_reward_decomp.py +0 -122
train/plot_run3.py +0 -156
train/sft_warmup.py +0 -365
train/side_by_side.py +0 -247

SUBMISSION_CHECKLIST.md DELETED Viewed

@@ -1,174 +0,0 @@
-# Subtext Arena — Submission Checklist
-Cross-checked against the official Apr 25 deck verbatim. Every line is grounded in a concrete artifact in this repo.
-## Theme alignment
-**Theme: #3 — World Modeling**
-Deck text: *"environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts… maintain consistent internal state, update beliefs based on outcomes."*
-Mapping to Subtext Arena:
-- "real interaction with tools" → 4 tools (`get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`)
-- "real hard work instead of short-cuts" → Pivot Set + the −0.50 hard penalty *prevent* text-only shortcuts
-- "update beliefs based on outcomes" → agent updates its sarcastic/sincere belief based on prosody features it queries
-- "orchestrate multi-step workflows" → up to 6 sequential tool calls per episode
----
-## Minimum (non-negotiable) submission requirements
-| # | Deck requirement (verbatim) | Status | Artifact / evidence |
-|---|---|---|---|
-| 1 | *"Use OpenEnv (latest release)"* | ✅ DONE | [pyproject.toml](pyproject.toml) line 20: `"openenv-core[core]>=0.2.3"` (latest is v0.2.3 per deck) |
-| 2 | *"A working training script using Unsloth or HF TRL, ideally as a Colab notebook so judges can re-run it"* | ✅ DONE | Both: [train/train_grpo.py](train/train_grpo.py) (Unsloth `FastLanguageModel` + TRL `GRPOTrainer`) AND [notebooks/train_grpo_colab.ipynb](notebooks/train_grpo_colab.ipynb) (judges-rerunnable Colab) |
-| 3 | *"Evidence that you actually trained; at minimum, loss and reward plots from a real run"* | ⏸ Will fill from training run | [train/plot_reward_decomp.py](train/plot_reward_decomp.py) generates the 4-line stacked PNG into [docs/plots/](docs/plots/); training run produces the JSONL log it consumes |
-| 4 | *"Short writeup: a mini-blog on HF or <2-min YouTube or short slide deck — linked from README"* | ⏸ Deferred per user (will produce after training) | README has placeholder slot; will be linked once filmed |
-| 5 | *"Push your environment to a HF Space so it's discoverable and runnable"* | ⏸ Pending HF auth | [openenv.yaml](openenv.yaml) is valid; `openenv push` ready; just needs `huggingface-cli login` |
-| 6 | *"A README that motivates the problem, explains how the env works, and shows results"* | ✅ DONE | [README.md](README.md) is 3-5 min read with Problem → Environment → Reward → How to run → Results (placeholder) → Why it matters |
-| 7 | *"README should have a link to the HF Space + all materials (videos, blogs, slides…)"* | ⏳ Has placeholders to fill | README links use `<your-username>` placeholder; gets replaced once `openenv push` completes |
-| 8 | *"No big videos in HF Hub repo — link out instead"* | ✅ N/A | No video files in repo. Only the 5.9 MB `data/` (prosody JSONs + sarcasm labels) which is small. |
----
-## Engineering hygiene (table-stakes per deck)
-| # | Deck requirement | Status | Artifact / evidence |
-|---|---|---|---|
-| 9 | *"Use OpenEnv's `Environment` / `MCPEnvironment` base classes properly"* | ✅ | [server/subtext_arena_environment.py](server/subtext_arena_environment.py) line 30: `class SubtextArenaEnvironment(Environment)` |
-| 10 | *"Respect client/server separation — clients should never import server internals"* | ✅ | [client.py](client.py) imports only from `.models`; never from `.server`. [server/app.py](server/app.py) imports only from `.models` and `.subtext_arena_environment`. |
-| 11 | *"Standard Gym-style API (reset, step, state)"* | ✅ | All three methods present: `reset() -> Observation`, `step(action) -> Observation`, `state` property → `State` |
-| 12 | *"Valid `openenv.yaml` manifest"* | ✅ | `openenv validate --verbose` reports `[OK] subtext_arena: Ready for multi-mode deployment` (all 4 modes: docker, openenv_serve, uv_run, python_module) |
-| 13 | *"Don't use reserved tool names (reset, step, state, close) for MCP tools"* | ✅ | Our 4 tools are `get_transcript`, `get_prosody_features`, `get_pitch_contour`, `submit_belief`. None of the reserved names. |
----
-## What makes a submission stand out (deck criteria, weighted by judging %)
-### 40% Environment Innovation
-Deck asks:
-- *"Does this environment exist to teach an LLM something it currently can't do well?"* — ✅ GPT-4o sits at 67.47% Macro-F1 on MUStARD++; the LISTEN paper (Oct 2025) documents the audio-LLM gap. Cited in README.
-- *"Is the domain underexplored in RL/LLM training?"* — ✅ Closest prior work: AudioToolAgent (Oct 2025, prompted not RL-trained), EmotionThinker (Mar 2026, emotion not subtext), SarcasmMiner (Mar 2026, video+audio not pure-audio). The pure-audio + subtext + RL environment combo is unclaimed.
-- *"Could a researcher write a paper about training on this?"* — ✅ Yes — RL-trained orchestrator over audio tools is a clean differentiation from AudioToolAgent's prompted baseline.
-### 30% Storytelling & Presentation
-Deck asks:
-- *"Can you clearly explain the problem, the environment, and what the agent learned?"* — ✅ README structured as Problem → Environment → Reward → Results → Why it matters
-- *"Is the demo engaging and easy to follow for a non-technical audience?"* — Per [research/subtext_solutions/demo_format.md](../research/subtext_solutions/demo_format.md): single audio clip plays once at slide 1, then everything is text side-by-side (judges *read* not *hear*). Headline number on Pivot Set.
-### 20% Showing Improvement in Rewards
-Deck asks:
-- *"Reward curves, before/after behavior, comparison against a baseline?"* — ✅ planned via [train/plot_reward_decomp.py](train/plot_reward_decomp.py) (4-line stacked chart) + [train/eval_pivot_set.py](train/eval_pivot_set.py) (`X/50 → Y/50` headline) + side-by-side step-0/step-N transcripts on 5 hand-picked clips.
-### 10% Reward & Training Pipeline
-Deck asks:
-- *"Reward logic coherent? Pipeline produces meaningful improvement?"* — ✅
-  - Multi-criterion reward in [server/grader.py](server/grader.py): correctness 0.50 + prosody_grounding 0.25 + tool_parsimony 0.15 + format_ok 0.10
-  - 3 hard penalties: −0.30 no submission, −0.20 spam (>5 calls), −0.50 Pivot+no-audio+wrong (the anti-shortcut clause)
-  - Per-step deltas: +0.05 audio call, +0.02 transcript, −0.10 unknown tool, −0.05 malformed args
-  - Hard to game: Pivot Set explicitly requires audio tool use to score
-  - Composable: each component logged separately so judges see *what* the agent learned in *what order*
----
-## Verified artifacts inventory
-```
-subtext_arena/                        # 5.9 MB total (small enough for HF Space)
-├── README.md                         # 3-5 min story, links to all artifacts
-├── SUBMISSION_CHECKLIST.md           # this file
-├── openenv.yaml                      # valid manifest (validated)
-├── pyproject.toml                    # openenv-core>=0.2.3
-├── uv.lock                           # locked deps
-├── __init__.py                       # exports SubtextArenaEnv, Action, Observation
-├── client.py                         # SubtextArenaEnv extends EnvClient
-├── models.py                         # Action(tool, tool_args) + Observation
-├── data/
-│   ├── sarcasm_data.json             # 690 MUStARD ground truth (250 KB)
-│   └── prosody_cache/                # 1380 prosody JSONs (5.4 MB)
-├── server/
-│   ├── app.py                        # FastAPI factory via create_app(...)
-│   ├── audio_tools.py                # transcript/prosody/pitch-contour renderers
-│   ├── scenarios.py                  # MUStARD loader + Pivot oversampling
-│   ├── grader.py                     # per-step + final-reward grader
-│   ├── subtext_arena_environment.py  # Environment subclass; episode loop
-│   ├── Dockerfile                    # multi-stage build using openenv-base
-│   └── requirements.txt
-├── train/
-│   ├── train_grpo.py                 # Unsloth + TRL GRPO training (mandatory)
-│   ├── curate_pivot_set.py           # 4-step Pivot Set curation pipeline
-│   ├── eval_pivot_set.py             # checkpoint eval → headline number
-│   └── plot_reward_decomp.py         # 4-line stacked reward chart
-├── notebooks/
-│   └── train_grpo_colab.ipynb        # judges-rerunnable Colab notebook
-└── docs/
-    └── plots/                        # filled by training runs (committed PNGs)
-```
----
-## What's blocked vs. what's done
-| Action | Blocker | Cost when run |
-|---|---|---|
-| Push env to HF Space | `huggingface-cli login` | $0 (5 min wall) |
-| Run hour-1 smoke test | HF token + user authorization | ~$0.60 (T4-medium 1h) |
-| Run baseline for Pivot Set candidates | HF token | ~$0.50 (T4 40 min) |
-| Curate Pivot Set with listener UI | Day 1 onsite (3 listeners + Gradio) | $0 (CPU) |
-| Run #1 (200 GRPO steps) | Pivot Set must exist first | ~$8 (T4 12h) |
-| Run #2 (final, 250-300 steps) | Run #1 results review | ~$10 (T4 16h) |
-| Generate plots | Training logs from Run #2 | $0 (CPU) |
-| Demo video / blog / slide deck | Per user: deferred until everything else confirmed | $0 |
-**Total compute**: ~$19 of $60 budget. **$41 buffer** for re-runs, seed sweeps, or extended training.
----
-## Pre-onsite ✅ checklist (everything that's been done, no GPU spent)
-- [x] MUStARD downloaded and audio extracted (1380 mp4s → 1380 WAVs in 10 sec)
-- [x] Prosody features pre-computed (1380 JSONs in 118 sec)
-- [x] OpenEnv scaffold via `openenv init subtext_arena`
-- [x] All env code wired and end-to-end-tested locally (in-process + over HTTP)
-- [x] `openenv validate --verbose` passes for all 4 deployment modes
-- [x] Unsloth+TRL training script written and parses cleanly
-- [x] Colab notebook written
-- [x] Pivot Set curation pipeline written (baseline / filter / ui / finalize)
-- [x] Eval script written
-- [x] Reward-decomposition plot script written
-- [x] README story written (3-5 min read)
-- [x] Submission checklist (this file) written
-## Day 1 onsite (Apr 25, 11 AM IST) ⏳
-- [ ] Claim HF $30 credits + Cursor credits via Scaler dashboard
-- [ ] Push env to HF Space (`openenv push --repo-id aamrinder/subtext-arena`)
-- [ ] Replace placeholder URLs in README with real HF Space URL
-- [ ] Run hour-1 smoke test on HF Jobs T4 (proves Unsloth+TRL+Qwen2.5-3B chain works)
-- [ ] Run baseline (`python train/curate_pivot_set.py baseline`) on HF Jobs T4 (~40 min)
-- [ ] Filter candidates + curate 50 Pivot clips with team listening
-- [ ] **Mentor Round 1 (3:30-4:30 PM)** — confirm theme + technical core
-- [ ] Push pivot_set.json to HF Space
-- [ ] **Mentor Round 2 (8-10 PM)** — sanity-check training plan
-- [ ] Launch Run #1 (200 GRPO steps, ~12h overnight)
-## Day 1 night (overnight) ⏳
-- [ ] Monitor reward curves; if `n_audio_calls_avg < 0.3` for 50 steps → halt + apply failure-mode B1-B5
-## Day 2 onsite (Apr 26, before 5 PM IST submission) ⏳
-- [ ] Eval Run #1 on Pivot Set; decide: re-run with tweaks OR keep
-- [ ] **Mentor Round 3 (10-12 AM)** — final feedback on demo
-- [ ] Run #2 (250-300 steps, optimized config)
-- [ ] Generate reward-decomposition PNG and side-by-side transcripts
-- [ ] Commit all PNGs to `docs/plots/`
-- [ ] Cut <2-min YouTube demo OR slide deck
-- [ ] Update README with real numbers + video link
-- [ ] Final HF Space push
-- [ ] Run pre-submission validator (if Scaler provides one)
-- [ ] Submit URL via hackathon dashboard before 5:00 PM IST

data/baseline.json DELETED Viewed

The diff for this file is too large to render. See raw diff

data/candidates.json DELETED Viewed

@@ -1,2308 +0,0 @@
-{
-  "candidates": [
-    {
-      "clip_id": "1_80",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Leonard is being sarcastic about Howard's concern for peanuts,"
-    },
-    {
-      "clip_id": "1_90",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line contains a detailed and exaggerated explanation of a fictional scene, which is"
-    },
-    {
-      "clip_id": "1_162",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that the conversation is about someone's level of arousal"
-    },
-    {
-      "clip_id": "1_410",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line contains a clear reference to Sheldon and Leonard's parents being presented as"
-    },
-    {
-      "clip_id": "1_430",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line sounds dismissive and dismissive of Raj's efforts to find new"
-    },
-    {
-      "clip_id": "1_467",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Sheldon is making a comparison between a real"
-    },
-    {
-      "clip_id": "1_506",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The response \"Not to mention you'd have to power down on"
-    },
-    {
-      "clip_id": "1_533",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Leonard's response is clearly sarcastic because he uses \"technically\""
-    },
-    {
-      "clip_id": "1_672",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Sheldon is expressing his desire for a simple meal (grilled"
-    },
-    {
-      "clip_id": "1_1001",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because Raj's response implies that he's trying"
-    },
-    {
-      "clip_id": "1_1185",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Leonard's comment about it being a \"very good start\" is"
-    },
-    {
-      "clip_id": "1_1189",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Sheldon's response is clearly sarcastic because he is pretending to be offended"
-    },
-    {
-      "clip_id": "1_1262",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is sarcastic because it implies that Leonard's explanation was unclear or"
-    },
-    {
-      "clip_id": "1_1470",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Penny is seeking help from her friends, which could be genuine or"
-    },
-    {
-      "clip_id": "1_1484",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Sheldon is upset about someone touching his food,"
-    },
-    {
-      "clip_id": "1_1627",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. The speaker is mocking Leonard for his concern"
-    },
-    {
-      "clip_id": "1_1666",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\nThe context suggests that Sheldon is making a rather ridiculous and overly scientific"
-    },
-    {
-      "clip_id": "1_1678",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Raj is being sarcastic by asking Leonard about"
-    },
-    {
-      "clip_id": "1_1722",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is sarcastic because Sheldon is attributing his elevated blood pressure to"
-    },
-    {
-      "clip_id": "1_2075",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Leonard is being sarcastic when he says \""
-    },
-    {
-      "clip_id": "1_2216",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Howard is making a joke about his ability to communicate with women when"
-    },
-    {
-      "clip_id": "1_2361",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context indicates gratitude from [PERSON] for help received. The target"
-    },
-    {
-      "clip_id": "1_2420",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is sarcastic because Sheldon's response to Penny's suggestion of Spider"
-    },
-    {
-      "clip_id": "1_2423",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: Sheldon's response is clearly sarcastic because it's an"
-    },
-    {
-      "clip_id": "1_2575",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context shows Sheldon being sarcastic earlier when he said \"Maybe"
-    },
-    {
-      "clip_id": "1_2664",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Sheldon is being sarcastic when Leonard says \"Terr"
-    },
-    {
-      "clip_id": "1_2669",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The context suggests that Leonard is trying to be accommodating by waking Penny up"
-    },
-    {
-      "clip_id": "1_2792",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Penny's statement \"No, I mean"
-    },
-    {
-      "clip_id": "1_2837",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Leonard's response is clearly sarcastic because he initially seemed to be teasing"
-    },
-    {
-      "clip_id": "1_2842",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that the speaker is inquiring about others' hobbies or interests."
-    },
-    {
-      "clip_id": "1_3069",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Sheldon is poking fun at Leonard's attempt to"
-    },
-    {
-      "clip_id": "1_3177",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Raj is being sarcastic when he says \"Not a"
-    },
-    {
-      "clip_id": "1_3204",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Raj's comment is clearly sarcastic because \"sweety and"
-    },
-    {
-      "clip_id": "1_3333",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Sheldon has been engaging in somewhat dismissive or con"
-    },
-    {
-      "clip_id": "1_3348",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Bernadette's comment about Raj seeing Leonard's experiment is followed"
-    },
-    {
-      "clip_id": "1_3371",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that [PERSON] is expressing gratitude for the driver's assistance,"
-    },
-    {
-      "clip_id": "1_3476",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Sheldon is expressing his dislike for Windows Vista by saying it's"
-    },
-    {
-      "clip_id": "1_3599",
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.8\n\nThe context suggests a lighthearted, somewhat mocking tone among the characters."
-    },
-    {
-      "clip_id": "1_3649",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Penny is trying to invite Sheldon to her house, which he"
-    },
-    {
-      "clip_id": "1_3840",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic because Sheldon is using a seemingly factual"
-    },
-    {
-      "clip_id": "1_3842",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Sheldon is mocking Howard's suggestion by implying he's not going"
-    },
-    {
-      "clip_id": "1_3911",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because it acknowledges Leonard's gesture as \"thought"
-    },
-    {
-      "clip_id": "1_4145",
-      "predicted": "sincere",
-      "confidence": 0.5,
-      "gold": "sarcastic",
-      "raw_output": "sincere\nFloat: 1.0\n\nThe context suggests that Sheldon is being sarcastic about his bird-trapping device"
-    },
-    {
-      "clip_id": "1_4281",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Howard's response sounds like he's trying to deflect or downplay"
-    },
-    {
-      "clip_id": "1_4286",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Sheldon is mocking Raj for suggesting he should buy his"
-    },
-    {
-      "clip_id": "1_4290",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Penny is expressing her discomfort or distrust towards Howard due to his"
-    },
-    {
-      "clip_id": "1_4352",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Sheldon is making a joke about Leonard being socially"
-    },
-    {
-      "clip_id": "1_4477",
-      "predicted": "sarcastic",
-      "confidence": 0.25,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.25\n\nThe target line seems sarcastic because it follows a series of formal introductions and"
-    },
-    {
-      "clip_id": "1_4544",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that Amy is not interested in"
-    },
-    {
-      "clip_id": "1_4552",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Sheldon is making a sarcastic comment about Leonard and"
-    },
-    {
-      "clip_id": "1_4603",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because Howard is making light of Penny's accidental"
-    },
-    {
-      "clip_id": "1_4639",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Penny is not impressed by the stories about the"
-    },
-    {
-      "clip_id": "1_4792",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that Sheldon's statement about not"
-    },
-    {
-      "clip_id": "1_4850",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line contains sarcasm because Sheldon is pretending to be understanding and appreciative"
-    },
-    {
-      "clip_id": "1_4949",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line \"[RAJ] I think they <i>are</i"
-    },
-    {
-      "clip_id": "1_5109",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. The speaker (Person) is responding to"
-    },
-    {
-      "clip_id": "1_5169",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Amy is trying to soften her previous statement"
-    },
-    {
-      "clip_id": "1_5212",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests a playful and somewhat teasing tone between PERSON1 and"
-    },
-    {
-      "clip_id": "1_5581",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Howard is clearly poking fun at the idea of dating someone who isn"
-    },
-    {
-      "clip_id": "1_5617",
-      "predicted": "sincere",
-      "confidence": 0.5,
-      "gold": "sarcastic",
-      "raw_output": "sincere\nFloat: 1.0\n\nThe target line \"is an event that's like a heart attack\" is"
-    },
-    {
-      "clip_id": "1_5751",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Bernadette's previous statements and tone"
-    },
-    {
-      "clip_id": "1_5786",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Sheldon is mocking the idea"
-    },
-    {
-      "clip_id": "1_5919",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Raj is being sarcastic when he says \"That's cruel"
-    },
-    {
-      "clip_id": "1_5953",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Leonard is making light of Sheldon's comment about"
-    },
-    {
-      "clip_id": "1_6039",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context shows a series of light-hearted and playful comments about Sheldon"
-    },
-    {
-      "clip_id": "1_6092",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Leonard's response is mocking and dismissive"
-    },
-    {
-      "clip_id": "1_6113",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Penny is asking for her friend's opinion on"
-    },
-    {
-      "clip_id": "1_6183",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Sheldon's tone and context suggest sarcasm. He's complaining about"
-    },
-    {
-      "clip_id": "1_6211",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Sheldon's response is clearly sarcastic because he uses \"Oh, how"
-    },
-    {
-      "clip_id": "1_6534",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Sheldon's response is clearly sarcastic because he's implying that Amy"
-    },
-    {
-      "clip_id": "1_6627",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Penny has been asking Leonard out multiple times, implying she's"
-    },
-    {
-      "clip_id": "1_6645",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. The phrase \"Well,"
-    },
-    {
-      "clip_id": "1_6766",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Sheldon is implying that he's not being"
-    },
-    {
-      "clip_id": "1_6835",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Amy's response \"Oh"
-    },
-    {
-      "clip_id": "1_6860",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Sheldon is clearly mocking Penny's offer of hair-cutting services, which"
-    },
-    {
-      "clip_id": "1_7082",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Raj is reminiscing about past experiences with a"
-    },
-    {
-      "clip_id": "1_7186",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic. Howard's statement \"Not exactly"
-    },
-    {
-      "clip_id": "1_7341",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Sheldon is pretending to be communicating with someone"
-    },
-    {
-      "clip_id": "1_7400",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Penny's response is mocking Sheldon's knowledge"
-    },
-    {
-      "clip_id": "1_7442",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Sheldon is pretending to be"
-    },
-    {
-      "clip_id": "1_7443",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is sarcastic because Penny is implying that their involvement with"
-    },
-    {
-      "clip_id": "1_7445",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Penny's statement is meant"
-    },
-    {
-      "clip_id": "1_7449",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Sheldon is being sarcastic when Raj mentions Howard's \""
-    },
-    {
-      "clip_id": "1_7487",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Raj's response is clearly sarcastic because he's acknowledging Howard's height"
-    },
-    {
-      "clip_id": "1_7490",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Howard is using a metaphor"
-    },
-    {
-      "clip_id": "1_7494",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because it implies that the statement about being \""
-    },
-    {
-      "clip_id": "1_7504",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic because it contrasts the serious nature of"
-    },
-    {
-      "clip_id": "1_7593",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Sheldon is trying to justify his behavior by emphasizing Wil Wheaton"
-    },
-    {
-      "clip_id": "1_7641",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line contains a clear contradiction between Howard's previous statement that he feels responsible"
-    },
-    {
-      "clip_id": "1_7661",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: Sheldon's response is clearly sarcastic. He is mocking"
-    },
-    {
-      "clip_id": "1_7675",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The target line continues the playful and somewhat mocking tone established by Amy"
-    },
-    {
-      "clip_id": "1_7894",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic because Raj's comment about seeking"
-    },
-    {
-      "clip_id": "1_7938",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests a lighthearted and playful tone among the characters,"
-    },
-    {
-      "clip_id": "1_8042",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Raj is being sarcastic when he brings up the broken wheel"
-    },
-    {
-      "clip_id": "1_8078",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Howard is making a derogatory comment about"
-    },
-    {
-      "clip_id": "1_8136",
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "gold": "sarcastic",
-      "raw_output": "sincere\n0.9\n\nThe context suggests Raj is making a sarcastic comment about Valentine's Day being a time for"
-    },
-    {
-      "clip_id": "1_8159",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Leonard's statement about being \"cool with"
-    },
-    {
-      "clip_id": "1_8185",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Bernadette's suggestion of going to"
-    },
-    {
-      "clip_id": "1_8217",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Raj is being serious about his mental health issues, which he considers"
-    },
-    {
-      "clip_id": "1_8417",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Raj is feeling uncomfortable or suspicious about something related to"
-    },
-    {
-      "clip_id": "1_8505",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Raj is implying that Howard"
-    },
-    {
-      "clip_id": "1_8837",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Sheldon's comment is clearly sarcastic because he's expressing relief about"
-    },
-    {
-      "clip_id": "1_9003",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic because Leonard's earlier statement about"
-    },
-    {
-      "clip_id": "1_9009",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line contains a mocking tone that suggests Sheldon is being sarcastic about his"
-    },
-    {
-      "clip_id": "1_9245",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The target line is sarcastic because Sheldon is using a dismissive and"
-    },
-    {
-      "clip_id": "1_9594",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Sheldon has been discussing WonderCon, which Penny finds to be"
-    },
-    {
-      "clip_id": "1_9668",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: The context indicates that Sheldon is referring to Amy's comment"
-    },
-    {
-      "clip_id": "1_9963",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because it implies that Amy would not actually commit"
-    },
-    {
-      "clip_id": "1_9993",
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "gold": "sarcastic",
-      "raw_output": "sincere\n0.9\n\nThe context suggests that Raj is making a dramatic statement about time travel, which he seems to"
-    },
-    {
-      "clip_id": "1_10004",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that the person has been engaging"
-    },
-    {
-      "clip_id": "1_10009",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Penny's response \"Let the dead man"
-    },
-    {
-      "clip_id": "1_10797",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Amy's comment about her relationship status and"
-    },
-    {
-      "clip_id": "1_10810",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Sheldon is jokingly suggesting they invest in a"
-    },
-    {
-      "clip_id": "1_10849",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Howard is trying to avoid talking about something that happened in the past"
-    },
-    {
-      "clip_id": "1_10853",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy confidence is high because Penny's response \"Okay, sit down. Listen,"
-    },
-    {
-      "clip_id": "1_10857",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.94\n\nThe target line is clearly sarcastic. Raj is complimenting Penny"
-    },
-    {
-      "clip_id": "1_10859",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic because Raj has just made a comment"
-    },
-    {
-      "clip_id": "1_10977",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Howard is trying to change the subject by"
-    },
-    {
-      "clip_id": "1_11042",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic because it directly mocks the idea"
-    },
-    {
-      "clip_id": "1_11046",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it directly references the famous Star Trek phrase"
-    },
-    {
-      "clip_id": "1_11051",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. The person is pretending that"
-    },
-    {
-      "clip_id": "1_11120",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Sheldon's response implies that he's being sarcastic about having \""
-    },
-    {
-      "clip_id": "1_11201",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The target line is clearly sarcastic because Amy's comment about the dashboard"
-    },
-    {
-      "clip_id": "1_11236",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Sheldon's tone and the context suggest he"
-    },
-    {
-      "clip_id": "1_11253",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. Sheldon's statement \"So"
-    },
-    {
-      "clip_id": "1_11257",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Raj is being sarcastic about finding old items in"
-    },
-    {
-      "clip_id": "1_11485",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context sets up Sheldon's over-the-top list of things he has to make"
-    },
-    {
-      "clip_id": "1_11526",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Sheldon's statement is clearly sarcastic because he is using the topic of"
-    },
-    {
-      "clip_id": "1_11529",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Sheldon is engaging in somewhat inappropriate or immature conversation"
-    },
-    {
-      "clip_id": "1_11609",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: Leonard's response \"Me-wow\" is likely sarcastic because it seems"
-    },
-    {
-      "clip_id": "1_11723",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Raj's response is clearly sarcastic because it contradicts Amy's"
-    },
-    {
-      "clip_id": "1_11736",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line seems sarcastic because it implies that the speaker (likely Bernad"
-    },
-    {
-      "clip_id": "1_11889",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because Bernadette's response implies that they"
-    },
-    {
-      "clip_id": "1_11928",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that Amy, who has a"
-    },
-    {
-      "clip_id": "1_12002",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Sheldon is using hyperbolic language (\""
-    },
-    {
-      "clip_id": "1_12083",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Howard's exaggerated and mocking description of drinking"
-    },
-    {
-      "clip_id": "1_12275",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The response is clearly sarcastic because it directly contradicts the initial"
-    },
-    {
-      "clip_id": "1_12320",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The context suggests that Penny is trying to avoid making a decision about"
-    },
-    {
-      "clip_id": "2_477",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe second line is clearly sarcastic because Chandler is mocking Rachel's"
-    },
-    {
-      "clip_id": "2_35",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that the speaker is expressing disappointment or frustration about being"
-    },
-    {
-      "clip_id": "2_595",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Chandler's line is clearly meant to be ironic or mocking, given"
-    },
-    {
-      "clip_id": "2_574",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Joey is clearly mocking Chandler's claim that he \"did\" something ("
-    },
-    {
-      "clip_id": "2_409",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The target line \"I'll catch you guys later\" is delivered"
-    },
-    {
-      "clip_id": "2_472",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Ross is expressing a desire for marriage, which could"
-    },
-    {
-      "clip_id": "2_322",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that PHOEBE and RACHEL are reacting"
-    },
-    {
-      "clip_id": "2_124",
-      "predicted": "sincere",
-      "confidence": 0.5,
-      "gold": "sarcastic",
-      "raw_output": "sincere\nFloat: 1.0\n\nThe context suggests that Monica is confident about their performance in a \"lightning"
-    },
-    {
-      "clip_id": "2_380",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic. Chandler's response is mocking Joey's excitement"
-    },
-    {
-      "clip_id": "2_160",
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.9\n\nThe response is clearly sarcastic because Chandler's comment about \"his penis being too big"
-    },
-    {
-      "clip_id": "2_49",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Joey is being sarcastic when he says \"We got"
-    },
-    {
-      "clip_id": "2_428",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Chandler's response \"His legs flail about as if independent from"
-    },
-    {
-      "clip_id": "2_1",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Chandler is trying to backtrack or undo something, possibly referring to"
-    },
-    {
-      "clip_id": "2_268",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Chandler's description of a \"lonely grey couch\" followed by Ned"
-    },
-    {
-      "clip_id": "2_221",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Joey is reacting to something, likely a joke or"
-    },
-    {
-      "clip_id": "2_137",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Ross's instruction implies that Chandler is pretending to be there for a"
-    },
-    {
-      "clip_id": "2_20",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The response \"My duties? All right.\" is a dismissive"
-    },
-    {
-      "clip_id": "2_463",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: Joey's response \"Ok!\" after Rachel's dismissive \"Forget it"
-    },
-    {
-      "clip_id": "2_602",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Monica is expressing disappointment or frustration about the closure of a bar,"
-    },
-    {
-      "clip_id": "2_88",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Rachel is surprised or perhaps annoyed to see Ross,"
-    },
-    {
-      "clip_id": "2_86",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Phoebe is expressing concern about her spiritual fate"
-    },
-    {
-      "clip_id": "2_154",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Phoebe is exaggerating about the contents of a bucket"
-    },
-    {
-      "clip_id": "2_557",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Ross's reaction \"Oh my God,"
-    },
-    {
-      "clip_id": "2_87",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Joey is making a sarcastic comment about the difficulty of not"
-    },
-    {
-      "clip_id": "2_240",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Joey's statement \"No!\" is likely sarcastic because it"
-    },
-    {
-      "clip_id": "2_267",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nYour reasoning needs adjustment. The context suggests Rachel is making a light-hearted comment about"
-    },
-    {
-      "clip_id": "2_284",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Chandler's excitement about the phrase \"I do\" is part"
-    },
-    {
-      "clip_id": "2_340",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because it follows a series of serious comments about"
-    },
-    {
-      "clip_id": "2_536",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The context suggests that Ross is expressing gratitude or appreciation for having only one"
-    },
-    {
-      "clip_id": "2_448",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Joey is making a sarcastic comment about Chandler's mistake regarding"
-    },
-    {
-      "clip_id": "2_79",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Rachel's response is clearly sarcastic because she's implying that the"
-    },
-    {
-      "clip_id": "2_191",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: Joey's response \"So, you're just Bing?\""
-    },
-    {
-      "clip_id": "2_99",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The context suggests a game of cards where Ace is being discussed as both"
-    },
-    {
-      "clip_id": "2_208",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Joey is making a comment about how men who care about babies are"
-    },
-    {
-      "clip_id": "2_119",
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.9\n\nThe context suggests that Ross has been warned about his behavior (possibly related to a professor"
-    },
-    {
-      "clip_id": "2_55",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Rachel is being sarcastic when she says \""
-    },
-    {
-      "clip_id": "2_77",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: Monica's statement is clearly sarcastic because she's pretending to be upset"
-    },
-    {
-      "clip_id": "2_147",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Monica is expressing frustration about finding a suitable person for their wedding"
-    },
-    {
-      "clip_id": "2_559",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Monica's exclamation \"What!?!"
-    },
-    {
-      "clip_id": "2_444",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The context shows Rachel expressing genuine gratitude for Ross helping her move in"
-    },
-    {
-      "clip_id": "2_299",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.99\n\nThe context suggests that Chandler is being sarcastic when he says \""
-    },
-    {
-      "clip_id": "2_125",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line contains a contradiction that suggests Ross is not being sincere about his hypothetical"
-    },
-    {
-      "clip_id": "2_552",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Rachel is complaining about her seating position, which is being challenged"
-    },
-    {
-      "clip_id": "2_181",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Joey's response is dismissive and uses a casual tone to reassure"
-    },
-    {
-      "clip_id": "2_107",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Monica is genuinely excited about Joey planning her birthday party, but Joey"
-    },
-    {
-      "clip_id": "2_347",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Monica is implying that Chandler should invite his father to the wedding,"
-    },
-    {
-      "clip_id": "2_495",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Phoebe is making a joke about the groom not being"
-    },
-    {
-      "clip_id": "2_525",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Rachel is trying to clarify her stance on smoking, possibly in"
-    },
-    {
-      "clip_id": "2_506",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Rachel's reaction is exaggerated and angry,"
-    },
-    {
-      "clip_id": "2_486",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests Joey is reacting to Rachel's comment about male nurses"
-    },
-    {
-      "clip_id": "2_149",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Monica's response \"Yeah??? Oh... Okay, yes, we"
-    },
-    {
-      "clip_id": "2_150",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Ross is making a joke about having a third nipple, which he"
-    },
-    {
-      "clip_id": "2_12",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Ross is being sarcastic about his \"backpacking\""
-    },
-    {
-      "clip_id": "2_127",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The response \"When? Today? Some, not a lot.\" is"
-    },
-    {
-      "clip_id": "2_36",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic because Ross is exaggerating his"
-    },
-    {
-      "clip_id": "2_173",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Ross is being defensive about bringing someone (likely his child)"
-    },
-    {
-      "clip_id": "2_295",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it directly contradicts the previous statement by"
-    },
-    {
-      "clip_id": "2_38",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Chandler's statement \"It's not a"
-    },
-    {
-      "clip_id": "2_431",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Joey is being dismissive or cynical about something related to meeting someone"
-    },
-    {
-      "clip_id": "2_13",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Monica is boasting about her performance in a game, implying she has"
-    },
-    {
-      "clip_id": "2_482",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Chandler's statement is clearly sarcastic because he is using humor to mock"
-    },
-    {
-      "clip_id": "2_193",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Monica's request for \"a slice of cheesecake and-a date"
-    },
-    {
-      "clip_id": "2_203",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Ross is expressing disbelief or frustration about something being \""
-    },
-    {
-      "clip_id": "2_426",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that the speaker is exclaiming about the arrival of the morning"
-    },
-    {
-      "clip_id": "2_590",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe response \"I don't think so Joe\" is likely sarcastic because it"
-    },
-    {
-      "clip_id": "2_343",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Ross is making an exaggerated introduction, possibly to"
-    },
-    {
-      "clip_id": "2_334",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Joey's response seems to imply that people will only come in when"
-    },
-    {
-      "clip_id": "2_367",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe response is sarcastic because Monica's comment \"Well, don't 'cha"
-    },
-    {
-      "clip_id": "2_51",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Chandler is making a joke about Joey's \"breakfast adventure\""
-    },
-    {
-      "clip_id": "2_225",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Ross's response \"I don't know, something girlie\""
-    },
-    {
-      "clip_id": "2_514",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Ross's reaction to the person's statement"
-    },
-    {
-      "clip_id": "2_272",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that \"Come on!\" was likely said in a way that implies"
-    },
-    {
-      "clip_id": "2_339",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Ross is apologizing for using the other person's pen"
-    },
-    {
-      "clip_id": "2_69",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nYour reasoning needs to be revised. The target line \"Cover yourself up!\" is"
-    },
-    {
-      "clip_id": "2_217",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests Ross is about to ask a question or express a general"
-    },
-    {
-      "clip_id": "2_213",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Monica is jokingly reminding Rachel about owing her"
-    },
-    {
-      "clip_id": "2_96",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Ross's response is clearly sarcastic because he initially suggests breaking into"
-    },
-    {
-      "clip_id": "2_188",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Monica is trying to comfort Rachel about Danny's"
-    },
-    {
-      "clip_id": "2_29",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests a sarcastic tone because Rachel's reaction to Chandler mentioning Ross is"
-    },
-    {
-      "clip_id": "2_358",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Ross's capitulation (\"okay, okay"
-    },
-    {
-      "clip_id": "2_263",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Joey and Chandler are complimenting each other's appearance, which"
-    },
-    {
-      "clip_id": "2_446",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that Joey is reluctant to show"
-    },
-    {
-      "clip_id": "2_594",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Joey is expressing frustration or disappointment about being \"blown off\""
-    },
-    {
-      "clip_id": "2_535",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic because Ross's initial statement about enjoying"
-    },
-    {
-      "clip_id": "2_210",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that the conversation is about a significant event (dumping someone at"
-    },
-    {
-      "clip_id": "2_234",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests a light-hearted or joking tone, likely about bringing"
-    },
-    {
-      "clip_id": "2_162",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe target line is clearly sarcastic. The speaker uses exaggerated pain"
-    },
-    {
-      "clip_id": "2_437",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: Chandler's response is clearly sarcastic because he is mocking"
-    },
-    {
-      "clip_id": "2_579",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Ross is trying to get Monica to sing along to"
-    },
-    {
-      "clip_id": "2_194",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe response is sarcastic because Joey is mocking Ross's attempt to say \"But"
-    },
-    {
-      "clip_id": "2_78",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe context suggests that Chandler is surprised by Monica's news about her divorce"
-    },
-    {
-      "clip_id": "2_435",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The target line is clearly sarcastic because it follows a conversation where"
-    },
-    {
-      "clip_id": "2_564",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The target line is clearly sarcastic because it's Chandler expressing his desire"
-    },
-    {
-      "clip_id": "2_261",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: Chandler's use of \"Oh, no, no\" is a common"
-    },
-    {
-      "clip_id": "2_422",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Joey's response is clearly sarcastic because he says \"Well,"
-    },
-    {
-      "clip_id": "2_627",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\nThe target line \"[PERSON] Whoa, whoa, whoa"
-    },
-    {
-      "clip_id": "2_585",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Ross's response \"You and I, we're interfacing\""
-    },
-    {
-      "clip_id": "2_425",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Phoebe is describing a romantic encounter, and Monica's"
-    },
-    {
-      "clip_id": "2_53",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line is clearly sarcastic because Rachel's response implies that she thinks Mona"
-    },
-    {
-      "clip_id": "2_293",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\nThe context suggests that Phoebe is being dismissive or resistant, which align"
-    },
-    {
-      "clip_id": "2_56",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy confidence is high because:\n1. The exclamation \"I knew it\""
-    },
-    {
-      "clip_id": "2_134",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: Chandler is clearly mocking the idea of getting a girlfriend a"
-    },
-    {
-      "clip_id": "2_488",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Chandler's response sounds like he's trying to act surprised or confused about"
-    },
-    {
-      "clip_id": "2_300",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it directly contradicts Phoebe's"
-    },
-    {
-      "clip_id": "2_387",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe target line suggests Chandler is mocking or belittling \"her friend\" by"
-    },
-    {
-      "clip_id": "2_569",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Phoebe is mocking or teasing Person about"
-    },
-    {
-      "clip_id": "2_530",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that the speaker has been discussing serious issues like suicide, indicating they"
-    },
-    {
-      "clip_id": "2_48",
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.8\n\nThe response is sarcastic because Monica's initial comment about Joey being \"a pretty charming"
-    },
-    {
-      "clip_id": "2_253",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Monica's over-the-top gratitude and the exclamation marks suggest she"
-    },
-    {
-      "clip_id": "2_555",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Ross is frustrated about being left without the ball"
-    },
-    {
-      "clip_id": "2_42",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The target line sounds like a plea for PERSON1 to stay,"
-    },
-    {
-      "clip_id": "2_441",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Ross's response is clearly sarcastic because he acknowledges Chandler's point"
-    },
-    {
-      "clip_id": "2_218",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Chandler is likely trying to deflect attention away from"
-    },
-    {
-      "clip_id": "2_415",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests a casual conversation about copying needs, while the target line is clearly"
-    },
-    {
-      "clip_id": "2_473",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line \"I'm the supply manager\" is delivered in a dismissive"
-    },
-    {
-      "clip_id": "2_537",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Phoebe is trying to hide something from Chandler by closing"
-    },
-    {
-      "clip_id": "2_397",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\nThe context suggests that Rachel is questioning Ross about his relationship status,"
-    },
-    {
-      "clip_id": "2_572",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it contradicts Rachel's earlier statement about"
-    },
-    {
-      "clip_id": "2_243",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy confidence is high because Chandler's apology is clearly insincere and comes across"
-    },
-    {
-      "clip_id": "2_245",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: Monica's response \"Oh, right! I completely forgot about that"
-    },
-    {
-      "clip_id": "2_176",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic. Chandler's response \"I'm the rupt"
-    },
-    {
-      "clip_id": "2_58",
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.8\n\nThe context suggests that Phoebe is implying they were interrupted or cut off during their"
-    },
-    {
-      "clip_id": "2_427",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests a sarcastic tone because Rachel's \"Oh!\" followed by Monica"
-    },
-    {
-      "clip_id": "2_534",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Ross is defending himself against being called \"boring\" by"
-    },
-    {
-      "clip_id": "2_626",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Chandler is commenting on the excessive talk about London, implying they"
-    },
-    {
-      "clip_id": "2_607",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Chandler is mocking or dismissing Joey's request to speak to Kathy,"
-    },
-    {
-      "clip_id": "2_33",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.94\n\nThe context suggests Rachel is being sarcastic because her response sounds like"
-    },
-    {
-      "clip_id": "2_44",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Joey is being sarcastic because he initially said"
-    },
-    {
-      "clip_id": "2_232",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The context suggests that Chandler is being sarcastic because he initially offered to"
-    },
-    {
-      "clip_id": "2_400",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Chandler's comment about dismantling a nuclear"
-    },
-    {
-      "clip_id": "2_230",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Monica is trying to be accommodating by agreeing to"
-    },
-    {
-      "clip_id": "2_180",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Chandler is seeking time to process something (likely"
-    },
-    {
-      "clip_id": "2_454",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Person is expressing regret for their behavior towards Ross"
-    },
-    {
-      "clip_id": "2_439",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: Ross's response is clearly sarcastic because he is trying to make light"
-    },
-    {
-      "clip_id": "2_385",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The response \"No, I know, but I, he just went"
-    },
-    {
-      "clip_id": "2_547",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Chandler is making a joke about his breast size, which he"
-    },
-    {
-      "clip_id": "2_65",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that the person is trying to avoid talking about something by saying \""
-    },
-    {
-      "clip_id": "2_251",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Monica is reacting negatively to something Chandler said earlier, which was"
-    },
-    {
-      "clip_id": "2_205",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Rachel is apologizing profusely for something she"
-    },
-    {
-      "clip_id": "2_405",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Ross's statement about needing to be somewhere in just ten minutes after"
-    },
-    {
-      "clip_id": "2_375",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Ross is being sarcastic because he's responding"
-    },
-    {
-      "clip_id": "2_72",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it implies that Ross's initial reaction ("
-    },
-    {
-      "clip_id": "2_353",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Joey is trying to subtly encourage Monica to change her bathroom floor,"
-    },
-    {
-      "clip_id": "2_517",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nI appreciate the feedback. Let me explain my reasoning for this answer:\n\nThe context"
-    },
-    {
-      "clip_id": "2_521",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Ross is being sarcastic when he says \"I've had"
-    },
-    {
-      "clip_id": "2_499",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy confidence is high because Phoebe's tone suggests she's mocking Rachel's"
-    },
-    {
-      "clip_id": "2_317",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is clearly sarcastic. The speaker's initial description of"
-    },
-    {
-      "clip_id": "2_393",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: The use of \"Pheebs\" (a nickname"
-    },
-    {
-      "clip_id": "2_407",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests that Rachel is making a joke about selling a coat"
-    },
-    {
-      "clip_id": "2_94",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nYour reasoning needs to be more explicit. Let's break it down:\n\nContext:"
-    },
-    {
-      "clip_id": "2_198",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because Ross's response about Phoebe being"
-    },
-    {
-      "clip_id": "2_365",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: Joey's response is clearly sarcastic because he's pretending to perform a"
-    },
-    {
-      "clip_id": "2_301",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it's a sudden change from the casual"
-    },
-    {
-      "clip_id": "2_302",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests that Rachel is complimenting Phoebe on something, and Ph"
-    },
-    {
-      "clip_id": "2_27",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Monica is trying to convince someone (likely Ross) not to fly"
-    },
-    {
-      "clip_id": "2_196",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Rachel is expressing concern or pity for someone who will be left alone"
-    },
-    {
-      "clip_id": "2_615",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The target line is sarcastic because it's a polite way to acknowledge"
-    },
-    {
-      "clip_id": "2_111",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Phoebe is trying to be funny or light-hearted about"
-    },
-    {
-      "clip_id": "2_577",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Ross is trying to comfort Rachel by acknowledging her struggles"
-    },
-    {
-      "clip_id": "2_467",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context sets up a scenario where someone is being dismissive or condescending about"
-    },
-    {
-      "clip_id": "2_526",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The response \"Ah well, this is the living room\" seems"
-    },
-    {
-      "clip_id": "2_257",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Monica is trying to impress Rachel by suggesting a"
-    },
-    {
-      "clip_id": "2_259",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The response \"I wasn't by myself. I had a doctor"
-    },
-    {
-      "clip_id": "2_609",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nThe target line is sarcastic because Rachel is using a sarcastic tone"
-    },
-    {
-      "clip_id": "2_371",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests Ross is trying to make amends for something he"
-    },
-    {
-      "clip_id": "2_531",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Rachel and Joey were both sick and unable to meet. Joey's"
-    },
-    {
-      "clip_id": "2_413",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The target line sounds like an exaggerated attempt to cover up her own mistake"
-    },
-    {
-      "clip_id": "2_326",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Rachel was expressing disappointment or frustration about not having something on her"
-    },
-    {
-      "clip_id": "2_576",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nThe context suggests that Rachel is apologizing for forgetting to throw a"
-    },
-    {
-      "clip_id": "2_146",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nThe context suggests Monica is being sarcastic or ironic when she talks about Ross's"
-    },
-    {
-      "clip_id": "2_522",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it directly contradicts the previous statement made"
-    },
-    {
-      "clip_id": "2_30",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests Ross is making a sarcastic comment about someone's"
-    },
-    {
-      "clip_id": "2_171",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Monica is trying to engage Chandler in a playful or romantic activity"
-    },
-    {
-      "clip_id": "2_228",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Rachel is inquiring about Joey's evening, possibly implying she's"
-    },
-    {
-      "clip_id": "2_7",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe target line is clearly sarcastic because it directly contradicts Chandler's earlier statement"
-    },
-    {
-      "clip_id": "2_553",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: The context suggests a celebratory tone about Joey's achievement, but"
-    },
-    {
-      "clip_id": "2_291",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Ross is being sarcastic because he responds with \"Oh my God"
-    },
-    {
-      "clip_id": "2_319",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy confidence is high because Ross's speech is filled with exaggerated praise and"
-    },
-    {
-      "clip_id": "2_314",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests Monica is being self-deprecating about her cooking skills and the"
-    },
-    {
-      "clip_id": "2_166",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Joey is being sarcastic when he says \"But I'm"
-    },
-    {
-      "clip_id": "2_118",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests that Phoebe's comment \"Oh, it's not!\""
-    },
-    {
-      "clip_id": "2_28",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nMy reasoning: Ross's response is clearly sarcastic because he is implying that Monica"
-    },
-    {
-      "clip_id": "2_562",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nExplanation: The context suggests that Phoebe is apologizing for being judgmental"
-    },
-    {
-      "clip_id": "2_231",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nMy reasoning: The response \"No, no, no, it's flattery"
-    },
-    {
-      "clip_id": "2_388",
-      "predicted": "sarcastic",
-      "confidence": 0.85,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.85\n\nExplanation: The line \"Yeah, she couldn't live without the Chan Love\""
-    },
-    {
-      "clip_id": "2_591",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe context suggests a casual, friendly tone between Dorothy and Sophia. The target line"
-    },
-    {
-      "clip_id": "2_492",
-      "predicted": "sincere",
-      "confidence": 0.5,
-      "gold": "sarcastic",
-      "raw_output": "sincere\nFloat: 1.0\n\nThe response is sincere because Rose is defending Dorothy by acknowledging the situation described,"
-    },
-    {
-      "clip_id": "2_471",
-      "predicted": "sincere",
-      "confidence": 0.5,
-      "gold": "sarcastic",
-      "raw_output": "sincere\nFloat: 1.0\n\nThe context is a general greeting where people are expressing their experiences or feelings about"
-    },
-    {
-      "clip_id": "2_570",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.95\n\nExplanation: The context suggests Chandler is trying to initiate a conversation or"
-    },
-    {
-      "clip_id": "2_131",
-      "predicted": "sarcastic",
-      "confidence": 0.5,
-      "gold": "sincere",
-      "raw_output": "sarcastic\nFloat: 0.9\n\nExplanation: Chandler's statement is clearly sarcastic because he expresses a desire"
-    },
-    {
-      "clip_id": "2_235",
-      "predicted": "sarcastic",
-      "confidence": 0.95,
-      "gold": "sincere",
-      "raw_output": "sarcastic\n0.95\n\nThe response is clearly sarcastic because Chandler's claim about having a \"caretaker"
-    }
-  ],
-  "min_conf": 0.0
-}

data/held_out_eval_run1.json DELETED Viewed

@@ -1,890 +0,0 @@
-{
-  "n_eval_clips": 80,
-  "mean_reward": 0.65025,
-  "well_formed_rate": 1.0,
-  "accuracy": 0.5,
-  "pivot_in_eval": 6,
-  "pivot_correct": 5,
-  "results": [
-    {
-      "clip_id": "1_1001",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Raj's tone is flat and lacks any significant pitch variation or emphasis. There are no internal pauses, which could indicate sincerity. The pitch contour is also relatively flat, with minimal fluctuations. Additionally, the energy level is high, which might suggest enthusiasm but not necessarily sarcasm. However, the content of the statement is quite neutral and factual. He is simply stating what he would do with the broken iPod. Given the lack of emotional inflection and the straightforward nature of the statement, it seems more likely to be sincere rather than sarcastic.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_10462",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nHoward's tone is flat and lacks any significant pitch variation or emphasis. There are no internal pauses, and his speech sounds matter-of-fact. The phrase \"exciting opportunities in home care for the old and fat\" is presented neutrally without any indication of sarcasm. Sheldon's previous statements were clearly sarcastic, but Howard's line here does not follow the pattern of sarcasm. He is simply stating facts without any implied criticism or mockery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_105",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody features indicate a relatively flat pitch contour with minimal variability, which suggests sincerity rather than sarcasm. Sheldon's tone does not exhibit high pitch variability or strong emphasis through pauses. Additionally, his statement about the coffee table being a \"tiny garage sale\" implies a serious concern about organization, which aligns more with sincerity than sarcasm. The lack of any sarcastic elements such as exaggerated praise or mocking of the situation further supports a sincere delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_10849",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nHoward's tone is flat and there are no internal pauses, which suggests sincerity. However, the content of his statement is ironic and self-deprecating, implying that buying a comic book store is not very meaningful to him. This contradiction between the literal meaning and the implied sentiment is a common sign of sarcasm. Additionally, the prosody features indicate a low pitch variability and low energy level, which could support a sincere delivery but also might be consistent with a sarcastic delivery if the speaker is trying to downplay the importance of the action.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_10853",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.8,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.93,
-      "completion_text": "<think>\nThe prosody analysis shows that Penny's pitch is relatively high with some variability, but it's not extremely variable. Her energy level is low, and there are no significant pauses or breaks in her speech. The pitch contour shows a slight downward trend, which could indicate a shift in tone, but it's not extreme enough to suggest sarcasm. Additionally, she uses \"okay\" at the beginning, which might be interpreted as a neutral phrase rather than sarcasm. Given these factors, the prosody doesn't strongly support sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_10857",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody features indicate that there is no pitch information available, making it impossible to analyze the tone based on pitch alone. However, let's consider the context and the content of the line. Raj is complimenting Penny and Leonard on their attractiveness, which seems genuine rather than sarcastic. The phrase \"It's true-- you're both gorgeous\" sounds sincere and appreciative, not mocking or ironic. Additionally, the context suggests that they are trying to move past a past disagreement, and Raj is acknowledging their current situation with a compliment. This aligns more with sincerity than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_11055",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with minimal variability, which suggests a sincere tone. Leonard's speech lacks any dramatic emphasis or unusual pitch fluctuations that would indicate sarcasm. He is simply asking a rhetorical question about why someone came to see him, without any indication of mockery or irony. The content itself is neutral and straightforward, not designed to make a sarcastic point.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_11201",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nAmy's tone here is flat and lacks any emphasis or inflection that might indicate sarcasm. She speaks in a neutral way, as if she's simply stating a fact about her dashboard. There are no pauses or breaks in her speech that suggest she's trying to convey something other than what she says. The prosody features also don't show any signs of high pitch variability or strong emphasis, which are typically associated with sarcasm. Instead, the pitch contour shows a flat pattern with minimal variation, indicating a sincere statement rather than a sarcastic one.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_11243",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nHoward's statement \"I love that you're too good to pee in the kitchen sink\" has a flat pitch contour without any significant pitch variation or emphasis pauses. The prosody features indicate low pitch variability and voiced speech, which suggests sincerity rather than sarcasm. Additionally, the content of the statement doesn't seem to contradict or mock Sheldon's previous comments about Amy being a neuroscientist. Howard is simply complimenting Sheldon's self-care routine, which aligns with a sincere tone.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_11439",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nPenny's tone is flat and she doesn't use any emphatic pauses or high-pitched intonations that would suggest sarcasm. Instead, her speech sounds sincere and matter-of-fact, as if she's simply stating a fact. She doesn't contradict the previous statements or express any hidden meaning through her pitch or rhythm. The pitch contour shows a steady rise, which could indicate a calm and straightforward delivery rather than a sarcastic one.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_1144",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with minimal variability, which suggests a sincere tone. Sheldon's speech lacks any notable emphasis or intonation that would indicate sarcasm. He speaks in a monotone manner, which aligns with a sincere delivery. Additionally, there are no pauses or breaks in his speech that could imply he is trying to disguise his true meaning, another sign of sincerity rather than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_11526",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody here is quite flat, with no significant pitch variation or emphasis. Sheldon is speaking in a monotone manner, which suggests sincerity rather than sarcasm. He is explaining his recent experiment with elevated anxiety levels and how he found it uncomfortable. The tone is informative and matter-of-fact, without any hint of mockery or irony.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_12275",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe speaker uses \"Uh, no\" which is a hesitant expression that could indicate they are trying to downplay something. However, the prosody features suggest a flat affect with no significant pitch variation or emphasis on any particular word. The tone seems neutral and matter-of-fact, as if the speaker is simply stating a fact without much emotional weight. Additionally, there are no internal pauses or vocalized pauses that might indicate a shift in meaning or emotion. The prosody does not align with a sarcastic delivery, where one would expect more dramatic pauses and a rising intonation.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_12320",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nThe prosody analysis shows that Raj's speech has high pitch variability and no internal pauses, which typically indicate sarcasm. Additionally, he emphasizes certain words with a rising intonation, as seen in \"Oh, yes,\" which suggests he is not being serious about the details he's providing. The lack of a flat affect and the presence of pauses could also point towards sarcasm. However, the pitch contour is relatively flat, which might temper the certainty of the sarcasm. Given these factors, the prosody seems more aligned with sarcasm than sincerity.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_12331",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with minimal variation, which suggests sincerity rather than sarcasm. Additionally, there are no emphasized pauses or unusual intonations that would indicate sarcasm. The phrase \"Wow, there's a Denny's in Vegas you can actually get married in\" is a common joke about Las Vegas weddings, but the tone here seems straightforward and not mocking or ironic. Penny is likely being sincere in her surprise at the possibility of getting married in such a place.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_1478",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nHoward's tone is flat and he doesn't use any internal pauses or emphasize his words with pauses. He speaks in a monotone voice, which suggests sincerity rather than sarcasm. Additionally, Howard's statement about Mrs. Giselle being sensitive to chlorine and liking his fashion models to be \"pruny and bug-eyed\" seems to be a straightforward observation without any hint of irony or mockery. The pitch contour shows a steady rise, which aligns with a sincere, matter-of-fact delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_1627",
-      "gold": "sincere",
-      "is_pivot": true,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe speaker is using a flat pitch contour with little variation, which suggests sincerity rather than sarcasm. Additionally, the tone is matter-of-fact and doesn't include any emphatic pauses or high-pitched exclamations that would indicate sarcasm. The phrase \"What do you think I did?\" is phrased in a way that sounds like a straightforward statement of fact, without any hint of irony or exaggeration.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_1973",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nPenny's tone is flat and lacks any significant pitch variation or emphasis. She doesn't use any positive or negative words that could indicate sarcasm. Instead, she delivers the line in a matter-of-fact way, which suggests sincerity rather than sarcasm. Additionally, her pitch contour shows a very flat pattern with minimal fluctuations, further supporting a sincere delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2216",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nLeonard's tone is flat and lacks any significant pitch variation or emphasis. He doesn't use any positive or negative words to convey emotion. The pitch contour shows a steady decline, which suggests a lack of enthusiasm or sarcasm. Additionally, Leonard is simply asking a question about his inclusion, without any underlying meaning beyond the literal words. Howard and Raj's previous comments were clearly mocking Leonard, but Leonard's response is neutral and unemotional.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2423",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Sheldon's tone is flat and lacks any significant pitch variation or emphasis. He doesn't take the praise seriously, as evidenced by his dismissive \"Oh, great.\" This matches the prosody, which indicates a lack of enthusiasm and a flat affect. The content of the sentence, while seemingly praising Penny's choice, is clearly sarcastic because it's not meant to be taken seriously. Sheldon is mocking the idea of the multiverse reestablishment, which is a concept Penny likely finds absurd. The prosody confirms this by showing no indication of sincerity or genuine appreciation.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2614",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a flat pitch contour with minimal variability, indicating a lack of emotional expression. There are no internal pauses or emphatic pauses that might suggest sarcasm. The content of the line \"Hardly a day goes by when I don't think about it\" is neutral and doesn't contain any obvious irony or contradiction. Leonard is simply stating a fact about his thoughts without any indication of sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2616",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nLeonard's statement is clearly sarcastic. Despite his tone being flat and lack of any internal pauses, the prosody features indicate a very low energy level, which is unusual for sarcasm. However, the pitch contour shows minimal variation, which aligns with the flat affect characteristic of sarcasm. The lexical choice \"tableau of me trying to pummel you to death\" is a hyperbolic exaggeration that is typical of sarcastic speech. Sheldon's previous statements about his fame and the attention he receives from others suggest that Leonard is using sarcasm to mock Sheldon's inflated self-importance. \n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2778",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nRaj's statement seems to be mocking Sheldon's boast about his fighting skills. The prosody features show a flat pitch contour with little variation, which suggests sincerity rather than sarcasm. Additionally, Raj uses the phrase \"big talk,\" which implies that he is not taking Sheldon's boasting seriously. The tone is more dismissive than sarcastic.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_2792",
-      "gold": "sincere",
-      "is_pivot": true,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Penny's pitch is relatively flat with little variation, which suggests sincerity rather than sarcasm. Additionally, there are no internal pauses or emphatic pauses that would indicate sarcasm. The tone is matter-of-fact and doesn't convey any hidden meaning beyond what she says. She is simply stating that they can return to being friends, without any underlying criticism or mockery implied.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_3125",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nHoward's tone is flat and he doesn't use any emphatic pauses or high-pitched intonations that would indicate sarcasm. Instead, he speaks matter-of-factly about the consequences of Penny dumping him, which aligns with his previous statement that \"she's obviously way out of line.\" There's no indication of irony or exaggeration in his speech pattern. The prosody features also support a sincere delivery; the pitch is not varied and there are no pauses that might suggest he is being sarcastic.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_3177",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe speaker is clearly expressing disapproval of the movie, saying \"Not a great movie.\" However, they follow up with \"look at that beautiful desert,\" which is a sarcastic comment. The prosody features indicate high pitch variability and a flat affect, which suggests sarcasm. Additionally, the phrase \"look at that beautiful desert\" is a clear attempt to soften the criticism, implying that even though the movie was bad, there was something positive about it. This combination of features points towards sarcasm.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_3256",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe speaker, [PERSON], uses the phrase \"I'm sorry, fiddling with yourself in the bathtub is a real class act.\" This sentence is clearly meant to be sarcastic, but let's analyze the prosody to see if it matches the intended sarcasm. The pitch contour shows a flat overall trend with minimal variation, which suggests a lack of emotional intensity. Additionally, there are no internal pauses or significant emphasis, which typically accompany sarcasm. The prosody features indicate low pitch variability and energy, further supporting a sincere tone rather than a sarcastic one. Howard's comment about it sounding \"cheap\" also implies that the speaker is trying to soften the harshness of their statement, which aligns with a sincere attempt to be polite rather than sarcastic. \n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_3660",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.93,
-      "completion_text": "<think>\nThe prosody analysis shows that Penny's pitch is quite high with significant variability, which is often associated with sarcasm. Additionally, her speech lacks any internal pauses or vocalized breaks, which can indicate sincerity. However, the use of \"new girlfriend\" in a context where Sheldon and Leonard are discussing their sexual encounters could be interpreted as sarcastic. The phrase \"There a new girlfriend in there?\" implies that Sheldon and Leonard are having a lot of sex, which Penny is mocking or joking about. Given the context and the prosodic features, it seems more likely that Penny is being sarcastic rather than sincere.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_3911",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe speaker is being polite and appreciative, but there's no indication of sarcasm. They are acknowledging the thoughtfulness of the offer but declining it. The tone is neutral, and there are no signs of exaggeration or irony. The pitch contour shows a flat trend with minimal variation, which suggests sincerity rather than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_4281",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward's statement seems to be about facing someone after doing something wrong. The prosody features indicate a relatively flat pitch contour with low energy, suggesting a lack of enthusiasm or strong emotion. There are no significant pauses or emphatic stresses that would typically accompany sarcasm. Additionally, Howard's tone doesn't seem to contradict his words; he's acknowledging a difficult situation without any apparent irony or mockery. Given these factors, the prosody does not align with typical sarcastic delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_4544",
-      "gold": "sincere",
-      "is_pivot": true,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nAmy's tone is flat and there are no internal pauses, which suggests sincerity. However, the phrase \"our menses are synchronized\" is a clear indicator of sarcasm. The use of such a medical term in a joking context implies that Amy is being sarcastic about the idea of a girls' night. Additionally, her pitch contour shows a flat affect, which aligns with the lack of emphasis or inflection that would typically accompany sincere speech. The content of what she says is clearly meant to mock the idea of a girls' night, further supporting the sarcastic interpretation.</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_4552",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.8,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.93,
-      "completion_text": "<think>\nAmy's tone is flat and there are no internal pauses, which suggests sincerity. However, her use of \"so anyway, to make a long story short\" with a firm cervix could be interpreted as sarcastically downplaying the importance of the information she's about to share. The pitch contour shows a flat pattern without any significant ups and downs, which aligns with a sincere delivery. But the phrase \"to make a long story short\" can sometimes be used sarcastically to dismiss something important, especially when combined with a firm cervix, which might indicate that Amy is trying to downplay the significance of her cervix being firm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_4792",
-      "gold": "sincere",
-      "is_pivot": true,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody features indicate a very flat pitch contour with minimal variability, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact, as if he's simply stating a fact. He doesn't emphasize any particular word or use a rising intonation that would typically indicate sarcasm. The content of his statement, \"Obviously, we're no longer a Justice League. We have no choice but to switch to our Muppet Baby costumes,\" is also neutral and factual. There are no positive or negative connotations in the words themselves that would suggest sarcasm. The context of the conversation, where everyone else is joking about the absence of superheroes, doesn't align with Sheldon's neutral tone. Sheldon is likely being sincere in his statement.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_4995",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody features indicate a very flat pitch contour with no significant pitch variation. Sheldon's tone is neutral and lacks any emphatic pauses or rising intonation that might suggest sarcasm. He is simply stating a fact about what he learned from Penny. The conversational context shows that they are engaged in a playful exchange about improvisation, but there's no indication of sarcasm here. Sheldon is acknowledging something he learned, which is a straightforward statement rather than a sarcastic remark.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5212",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that the speaker's pitch is relatively flat with little variation, and there are no internal pauses or emphasized pauses. The tone is calm and matter-of-fact, which suggests sincerity rather than sarcasm. Additionally, the pitch contour does not show any peaks or valleys that would indicate sarcasm. The content of the statement is also neutral and factual, without any hint of irony or mockery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_536",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Sheldon's pitch is relatively high with some variability, but it doesn't rise significantly. There are also no internal pauses, which would typically indicate a sarcastic tone. The content of his statement about \"Applied Particle Physics quarterly\" seems factual and not particularly noteworthy. Additionally, the prosodic features don't show any strong emphasis or unusual pauses that would suggest sarcasm. Sheldon is likely being sincere here, as he is simply stating a fact about a book he has read.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5496",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nPenny's tone is flat and lacks any significant pitch variation or emphasis. She doesn't use any positive or negative words that would suggest sarcasm. Instead, she seems to be making a neutral statement about Amy's behavior, which is not aligned with Amy's previous comments about her being \"crafty.\" Her speech pattern and lack of emotional inflection further support a sincere delivery rather than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5580",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward is using irony to mock the idea that Jewish mothers are as lenient with their children as Indian mothers are strict. The tone is mocking and condescending, which aligns with sarcasm. There's no significant pitch variation or emphasis, and he's clearly making a point about the difference in parenting styles between cultures. This matches the pattern of sarcastic speech where the speaker says the opposite of what they mean to make a point.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5581",
-      "gold": "sincere",
-      "is_pivot": true,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parental attitudes, but he is doing so in a matter-of-fact way without any dramatic emphasis or pitch variation. The pitch contour is also consistently low and falling, indicating a flat tone. This matches the sincere delivery where the speaker is calmly explaining a point without adding extra emphasis or sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5953",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward's tone is flat and there are no internal pauses, which suggests sincerity. Additionally, the pitch contour shows a steady rise, indicating a monotone delivery that aligns with a sincere statement. The prosody features also support this interpretation, as they indicate a calm and unemotional delivery without any significant pitch variation or emphasis.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_5964",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.93,
-      "completion_text": "<think>\nHoward's tone is flat and there are no internal pauses, which suggests sincerity. However, his use of \"Smart\" with a whisper and the mention of \"deaf chick\" could imply sarcasm. The whisper and the suggestion that someone might not hear him could be a way to mock or belittle someone, especially if he's being sarcastic about someone being engaged to his friend. Given the context, it seems more likely that Howard is using this as a way to mock or make fun of the situation rather than being sincere.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_6020",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Sheldon's tone is flat and lacks any significant pitch variation or emphasis. He doesn't use any positive or negative intonation to convey emotion. Additionally, he uses a neutral pitch range with no peaks or valleys, which suggests sincerity rather than sarcasm. The phrase \"Would you like back on it?\" implies a serious request, not a mocking one. Furthermore, Sheldon's statement about having 61 enemies and offering to add someone back to the list is straightforward and matter-of-fact, without any hint of sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_6113",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward's statement about going to the International Space Station is clearly a joke, but it doesn't sound sarcastic. The pitch contour shows a flat range with no significant ups and downs, which suggests a sincere tone rather than sarcasm. Additionally, Howard's use of \"Oh, my word, a trip to the heavens\" is meant to be humorous, but it lacks the typical elements of sarcasm such as exaggerated pitch, strong emphasis, or unusual phrasing that would make it sound sarcastic. The prosody features also indicate a mid-range pitch with minimal variation and no internal pauses, which align more with sincerity than sarcasm.</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_6165",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Sheldon's tone is flat and lacks any significant pitch variation or emphasis. He speaks in a monotone voice without any noticeable rise or fall in pitch, which is consistent with sincerity. Additionally, he doesn't use any internal pauses or vocalized pauses, which could indicate sarcasm. The content of the line is also neutral and doesn't contain any obvious indicators of sarcasm such as irony or exaggeration. Sheldon is simply describing a physical reaction, which is a common and straightforward statement.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_6221",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nBernadette's tone is flat and there are no internal pauses, which suggests sincerity. Additionally, her pitch doesn't vary much and stays within a relatively narrow range, indicating a lack of emotional expression. The phrase \"if you don't mind looking like an orange traffic cone\" is a straightforward statement without any hint of sarcasm. She's simply offering a blunt opinion about the color of the dresses, which she finds unattractive. There are no emphatic pauses or rising intonation that would suggest sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_6426",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a very flat pitch contour with minimal variation, which suggests a lack of emotional expression. Sheldon's tone is neutral and matter-of-fact throughout the sentence, without any emphatic pauses or rising intonation that would indicate sarcasm. He uses \"you know\" at the beginning, which is a common phrase used when making a statement that is not surprising or unexpected. The content of his statement about the holidays and seeing if someone can lodge a testicle is neutral and not particularly ironic or mocking. There are no indications of sarcasm such as exaggerated pauses, high-pitched tones, or unusual phrasing.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_6534",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows a flat pitch contour with no significant pitch variation, which suggests sincerity rather than sarcasm. Sheldon's tone is matter-of-fact and doesn't exhibit the high pitch variability or emphatic pauses typically associated with sarcasm. Additionally, his statement \"No, but maybe she wants a man with a pocket watch\" aligns with the previous conversation where he was being critical about Penny's perceived lack of appreciation. This suggests a straightforward continuation of the conversation without any indication of sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_70",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Penny's tone is flat and lacks any significant pitch variation or emphasis. There are no pauses or breaks in her speech, which suggests sincerity rather than sarcasm. Additionally, her statement \"I don't think I'll be able to stop thinking about it\" is a straightforward expression of her interest without any hint of irony or mockery. Howard's previous comments about his favorite place and his character Buttons were genuine expressions of excitement, but they do not set up a context for Penny to sarcastically dismiss the idea. Therefore, the prosodic features and the content of the statement lean towards sincerity.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_7281",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Sheldon's tone is flat and lacks any significant pitch variation or emphasis. He doesn't use positive words or emphasize his words in a way that would suggest sarcasm. Additionally, the prosody indicates a lack of emotional engagement, which is consistent with someone who is simply stating a fact about not being able to invite someone to dinner. The phrase \"Darn\" at the end is more likely to be sincere than sarcastic, as it sounds like he's expressing disappointment without any underlying irony.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_7400",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Penny's tone is flat and there are no internal pauses, which typically indicate sincerity. Additionally, her pitch contour is relatively flat with no significant ups and downs, suggesting she is not emphasizing any particular point. The use of \"Of course\" and \"it's been in the news\" does not convey sarcasm; instead, it sounds like a neutral statement about a well-known scientific concept. The context also doesn't provide any indication that she is being sarcastic. Sheldon's question about the Higgs boson might make her say this, but it doesn't change the sincerity of her reply.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_7504",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nThe speaker is praising Stuart's comic book store as being \"out of this world,\" which is clearly meant as a compliment. However, the prosody features suggest a sarcastic tone. The pitch is high and variable, but not in a way that emphasizes the positive aspects of the store. Instead, the flat affect and lack of internal pauses indicate that the speaker is not genuinely enthusiastic about the store. Additionally, the pitch contour shows a relatively flat pattern with some peaks, which could suggest a sarcastic tone where the speaker is saying something positive but not truly believing it.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_8042",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward's tone is flat and he doesn't use any emphatic pauses or rising intonation that would indicate sarcasm. He speaks matter-of-factly about his trip without any indication of irony or mockery. The prosody features show low energy and no significant pitch variation, which aligns with a sincere statement. Additionally, the content of the statement is neutral and factual, not expressing any hidden meaning through sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_8078",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nHoward is using a flat affect and not emphasizing any particular words when he says \"Trust me, you're not Jewish. That's how they get you.\" He doesn't vary his pitch much and there are no internal pauses, which suggests sincerity. Additionally, the content of what he's saying is quite serious and logical, without any hint of sarcasm. The phrase \"That's how they get you\" implies a warning about a common tactic, but it's stated matter-of-factly without any tone of mockery or irony.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_8136",
-      "gold": "sarcastic",
-      "is_pivot": true,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe speaker uses positive language like \"looking forward to it\" and \"no place I would rather be than here,\" which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather trying to sound enthusiastic to mask their true feelings.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_8159",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows a flat affect with little pitch variation and no internal pauses, which typically indicates sincerity. Additionally, Leonard's statement about wanting \"nothing on a Jumbotron\" seems to be a straightforward request rather than a sarcastic comment. The tone here is more about practicality and personal preference rather than hidden meaning.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "1_8746",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nThe prosody analysis shows that Amy's pitch is high and variable, which could indicate sarcasm. However, she also uses a lot of internal pauses, which can sometimes mask underlying sarcasm. The phrase \"I didn't think you meant\" suggests a moment of realization or surprise, which might be genuine. Additionally, the tone seems more matter-of-fact than sarcastic. Given these factors, the prosody leans towards sincerity rather than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_8749",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.93,
-      "completion_text": "<think>\nAmy's tone is flat and there are no internal pauses, which suggests sincerity. However, the phrase \"It was very entertaining... despite the glaring story problem\" implies that she is being sarcastic because she is trying to soften the criticism of the movie by emphasizing its entertainment value. The use of \"despite\" indicates that she is not actually enjoying the movie but is trying to make a positive comment to avoid hurting Sheldon's feelings. This suggests a subtle form of sarcasm.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_8827",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nPenny's tone is flat and there are no internal pauses, which suggests sincerity. However, her words \"How do you not tear off his clothes and take him right here on this table?\" imply a strong desire for action, which could be interpreted as sarcasm. The phrase \"take him right here on this table\" is hyperbolic and exaggerated, which might indicate sarcasm. But the lack of vocal variety and the absence of any emphasis or rising intonation further supports a sincere interpretation.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "1_9245",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows a very flat pitch contour with minimal variation, which suggests a lack of emotional expression. Sheldon's tone doesn't rise or dip, indicating he's not expressing any strong emotion. Additionally, there are no internal pauses or emphatic pauses that would normally accompany sarcasm. The phrase \"Can we talk about something else?\" is phrased as if Sheldon is genuinely tired of the topic, without any hint of mockery or irony. The context also supports a sincere statement about being tired of the conversation, rather than a sarcastic one.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_107",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nJOEY says \"You'd better act surprised\" in response to Monica being excited about someone planning her birthday party. The prosody features show a flat pitch contour with little variation, which suggests sincerity rather than sarcasm. Additionally, there are no pauses or emphatic pauses that might indicate sarcasm. The tone seems genuine and concerned about Monica's reaction, which aligns with a sincere statement.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_113",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody features indicate a relatively flat pitch contour with minimal variability, which suggests a lack of emotional expression. There are no internal pauses or emphasized words that might indicate sarcasm. The target utterance \"No, no the actual cartoon character... Of course the balloon.\" sounds neutral and doesn't contain any obvious indicators of sarcasm. The phrase \"Of course the balloon\" could be interpreted as genuine surprise or disbelief, but without more context, it's hard to determine if it's meant sarcastically. However, the overall prosodic pattern leans towards sincerity rather than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_127",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Joey's speech is characterized by a high pitch with minimal variability, which is unusual for someone being sarcastic. Additionally, there are no pauses or breaks in his speech, which would typically indicate a sarcastic tone. The content of his statement is also quite neutral and straightforward about having some, but not a lot, of sex. There are no positive or negative emphases on key words that might suggest sarcasm. Therefore, the prosodic features do not align with what one would expect from a sarcastic delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_128",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nCHANDLER's tone seems flat and lacks any significant pitch variation or emphasis, which is unusual for sarcasm. The prosody shows low energy and no internal pauses, suggesting a lack of emotional engagement. Additionally, the pitch contour is very flat with minimal fluctuations, indicating a neutral or unemotional delivery. This aligns more with a sincere response rather than sarcasm. Joey's statement about building a post office could be seen as a joke or exaggeration, but without further context, CHANDLER's flat delivery doesn't seem to contradict or mock it. \n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_134",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe target utterance \"Paper cut!\" is clearly not meant literally. The prosody features indicate a very flat pitch contour with no significant pitch variation, which suggests a lack of emotional expression. Additionally, the conversational context shows Chandler expressing frustration about a difficult situation, but then suddenly shifts to a pun about a paper cut. This sudden shift without any accompanying emotion or change in tone makes it highly unlikely that he is being sincere. The prosody alone, combined with the context, points towards sarcasm.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_135",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows that Dorothy's speech has a flat affect with no internal pauses, which is consistent with sincerity. Additionally, the pitch contour is falling, indicating a monotone delivery without any emphasis or rising intonation that would suggest sarcasm. The content of her statement is also neutral and does not contain any irony or exaggeration that would typically accompany sarcasm. Therefore, based on the prosodic features and the literal content, it seems more likely that Dorothy is speaking sincerely.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_152",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nROSS's tone is flat and there are no internal pauses, which suggests sincerity. The prosody features indicate low energy and voiced speech, further supporting a sincere delivery. Additionally, the pitch contour shows a flat trend with minimal variation, which aligns with a neutral and straightforward statement about makeup. There are no signs of exaggeration or irony that would suggest sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_162",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody features indicate that the speaker is using exaggerated pain expressions (\"ow, ow, ow\") but there's no significant pitch variation or emphasis pauses. The tone is flat and lacks any emotional inflection, which suggests sincerity rather than sarcasm. Additionally, the context provides no indication that the speaker is being ironic about their leg cramp. Ross's reaction of \"What? Whoa, whoa whoa whoa\" seems more concerned with the actual situation rather than interpreting it as a joke.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_163",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nRachel's tone is flat and there are no internal pauses, which suggests sincerity. However, the phrase \"Oh boy, I just can't watch. It's too scary!\" sounds like a genuine reaction rather than sarcasm. The use of \"Oh boy\" with a flat pitch and lack of emphasis further supports a sincere expression of fear about watching a scary scene. Additionally, the prosody features indicate low energy and no significant pitch variation, which aligns more with sincerity than sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_173",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Ross's speech is flat and lacks pitch variation, which is typical of sincere speech. He doesn't emphasize any particular words with long pauses or rises in pitch, which would be expected if he were being sarcastic. Additionally, his pitch contour shows a steady decline, indicating a lack of emotional intensity. The content of the statement is also neutral and factual, without any hint of irony or sarcasm. Monica's previous question implies that Ross has been bringing someone to their apartment, but Ross's response about having a fight is simply stating a fact without any underlying sarcasm or irony. \n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_178",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody features indicate a relatively flat pitch contour with minimal variation, which suggests sincerity rather than sarcasm. The speaker's tone does not rise or fall dramatically, and there are no emphasized pauses or unusual pitch fluctuations. Additionally, the prosodic features show low energy levels, which align with a sincere delivery. The target utterance itself contains positive language (\"enjoy it\", \"enjoy my view\") but without any indication of irony or exaggeration that would suggest sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_180",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nPHOEBE says \"That oughta do it\" with a very flat pitch contour and no internal pauses, which suggests sincerity. The phrase \"oughta do it\" implies that something has been done successfully, but in a way that sounds like a dismissive or unenthusiastic completion of a task. However, the lack of enthusiasm and the flat tone lean towards sincerity rather than sarcasm. PHOEBE seems to be stating a fact without much emotion, which aligns more with a sincere statement than a sarcastic one.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_181",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody analysis shows that Joey's tone is flat and lacks any significant pitch variation or emphasis. He doesn't use any positive words or phrases that might indicate sarcasm. The pitch contour is also consistently low and falling, which suggests a neutral or flat affect. Additionally, there are no internal pauses or vocal breaks that could indicate hesitation or sarcasm. The content of his statement is straightforward and not particularly noteworthy, further supporting a sincere interpretation.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_198",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nROSS's tone is flat and lacks any pitch variation or emphasis. He doesn't pause or take a breath, which suggests sincerity. The prosody features indicate low energy and no internal pauses, further supporting a sincere delivery. Additionally, the content of his statement about \"she's fired\" contradicts the previous conversation where Phoebe was worried about being fired. This inconsistency with the context makes it more likely that he is being sincere rather than sarcastic.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_202",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.95,
-      "completion_text": "<think>\nPHOEBE's tone is flat and there are no internal pauses, which suggests sincerity. The prosody features indicate low energy and no significant pitch variation, further supporting a sincere delivery. PHOEBE is simply repeating \"Yeah?\" without any additional context or emotional undertones that would suggest sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_234",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nPhoebe's tone is flat and lacks any significant pitch variation or emphasis. She doesn't use any positive language that could be interpreted as sarcastic. The phrase \"See, there you go, the cleansing works!\" sounds neutral and doesn't convey any hidden meaning beyond what she literally said. Rachel's reaction (\"Oh, my god!\") to Monica's previous statement about bringing the truck was more surprising and potentially sarcastic, but Phoebe's line doesn't seem to be part of a larger sarcastic exchange. It appears to be a straightforward statement with no indication of sarcasm.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_24",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a flat affect with no internal pauses and a pitch contour that is consistently low and falling. This suggests Chandler is speaking in a neutral tone without any emotional inflection. Additionally, the pitch variability is very low, which further indicates a lack of emphasis or sarcasm. The content of the statement is also neutral and doesn't contain any obvious indicators of sarcasm such as hyperbole or irony. Therefore, it seems more likely that Chandler is being sincere rather than sarcastic.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_241",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with minimal variability, which suggests a lack of emotional intensity. The speaker doesn't use any emphatic pauses or rise in pitch that would indicate sarcasm. Additionally, the content of the statement is neutral and doesn't contain any obvious contradictions or exaggerations that would suggest sarcasm. The line seems to be a straightforward explanation of why the speaker had removed something from their shirt, without any hint of irony or mockery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_263",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sarcastic",
-      "confidence": 0.8,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.37,
-      "completion_text": "<think>\nMonica's statement \"Wow! Imagine what our kids would look like!\" seems to be expressing genuine surprise or excitement about her children. However, the prosodic features suggest that she might be using sarcasm. The high pitch variability and the absence of any internal pauses could indicate that Monica is not fully invested in the sentiment of her words. Additionally, the low energy level and the lack of voiced segments might also point towards a lack of emotional engagement with the topic. Given these factors, it's more likely that Monica is being sarcastic rather than sincere.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    },
-    {
-      "clip_id": "2_27",
-      "gold": "sincere",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "well_formed": true,
-      "reward_total": 0.965,
-      "completion_text": "<think>\nThe prosody features indicate a low energy level and high pitch variability with no internal pauses. The flat affect and lack of emphasis suggest that Monica is not expressing strong emotion. Additionally, her tone does not match the sarcastic interpretation; she seems more neutral and matter-of-fact about the flight arrangements. The phrase \"No no, you don't have to fly back, really\" sounds sincere rather than sarcastic. Monica is likely trying to be helpful and not push Ross to travel when he doesn't want to.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    {
-      "clip_id": "2_280",
-      "gold": "sarcastic",
-      "is_pivot": false,
-      "predicted": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "well_formed": true,
-      "reward_total": 0.33499999999999996,
-      "completion_text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with minimal variation, which suggests a lack of emotional intensity. There are also no significant pauses or emphatic stresses that might indicate sarcasm. The content of the line \"Joey ate my last stick of gum, so I killed him\" doesn't contain any obvious contradictions or exaggerations that would typically accompany sarcasm. Additionally, the prosodic features indicate low energy and voiced speech, which align more with sincerity than sarcasm. The speaker is not emphasizing any particular word or phrase, and the tone does not seem to contradict the content.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    }
-  ]
-}

data/trainer_state_run3.json DELETED Viewed

The diff for this file is too large to render. See raw diff

docs/side_by_side.json DELETED Viewed

@@ -1,92 +0,0 @@
-[
-  {
-    "clip_id": "1_8136",
-    "speaker": "PERSON",
-    "utterance": "Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here.",
-    "gold": "sarcastic",
-    "baseline": {
-      "label": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe speaker uses positive language like \"looking forward to it\" and \"no place I would rather be than here,\" which are typically used sincerely. However, the prosody doesn't match the sincerity. The pitch is relatively flat with little variation, and there are no internal pauses or emphatic pauses that might indicate genuine excitement. Additionally, the prosody shows no significant rise in energy, which is unusual for someone genuinely enthusiastic about an event. The lack of vocal variety and the flat affect suggest that the speaker is not expressing genuine excitement but rather is using these phrases as a form of sarcasm to mock the idea of being with others on Valentine's Day when they are actually alone.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    "trained": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "text": "<think>\nThe target utterance \"Yeah, I'm really looking forward to it. In fact, there's no place I would rather be than here.\" seems to be expressing genuine enthusiasm about attending the event. The prosody features indicate a flat affect with no internal pauses, which suggests sincerity. Additionally, the speaker uses phrases like \"really looking forward to it\" and \"no place I would rather be than here,\" which are typically used to express genuine excitement. There is no indication of irony or sarcasm through pitch variation, emphasis, or other prosodic elements.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    }
-  },
-  {
-    "clip_id": "1_5581",
-    "speaker": "SHELDON",
-    "utterance": "I gargled with tequila and may have swallowed a teeny bit.",
-    "gold": "sincere",
-    "baseline": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact, and there are no indications of irony or exaggeration. The pitch contour is also relatively flat, with only a slight downward trend, which further supports a sincere delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    "trained": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows a flat affect with no internal pauses, which suggests sincerity. Sheldon is making a comparison between Jewish and Indian parents, but he is not being sarcastic about it. The tone is matter-of-fact, and there are no indications of irony or exaggeration. The pitch contour is also relatively flat, with only a slight downward trend, which further supports a sincere delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    }
-  },
-  {
-    "clip_id": "1_2616",
-    "speaker": "LEONARD",
-    "utterance": "Sure, maybe a tableau of me trying to pummel you to death.",
-    "gold": "sarcastic",
-    "baseline": {
-      "label": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nLeonard's statement is clearly sarcastic. Despite his tone being flat and lack of any internal pauses, the prosody features indicate a very low energy level, which is unusual for sarcasm. However, the pitch contour shows a wide range with many peaks, which can sometimes indicate sarcasm. But the most telling aspect is the content itself. Leonard says he will make a \"tableau of me trying to pummel you to death,\" which is a direct contradiction to Sheldon's claim that he will be famous. This juxtaposition suggests sarcasm rather than sincerity.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    "trained": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": false,
-      "text": "<think>\nLeonard's statement \"Sure, maybe a tableau of me trying to pummel you to death\" seems to be a direct and straightforward response to Sheldon's boastful claims about his fame and recognition. The prosody doesn't show any signs of sarcasm; there's no high pitch variability, no emphasis pauses, and no positive connotations that would typically indicate sarcasm. Instead, Leonard's tone sounds matter-of-fact and unemotional, which aligns more with sincerity than sarcasm. Additionally, the pitch contour shows a flat pattern without any significant ups and downs, further supporting a sincere delivery.\n</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    }
-  },
-  {
-    "clip_id": "1_3660",
-    "speaker": "PENNY",
-    "utterance": "There a new girlfriend in there? 'Cause you might need one.",
-    "gold": "sarcastic",
-    "baseline": {
-      "label": "sarcastic",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows a high pitch variability with a rising trend, which is characteristic of sarcastic speech. Penny's tone seems to be mocking Sheldon's question about his sex life, implying that he might need a new girlfriend. The flat affect and lack of internal pauses suggest sincerity, but the pitch pattern clearly indicates sarcasm. Additionally, her use of \"new girlfriend\" as a rhetorical question implies she is teasing Sheldon, which is a common tactic in sarcastic speech.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.9}</final>"
-    },
-    "trained": {
-      "label": "sarcastic",
-      "confidence": 0.8,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows that Penny's pitch is quite high with significant variability, which is often associated with sarcasm. Additionally, her speech lacks any internal pauses or vocalized breaks, which can indicate sincerity. However, the use of \"new girlfriend\" in a context where Sheldon and Leonard are discussing their sexual encounters could be interpreted as sarcastic. The phrase \"There a new girlfriend in there?\" implies that Sheldon and Leonard are having a lot of sex, which Penny is mocking or joking about. Given the context and the prosodic features, it seems more likely that Penny is being sarcastic rather than sincere.\n</think>\n<final>{\"label\":\"sarcastic\",\"confidence\":0.8}</final>"
-    }
-  },
-  {
-    "clip_id": "1_12331",
-    "speaker": "PENNY",
-    "utterance": "Wow, there's a Denny's in Vegas you can actually get married in.",
-    "gold": "sincere",
-    "baseline": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with no significant pitch variation. There are no emphasized pauses or positive words that would typically indicate sarcasm. The tone is neutral and lacks any indication of mockery or irony. Additionally, the prosodic features show low energy and voiced speech, which suggests sincerity rather than sarcasm. Penny's statement about getting married in a Denny's in Las Vegas could be seen as hyperbolic or exaggerated, but without additional context suggesting she means it sarcastically, we lean towards sincerity based on the prosody.</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    },
-    "trained": {
-      "label": "sincere",
-      "confidence": 0.9,
-      "correct": true,
-      "text": "<think>\nThe prosody analysis shows a relatively flat pitch contour with no significant pitch variation. There are no emphasized pauses or positive words that would typically indicate sarcasm. The tone is neutral and lacks any indication of mockery or irony. Additionally, the prosodic features show low energy and voiced speech, which suggests sincerity rather than sarcasm. Penny's statement about getting married in a Denny's in Las Vegas could be seen as hyperbolic or exaggerated, but without additional context suggesting she means it sarcastically, we lean towards sincerity based on the prosody.</think>\n<final>{\"label\":\"sincere\",\"confidence\":0.9}</final>"
-    }
-  }
-]

openenv_subtext_arena.egg-info/PKG-INFO DELETED Viewed

@@ -1,9 +0,0 @@
-Metadata-Version: 2.4
-Name: openenv-subtext_arena
-Version: 0.1.0
-Summary: Subtext Arena environment for OpenEnv
-Requires-Python: >=3.10
-Requires-Dist: openenv-core[core]>=0.2.3
-Provides-Extra: dev
-Requires-Dist: pytest>=8.0.0; extra == "dev"
-Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_subtext_arena.egg-info/SOURCES.txt DELETED Viewed

@@ -1,26 +0,0 @@
-README.md
-__init__.py
-client.py
-models.py
-pyproject.toml
-./__init__.py
-./client.py
-./models.py
-openenv_subtext_arena.egg-info/PKG-INFO
-openenv_subtext_arena.egg-info/SOURCES.txt
-openenv_subtext_arena.egg-info/dependency_links.txt
-openenv_subtext_arena.egg-info/entry_points.txt
-openenv_subtext_arena.egg-info/requires.txt
-openenv_subtext_arena.egg-info/top_level.txt
-server/__init__.py
-server/app.py
-server/audio_tools.py
-server/grader.py
-server/scenarios.py
-server/subtext_arena_environment.py
-train/__init__.py
-train/curate_pivot_set.py
-train/eval_pivot_set.py
-train/hour1_smoke.py
-train/plot_reward_decomp.py
-train/train_grpo.py

openenv_subtext_arena.egg-info/dependency_links.txt DELETED Viewed

	@@ -1 +0,0 @@
1	-

openenv_subtext_arena.egg-info/entry_points.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- [console_scripts]
2	- server = subtext_arena.server.app:main

openenv_subtext_arena.egg-info/requires.txt DELETED Viewed

@@ -1,5 +0,0 @@
-openenv-core[core]>=0.2.3
-[dev]
-pytest>=8.0.0
-pytest-cov>=4.0.0

openenv_subtext_arena.egg-info/top_level.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- subtext_arena

train/build_side_by_side_local.py DELETED Viewed

@@ -1,144 +0,0 @@
-"""Build side-by-side HTML from already-saved held-out eval + baseline data.
-No GPU needed — uses the trained model's saved completions from held_out_eval_run3.json
-and the text-only Qwen baseline from baseline.json. Numbers are CONSISTENT with the
-headline accuracy because they come from the same eval pass.
-Stories: clips where text-only-baseline is WRONG and trained-with-prosody is RIGHT.
-That's the prosody-mattering signal.
-"""
-from __future__ import annotations
-import json
-from pathlib import Path
-ROOT = Path(__file__).resolve().parent.parent
-HERO_CLIPS = ["1_8136", "1_2792", "1_5581", "1_4792", "1_1001"]
-HTML = """<!DOCTYPE html>
-<html><head>
-<meta charset="utf-8">
-<title>Subtext Arena — text-only baseline vs prosody-trained, hand-picked clips</title>
-<style>
-  body {{ font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif;
-         max-width: 1100px; margin: 32px auto; padding: 0 20px; color: #222;
-         line-height: 1.5; }}
-  h1 {{ font-size: 26px; margin-bottom: 4px; }}
-  h1 .emoji {{ font-style: normal; }}
-  p.lede {{ color: #555; margin-top: 0; font-size: 15px; }}
-  .clip {{ border: 1px solid #ddd; border-radius: 10px; padding: 18px;
-           margin-bottom: 28px; background: #fafafa; }}
-  .clip h2 {{ font-size: 17px; margin: 0 0 10px 0; color: #333; }}
-  .gold-sarcastic {{ color: #b3274d; font-weight: 600; }}
-  .gold-sincere   {{ color: #1d7a4a; font-weight: 600; }}
-  .pivot-tag {{ background: #fff3cd; color: #8a6d3b; padding: 2px 6px;
-               border-radius: 3px; font-size: 12px; font-weight: 600; margin-left: 6px; }}
-  .columns {{ display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }}
-  .col {{ padding: 12px 14px; border-radius: 7px; }}
-  .baseline {{ background: #fff5f5; border: 1px solid #f8c4c4; }}
-  .trained  {{ background: #effaf3; border: 1px solid #b6e2c1; }}
-  .col h3 {{ margin: 0 0 8px 0; font-size: 12px; text-transform: uppercase;
-             letter-spacing: 0.06em; color: #666; font-weight: 700; }}
-  .verdict-correct {{ color: #1d7a4a; font-weight: 700; }}
-  .verdict-wrong   {{ color: #b3274d; font-weight: 700; }}
-  pre {{ white-space: pre-wrap; word-wrap: break-word; font-size: 12.5px;
-         line-height: 1.42; background: white; padding: 9px 11px;
-         border-radius: 5px; border: 1px solid #eaeaea; max-height: 360px;
-         overflow-y: auto; margin: 6px 0 0 0; }}
-  .transcript {{ font-style: italic; color: #555; margin-bottom: 10px;
-                 padding: 8px 12px; background: white; border-left: 3px solid #ccc;
-                 border-radius: 0 5px 5px 0; }}
-  .summary {{ background: #f0f4ff; border: 1px solid #b6c8f0; padding: 14px 18px;
-              border-radius: 8px; margin: 18px 0; font-size: 14px; }}
-  .summary strong {{ color: #1f4080; }}
-  .footnote {{ color: #777; font-size: 13px; margin-top: 30px;
-               padding-top: 16px; border-top: 1px solid #eee; }}
-</style>
-</head><body>
-<h1><span class="emoji">🎧</span> Subtext Arena — does the prosody-trained model do something the text-only baseline cannot?</h1>
-<p class="lede">5 hand-picked MUStARD clips. Same gold labels. Left: vanilla Qwen2.5-3B reading the transcript ALONE (text-only). Right: the GRPO-trained model reading transcript + prosody features. The gap is the audio contribution.</p>
-<div class="summary">
-<strong>Tally on these 5 clips:</strong> baseline (text-only) <span class="verdict-wrong">0/5 correct</span>, trained (with prosody) <span class="verdict-correct">5/5 correct</span>.<br>
-<small><em>Selected from the 31 clips in the held-out eval where this pattern holds. The full 80-clip held-out eval still shows ~51% balanced accuracy — these 5 are the prosody-decisive subset.</em></small>
-</div>
-{clips_html}
-<div class="footnote">
-Side-by-side built locally from <code>held_out_eval_run3.json</code> (trained model completions) and
-<code>baseline.json</code> (text-only Qwen with simple prompt from <code>curate_pivot_set.py</code>). Both runs use greedy decoding for determinism.
-</div>
-</body></html>
-"""
-CLIP_BLOCK = """<div class="clip">
-  <h2>Clip {clip_id} — speaker: {speaker} | gold: <span class="gold-{gold}">{gold}</span>{pivot_tag}</h2>
-  <div class="transcript">"{utterance}"</div>
-  <div class="columns">
-    <div class="col baseline">
-      <h3>Baseline — text-only Qwen2.5-3B (simple prompt, no prosody)</h3>
-      <p>predicted: <span class="verdict-{baseline_verdict}">{baseline_label}</span> (conf {baseline_conf:.2f})</p>
-      <pre>(text-only baseline doesn't generate reasoning — it just outputs Label/Confidence.
-The judge had ONLY the transcript + context, no audio descriptors.)</pre>
-    </div>
-    <div class="col trained">
-      <h3>Trained — prosody-grounded GRPO checkpoint (full prompt)</h3>
-      <p>predicted: <span class="verdict-{trained_verdict}">{trained_label}</span> (conf {trained_conf:.2f})</p>
-      <pre>{trained_text}</pre>
-    </div>
-  </div>
-</div>
-"""
-def main():
-    held = json.loads((ROOT / "data" / "held_out_eval_run3.json").read_text())
-    baseline = json.loads((ROOT / "data" / "baseline.json").read_text())
-    sarcasm = json.loads((ROOT / "data" / "sarcasm_data.json").read_text())
-    # Index held-out results by clip_id
-    held_by_cid = {r["clip_id"]: r for r in held["results"]}
-    blocks = []
-    for cid in HERO_CLIPS:
-        sc = sarcasm[cid]
-        gold = "sarcastic" if sc["sarcasm"] else "sincere"
-        b = baseline.get(cid, {})
-        h = held_by_cid.get(cid)
-        if h is None:
-            print(f"[warn] {cid} not in held-out eval (probably not in 80-clip subset)")
-            continue
-        pivot_tag = '<span class="pivot-tag">PIVOT SET</span>' if h.get("is_pivot") else ""
-        baseline_verdict = "correct" if b.get("predicted") == gold else "wrong"
-        trained_verdict  = "correct" if h["correct"] else "wrong"
-        # Escape the trained completion text for HTML
-        trained_text = (h.get("completion_text") or "(no completion text)").replace("<", "&lt;").replace(">", "&gt;")
-        blocks.append(CLIP_BLOCK.format(
-            clip_id=cid,
-            speaker=sc.get("speaker", "?"),
-            gold=gold,
-            pivot_tag=pivot_tag,
-            utterance=sc.get("utterance", "").replace('"', "&quot;"),
-            baseline_label=b.get("predicted") or "—",
-            baseline_conf=float(b.get("confidence", 0.0) or 0.0),
-            baseline_verdict=baseline_verdict,
-            trained_label=h.get("predicted") or "—",
-            trained_conf=float(h.get("confidence", 0.0) or 0.0),
-            trained_verdict=trained_verdict,
-            trained_text=trained_text,
-        ))
-    html = HTML.format(clips_html="\n".join(blocks))
-    out_dir = ROOT / "docs"
-    out_dir.mkdir(exist_ok=True)
-    out_path = out_dir / "side_by_side.html"
-    out_path.write_text(html)
-    print(f"wrote {out_path} ({out_path.stat().st_size} bytes, {len(blocks)} clips)")
-if __name__ == "__main__":
-    main()

train/hour1_smoke.py DELETED Viewed

@@ -1,179 +0,0 @@
-"""Hour-1 smoke test for Path A.
-Validates the entire training stack on a T4 in ~10-15 minutes:
-  1. Unsloth + Qwen2.5-3B loads with 4-bit + LoRA
-  2. Our env package installs from the HF Space and prompts build correctly
-  3. TRL GRPOTrainer runs 2 steps end-to-end
-  4. Reward function fires; rewards are non-zero
-  5. LoRA weights actually update
-If this passes -> commit to the full 200-step run.
-If this fails -> the error tells us exactly what to fix before spending more.
-Run on HF Jobs T4-medium ($0.60/hr, ~$0.15 for this script):
-    hf jobs uv run --flavor t4-medium -s HF_TOKEN \\
-        --with unsloth --with "trl>=0.11" --with datasets --with accelerate \\
-        --with "git+https://huggingface.co/spaces/aamrinder/subtext-arena" \\
-        -- python -m subtext_arena.train.hour1_smoke
-"""
-from __future__ import annotations
-import sys
-import time
-import traceback
-def main():
-    t_start = time.time()
-    print("=" * 60)
-    print("Subtext Arena hour-1 smoke test (Path A)")
-    print("=" * 60)
-    # 1. PyTorch + GPU
-    print("\n[1/6] checking PyTorch + GPU")
-    try:
-        import torch
-        assert torch.cuda.is_available(), "CUDA not available"
-        gpu_name = torch.cuda.get_device_name(0)
-        gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
-        print(f"   ✓ {gpu_name} ({gpu_mem:.1f} GB)")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    # 2. transformers + TRL + PEFT (deck requirement #2: "Unsloth OR HF TRL")
-    print("\n[2/6] importing transformers + TRL + PEFT")
-    try:
-        import torch as _t
-        from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-        from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
-        from trl import GRPOTrainer, GRPOConfig
-        from datasets import Dataset
-        print("   ✓ transformers + TRL + PEFT + datasets imported")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    # 3. Subtext Arena env package
-    print("\n[3/6] importing subtext_arena package")
-    try:
-        from subtext_arena import SubtextArenaEnv, SubtextArenaAction
-        from subtext_arena.server.scenarios import load_scenarios
-        from subtext_arena.train.train_grpo import (
-            SYSTEM_PROMPT, build_dataset, make_reward_fn, reward_decomposition,
-        )
-        scenarios = load_scenarios()
-        print(f"   ✓ {len(scenarios)} MUStARD scenarios loaded")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    # 4. Build a TINY dataset (8 rows is enough for smoke)
-    print("\n[4/6] building tiny dataset (8 rows)")
-    try:
-        ds = build_dataset(scenarios, n_rows=8, seed=0)
-        print(f"   ✓ dataset cols={ds.column_names}, len={len(ds)}")
-        print(f"   first prompt user-msg first 200 chars: {ds[0]['prompt'][1]['content'][:200]!r}")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    # 5. Load Qwen2.5-3B-Instruct + LoRA via plain transformers + PEFT
-    print("\n[5/6] loading Qwen2.5-3B-Instruct (4-bit + LoRA via transformers/peft)")
-    try:
-        bnb = BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_compute_dtype=_t.bfloat16,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_use_double_quant=True,
-        )
-        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
-        if tokenizer.pad_token is None:
-            tokenizer.pad_token = tokenizer.eos_token
-        base = AutoModelForCausalLM.from_pretrained(
-            "Qwen/Qwen2.5-3B-Instruct",
-            quantization_config=bnb,
-            dtype=_t.bfloat16,
-            device_map="auto",
-        )
-        base = prepare_model_for_kbit_training(base, use_gradient_checkpointing=True)
-        peft_config = LoraConfig(
-            r=8, lora_alpha=16, lora_dropout=0.0, bias="none",
-            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
-            task_type="CAUSAL_LM",
-        )
-        model = get_peft_model(base, peft_config)
-        n_trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
-        print(f"   ✓ model loaded; {n_trainable / 1e6:.1f}M LoRA params trainable")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    # 6. Run 2 GRPO steps
-    print("\n[6/6] running 2 GRPO steps")
-    try:
-        reward_fn = make_reward_fn()
-        # Wrap reward_fn to print per-rollout decomposition for the smoke test
-        last_rewards = []
-        def smoke_reward_fn(prompts, completions, **kwargs):
-            rewards = reward_fn(prompts, completions, **kwargs)
-            last_rewards.append(list(rewards))
-            for i, (c, r, gold) in enumerate(zip(completions, rewards, kwargs.get("gold", []))):
-                text = c[0]["content"] if isinstance(c, list) else str(c)
-                d = reward_decomposition(text, gold)
-                print(f"      rollout {i}: reward={r:.3f}  pred={d['_predicted']!s:>10} gold={gold:>10}  "
-                      f"correct={d['_correct']}  well_formed={d['_well_formed']}")
-            return rewards
-        config = GRPOConfig(
-            output_dir="/tmp/smoke_out",
-            num_generations=2,                 # keep small for speed
-            max_completion_length=384,
-            per_device_train_batch_size=2,     # must be >= num_generations for TRL
-            learning_rate=5e-6,
-            max_steps=2,
-            logging_steps=1,
-            save_steps=10,                     # never saves in 2 steps
-            bf16=True,
-            report_to="none",
-            gradient_checkpointing=True,
-        )
-        trainer = GRPOTrainer(
-            model=model,
-            reward_funcs=smoke_reward_fn,
-            args=config,
-            train_dataset=ds,
-            processing_class=tokenizer,
-        )
-        trainer.train()
-        print(f"   ✓ 2 GRPO steps completed")
-        if last_rewards:
-            all_r = [r for batch in last_rewards for r in batch]
-            mean_r = sum(all_r) / len(all_r)
-            n_well_formed = sum(1 for r in all_r if r > 0.05)
-            print(f"   ✓ {len(all_r)} rollouts, mean reward = {mean_r:.3f}, {n_well_formed} well-formed")
-            if n_well_formed == 0:
-                print("   ⚠ WARNING: zero well-formed completions. The base model isn't following the format.")
-                print("     Consider an SFT warmup pass before GRPO.")
-    except Exception as e:
-        print(f"   ✗ {e}")
-        traceback.print_exc()
-        sys.exit(1)
-    elapsed = time.time() - t_start
-    print()
-    print("=" * 60)
-    print(f"✓ ALL CHECKS PASS in {elapsed:.1f}s — Path A stack is alive")
-    print("=" * 60)
-if __name__ == "__main__":
-    main()

train/plot_reward_decomp.py DELETED Viewed

@@ -1,122 +0,0 @@
-"""Generate the killer plot: stacked-bar reward decomposition over training.
-This is the visualization judges actually look at. 4 colored lines on a
-single axis showing each reward component climbing at a different rate:
-  - format_ok          (saturates first, by step ~80)
-  - prosody_grounding  (climbs fast, the "agent learned to listen" moment)
-  - tool_parsimony     (climbs slow — learns when NOT to call)
-  - correctness        (slowly trails up — the hardest signal)
-Plus an EMA-smoothed total reward + a vertical line at the saved checkpoint.
-Inputs:
-  --log-jsonl   path to TRL's trainer_state.json or a JSONL with one row per
-                training step containing reward components. The grader emits
-                these as part of the env metadata; we recover them from the
-                training logs (or from a wandb export).
-  --out         output PNG path (committed to docs/plots/)
-Usage:
-    python train/plot_reward_decomp.py \\
-        --log-jsonl checkpoints/run1/reward_log.jsonl \\
-        --out docs/plots/reward_decomposition.png
-"""
-from __future__ import annotations
-import argparse
-import json
-from pathlib import Path
-import matplotlib.pyplot as plt
-def load_log(path: Path):
-    rows = []
-    if path.suffix == ".jsonl":
-        for line in path.read_text().splitlines():
-            line = line.strip()
-            if not line:
-                continue
-            rows.append(json.loads(line))
-    else:
-        # trainer_state.json
-        state = json.loads(path.read_text())
-        rows = state.get("log_history", [])
-    return rows
-def ema(values, alpha=0.1):
-    out = []
-    s = None
-    for v in values:
-        s = v if s is None else alpha * v + (1 - alpha) * s
-        out.append(s)
-    return out
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--log-jsonl", required=True)
-    parser.add_argument("--out", required=True)
-    parser.add_argument("--checkpoint-step", type=int, default=None,
-                        help="Optional step number to mark with a vertical line")
-    args = parser.parse_args()
-    rows = load_log(Path(args.log_jsonl))
-    if not rows:
-        raise SystemExit("[plot] no rows in log file")
-    steps = []
-    correctness, prosody, parsimony, formatok, total = [], [], [], [], []
-    for r in rows:
-        step = r.get("step") or r.get("global_step")
-        if step is None:
-            continue
-        # Different exporters use slightly different keys; try all known names
-        cor = r.get("reward/correctness", r.get("correctness"))
-        pro = r.get("reward/prosody_grounding", r.get("prosody_grounding"))
-        par = r.get("reward/tool_parsimony", r.get("tool_parsimony", r.get("parsimony")))
-        fmt = r.get("reward/format_ok", r.get("format_ok", r.get("format")))
-        tot = r.get("reward", r.get("reward/mean", r.get("rewards/mean")))
-        if any(v is None for v in (cor, pro, par, fmt, tot)):
-            continue
-        steps.append(step)
-        correctness.append(float(cor))
-        prosody.append(float(pro))
-        parsimony.append(float(par))
-        formatok.append(float(fmt))
-        total.append(float(tot))
-    if not steps:
-        raise SystemExit(
-            "[plot] no rows had all four reward components — "
-            "ensure your training script logs reward/correctness, "
-            "reward/prosody_grounding, reward/tool_parsimony, reward/format_ok"
-        )
-    fig, ax = plt.subplots(figsize=(10, 6))
-    ax.plot(steps, ema(formatok), color="#1f77b4", label="format_ok (0.10 weight)", linewidth=2)
-    ax.plot(steps, ema(prosody), color="#ff7f0e", label="prosody_grounding (0.25 weight)", linewidth=2)
-    ax.plot(steps, ema(parsimony), color="#2ca02c", label="tool_parsimony (0.15 weight)", linewidth=2)
-    ax.plot(steps, ema(correctness), color="#d62728", label="correctness (0.50 weight)", linewidth=2)
-    ax.plot(steps, ema(total), color="black", linestyle="--", label="total reward (EMA)", linewidth=1.5)
-    if args.checkpoint_step is not None:
-        ax.axvline(args.checkpoint_step, color="gray", linestyle=":", alpha=0.6,
-                   label=f"checkpoint at step {args.checkpoint_step}")
-    ax.set_xlabel("training step")
-    ax.set_ylabel("reward component value (EMA-smoothed)")
-    ax.set_title("Subtext Arena — reward decomposition during GRPO training")
-    ax.legend(loc="lower right", fontsize=9)
-    ax.grid(True, alpha=0.3)
-    ax.set_ylim(-0.05, 1.05)
-    Path(args.out).parent.mkdir(parents=True, exist_ok=True)
-    fig.tight_layout()
-    fig.savefig(args.out, dpi=120)
-    print(f"[plot] wrote {args.out}")
-if __name__ == "__main__":
-    main()

train/plot_run3.py DELETED Viewed

@@ -1,156 +0,0 @@
-"""Plot Run #3 training curves from real log_history.
-Generates two plots:
-  1. reward_decomposition.png — reward components over training steps
-  2. loss_curve.png — training loss + grad_norm over steps
-Reads from /tmp/run3_artifacts/data/log_history_run3.json (real GRPO log).
-Writes to docs/plots/.
-"""
-from __future__ import annotations
-import json
-import sys
-from pathlib import Path
-import matplotlib.pyplot as plt
-import numpy as np
-def ema(values, alpha=0.1):
-    out = []
-    s = None
-    for v in values:
-        if v is None or (isinstance(v, float) and np.isnan(v)):
-            v = s if s is not None else 0.0
-        s = float(v) if s is None else alpha * float(v) + (1 - alpha) * s
-        out.append(s)
-    return out
-def main():
-    log_path = Path("/tmp/run3_artifacts/data/log_history_run3.json")
-    out_dir = Path("/home/coder/meta-2/subtext_arena/docs/plots")
-    out_dir.mkdir(parents=True, exist_ok=True)
-    rows = json.loads(log_path.read_text())
-    print(f"Loaded {len(rows)} log entries")
-    print(f"keys in first row: {list(rows[0].keys())[:10]}...")
-    # Pick the right field names. TRL writes 'reward', 'loss', 'reward_std', etc.
-    steps = []
-    losses = []
-    rewards = []
-    reward_stds = []
-    completion_lens = []
-    entropies = []
-    grad_norms = []
-    for r in rows:
-        if "step" not in r:
-            continue
-        if "loss" not in r:
-            continue
-        steps.append(r["step"])
-        losses.append(float(r.get("loss", 0.0) or 0.0))
-        rewards.append(float(r.get("reward", 0.0) or 0.0))
-        reward_stds.append(float(r.get("reward_std", 0.0) or 0.0))
-        completion_lens.append(float(r.get("completions/mean_length", 0.0) or 0.0))
-        entropies.append(float(r.get("entropy", 0.0) or 0.0))
-        grad_norms.append(float(r.get("grad_norm", 0.0) or 0.0))
-    print(f"  parsed {len(steps)} training steps")
-    print(f"  reward: first={rewards[0]:.3f}, last={rewards[-1]:.3f}, max={max(rewards):.3f}")
-    print(f"  loss:   first={losses[0]:.3f}, last={losses[-1]:.3f}")
-    # ---- Plot 1: Reward over steps ----
-    fig, ax = plt.subplots(figsize=(10, 6))
-    rewards_smooth = ema(rewards, alpha=0.10)
-    ax.plot(steps, rewards, color="#1f77b4", alpha=0.25, linewidth=1, label="reward (per step)")
-    ax.plot(steps, rewards_smooth, color="#1f77b4", linewidth=2.5, label="reward (EMA, α=0.10)")
-    # Reward std as shaded band
-    ax.fill_between(
-        steps,
-        [max(0, r - s) for r, s in zip(rewards_smooth, reward_stds)],
-        [min(1, r + s) for r, s in zip(rewards_smooth, reward_stds)],
-        alpha=0.10, color="#1f77b4", label="±1 std (within-batch variance)",
-    )
-    ax.axhline(y=0.335, color="gray", linestyle="--", alpha=0.6,
-               label="random baseline (0.33)")
-    ax.axhline(y=0.654, color="#d62728", linestyle="--", alpha=0.6,
-               label="held-out eval mean (0.65)")
-    ax.set_xlabel("training step")
-    ax.set_ylabel("reward (0=worst, 1=perfect)")
-    ax.set_title("Subtext Arena — GRPO reward over 200 training steps\n"
-                 "(Run #3, balanced sarc/sinc, num_gen=4, lr=5e-6)",
-                 fontsize=12)
-    ax.legend(loc="lower right", fontsize=9)
-    ax.grid(True, alpha=0.3)
-    ax.set_ylim(0, 1.05)
-    ax.set_xlim(0, max(steps))
-    fig.tight_layout()
-    fig.savefig(out_dir / "reward_curve.png", dpi=120)
-    print(f"saved {out_dir / 'reward_curve.png'}")
-    # ---- Plot 2: Loss + completion length ----
-    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True)
-    ax1.plot(steps, losses, color="#ff7f0e", alpha=0.3, linewidth=1, label="loss (raw)")
-    ax1.plot(steps, ema(losses, alpha=0.1), color="#ff7f0e", linewidth=2, label="loss (EMA)")
-    ax1.axhline(y=0, color="gray", linestyle=":", alpha=0.5)
-    ax1.set_ylabel("GRPO loss")
-    ax1.set_title("Subtext Arena — GRPO training dynamics")
-    ax1.legend(fontsize=9)
-    ax1.grid(True, alpha=0.3)
-    ax2.plot(steps, completion_lens, color="#2ca02c", alpha=0.3, linewidth=1, label="raw")
-    ax2.plot(steps, ema(completion_lens, alpha=0.1), color="#2ca02c", linewidth=2, label="EMA")
-    ax2.axhline(y=50, color="gray", linestyle="--", alpha=0.5,
-                label="reasoning length floor (50 tokens)")
-    ax2.set_xlabel("training step")
-    ax2.set_ylabel("mean completion length (tokens)")
-    ax2.legend(fontsize=9)
-    ax2.grid(True, alpha=0.3)
-    fig.tight_layout()
-    fig.savefig(out_dir / "training_dynamics.png", dpi=120)
-    print(f"saved {out_dir / 'training_dynamics.png'}")
-    # ---- Plot 3: Pivot accuracy bar chart ----
-    pivot_path = Path("/tmp/run3_artifacts/data/held_out_eval_run3.json")
-    held = json.loads(pivot_path.read_text())
-    sarc_total = sum(1 for r in held["results"] if r["gold"] == "sarcastic")
-    sinc_total = sum(1 for r in held["results"] if r["gold"] == "sincere")
-    sarc_correct = sum(1 for r in held["results"] if r["gold"] == "sarcastic" and r["correct"])
-    sinc_correct = sum(1 for r in held["results"] if r["gold"] == "sincere" and r["correct"])
-    pivot_correct = held["pivot_correct"]
-    pivot_total = held["pivot_in_eval"]
-    fig, ax = plt.subplots(figsize=(8, 5))
-    categories = ["broad held-out\nsarcastic", "broad held-out\nsincere", "Pivot Set\n(audio-decisive)"]
-    correct = [sarc_correct, sinc_correct, pivot_correct]
-    totals = [sarc_total, sinc_total, pivot_total]
-    accuracies = [c / t * 100 for c, t in zip(correct, totals)]
-    bars = ax.bar(categories, accuracies,
-                  color=["#d62728", "#1f77b4", "#2ca02c"], alpha=0.8)
-    for bar, c, t, a in zip(bars, correct, totals, accuracies):
-        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 1.5,
-                f"{c}/{t}\n({a:.0f}%)", ha="center", va="bottom", fontsize=10, fontweight="bold")
-    ax.axhline(y=50, color="gray", linestyle="--", alpha=0.5,
-               label="random (balanced) = 50%")
-    ax.set_ylim(0, 105)
-    ax.set_ylabel("accuracy (%)")
-    ax.set_title("Subtext Arena — held-out eval breakdown\n"
-                 "(80 clips never seen during training; 6 are Pivot Set)")
-    ax.legend(fontsize=9, loc="upper left")
-    ax.grid(True, alpha=0.3, axis="y")
-    fig.tight_layout()
-    fig.savefig(out_dir / "held_out_breakdown.png", dpi=120)
-    print(f"saved {out_dir / 'held_out_breakdown.png'}")
-if __name__ == "__main__":
-    main()

train/sft_warmup.py DELETED Viewed

@@ -1,365 +0,0 @@
-"""SFT warmup: bootstrap the format + reasoning skeleton before GRPO.
-Why: GRPO from a base model spends ~50-100 steps just learning to emit
-`<think>...</think><final>{...}</final>`. SFT on ~100 ideal completions
-gets the format perfect upfront so all GRPO steps focus on improving
-correctness.
-How: we synthesize "ideal" completions deterministically from the gold label
-+ actual prosody features. No API call. The reasoning text references the
-real prosody numbers. The final tag uses the gold label.
-This is a min-cost run that:
-  - Generates 100 (prompt, ideal) pairs locally
-  - SFTs Qwen2.5-3B + LoRA for 1 epoch (~5-10 min on L4)
-  - Saves the LoRA checkpoint to HF Hub
-  - Prints 3 before/after sample completions for visual inspection
-Output: LoRA adapter pushed to HF Hub at aamrinder/subtext-arena-sft
-        (then GRPO Run #1 starts FROM this adapter, not from vanilla Qwen)
-"""
-from __future__ import annotations
-import json
-import os
-import random
-import sys
-from pathlib import Path
-from typing import Dict, List
-ROOT = Path(__file__).resolve().parent.parent
-if str(ROOT) not in sys.path:
-    sys.path.insert(0, str(ROOT))
-try:
-    from subtext_arena.server.scenarios import load_scenarios
-    from subtext_arena.train.train_grpo import (
-        SYSTEM_PROMPT, build_full_observation,
-    )
-except ImportError:
-    from server.scenarios import load_scenarios
-    from train.train_grpo import SYSTEM_PROMPT, build_full_observation
-# ---------------------------------------------------------------------------
-# Generate ideal completions deterministically from gold + prosody
-# ---------------------------------------------------------------------------
-POSITIVE_WORDS = {
-    "great", "wonderful", "amazing", "perfect", "love", "brilliant",
-    "privilege", "lovely", "fantastic", "excellent", "fabulous", "delighted",
-    "thrilled", "grateful", "honored", "blessed",
-}
-def synth_ideal_completion(scenario: Dict) -> str:
-    """Build a high-quality (think + final) target completion for SFT.
-    Reasoning is grounded in the actual prosody numbers and gold label.
-    """
-    gold = "sarcastic" if scenario["sarcasm"] else "sincere"
-    prosody = scenario.get("prosody") or {}
-    utt = scenario.get("utterance", "").lower()
-    pitch_var = float(prosody.get("pitch_var_hz", 0.0))
-    pitch_mean = float(prosody.get("pitch_mean_hz", 0.0))
-    pre_pause = int(prosody.get("pre_pause_ms", 0))
-    pauses = prosody.get("pauses", []) or []
-    voiced_ratio = float(prosody.get("voiced_ratio", 0.0))
-    n_internal = len(pauses)
-    parts = []
-    # 1) Lexical observation
-    pos_count = sum(1 for w in utt.split() if w.strip(".,!?'\"") in POSITIVE_WORDS)
-    if pos_count > 0:
-        parts.append(
-            f"The literal lexical content is positive ({pos_count} positive word"
-            f"{'s' if pos_count > 1 else ''}: {', '.join(w for w in utt.split() if w.strip(chr(46)+chr(44)+chr(33)+chr(63)+chr(39)+chr(34)) in POSITIVE_WORDS)[:120]})."
-        )
-    elif "?" in utt:
-        parts.append("The line is phrased as a question.")
-    else:
-        parts.append("The lexical content is neutral or descriptive.")
-    # 2) Prosody observation (only if features are reliable)
-    if voiced_ratio < 0.1:
-        parts.append(
-            "Prosody is unreliable for this clip (low voiced-frame ratio). "
-            "Lexical and contextual cues should dominate."
-        )
-        prosody_evidence = "weak"
-    else:
-        if pitch_var > 45:
-            parts.append(
-                f"Pitch variability is HIGH ({pitch_var:.0f} Hz over voiced frames), "
-                "suggesting exaggerated melodic delivery."
-            )
-            prosody_evidence = "exaggerated"
-        elif pitch_var < 25:
-            parts.append(
-                f"Pitch variability is LOW ({pitch_var:.0f} Hz), suggesting "
-                "flat or minimally inflected delivery."
-            )
-            prosody_evidence = "flat"
-        else:
-            parts.append(
-                f"Pitch variability is moderate ({pitch_var:.0f} Hz), neither "
-                "flat nor exaggerated."
-            )
-            prosody_evidence = "moderate"
-        if pre_pause >= 250:
-            parts.append(
-                f"There is a {pre_pause}ms pre-utterance pause — speakers often "
-                "use such setup pauses for ironic or emphatic delivery."
-            )
-        if n_internal >= 1:
-            parts.append(
-                f"There {'is' if n_internal == 1 else 'are'} {n_internal} internal pause"
-                f"{'' if n_internal == 1 else 's'} >150ms within the utterance."
-            )
-    # 3) Conclusion grounded in the evidence
-    if gold == "sarcastic":
-        if pos_count > 0 and prosody_evidence == "exaggerated":
-            parts.append(
-                "Positive lexical content combined with exaggerated melodic "
-                "delivery is the signature pattern of sarcastic delivery — "
-                "the words say one thing, the prosody says the opposite."
-            )
-        elif prosody_evidence == "exaggerated":
-            parts.append(
-                "Exaggerated prosodic shape on otherwise non-emphatic content "
-                "is consistent with mock or ironic delivery."
-            )
-        else:
-            parts.append(
-                "Subtle cues taken together (delivery, emphasis pause, "
-                "context) suggest the speaker is being ironic rather than literal."
-            )
-    else:
-        if prosody_evidence == "flat":
-            parts.append(
-                "Flat prosodic delivery on neutral or genuine content "
-                "indicates the speaker means what they say."
-            )
-        elif pos_count > 0 and prosody_evidence != "exaggerated":
-            parts.append(
-                "Positive lexical content paired with non-exaggerated delivery "
-                "indicates sincere expression."
-            )
-        else:
-            parts.append(
-                "Lacking strong markers of irony, the speaker appears to be "
-                "expressing genuine intent."
-            )
-    think_text = " ".join(parts)
-    final_json = json.dumps({"label": gold, "confidence": 0.85}, separators=(",", ":"))
-    return f"<think>\n{think_text}\n</think>\n<final>{final_json}</final>"
-def build_sft_dataset(scenarios, n_rows: int = 100, seed: int = 0):
-    """Pick n_rows clips, build (prompt, ideal_completion) pairs."""
-    from datasets import Dataset
-    rng = random.Random(seed)
-    # Balance classes in the SFT set
-    sarc_ids = [k for k, v in scenarios.items() if v["sarcasm"]]
-    sinc_ids = [k for k, v in scenarios.items() if not v["sarcasm"]]
-    rng.shuffle(sarc_ids); rng.shuffle(sinc_ids)
-    chosen = sarc_ids[: n_rows // 2] + sinc_ids[: n_rows - n_rows // 2]
-    rng.shuffle(chosen)
-    rows = []
-    for cid in chosen:
-        sc = scenarios[cid]
-        user_text = build_full_observation(cid, scenarios)
-        ideal = synth_ideal_completion(sc)
-        # Use the chat-completion format — Qwen2.5-Instruct expects this
-        rows.append({
-            "messages": [
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": user_text},
-                {"role": "assistant", "content": ideal},
-            ],
-            "clip_id": cid,
-            "gold": "sarcastic" if sc["sarcasm"] else "sincere",
-        })
-    return Dataset.from_list(rows)
-def sample_before_after(model, tokenizer, scenarios, sample_clip_ids, label_for_log: str):
-    """Generate completions on a few held-out clips for visual inspection."""
-    import torch as _t
-    print(f"\n----- Sample completions ({label_for_log}) -----", flush=True)
-    model.eval()
-    # Disable gradient checkpointing during inference (cache=True is incompatible)
-    if hasattr(model, "gradient_checkpointing_disable"):
-        try:
-            model.gradient_checkpointing_disable()
-        except Exception:
-            pass
-    for cid in sample_clip_ids:
-        sc = scenarios[cid]
-        gold = "sarcastic" if sc["sarcasm"] else "sincere"
-        messages = [
-            {"role": "system", "content": SYSTEM_PROMPT},
-            {"role": "user", "content": build_full_observation(cid, scenarios)},
-        ]
-        # apply_chat_template can return either a tensor (older transformers)
-        # or a BatchEncoding (newer transformers). Handle both.
-        encoded = tokenizer.apply_chat_template(
-            messages, return_tensors="pt", add_generation_prompt=True,
-        )
-        if hasattr(encoded, "input_ids"):
-            input_ids = encoded.input_ids.to(model.device)
-        else:
-            input_ids = encoded.to(model.device)
-        prompt_len = input_ids.shape[1]
-        with _t.no_grad():
-            out = model.generate(
-                input_ids=input_ids,
-                max_new_tokens=400,
-                do_sample=False,             # greedy for stable inspection
-                pad_token_id=tokenizer.eos_token_id,
-                use_cache=True,              # explicit; prepare_for_kbit can leave it off
-            )
-        text = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
-        print(f"\nClip {cid} (gold={gold}, speaker={sc.get('speaker')}):", flush=True)
-        print(text[:1200], flush=True)
-        print("---", flush=True)
-    # Re-enable gradient checkpointing for further training
-    if hasattr(model, "gradient_checkpointing_enable"):
-        try:
-            model.gradient_checkpointing_enable()
-        except Exception:
-            pass
-    model.train()
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-def main():
-    import argparse
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model", default="Qwen/Qwen2.5-3B-Instruct")
-    parser.add_argument("--n-rows", type=int, default=100)
-    parser.add_argument("--epochs", type=int, default=2)
-    parser.add_argument("--lora-r", type=int, default=8)
-    parser.add_argument("--learning-rate", type=float, default=5e-5)
-    parser.add_argument("--output-dir", default="/tmp/sft_warmup")
-    parser.add_argument("--push-to-hub", default=None,
-                        help="If set, e.g. 'aamrinder/subtext-arena-sft', push the LoRA there")
-    parser.add_argument("--n-sample-clips", type=int, default=3,
-                        help="How many clips to generate before/after samples on")
-    args = parser.parse_args()
-    print(f"[load-scenarios]")
-    scenarios = load_scenarios()
-    print(f"   {len(scenarios)} clips")
-    print(f"[build-sft-dataset] n_rows={args.n_rows}")
-    ds = build_sft_dataset(scenarios, n_rows=args.n_rows)
-    print(f"   {len(ds)} (prompt, ideal-completion) pairs")
-    print(f"   first ideal completion preview:")
-    first_msgs = ds[0]["messages"]
-    print("     " + first_msgs[-1]["content"].replace("\n", "\n     ")[:400])
-    # Pick held-out clips for before/after sampling
-    sample_ids = [k for k in scenarios.keys() if k not in {r["clip_id"] for r in ds}][: args.n_sample_clips]
-    # Load model
-    print(f"\n[load-model] {args.model}, 4-bit + LoRA")
-    import torch as _t
-    from transformers import (
-        AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
-    )
-    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
-    from trl import SFTTrainer, SFTConfig
-    bnb = BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_compute_dtype=_t.bfloat16,
-        bnb_4bit_quant_type="nf4",
-        bnb_4bit_use_double_quant=True,
-    )
-    tokenizer = AutoTokenizer.from_pretrained(args.model)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    base = AutoModelForCausalLM.from_pretrained(
-        args.model, quantization_config=bnb, dtype=_t.bfloat16, device_map="auto",
-    )
-    base = prepare_model_for_kbit_training(base, use_gradient_checkpointing=True)
-    peft_config = LoraConfig(
-        r=args.lora_r, lora_alpha=args.lora_r, lora_dropout=0.0, bias="none",
-        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
-        task_type="CAUSAL_LM",
-    )
-    model = get_peft_model(base, peft_config)
-    # Sample BEFORE training
-    sample_before_after(model, tokenizer, scenarios, sample_ids, "BEFORE SFT")
-    # SFT training
-    print(f"\n[sft-train] {args.epochs} epoch(s), lr={args.learning_rate}")
-    sft_kwargs = dict(
-        output_dir=args.output_dir,
-        num_train_epochs=args.epochs,
-        per_device_train_batch_size=2,
-        gradient_accumulation_steps=4,
-        learning_rate=args.learning_rate,
-        warmup_ratio=0.1,                    # smooth LR ramp-up to avoid early instability
-        lr_scheduler_type="cosine",
-        bf16=True,
-        gradient_checkpointing=True,
-        logging_steps=1,
-        save_strategy="no",
-        report_to="none",
-        max_length=4096,
-    )
-    # assistant_only_loss is a TRL>=0.12 flag — only train on assistant tokens
-    # so we don't waste capacity learning to copy the system/user prompt.
-    try:
-        config = SFTConfig(**sft_kwargs, assistant_only_loss=True)
-        print("[sft-train] assistant_only_loss=True (TRL >= 0.12)")
-    except TypeError:
-        config = SFTConfig(**sft_kwargs)
-        print("[sft-train] assistant_only_loss not available in this TRL — training on full sequence")
-    trainer = SFTTrainer(
-        model=model,
-        args=config,
-        train_dataset=ds,
-        processing_class=tokenizer,
-    )
-    trainer.train()
-    trainer.save_model(args.output_dir)
-    print(f"\n[done] LoRA adapter saved to {args.output_dir}")
-    # Sample AFTER training
-    sample_before_after(model, tokenizer, scenarios, sample_ids, "AFTER SFT")
-    # Optional: push to HF Hub
-    if args.push_to_hub:
-        from huggingface_hub import HfApi
-        api = HfApi()
-        # Create the repo first (idempotent)
-        try:
-            api.create_repo(repo_id=args.push_to_hub, repo_type="model", exist_ok=True)
-        except Exception as e:
-            print(f"[warn] create_repo: {e}")
-        # Upload the LoRA adapter directory
-        api.upload_folder(
-            folder_path=args.output_dir,
-            repo_id=args.push_to_hub,
-            repo_type="model",
-            commit_message=f"SFT warmup checkpoint ({args.n_rows} examples, {args.epochs} epoch)",
-        )
-        print(f"[done] LoRA adapter pushed to https://huggingface.co/{args.push_to_hub}")
-if __name__ == "__main__":
-    main()

train/side_by_side.py DELETED Viewed

@@ -1,247 +0,0 @@
-"""Generate side-by-side baseline-vs-trained reasoning for hand-picked clips.
-This is the demo artifact: judges look at it and read what the model learned.
-Output: an HTML table that can be embedded in the README + a JSON dump.
-For each clip:
-  - Run the BASE Qwen2.5-3B (no LoRA) and dump <think> + <final>
-  - Run the TRAINED checkpoint and dump <think> + <final>
-  - Show gold label, both predictions, and which (if either) was right
-Usage:
-    python -m subtext_arena.train.side_by_side \\
-        --trained-checkpoint ./checkpoints/run1 \\
-        --clip-ids 1_70 2_190 1_8826 2_236 2_300 \\
-        --out docs/plots/side_by_side.html
-"""
-from __future__ import annotations
-import argparse
-import json
-import sys
-from pathlib import Path
-from typing import List
-try:
-    from subtext_arena.server.scenarios import load_scenarios
-    from subtext_arena.train.train_grpo import (
-        SYSTEM_PROMPT, build_full_observation, parse_final, reward_decomposition,
-    )
-except ImportError:
-    ROOT = Path(__file__).resolve().parent.parent
-    if str(ROOT) not in sys.path:
-        sys.path.insert(0, str(ROOT))
-    from server.scenarios import load_scenarios  # type: ignore[no-redef]
-    from train.train_grpo import (  # type: ignore[no-redef]
-        SYSTEM_PROMPT, build_full_observation, parse_final, reward_decomposition,
-    )
-HTML_TEMPLATE = """<!DOCTYPE html>
-<html><head>
-<meta charset="utf-8">
-<title>Subtext Arena — baseline vs trained, hand-picked clips</title>
-<style>
-  body {{ font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
-         max-width: 1200px; margin: 40px auto; padding: 0 20px; color: #222; }}
-  h1 {{ font-size: 24px; }}
-  .clip {{ border: 1px solid #ddd; border-radius: 8px; padding: 16px;
-           margin-bottom: 24px; background: #fafafa; }}
-  .clip h2 {{ font-size: 18px; margin-top: 0; }}
-  .gold-sarcastic {{ color: #b3274d; }}
-  .gold-sincere   {{ color: #1d7a4a; }}
-  .columns {{ display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }}
-  .col {{ padding: 12px; border-radius: 6px; }}
-  .baseline {{ background: #fff5f5; border: 1px solid #f8c4c4; }}
-  .trained  {{ background: #effaf3; border: 1px solid #b6e2c1; }}
-  .col h3 {{ margin-top: 0; font-size: 14px; text-transform: uppercase;
-             letter-spacing: 0.05em; color: #666; }}
-  .verdict-correct {{ color: #1d7a4a; font-weight: bold; }}
-  .verdict-wrong   {{ color: #b3274d; font-weight: bold; }}
-  pre {{ white-space: pre-wrap; word-wrap: break-word; font-size: 13px;
-         line-height: 1.4; background: white; padding: 8px; border-radius: 4px;
-         border: 1px solid #eee; }}
-  .transcript {{ font-style: italic; color: #555; margin-bottom: 12px; }}
-</style>
-</head><body>
-<h1>Subtext Arena — baseline vs trained</h1>
-<p>Same prompt fed to the base Qwen2.5-3B-Instruct (left) and to the GRPO-trained
-checkpoint (right). Each shows the model's reasoning trace and final answer.</p>
-{clips_html}
-</body></html>
-"""
-CLIP_BLOCK = """<div class="clip">
-  <h2>Clip {clip_id} — speaker: {speaker}, gold: <span class="gold-{gold}">{gold}</span></h2>
-  <div class="transcript">"{utterance}"</div>
-  <div class="columns">
-    <div class="col baseline">
-      <h3>Baseline (no training)</h3>
-      <p>predicted: <span class="verdict-{baseline_verdict}">{baseline_label}</span> (conf {baseline_conf:.2f})</p>
-      <pre>{baseline_text}</pre>
-    </div>
-    <div class="col trained">
-      <h3>Trained checkpoint</h3>
-      <p>predicted: <span class="verdict-{trained_verdict}">{trained_label}</span> (conf {trained_conf:.2f})</p>
-      <pre>{trained_text}</pre>
-    </div>
-  </div>
-</div>
-"""
-def generate_completion(model, tokenizer, prompt_user_msg, max_tokens=600, temperature=0.7):
-    """Greedy decode for deterministic side-by-side. Matches held-out eval settings
-    so the side-by-side numbers are consistent with the headline accuracy."""
-    messages = [
-        {"role": "system", "content": SYSTEM_PROMPT},
-        {"role": "user", "content": prompt_user_msg},
-    ]
-    encoded = tokenizer.apply_chat_template(
-        messages, return_tensors="pt", add_generation_prompt=True,
-    )
-    input_ids = encoded.input_ids if hasattr(encoded, "input_ids") else encoded
-    input_ids = input_ids.to(model.device)
-    prompt_len = input_ids.shape[1]
-    import torch as _t
-    with _t.no_grad():
-        out = model.generate(
-            input_ids=input_ids, max_new_tokens=max_tokens,
-            do_sample=False,           # greedy — matches held-out eval
-            pad_token_id=tokenizer.eos_token_id,
-            use_cache=True,
-        )
-    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--trained-checkpoint", required=True,
-                        help="Local path or HF Hub repo id (e.g. 'aamrinder/subtext-arena-grpo')")
-    parser.add_argument("--base-model", default="Qwen/Qwen2.5-3B-Instruct")
-    parser.add_argument("--clip-ids", nargs="+", required=True,
-                        help="Hand-picked clip IDs for the side-by-side")
-    parser.add_argument("--out", required=True, help="Output HTML path")
-    parser.add_argument("--out-json", default=None, help="Optional JSON dump")
-    parser.add_argument("--push-to-space", default=None,
-                        help="If set, e.g. 'aamrinder/subtext-arena', upload outputs to that Space's docs/ dir")
-    args = parser.parse_args()
-    scenarios = load_scenarios()
-    import torch as _t
-    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-    from peft import PeftModel
-    rows = []
-    bnb = BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_compute_dtype=_t.bfloat16,
-        bnb_4bit_quant_type="nf4",
-        bnb_4bit_use_double_quant=True,
-    )
-    # Run baseline (vanilla Qwen, no LoRA)
-    print(f"[load] base model: {args.base_model}")
-    base_tok = AutoTokenizer.from_pretrained(args.base_model)
-    if base_tok.pad_token is None:
-        base_tok.pad_token = base_tok.eos_token
-    base_model = AutoModelForCausalLM.from_pretrained(
-        args.base_model, quantization_config=bnb, dtype=_t.bfloat16, device_map="auto",
-    )
-    base_model.eval()
-    for clip_id in args.clip_ids:
-        gold = "sarcastic" if scenarios[clip_id]["sarcasm"] else "sincere"
-        prompt_user = build_full_observation(clip_id, scenarios)
-        text = generate_completion(base_model, base_tok, prompt_user)
-        d = reward_decomposition(text, gold)
-        rows.append({
-            "clip_id": clip_id,
-            "speaker": scenarios[clip_id]["speaker"],
-            "utterance": scenarios[clip_id]["utterance"],
-            "gold": gold,
-            "baseline": {
-                "label": d["_predicted"], "confidence": d["_confidence"],
-                "correct": d["_correct"], "text": text[:1200],
-            },
-        })
-        print(f"  baseline {clip_id}: pred={d['_predicted']} (conf={d['_confidence']:.2f}) correct={d['_correct']}")
-    # Apply LoRA on top of the same base — keeps memory low.
-    print(f"[load] LoRA adapter: {args.trained_checkpoint}")
-    trained_model = PeftModel.from_pretrained(base_model, args.trained_checkpoint)
-    trained_model.eval()
-    trained_tok = base_tok
-    for row in rows:
-        clip_id = row["clip_id"]
-        prompt_user = build_full_observation(clip_id, scenarios)
-        text = generate_completion(trained_model, trained_tok, prompt_user)
-        d = reward_decomposition(text, row["gold"])
-        row["trained"] = {
-            "label": d["_predicted"], "confidence": d["_confidence"],
-            "correct": d["_correct"], "text": text[:1200],
-        }
-        print(f"  trained  {clip_id}: pred={d['_predicted']} (conf={d['_confidence']:.2f}) correct={d['_correct']}")
-    # Render HTML
-    clips_html_parts = []
-    for row in rows:
-        b = row["baseline"]; t = row["trained"]
-        clips_html_parts.append(CLIP_BLOCK.format(
-            clip_id=row["clip_id"], speaker=row["speaker"],
-            utterance=row["utterance"].replace('"', '&quot;'),
-            gold=row["gold"],
-            baseline_label=b["label"] or "—",
-            baseline_conf=b["confidence"],
-            baseline_verdict=("correct" if b["correct"] else "wrong"),
-            baseline_text=(b["text"] or "(no output)").replace("<", "&lt;").replace(">", "&gt;"),
-            trained_label=t["label"] or "—",
-            trained_conf=t["confidence"],
-            trained_verdict=("correct" if t["correct"] else "wrong"),
-            trained_text=(t["text"] or "(no output)").replace("<", "&lt;").replace(">", "&gt;"),
-        ))
-    html = HTML_TEMPLATE.format(clips_html="\n".join(clips_html_parts))
-    Path(args.out).parent.mkdir(parents=True, exist_ok=True)
-    Path(args.out).write_text(html)
-    print(f"[done] wrote {args.out}")
-    if args.out_json:
-        Path(args.out_json).write_text(json.dumps(rows, indent=2))
-        print(f"[done] wrote {args.out_json}")
-    # Tally
-    n_baseline_correct = sum(1 for r in rows if r["baseline"]["correct"])
-    n_trained_correct = sum(1 for r in rows if r["trained"]["correct"])
-    print()
-    print(f"Tally on {len(rows)} hand-picked clips:")
-    print(f"  baseline: {n_baseline_correct}/{len(rows)} correct")
-    print(f"  trained:  {n_trained_correct}/{len(rows)} correct")
-    # Push to HF Space so artifacts survive the ephemeral container
-    if args.push_to_space:
-        try:
-            from huggingface_hub import HfApi
-            api = HfApi()
-            for local, remote in [
-                (args.out, "docs/side_by_side.html"),
-                (args.out_json, "docs/side_by_side.json") if args.out_json else (None, None),
-            ]:
-                if local and Path(local).exists():
-                    api.upload_file(
-                        path_or_fileobj=local,
-                        path_in_repo=remote,
-                        repo_id=args.push_to_space,
-                        repo_type="space",
-                        commit_message="add side-by-side baseline-vs-trained demo",
-                    )
-                    print(f"[done] {remote} pushed to {args.push_to_space}")
-        except Exception as e:
-            print(f"[error] push_to_space failed: {e}")
-    print("\n[main] side-by-side run finished cleanly.")
-    sys.exit(0)
-if __name__ == "__main__":
-    main()