feat: upload actual trained LoRA adapter (Qwen2.5-1.5B ORPO, 3 epochs, 36 steps)

Browse files

Files changed (3) hide show

README.md +79 -159
build_judge_pairs.py +214 -0
inference_example.py +339 -13

README.md CHANGED Viewed

@@ -2,34 +2,33 @@
 license: cc-by-4.0
 language:
 - en
-base_model: unsloth/Qwen3.5-0.8B
 tags:
 - judge
 - b2b-sales
 - orpo
 - preference-learning
 - tenacious-bench
 - evaluation
-- qwen3
 - unsloth
 datasets:
 - rafiakedir/tenacious-bench-v0.1
 ---
-# Tenacious-Bench Judge — ORPO Fine-Tuned Qwen3.5-0.8B
 A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
 [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
-preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
-the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
-rubric dimensions; outputs below threshold are rejected and regenerated.
-**Base model:** `unsloth/Qwen3.5-0.8B`
-**Training algorithm:** ORPO (no reference model — single forward pass)
-**Weights:** Merged (full model, not a LoRA adapter)
-**Precision:** BF16 · ~873M parameters · ~1.75 GB
-**Context length:** 262,144 tokens
 **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
 ---
@@ -48,185 +47,115 @@ rubric dimensions; outputs below threshold are rejected and regenerated.
 ## Quick Start — Inference
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-model_id = "rafiakedir/tenacious-bench-adapter"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(
-    model_id, torch_dtype=torch.bfloat16, device_map="auto"
-)
-SYSTEM = """You are a rubric-aware judge for B2B outbound sales emails.
-Score the candidate output on the following dimension.
-Dimension: signal_grounding_fidelity
-Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
-with confidence >= 0.60, or be phrased as a question.
-Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}"""
-USER = """Hiring signal brief:
-{
-  "company_name": "Acme Corp",
-  "open_roles": 3,
-  "confidence": "low",
-  "domain": "fintech"
-}
-Candidate email:
-"Hi Alex — noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
-We staff specialized capability-gap squads for fintech teams at your growth stage.
-Would a 30-minute scoping conversation make sense this week?"
-Score this output."""
-messages = [
-    {"role": "system", "content": SYSTEM},
-    {"role": "user", "content": USER},
-]
-text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-with torch.no_grad():
-    out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
-response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-print(response)
-# Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low — should be phrased as a question."}
 ```
 ---
 ## Training Details
-### Why ORPO
-ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
-the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
-VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
-checkpointing hacks.
-For a discriminative judge (score calibration rather than generation quality), the
-preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
-that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
-### Preference Pair Construction
-| Source | Count |
-|---|---|
-| Failing tasks → generated chosen (DeepSeek V3.2) | ~111 attempted |
-| Passing tasks → generated rejected (DeepSeek V3.2) | ~41 attempted |
-| **Final pairs after filtering** | **94** |
-Filter: chosen score ≥ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
-Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
-and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
-**Preference leakage prevention (Li et al., 2025):**
-Generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
-All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
-### Hyperparameters
 | Parameter | Value |
 |---|---|
-| Base model | `unsloth/Qwen3.5-0.8B` |
 | LoRA rank | 16 |
 | LoRA alpha | 32 |
 | Target modules | q_proj, v_proj |
 | LoRA dropout | 0.05 |
 | Learning rate | 8e-6 |
-| Batch size (per device) | 2 |
-| Gradient accumulation | 4 (effective batch 8) |
 | Epochs | 3 |
-| Warmup ratio | 0.1 |
-| LR scheduler | cosine |
 | ORPO beta | 0.1 |
 | Max sequence length | 1024 |
-| Precision | BF16 (T4) |
 | Seed | 42 |
-Training notebook: see `run_on_colab.ipynb` in this repo.
 ---
 ## Evaluation Results
 Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
-Paired bootstrap significance test: 10,000 iterations, seed 42.
-| Condition | Mean Score | vs. Baseline |
-|---|---|---|
-| Baseline (`scoring_evaluator.py` only) | 0.458 | — |
-| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Δ=+0.025, p=0.189, not significant |
-| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Δ=−0.021 vs. trained, p=0.978 |
-**Delta A** (trained vs. baseline): Δ=+0.025, 95% CI [−0.032, +0.081], p=0.189 — **not statistically significant**.
-**Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` —
-the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
-Note: Delta B compares a 0.8B trained model against a 30B zero-shot model — this conflates backbone
-capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
-`Qwen3.5-0.8B-Instruct` (no fine-tuning).
-**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
-`scoring_evaluator.py` deterministically. Retrain with ≥150 pairs covering all 5 dimensions
-before re-evaluating.
-Full numbers: `ablation_results.json` in the dataset repo.
 ---
 ## Known Limitations
-**1. Dimension coverage gap (critical).**
-The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
-for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
-to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
-bench commitment honesty — the highest SOW-breach-risk dimension. It cannot be trusted to gate
-bench-commitment outputs.
-**2. Delta A not significant at v0.1 scale.**
-The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
-does not reliably outperform `scoring_evaluator.py` on held-out tasks.
-**3. Backbone below Prometheus-2 threshold.**
-Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
-below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
-**4. Synthetic training distribution.**
-All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
-may not generalize to real prospect data with industry-specific jargon or edge cases outside the
-training distribution.
-**5. Static bench_summary.**
-The judge was trained on snapshot bench capacities. In production the bench changes weekly —
-calibration for `bench_commitment_honesty` will drift over time.
 ---
-## Files in This Repo
 | File | Description |
 |---|---|
-| `model.safetensors-*` | Merged model weights (BF16) |
-| `config.json` | Model architecture config |
 | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
-| `train_judge.py` | Full ORPO training script |
-| `hyperparams.json` | All hyperparameters (pinned) |
-| `run_on_colab.ipynb` | End-to-end training notebook for T4 |
-| `inference_example.py` | Inference helper with prompt templates |
-Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
----
-## Environmental Impact
-- **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
-- **CO₂e:** ~0.1 kg (T4 at 70W × 90 min × US grid 0.42 kg CO₂/kWh ÷ 1000)
-- **Infrastructure:** Google Colab free tier
 ---
@@ -234,18 +163,9 @@ Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://hu
 ```bibtex
 @misc{tenacious-bench-adapter-2026,
-  title        = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
-  author       = {Kedir, Rafia},
-  year         = {2026},
-  howpublished = {HuggingFace Model Hub},
-  url          = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
-}
-@misc{tenacious-bench-v01-2026,
-  title        = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
-  author       = {Kedir, Rafia},
-  year         = {2026},
-  howpublished = {HuggingFace Datasets Hub},
-  url          = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
 }
 ```

 license: cc-by-4.0
 language:
 - en
+base_model: unsloth/Qwen2.5-1.5B-Instruct
 tags:
 - judge
 - b2b-sales
 - orpo
+- lora
 - preference-learning
 - tenacious-bench
 - evaluation
+- qwen2.5
 - unsloth
 datasets:
 - rafiakedir/tenacious-bench-v0.1
 ---
+# Tenacious-Bench Judge — ORPO LoRA Adapter (Qwen2.5-1.5B)
 A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
 [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
+preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine.
+**Base model:** `unsloth/Qwen2.5-1.5B-Instruct`
+**Adapter type:** LoRA (PEFT) — load with base model + `PeftModel.from_pretrained`
+**Training algorithm:** ORPO (no reference model required)
+**Precision:** 4-bit quantized during training (Unsloth), fp16 for inference
 **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
+**Training:** 3 epochs · 36 steps · lr=8e-6 · beta=0.1 · LoRA r=16 alpha=32
 ---
 ## Quick Start — Inference
 ```python
+import json, torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
+ADAPTER_ID  = "rafiakedir/tenacious-bench-adapter"
+tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
+base = AutoModelForCausalLM.from_pretrained(
+    BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
+)
+model = PeftModel.from_pretrained(base, ADAPTER_ID)
+model.eval()
+JUDGE_SYSTEM = (
+    "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
+    "Given a task context and a candidate email, score it on the specified rubric dimension. "
+    "Respond with a JSON object only:\n"
+    '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
+)
+def judge(email, context, dimension):
+    user = (
+        f"EVALUATION DIMENSION: {dimension}\n\n"
+        f"TASK CONTEXT:\n{context}\n\n"
+        f"CANDIDATE EMAIL:\n{email}\n\n"
+        f"Score this email on the {dimension} dimension."
+    )
+    msgs = [{"role": "system", "content": JUDGE_SYSTEM},
+            {"role": "user",   "content": user}]
+    text   = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
+    inputs = tokenizer(text, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
+                             pad_token_id=tokenizer.eos_token_id)
+    resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
+    s, e = resp.find("{"), resp.rfind("}") + 1
+    return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}
+result = judge(
+    email="Casey — TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
+    context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
+    dimension="signal_grounding_fidelity"
+)
+print(result)
 ```
 ---
 ## Training Details
 | Parameter | Value |
 |---|---|
+| Base model | `unsloth/Qwen2.5-1.5B-Instruct` (4-bit during training) |
 | LoRA rank | 16 |
 | LoRA alpha | 32 |
 | Target modules | q_proj, v_proj |
 | LoRA dropout | 0.05 |
 | Learning rate | 8e-6 |
+| Effective batch size | 8 (batch=2, grad_accum=4) |
 | Epochs | 3 |
+| Total steps | 36 |
 | ORPO beta | 0.1 |
 | Max sequence length | 1024 |
 | Seed | 42 |
+**Training loss:** 2.8676 → 2.9646 → 2.9386 (3 checkpoints)
+**Reward accuracy:** 0.5375 → 0.6026 → 0.5128
+**Training data:** 94 preference pairs from the train partition. Preference leakage prevention:
+generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
+All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
 ---
 ## Evaluation Results
 Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
+Full results in `ablation_results.json` in the dataset repo.
+**Deployment recommendation:** Run `ablations/run_ablations.py` with this adapter to get Delta A.
+The ablation script loads this adapter via HuggingFace — requires GPU + transformers + peft.
 ---
 ## Known Limitations
+1. **Dimension coverage gap.** 0 training pairs for `bench_commitment_honesty`, 4 for `icp_segment_appropriateness` due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.
+2. **Backbone below Prometheus-2 threshold.** Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.
+3. **Synthetic training distribution.** All pairs derive from synthetic prospect briefs and LLM-generated emails.
+4. **Static bench_summary.** Judge calibration drifts as real bench composition changes weekly.
 ---
+## Files
 | File | Description |
 |---|---|
+| `adapter_config.json` | LoRA configuration (r=16, alpha=32, q_proj+v_proj) |
+| `adapter_model.safetensors` | Trained LoRA weights (8.4 MB) |
 | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
+| `run_on_colab.ipynb` | End-to-end training + push notebook |
+| `train_judge.py` | Training script |
+| `inference_example.py` | Per-dimension and all-dimension scoring helper |
+Training data: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
 ---
 ```bibtex
 @misc{tenacious-bench-adapter-2026,
+  title  = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
+  author = {Kedir, Rafia},
+  year   = {2026},
+  url    = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
 }
 ```

build_judge_pairs.py ADDED Viewed

	@@ -0,0 +1,214 @@

+#!/usr/bin/env python3
+"""
+Build judge-format ORPO training pairs.
+Each preference pair in preference_pairs.jsonl has:
+  chosen = a GOOD email (passes rubric)
+  rejected = a BAD email (fails rubric)
+For judge training we need the model to score emails, not generate them.
+So we create pairs where:
+  chosen_response  = correct JSON score for the email
+  rejected_response = wrong JSON score for the same email
+From each original pair we create TWO judge training examples:
+  1. Judge pair for the GOOD email  → correct high score is chosen, wrong low-ish score is rejected
+  2. Judge pair for the BAD email   → correct low score is chosen, wrong high-ish score is rejected
+Output: training_data/judge_pairs.jsonl  (conversations format for ORPOTrainer)
+"""
+import json
+import sys
+from pathlib import Path
+ROOT = Path(__file__).parent.parent
+sys.path.insert(0, str(ROOT))
+from scoring_evaluator import score_task
+PAIRS_PATH   = ROOT / "training_data/preference_pairs.jsonl"
+TASKS_PATH   = ROOT / "tenacious_bench_v0.1/train/tasks.jsonl"
+OUTPUT_PATH  = ROOT / "training_data/judge_pairs.jsonl"
+JUDGE_SYSTEM = (
+    "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
+    "Given a task context and a candidate email, score the email on the specified rubric "
+    "dimension. Respond with a JSON object only:\n"
+    '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, '
+    '"reasoning": "<one concise sentence>"}'
+)
+PASS_THRESHOLD = {
+    "signal_grounding_fidelity":    0.60,
+    "bench_commitment_honesty":     0.50,
+    "icp_segment_appropriateness":  0.50,
+    "competitor_gap_honesty":       0.50,
+    "tone_preservation":            0.60,
+}
+# Dimension-specific reasoning templates
+PASS_REASONING = {
+    "signal_grounding_fidelity":   "Email grounds all factual claims in documented hiring signals from the brief; low-confidence signals are phrased as questions.",
+    "bench_commitment_honesty":    "Staffing commitment is within the available bench count for the required stack.",
+    "icp_segment_appropriateness": "Email language matches the correct ICP segment for the prospect's funding stage and posture.",
+    "competitor_gap_honesty":      "Competitor gap claims are grounded in the competitor_gap_brief; no fabricated assertions.",
+    "tone_preservation":           "Email maintains Tenacious brand voice: no clichés, no over-apologetic language, calendar CTA included.",
+}
+FAIL_REASONING = {
+    "signal_grounding_fidelity":   "Email asserts growth or capability claims not supported by the hiring signal brief; treats low-confidence signals as established facts.",
+    "bench_commitment_honesty":    "Email promises engineer capacity that exceeds the available bench count for the required stack.",
+    "icp_segment_appropriateness": "Email uses the wrong segment language; growth-phase pitch applied to a cost-restructuring or abstain-segment prospect.",
+    "competitor_gap_honesty":      "Email asserts competitor gaps not documented in the brief, fabricating capability differences.",
+    "tone_preservation":           "Email uses a banned re-engagement phrase or lacks the required 30-minute scoping calendar CTA.",
+}
+def build_user_prompt(task: dict, email_text: str) -> str:
+    dim = task.get("dimension", "")
+    inp = task.get("input", {})
+    # Compact the signal brief (trim to 800 chars to stay within max_prompt_length)
+    brief = json.dumps(
+        inp.get("hiring_signal_brief") or inp.get("bench_summary") or {},
+        indent=2
+    )[:800]
+    return (
+        f"EVALUATION DIMENSION: {dim}\n\n"
+        f"TASK CONTEXT:\n{brief}\n\n"
+        f"CANDIDATE EMAIL:\n{email_text.strip()}\n\n"
+        f"Score this email on the {dim} dimension."
+    )
+def make_score_json(dim: str, score: float, passed: bool, reasoning: str) -> str:
+    return json.dumps({
+        "dimension": dim,
+        "score": round(score, 2),
+        "pass": passed,
+        "reasoning": reasoning,
+    })
+def conversations(system: str, user: str, assistant: str) -> list:
+    return [
+        {"role": "system",    "content": system},
+        {"role": "user",      "content": user},
+        {"role": "assistant", "content": assistant},
+    ]
+def main():
+    # Load tasks by task_id
+    tasks = {}
+    with open(TASKS_PATH) as f:
+        for line in f:
+            t = json.loads(line)
+            tasks[t["task_id"]] = t
+    pairs_raw = []
+    with open(PAIRS_PATH) as f:
+        for line in f:
+            pairs_raw.append(json.loads(line))
+    judge_pairs = []
+    skipped = 0
+    for pair in pairs_raw:
+        task_id = pair["task_id"]
+        dim     = pair["dimension"]
+        task    = tasks.get(task_id)
+        if task is None:
+            skipped += 1
+            continue
+        # Strip the <|im_end|> token that was embedded during generation
+        chosen_email   = pair["chosen"].replace("<|im_end|>", "").strip()
+        rejected_email = pair["rejected"].replace("<|im_end|>", "").strip()
+        # Score both emails with the deterministic evaluator
+        r_chosen   = score_task({**task, "candidate_output": chosen_email})
+        r_rejected = score_task({**task, "candidate_output": rejected_email})
+        sc = r_chosen.get("score", 0.5)
+        sr = r_rejected.get("score", 0.5)
+        threshold = PASS_THRESHOLD.get(dim, 0.5)
+        chosen_passes   = sc >= threshold
+        rejected_passes = sr >= threshold
+        # ── Judge pair 1: score the GOOD (chosen) email ──────────────────────
+        # Correct judgment: high score (chosen) vs wrong judgment: low score (rejected)
+        user_prompt_chosen = build_user_prompt(task, chosen_email)
+        correct_score_chosen = round(min(sc + 0.05, 1.0), 2) if chosen_passes else round(sc, 2)
+        wrong_score_chosen   = round(max(sc - 0.5,  0.0), 2)
+        correct_response = make_score_json(
+            dim, correct_score_chosen, chosen_passes,
+            PASS_REASONING[dim] if chosen_passes else FAIL_REASONING[dim]
+        )
+        wrong_response = make_score_json(
+            dim, wrong_score_chosen, not chosen_passes,
+            FAIL_REASONING[dim] if chosen_passes else PASS_REASONING[dim]
+        )
+        # Only include if there's a meaningful score gap
+        if abs(correct_score_chosen - wrong_score_chosen) >= 0.2:
+            judge_pairs.append({
+                "chosen":   conversations(JUDGE_SYSTEM, user_prompt_chosen, correct_response),
+                "rejected": conversations(JUDGE_SYSTEM, user_prompt_chosen, wrong_response),
+                "task_id":  task_id,
+                "dimension": dim,
+                "email_type": "chosen",
+                "actual_score": sc,
+            })
+        # ── Judge pair 2: score the BAD (rejected) email ─────────────────────
+        user_prompt_rejected = build_user_prompt(task, rejected_email)
+        correct_score_rejected = round(sr, 2)
+        wrong_score_rejected   = round(min(sr + 0.5, 1.0), 2)
+        correct_response_r = make_score_json(
+            dim, correct_score_rejected, rejected_passes,
+            PASS_REASONING[dim] if rejected_passes else FAIL_REASONING[dim]
+        )
+        wrong_response_r = make_score_json(
+            dim, wrong_score_rejected, not rejected_passes,
+            PASS_REASONING[dim] if not rejected_passes else FAIL_REASONING[dim]
+        )
+        if abs(wrong_score_rejected - correct_score_rejected) >= 0.2:
+            judge_pairs.append({
+                "chosen":   conversations(JUDGE_SYSTEM, user_prompt_rejected, correct_response_r),
+                "rejected": conversations(JUDGE_SYSTEM, user_prompt_rejected, wrong_response_r),
+                "task_id":  task_id,
+                "dimension": dim,
+                "email_type": "rejected",
+                "actual_score": sr,
+            })
+    with open(OUTPUT_PATH, "w") as f:
+        for jp in judge_pairs:
+            f.write(json.dumps(jp) + "\n")
+    from collections import Counter
+    dim_counts = Counter(jp["dimension"] for jp in judge_pairs)
+    type_counts = Counter(jp["email_type"] for jp in judge_pairs)
+    print(f"Built {len(judge_pairs)} judge pairs (skipped {skipped} missing tasks)")
+    print(f"Dimension breakdown: {dict(dim_counts)}")
+    print(f"Email type: {dict(type_counts)}")
+    print(f"Written to {OUTPUT_PATH}")
+    # Validate format
+    sample = judge_pairs[0]
+    assert "chosen" in sample and isinstance(sample["chosen"], list)
+    assert sample["chosen"][0]["role"] == "system"
+    assert sample["chosen"][-1]["role"] == "assistant"
+    print("\nFormat validation: PASSED")
+    print(f"Sample chosen response:  {sample['chosen'][-1]['content']}")
+    print(f"Sample rejected response: {sample['rejected'][-1]['content']}")
+if __name__ == "__main__":
+    main()

inference_example.py CHANGED Viewed

@@ -1,5 +1,274 @@
 #!/usr/bin/env python3
 """
 Inference helper for rafiakedir/tenacious-bench-adapter.
 Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
 """
@@ -12,50 +281,53 @@ MODEL_ID = "rafiakedir/tenacious-bench-adapter"
 DIMENSION_PROMPTS = {
     "signal_grounding_fidelity": (
-        "Dimension: signal_grounding_fidelity\n"
         "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
         "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
         "without a high/medium-confidence signal in the brief must be recast as questions."
     ),
     "bench_commitment_honesty": (
-        "Dimension: bench_commitment_honesty\n"
         "Rubric: The email must not promise or imply a number of engineers that exceeds "
         "the total available in the bench_summary. Any staffing commitment must stay within capacity."
     ),
     "icp_segment_appropriateness": (
-        "Dimension: icp_segment_appropriateness\n"
         "Rubric: The email's language and pitch angle must match the correct ICP segment "
         "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
         "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
     ),
     "competitor_gap_honesty": (
-        "Dimension: competitor_gap_honesty\n"
         "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
         "The email must not assert that competitors have capabilities the prospect lacks "
         "unless the brief explicitly documents this gap."
     ),
     "tone_preservation": (
-        "Dimension: tone_preservation\n"
         "Rubric: No re-engagement clichés ('just wanted to circle back', 'touching base', "
         "'following up'). No over-apologetic exits ('sorry for taking your time'). "
         "Calendar CTA required. Confident but not pushy."
     ),
 }
-SYSTEM_TEMPLATE = """You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
 {dimension_prompt}
 Respond with a JSON object only:
 {{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
 """
-USER_TEMPLATE = """Context:
 {context_json}
 Candidate email:
 {candidate_output}
-Score this output on the dimension above."""
 def load_model(model_id: str = MODEL_ID):
@@ -162,15 +434,69 @@ if __name__ == "__main__":
         "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
     )
-    print("\nScoring on signal_grounding_fidelity...")
     result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
-    print(f"  Score: {result['score']:.2f}")
-    print(f"  Reasoning: {result['reasoning']}")
-    print("\nScoring all dimensions...")
     all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
     for dim, r in all_results.items():
         if dim == "mean_score":
             print(f"  MEAN: {r:.3f}")
         else:
-            print(f"  {dim}: {r['score']:.2f} — {r['reasoning'][:80]}")

 #!/usr/bin/env python3
 """
+Upload model card, training scripts, and inference helper to
+rafiakedir/tenacious-bench-adapter on HuggingFace.
+Does NOT re-upload the safetensors weights — those are already there.
+"""
+from pathlib import Path
+from huggingface_hub import HfApi, CommitOperationAdd
+ROOT = Path(__file__).parent
+REPO_ID = "rafiakedir/tenacious-bench-adapter"
+# ── Model Card ────────────────────────────────────────────────────────────────
+MODEL_CARD = """\
+---
+license: cc-by-4.0
+language:
+- en
+base_model: unsloth/Qwen3.5-0.8B
+tags:
+- judge
+- b2b-sales
+- orpo
+- preference-learning
+- tenacious-bench
+- evaluation
+- qwen3
+- unsloth
+datasets:
+- rafiakedir/tenacious-bench-v0.1
+---
+# Tenacious-Bench Judge — ORPO Fine-Tuned Qwen3.5-0.8B
+A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
+[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
+preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
+the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
+rubric dimensions; outputs below threshold are rejected and regenerated.
+**Base model:** `unsloth/Qwen3.5-0.8B`
+**Training algorithm:** ORPO (no reference model — single forward pass)
+**Weights:** Merged (full model, not a LoRA adapter)
+**Precision:** BF16 · ~873M parameters · ~1.75 GB
+**Context length:** 262,144 tokens
+**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
+---
+## What It Scores
+| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
+|---|---|---|
+| `signal_grounding_fidelity` | 35% | CTO credibility loss |
+| `competitor_gap_honesty` | 45% | Irreversible brand damage |
+| `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
+| `tone_preservation` | 15% | Brand voice violation |
+| `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |
+---
+## Quick Start — Inference
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "rafiakedir/tenacious-bench-adapter"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, torch_dtype=torch.bfloat16, device_map="auto"
+)
+SYSTEM = \"\"\"You are a rubric-aware judge for B2B outbound sales emails.
+Score the candidate output on the following dimension.
+Dimension: signal_grounding_fidelity
+Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
+with confidence >= 0.60, or be phrased as a question.
+Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}\"\"\"
+USER = \"\"\"Hiring signal brief:
+{
+  "company_name": "Acme Corp",
+  "open_roles": 3,
+  "confidence": "low",
+  "domain": "fintech"
+}
+Candidate email:
+"Hi Alex — noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
+We staff specialized capability-gap squads for fintech teams at your growth stage.
+Would a 30-minute scoping conversation make sense this week?"
+Score this output.\"\"\"
+messages = [
+    {"role": "system", "content": SYSTEM},
+    {"role": "user", "content": USER},
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(text, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
+response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+print(response)
+# Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low — should be phrased as a question."}
+```
+---
+## Training Details
+### Why ORPO
+ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
+the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
+VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
+checkpointing hacks.
+For a discriminative judge (score calibration rather than generation quality), the
+preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
+that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
+### Preference Pair Construction
+| Source | Count |
+|---|---|
+| Failing tasks → generated chosen (DeepSeek V3.2) | ~111 attempted |
+| Passing tasks → generated rejected (DeepSeek V3.2) | ~41 attempted |
+| **Final pairs after filtering** | **94** |
+Filter: chosen score ≥ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
+Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
+and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
+**Preference leakage prevention (Li et al., 2025):**
+Generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
+All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
+### Hyperparameters
+| Parameter | Value |
+|---|---|
+| Base model | `unsloth/Qwen3.5-0.8B` |
+| LoRA rank | 16 |
+| LoRA alpha | 32 |
+| Target modules | q_proj, v_proj |
+| LoRA dropout | 0.05 |
+| Learning rate | 8e-6 |
+| Batch size (per device) | 2 |
+| Gradient accumulation | 4 (effective batch 8) |
+| Epochs | 3 |
+| Warmup ratio | 0.1 |
+| LR scheduler | cosine |
+| ORPO beta | 0.1 |
+| Max sequence length | 1024 |
+| Precision | BF16 (T4) |
+| Seed | 42 |
+Training notebook: see `run_on_colab.ipynb` in this repo.
+---
+## Evaluation Results
+Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
+Paired bootstrap significance test: 10,000 iterations, seed 42.
+| Condition | Mean Score | vs. Baseline |
+|---|---|---|
+| Baseline (`scoring_evaluator.py` only) | 0.458 | — |
+| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Δ=+0.025, p=0.189, not significant |
+| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Δ=−0.021 vs. trained, p=0.978 |
+**Delta A** (trained vs. baseline): Δ=+0.025, 95% CI [−0.032, +0.081], p=0.189 — **not statistically significant**.
+**Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` —
+the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
+Note: Delta B compares a 0.8B trained model against a 30B zero-shot model — this conflates backbone
+capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
+`Qwen3.5-0.8B-Instruct` (no fine-tuning).
+**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
+`scoring_evaluator.py` deterministically. Retrain with ≥150 pairs covering all 5 dimensions
+before re-evaluating.
+Full numbers: `ablation_results.json` in the dataset repo.
+---
+## Known Limitations
+**1. Dimension coverage gap (critical).**
+The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
+for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
+to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
+bench commitment honesty — the highest SOW-breach-risk dimension. It cannot be trusted to gate
+bench-commitment outputs.
+**2. Delta A not significant at v0.1 scale.**
+The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
+does not reliably outperform `scoring_evaluator.py` on held-out tasks.
+**3. Backbone below Prometheus-2 threshold.**
+Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
+below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
+**4. Synthetic training distribution.**
+All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
+may not generalize to real prospect data with industry-specific jargon or edge cases outside the
+training distribution.
+**5. Static bench_summary.**
+The judge was trained on snapshot bench capacities. In production the bench changes weekly —
+calibration for `bench_commitment_honesty` will drift over time.
+---
+## Files in This Repo
+| File | Description |
+|---|---|
+| `model.safetensors-*` | Merged model weights (BF16) |
+| `config.json` | Model architecture config |
+| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
+| `train_judge.py` | Full ORPO training script |
+| `hyperparams.json` | All hyperparameters (pinned) |
+| `run_on_colab.ipynb` | End-to-end training notebook for T4 |
+| `inference_example.py` | Inference helper with prompt templates |
+Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
+---
+## Environmental Impact
+- **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
+- **CO₂e:** ~0.1 kg (T4 at 70W × 90 min × US grid 0.42 kg CO₂/kWh ÷ 1000)
+- **Infrastructure:** Google Colab free tier
+---
+## Citation
+```bibtex
+@misc{tenacious-bench-adapter-2026,
+  title        = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
+  author       = {Kedir, Rafia},
+  year         = {2026},
+  howpublished = {HuggingFace Model Hub},
+  url          = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
+}
+@misc{tenacious-bench-v01-2026,
+  title        = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
+  author       = {Kedir, Rafia},
+  year         = {2026},
+  howpublished = {HuggingFace Datasets Hub},
+  url          = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
+}
+```
+"""
+# ── Inference Example ─────────────────────────────────────────────────────────
+INFERENCE_EXAMPLE = '''\
+#!/usr/bin/env python3
+"""
 Inference helper for rafiakedir/tenacious-bench-adapter.
 Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
 """
 DIMENSION_PROMPTS = {
     "signal_grounding_fidelity": (
+        "Dimension: signal_grounding_fidelity\\n"
         "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
         "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
         "without a high/medium-confidence signal in the brief must be recast as questions."
     ),
     "bench_commitment_honesty": (
+        "Dimension: bench_commitment_honesty\\n"
         "Rubric: The email must not promise or imply a number of engineers that exceeds "
         "the total available in the bench_summary. Any staffing commitment must stay within capacity."
     ),
     "icp_segment_appropriateness": (
+        "Dimension: icp_segment_appropriateness\\n"
         "Rubric: The email's language and pitch angle must match the correct ICP segment "
         "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
         "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
     ),
     "competitor_gap_honesty": (
+        "Dimension: competitor_gap_honesty\\n"
         "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
         "The email must not assert that competitors have capabilities the prospect lacks "
         "unless the brief explicitly documents this gap."
     ),
     "tone_preservation": (
+        "Dimension: tone_preservation\\n"
         "Rubric: No re-engagement clichés ('just wanted to circle back', 'touching base', "
         "'following up'). No over-apologetic exits ('sorry for taking your time'). "
         "Calendar CTA required. Confident but not pushy."
     ),
 }
+SYSTEM_TEMPLATE = """\
+You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
 {dimension_prompt}
 Respond with a JSON object only:
 {{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
 """
+USER_TEMPLATE = """\
+Context:
 {context_json}
 Candidate email:
 {candidate_output}
+Score this output on the dimension above.\
+"""
 def load_model(model_id: str = MODEL_ID):
         "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
     )
+    print("\\nScoring on signal_grounding_fidelity...")
     result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
+    print(f"  Score: {result[\'score\']:.2f}")
+    print(f"  Reasoning: {result[\'reasoning\']}")
+    print("\\nScoring all dimensions...")
     all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
     for dim, r in all_results.items():
         if dim == "mean_score":
             print(f"  MEAN: {r:.3f}")
         else:
+            print(f"  {dim}: {r[\'score\']:.2f} — {r[\'reasoning\'][:80]}")
+'''
+def main():
+    api = HfApi()
+    operations = []
+    def add_bytes(content: bytes, repo_path: str, label: str = ""):
+        lbl = label or repo_path
+        print(f"  queuing {lbl} ({len(content):,} bytes)")
+        operations.append(CommitOperationAdd(
+            path_in_repo=repo_path,
+            path_or_fileobj=content,
+        ))
+    def add_file(local_path: Path, repo_path: str):
+        print(f"  queuing {repo_path} ({local_path.stat().st_size:,} bytes)")
+        operations.append(CommitOperationAdd(
+            path_in_repo=repo_path,
+            path_or_fileobj=str(local_path),
+        ))
+    # Model card
+    add_bytes(MODEL_CARD.encode(), "README.md", "README.md (model card)")
+    # Inference example
+    add_bytes(INFERENCE_EXAMPLE.encode(), "inference_example.py")
+    # Training scripts
+    add_file(ROOT / "training" / "train_judge.py", "train_judge.py")
+    add_file(ROOT / "training" / "hyperparams.json", "hyperparams.json")
+    add_file(ROOT / "training" / "run_on_colab.ipynb", "run_on_colab.ipynb")
+    add_file(ROOT / "training" / "requirements_training.txt", "requirements_training.txt")
+    print(f"\nCommitting {len(operations)} files to {REPO_ID}...")
+    url = api.create_commit(
+        repo_id=REPO_ID,
+        repo_type="model",
+        operations=operations,
+        commit_message=(
+            "feat: add model card, inference example, and training scripts\n\n"
+            "- Proper model card with YAML frontmatter (base_model, tags, datasets)\n"
+            "- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict\n"
+            "- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)\n"
+            "- inference_example.py with per-dimension and all-dimensions scoring\n"
+            "- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb"
+        ),
+    )
+    print(f"\nDone. Commit URL: {url}")
+    print(f"Model: https://huggingface.co/rafiakedir/tenacious-bench-adapter")
+if __name__ == "__main__":
+    main()