File size: 18,724 Bytes
e1374e8
 
f02a80f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e1374e8
 
 
 
 
 
 
 
 
 
 
 
f02a80f
e1374e8
 
 
 
 
f02a80f
e1374e8
 
 
 
f02a80f
e1374e8
 
 
 
 
f02a80f
e1374e8
 
 
 
 
f02a80f
e1374e8
 
 
 
 
 
f02a80f
 
e1374e8
 
 
 
 
 
f02a80f
 
e1374e8
 
 
 
 
f02a80f
 
e1374e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f02a80f
e1374e8
f02a80f
 
e1374e8
f02a80f
e1374e8
 
 
 
 
f02a80f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
#!/usr/bin/env python3
"""
Upload model card, training scripts, and inference helper to
rafiakedir/tenacious-bench-adapter on HuggingFace.
Does NOT re-upload the safetensors weights β€” those are already there.
"""
from pathlib import Path
from huggingface_hub import HfApi, CommitOperationAdd

ROOT = Path(__file__).parent
REPO_ID = "rafiakedir/tenacious-bench-adapter"

# ── Model Card ────────────────────────────────────────────────────────────────
MODEL_CARD = """\
---
license: cc-by-4.0
language:
- en
base_model: unsloth/Qwen3.5-0.8B
tags:
- judge
- b2b-sales
- orpo
- preference-learning
- tenacious-bench
- evaluation
- qwen3
- unsloth
datasets:
- rafiakedir/tenacious-bench-v0.1
---

# Tenacious-Bench Judge β€” ORPO Fine-Tuned Qwen3.5-0.8B

A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
rubric dimensions; outputs below threshold are rejected and regenerated.

**Base model:** `unsloth/Qwen3.5-0.8B`
**Training algorithm:** ORPO (no reference model β€” single forward pass)
**Weights:** Merged (full model, not a LoRA adapter)
**Precision:** BF16 Β· ~873M parameters Β· ~1.75 GB
**Context length:** 262,144 tokens
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)

---

## What It Scores

| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
|---|---|---|
| `signal_grounding_fidelity` | 35% | CTO credibility loss |
| `competitor_gap_honesty` | 45% | Irreversible brand damage |
| `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
| `tone_preservation` | 15% | Brand voice violation |
| `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |

---

## Quick Start β€” Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "rafiakedir/tenacious-bench-adapter"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

SYSTEM = \"\"\"You are a rubric-aware judge for B2B outbound sales emails.
Score the candidate output on the following dimension.

Dimension: signal_grounding_fidelity
Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
with confidence >= 0.60, or be phrased as a question.

Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}\"\"\"

USER = \"\"\"Hiring signal brief:
{
  "company_name": "Acme Corp",
  "open_roles": 3,
  "confidence": "low",
  "domain": "fintech"
}

Candidate email:
"Hi Alex β€” noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
We staff specialized capability-gap squads for fintech teams at your growth stage.
Would a 30-minute scoping conversation make sense this week?"

Score this output.\"\"\"

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": USER},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)

response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low β€” should be phrased as a question."}
```

---

## Training Details

### Why ORPO

ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
checkpointing hacks.

For a discriminative judge (score calibration rather than generation quality), the
preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.

### Preference Pair Construction

| Source | Count |
|---|---|
| Failing tasks β†’ generated chosen (DeepSeek V3.2) | ~111 attempted |
| Passing tasks β†’ generated rejected (DeepSeek V3.2) | ~41 attempted |
| **Final pairs after filtering** | **94** |

Filter: chosen score β‰₯ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
and ICP segment tasks with key mismatch making pass threshold structurally unachievable.

**Preference leakage prevention (Li et al., 2025):**
Generator (DeepSeek V3.2) β‰  judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.

### Hyperparameters

| Parameter | Value |
|---|---|
| Base model | `unsloth/Qwen3.5-0.8B` |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Target modules | q_proj, v_proj |
| LoRA dropout | 0.05 |
| Learning rate | 8e-6 |
| Batch size (per device) | 2 |
| Gradient accumulation | 4 (effective batch 8) |
| Epochs | 3 |
| Warmup ratio | 0.1 |
| LR scheduler | cosine |
| ORPO beta | 0.1 |
| Max sequence length | 1024 |
| Precision | BF16 (T4) |
| Seed | 42 |

Training notebook: see `run_on_colab.ipynb` in this repo.

---

## Evaluation Results

Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
Paired bootstrap significance test: 10,000 iterations, seed 42.

| Condition | Mean Score | vs. Baseline |
|---|---|---|
| Baseline (`scoring_evaluator.py` only) | 0.458 | β€” |
| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Ξ”=+0.025, p=0.189, not significant |
| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Ξ”=βˆ’0.021 vs. trained, p=0.978 |

**Delta A** (trained vs. baseline): Ξ”=+0.025, 95% CI [βˆ’0.032, +0.081], p=0.189 β€” **not statistically significant**.

**Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` β€”
the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
Note: Delta B compares a 0.8B trained model against a 30B zero-shot model β€” this conflates backbone
capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
`Qwen3.5-0.8B-Instruct` (no fine-tuning).

**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
`scoring_evaluator.py` deterministically. Retrain with β‰₯150 pairs covering all 5 dimensions
before re-evaluating.

Full numbers: `ablation_results.json` in the dataset repo.

---

## Known Limitations

**1. Dimension coverage gap (critical).**
The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
bench commitment honesty β€” the highest SOW-breach-risk dimension. It cannot be trusted to gate
bench-commitment outputs.

**2. Delta A not significant at v0.1 scale.**
The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
does not reliably outperform `scoring_evaluator.py` on held-out tasks.

**3. Backbone below Prometheus-2 threshold.**
Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.

**4. Synthetic training distribution.**
All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
may not generalize to real prospect data with industry-specific jargon or edge cases outside the
training distribution.

**5. Static bench_summary.**
The judge was trained on snapshot bench capacities. In production the bench changes weekly β€”
calibration for `bench_commitment_honesty` will drift over time.

---

## Files in This Repo

| File | Description |
|---|---|
| `model.safetensors-*` | Merged model weights (BF16) |
| `config.json` | Model architecture config |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
| `train_judge.py` | Full ORPO training script |
| `hyperparams.json` | All hyperparameters (pinned) |
| `run_on_colab.ipynb` | End-to-end training notebook for T4 |
| `inference_example.py` | Inference helper with prompt templates |

Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)

---

## Environmental Impact

- **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
- **COβ‚‚e:** ~0.1 kg (T4 at 70W Γ— 90 min Γ— US grid 0.42 kg COβ‚‚/kWh Γ· 1000)
- **Infrastructure:** Google Colab free tier

---

## Citation

```bibtex
@misc{tenacious-bench-adapter-2026,
  title        = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
  author       = {Kedir, Rafia},
  year         = {2026},
  howpublished = {HuggingFace Model Hub},
  url          = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
}

@misc{tenacious-bench-v01-2026,
  title        = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
  author       = {Kedir, Rafia},
  year         = {2026},
  howpublished = {HuggingFace Datasets Hub},
  url          = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
}
```
"""

# ── Inference Example ─────────────────────────────────────────────────────────
INFERENCE_EXAMPLE = '''\
#!/usr/bin/env python3
"""
Inference helper for rafiakedir/tenacious-bench-adapter.
Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
"""

import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "rafiakedir/tenacious-bench-adapter"

DIMENSION_PROMPTS = {
    "signal_grounding_fidelity": (
        "Dimension: signal_grounding_fidelity\\n"
        "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
        "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
        "without a high/medium-confidence signal in the brief must be recast as questions."
    ),
    "bench_commitment_honesty": (
        "Dimension: bench_commitment_honesty\\n"
        "Rubric: The email must not promise or imply a number of engineers that exceeds "
        "the total available in the bench_summary. Any staffing commitment must stay within capacity."
    ),
    "icp_segment_appropriateness": (
        "Dimension: icp_segment_appropriateness\\n"
        "Rubric: The email's language and pitch angle must match the correct ICP segment "
        "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
        "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
    ),
    "competitor_gap_honesty": (
        "Dimension: competitor_gap_honesty\\n"
        "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
        "The email must not assert that competitors have capabilities the prospect lacks "
        "unless the brief explicitly documents this gap."
    ),
    "tone_preservation": (
        "Dimension: tone_preservation\\n"
        "Rubric: No re-engagement clichΓ©s ('just wanted to circle back', 'touching base', "
        "'following up'). No over-apologetic exits ('sorry for taking your time'). "
        "Calendar CTA required. Confident but not pushy."
    ),
}

SYSTEM_TEMPLATE = """\
You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
{dimension_prompt}

Respond with a JSON object only:
{{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
"""

USER_TEMPLATE = """\
Context:
{context_json}

Candidate email:
{candidate_output}

Score this output on the dimension above.\
"""


def load_model(model_id: str = MODEL_ID):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    model.eval()
    return tokenizer, model


def score(
    tokenizer,
    model,
    task_input: dict,
    candidate_output: str,
    dimension: str,
    max_new_tokens: int = 150,
) -> dict:
    """
    Score a single candidate output on one rubric dimension.

    Args:
        task_input: dict with keys like 'hiring_signal_brief', 'bench_summary', etc.
        candidate_output: the email text to score
        dimension: one of the five Tenacious rubric dimensions
    Returns:
        dict with 'score' (float) and 'reasoning' (str)
    """
    if dimension not in DIMENSION_PROMPTS:
        raise ValueError(f"Unknown dimension: {dimension}. Choose from {list(DIMENSION_PROMPTS)}")

    context_json = json.dumps(task_input, indent=2)
    system = SYSTEM_TEMPLATE.format(dimension_prompt=DIMENSION_PROMPTS[dimension])
    user = USER_TEMPLATE.format(
        context_json=context_json,
        candidate_output=candidate_output,
    )

    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": user},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    ).strip()

    # Parse JSON from response
    try:
        # Find first { ... } block
        start = response.find("{")
        end = response.rfind("}") + 1
        result = json.loads(response[start:end])
        return {"score": float(result["score"]), "reasoning": result.get("reasoning", "")}
    except Exception:
        return {"score": 0.5, "reasoning": f"parse_error: {response[:200]}"}


def score_all_dimensions(tokenizer, model, task_input: dict, candidate_output: str) -> dict:
    """Score a candidate output on all five dimensions."""
    results = {}
    for dim in DIMENSION_PROMPTS:
        results[dim] = score(tokenizer, model, task_input, candidate_output, dim)
    results["mean_score"] = sum(r["score"] for r in results.values()) / len(results)
    return results


# ── Demo ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
    print(f"Loading {MODEL_ID}...")
    tokenizer, model = load_model()

    demo_input = {
        "hiring_signal_brief": {
            "company_name": "Acme Corp",
            "domain": "fintech",
            "open_roles": 3,
            "confidence": "low",
            "stage": "Series B",
        },
        "bench_summary": {
            "total_available": 8,
            "specializations": ["Python", "Go", "ML Engineering"],
        },
    }

    demo_email = (
        "Hi Alex β€” noticed Acme Corp is aggressively scaling its engineering team "
        "with 3 open roles. We staff specialized capability-gap squads for fintech "
        "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
    )

    print("\\nScoring on signal_grounding_fidelity...")
    result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
    print(f"  Score: {result[\'score\']:.2f}")
    print(f"  Reasoning: {result[\'reasoning\']}")

    print("\\nScoring all dimensions...")
    all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
    for dim, r in all_results.items():
        if dim == "mean_score":
            print(f"  MEAN: {r:.3f}")
        else:
            print(f"  {dim}: {r[\'score\']:.2f} β€” {r[\'reasoning\'][:80]}")
'''


def main():
    api = HfApi()
    operations = []

    def add_bytes(content: bytes, repo_path: str, label: str = ""):
        lbl = label or repo_path
        print(f"  queuing {lbl} ({len(content):,} bytes)")
        operations.append(CommitOperationAdd(
            path_in_repo=repo_path,
            path_or_fileobj=content,
        ))

    def add_file(local_path: Path, repo_path: str):
        print(f"  queuing {repo_path} ({local_path.stat().st_size:,} bytes)")
        operations.append(CommitOperationAdd(
            path_in_repo=repo_path,
            path_or_fileobj=str(local_path),
        ))

    # Model card
    add_bytes(MODEL_CARD.encode(), "README.md", "README.md (model card)")

    # Inference example
    add_bytes(INFERENCE_EXAMPLE.encode(), "inference_example.py")

    # Training scripts
    add_file(ROOT / "training" / "train_judge.py", "train_judge.py")
    add_file(ROOT / "training" / "hyperparams.json", "hyperparams.json")
    add_file(ROOT / "training" / "run_on_colab.ipynb", "run_on_colab.ipynb")
    add_file(ROOT / "training" / "requirements_training.txt", "requirements_training.txt")

    print(f"\nCommitting {len(operations)} files to {REPO_ID}...")
    url = api.create_commit(
        repo_id=REPO_ID,
        repo_type="model",
        operations=operations,
        commit_message=(
            "feat: add model card, inference example, and training scripts\n\n"
            "- Proper model card with YAML frontmatter (base_model, tags, datasets)\n"
            "- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict\n"
            "- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)\n"
            "- inference_example.py with per-dimension and all-dimensions scoring\n"
            "- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb"
        ),
    )
    print(f"\nDone. Commit URL: {url}")
    print(f"Model: https://huggingface.co/rafiakedir/tenacious-bench-adapter")


if __name__ == "__main__":
    main()