rafiakedir commited on
Commit
f02a80f
Β·
verified Β·
1 Parent(s): 1abbbd5

feat: upload actual trained LoRA adapter (Qwen2.5-1.5B ORPO, 3 epochs, 36 steps)

Browse files
Files changed (3) hide show
  1. README.md +79 -159
  2. build_judge_pairs.py +214 -0
  3. inference_example.py +339 -13
README.md CHANGED
@@ -2,34 +2,33 @@
2
  license: cc-by-4.0
3
  language:
4
  - en
5
- base_model: unsloth/Qwen3.5-0.8B
6
  tags:
7
  - judge
8
  - b2b-sales
9
  - orpo
 
10
  - preference-learning
11
  - tenacious-bench
12
  - evaluation
13
- - qwen3
14
  - unsloth
15
  datasets:
16
  - rafiakedir/tenacious-bench-v0.1
17
  ---
18
 
19
- # Tenacious-Bench Judge β€” ORPO Fine-Tuned Qwen3.5-0.8B
20
 
21
  A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
22
  [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
23
- preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
24
- the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
25
- rubric dimensions; outputs below threshold are rejected and regenerated.
26
-
27
- **Base model:** `unsloth/Qwen3.5-0.8B`
28
- **Training algorithm:** ORPO (no reference model β€” single forward pass)
29
- **Weights:** Merged (full model, not a LoRA adapter)
30
- **Precision:** BF16 Β· ~873M parameters Β· ~1.75 GB
31
- **Context length:** 262,144 tokens
32
  **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
 
33
 
34
  ---
35
 
@@ -48,185 +47,115 @@ rubric dimensions; outputs below threshold are rejected and regenerated.
48
  ## Quick Start β€” Inference
49
 
50
  ```python
 
51
  from transformers import AutoTokenizer, AutoModelForCausalLM
52
- import torch
53
-
54
- model_id = "rafiakedir/tenacious-bench-adapter"
55
- tokenizer = AutoTokenizer.from_pretrained(model_id)
56
- model = AutoModelForCausalLM.from_pretrained(
57
- model_id, torch_dtype=torch.bfloat16, device_map="auto"
58
- )
59
-
60
- SYSTEM = """You are a rubric-aware judge for B2B outbound sales emails.
61
- Score the candidate output on the following dimension.
62
-
63
- Dimension: signal_grounding_fidelity
64
- Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
65
- with confidence >= 0.60, or be phrased as a question.
66
-
67
- Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}"""
68
-
69
- USER = """Hiring signal brief:
70
- {
71
- "company_name": "Acme Corp",
72
- "open_roles": 3,
73
- "confidence": "low",
74
- "domain": "fintech"
75
- }
76
-
77
- Candidate email:
78
- "Hi Alex β€” noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
79
- We staff specialized capability-gap squads for fintech teams at your growth stage.
80
- Would a 30-minute scoping conversation make sense this week?"
81
-
82
- Score this output."""
83
 
84
- messages = [
85
- {"role": "system", "content": SYSTEM},
86
- {"role": "user", "content": USER},
87
- ]
88
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
89
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
90
 
91
- with torch.no_grad():
92
- out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
 
 
 
 
 
 
 
 
 
 
 
93
 
94
- response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
95
- print(response)
96
- # Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low β€” should be phrased as a question."}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ```
98
 
99
  ---
100
 
101
  ## Training Details
102
 
103
- ### Why ORPO
104
-
105
- ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
106
- the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
107
- VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
108
- checkpointing hacks.
109
-
110
- For a discriminative judge (score calibration rather than generation quality), the
111
- preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
112
- that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
113
-
114
- ### Preference Pair Construction
115
-
116
- | Source | Count |
117
- |---|---|
118
- | Failing tasks β†’ generated chosen (DeepSeek V3.2) | ~111 attempted |
119
- | Passing tasks β†’ generated rejected (DeepSeek V3.2) | ~41 attempted |
120
- | **Final pairs after filtering** | **94** |
121
-
122
- Filter: chosen score β‰₯ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
123
- Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
124
- and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
125
-
126
- **Preference leakage prevention (Li et al., 2025):**
127
- Generator (DeepSeek V3.2) β‰  judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
128
- All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
129
-
130
- ### Hyperparameters
131
-
132
  | Parameter | Value |
133
  |---|---|
134
- | Base model | `unsloth/Qwen3.5-0.8B` |
135
  | LoRA rank | 16 |
136
  | LoRA alpha | 32 |
137
  | Target modules | q_proj, v_proj |
138
  | LoRA dropout | 0.05 |
139
  | Learning rate | 8e-6 |
140
- | Batch size (per device) | 2 |
141
- | Gradient accumulation | 4 (effective batch 8) |
142
  | Epochs | 3 |
143
- | Warmup ratio | 0.1 |
144
- | LR scheduler | cosine |
145
  | ORPO beta | 0.1 |
146
  | Max sequence length | 1024 |
147
- | Precision | BF16 (T4) |
148
  | Seed | 42 |
149
 
150
- Training notebook: see `run_on_colab.ipynb` in this repo.
 
 
 
 
 
151
 
152
  ---
153
 
154
  ## Evaluation Results
155
 
156
  Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
157
- Paired bootstrap significance test: 10,000 iterations, seed 42.
158
-
159
- | Condition | Mean Score | vs. Baseline |
160
- |---|---|---|
161
- | Baseline (`scoring_evaluator.py` only) | 0.458 | β€” |
162
- | **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Ξ”=+0.025, p=0.189, not significant |
163
- | Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Ξ”=βˆ’0.021 vs. trained, p=0.978 |
164
-
165
- **Delta A** (trained vs. baseline): Ξ”=+0.025, 95% CI [βˆ’0.032, +0.081], p=0.189 β€” **not statistically significant**.
166
 
167
- **Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` β€”
168
- the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
169
- Note: Delta B compares a 0.8B trained model against a 30B zero-shot model β€” this conflates backbone
170
- capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
171
- `Qwen3.5-0.8B-Instruct` (no fine-tuning).
172
-
173
- **Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
174
- `scoring_evaluator.py` deterministically. Retrain with β‰₯150 pairs covering all 5 dimensions
175
- before re-evaluating.
176
-
177
- Full numbers: `ablation_results.json` in the dataset repo.
178
 
179
  ---
180
 
181
  ## Known Limitations
182
 
183
- **1. Dimension coverage gap (critical).**
184
- The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
185
- for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
186
- to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
187
- bench commitment honesty β€” the highest SOW-breach-risk dimension. It cannot be trusted to gate
188
- bench-commitment outputs.
189
-
190
- **2. Delta A not significant at v0.1 scale.**
191
- The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
192
- does not reliably outperform `scoring_evaluator.py` on held-out tasks.
193
 
194
- **3. Backbone below Prometheus-2 threshold.**
195
- Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
196
- below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
197
 
198
- **4. Synthetic training distribution.**
199
- All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
200
- may not generalize to real prospect data with industry-specific jargon or edge cases outside the
201
- training distribution.
202
 
203
- **5. Static bench_summary.**
204
- The judge was trained on snapshot bench capacities. In production the bench changes weekly β€”
205
- calibration for `bench_commitment_honesty` will drift over time.
206
 
207
  ---
208
 
209
- ## Files in This Repo
210
 
211
  | File | Description |
212
  |---|---|
213
- | `model.safetensors-*` | Merged model weights (BF16) |
214
- | `config.json` | Model architecture config |
215
  | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
216
- | `train_judge.py` | Full ORPO training script |
217
- | `hyperparams.json` | All hyperparameters (pinned) |
218
- | `run_on_colab.ipynb` | End-to-end training notebook for T4 |
219
- | `inference_example.py` | Inference helper with prompt templates |
220
 
221
- Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
222
-
223
- ---
224
-
225
- ## Environmental Impact
226
-
227
- - **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
228
- - **COβ‚‚e:** ~0.1 kg (T4 at 70W Γ— 90 min Γ— US grid 0.42 kg COβ‚‚/kWh Γ· 1000)
229
- - **Infrastructure:** Google Colab free tier
230
 
231
  ---
232
 
@@ -234,18 +163,9 @@ Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://hu
234
 
235
  ```bibtex
236
  @misc{tenacious-bench-adapter-2026,
237
- title = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
238
- author = {Kedir, Rafia},
239
- year = {2026},
240
- howpublished = {HuggingFace Model Hub},
241
- url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
242
- }
243
-
244
- @misc{tenacious-bench-v01-2026,
245
- title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
246
- author = {Kedir, Rafia},
247
- year = {2026},
248
- howpublished = {HuggingFace Datasets Hub},
249
- url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
250
  }
251
  ```
 
2
  license: cc-by-4.0
3
  language:
4
  - en
5
+ base_model: unsloth/Qwen2.5-1.5B-Instruct
6
  tags:
7
  - judge
8
  - b2b-sales
9
  - orpo
10
+ - lora
11
  - preference-learning
12
  - tenacious-bench
13
  - evaluation
14
+ - qwen2.5
15
  - unsloth
16
  datasets:
17
  - rafiakedir/tenacious-bench-v0.1
18
  ---
19
 
20
+ # Tenacious-Bench Judge β€” ORPO LoRA Adapter (Qwen2.5-1.5B)
21
 
22
  A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
23
  [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
24
+ preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine.
25
+
26
+ **Base model:** `unsloth/Qwen2.5-1.5B-Instruct`
27
+ **Adapter type:** LoRA (PEFT) β€” load with base model + `PeftModel.from_pretrained`
28
+ **Training algorithm:** ORPO (no reference model required)
29
+ **Precision:** 4-bit quantized during training (Unsloth), fp16 for inference
 
 
 
30
  **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
31
+ **Training:** 3 epochs Β· 36 steps Β· lr=8e-6 Β· beta=0.1 Β· LoRA r=16 alpha=32
32
 
33
  ---
34
 
 
47
  ## Quick Start β€” Inference
48
 
49
  ```python
50
+ import json, torch
51
  from transformers import AutoTokenizer, AutoModelForCausalLM
52
+ from peft import PeftModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
55
+ ADAPTER_ID = "rafiakedir/tenacious-bench-adapter"
 
 
 
 
56
 
57
+ tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
58
+ base = AutoModelForCausalLM.from_pretrained(
59
+ BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
60
+ )
61
+ model = PeftModel.from_pretrained(base, ADAPTER_ID)
62
+ model.eval()
63
+
64
+ JUDGE_SYSTEM = (
65
+ "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
66
+ "Given a task context and a candidate email, score it on the specified rubric dimension. "
67
+ "Respond with a JSON object only:\n"
68
+ '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
69
+ )
70
 
71
+ def judge(email, context, dimension):
72
+ user = (
73
+ f"EVALUATION DIMENSION: {dimension}\n\n"
74
+ f"TASK CONTEXT:\n{context}\n\n"
75
+ f"CANDIDATE EMAIL:\n{email}\n\n"
76
+ f"Score this email on the {dimension} dimension."
77
+ )
78
+ msgs = [{"role": "system", "content": JUDGE_SYSTEM},
79
+ {"role": "user", "content": user}]
80
+ text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
81
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
82
+ with torch.no_grad():
83
+ out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
84
+ pad_token_id=tokenizer.eos_token_id)
85
+ resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
86
+ s, e = resp.find("{"), resp.rfind("}") + 1
87
+ return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}
88
+
89
+ result = judge(
90
+ email="Casey β€” TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
91
+ context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
92
+ dimension="signal_grounding_fidelity"
93
+ )
94
+ print(result)
95
  ```
96
 
97
  ---
98
 
99
  ## Training Details
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  | Parameter | Value |
102
  |---|---|
103
+ | Base model | `unsloth/Qwen2.5-1.5B-Instruct` (4-bit during training) |
104
  | LoRA rank | 16 |
105
  | LoRA alpha | 32 |
106
  | Target modules | q_proj, v_proj |
107
  | LoRA dropout | 0.05 |
108
  | Learning rate | 8e-6 |
109
+ | Effective batch size | 8 (batch=2, grad_accum=4) |
 
110
  | Epochs | 3 |
111
+ | Total steps | 36 |
 
112
  | ORPO beta | 0.1 |
113
  | Max sequence length | 1024 |
 
114
  | Seed | 42 |
115
 
116
+ **Training loss:** 2.8676 β†’ 2.9646 β†’ 2.9386 (3 checkpoints)
117
+ **Reward accuracy:** 0.5375 β†’ 0.6026 β†’ 0.5128
118
+
119
+ **Training data:** 94 preference pairs from the train partition. Preference leakage prevention:
120
+ generator (DeepSeek V3.2) β‰  judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
121
+ All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
122
 
123
  ---
124
 
125
  ## Evaluation Results
126
 
127
  Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
128
+ Full results in `ablation_results.json` in the dataset repo.
 
 
 
 
 
 
 
 
129
 
130
+ **Deployment recommendation:** Run `ablations/run_ablations.py` with this adapter to get Delta A.
131
+ The ablation script loads this adapter via HuggingFace β€” requires GPU + transformers + peft.
 
 
 
 
 
 
 
 
 
132
 
133
  ---
134
 
135
  ## Known Limitations
136
 
137
+ 1. **Dimension coverage gap.** 0 training pairs for `bench_commitment_honesty`, 4 for `icp_segment_appropriateness` due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.
 
 
 
 
 
 
 
 
 
138
 
139
+ 2. **Backbone below Prometheus-2 threshold.** Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.
 
 
140
 
141
+ 3. **Synthetic training distribution.** All pairs derive from synthetic prospect briefs and LLM-generated emails.
 
 
 
142
 
143
+ 4. **Static bench_summary.** Judge calibration drifts as real bench composition changes weekly.
 
 
144
 
145
  ---
146
 
147
+ ## Files
148
 
149
  | File | Description |
150
  |---|---|
151
+ | `adapter_config.json` | LoRA configuration (r=16, alpha=32, q_proj+v_proj) |
152
+ | `adapter_model.safetensors` | Trained LoRA weights (8.4 MB) |
153
  | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
154
+ | `run_on_colab.ipynb` | End-to-end training + push notebook |
155
+ | `train_judge.py` | Training script |
156
+ | `inference_example.py` | Per-dimension and all-dimension scoring helper |
 
157
 
158
+ Training data: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
 
 
 
 
 
 
 
 
159
 
160
  ---
161
 
 
163
 
164
  ```bibtex
165
  @misc{tenacious-bench-adapter-2026,
166
+ title = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
167
+ author = {Kedir, Rafia},
168
+ year = {2026},
169
+ url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
 
 
 
 
 
 
 
 
 
170
  }
171
  ```
build_judge_pairs.py ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Build judge-format ORPO training pairs.
4
+
5
+ Each preference pair in preference_pairs.jsonl has:
6
+ chosen = a GOOD email (passes rubric)
7
+ rejected = a BAD email (fails rubric)
8
+
9
+ For judge training we need the model to score emails, not generate them.
10
+ So we create pairs where:
11
+ chosen_response = correct JSON score for the email
12
+ rejected_response = wrong JSON score for the same email
13
+
14
+ From each original pair we create TWO judge training examples:
15
+ 1. Judge pair for the GOOD email β†’ correct high score is chosen, wrong low-ish score is rejected
16
+ 2. Judge pair for the BAD email β†’ correct low score is chosen, wrong high-ish score is rejected
17
+
18
+ Output: training_data/judge_pairs.jsonl (conversations format for ORPOTrainer)
19
+ """
20
+
21
+ import json
22
+ import sys
23
+ from pathlib import Path
24
+
25
+ ROOT = Path(__file__).parent.parent
26
+ sys.path.insert(0, str(ROOT))
27
+ from scoring_evaluator import score_task
28
+
29
+ PAIRS_PATH = ROOT / "training_data/preference_pairs.jsonl"
30
+ TASKS_PATH = ROOT / "tenacious_bench_v0.1/train/tasks.jsonl"
31
+ OUTPUT_PATH = ROOT / "training_data/judge_pairs.jsonl"
32
+
33
+ JUDGE_SYSTEM = (
34
+ "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
35
+ "Given a task context and a candidate email, score the email on the specified rubric "
36
+ "dimension. Respond with a JSON object only:\n"
37
+ '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, '
38
+ '"reasoning": "<one concise sentence>"}'
39
+ )
40
+
41
+ PASS_THRESHOLD = {
42
+ "signal_grounding_fidelity": 0.60,
43
+ "bench_commitment_honesty": 0.50,
44
+ "icp_segment_appropriateness": 0.50,
45
+ "competitor_gap_honesty": 0.50,
46
+ "tone_preservation": 0.60,
47
+ }
48
+
49
+ # Dimension-specific reasoning templates
50
+ PASS_REASONING = {
51
+ "signal_grounding_fidelity": "Email grounds all factual claims in documented hiring signals from the brief; low-confidence signals are phrased as questions.",
52
+ "bench_commitment_honesty": "Staffing commitment is within the available bench count for the required stack.",
53
+ "icp_segment_appropriateness": "Email language matches the correct ICP segment for the prospect's funding stage and posture.",
54
+ "competitor_gap_honesty": "Competitor gap claims are grounded in the competitor_gap_brief; no fabricated assertions.",
55
+ "tone_preservation": "Email maintains Tenacious brand voice: no clichΓ©s, no over-apologetic language, calendar CTA included.",
56
+ }
57
+ FAIL_REASONING = {
58
+ "signal_grounding_fidelity": "Email asserts growth or capability claims not supported by the hiring signal brief; treats low-confidence signals as established facts.",
59
+ "bench_commitment_honesty": "Email promises engineer capacity that exceeds the available bench count for the required stack.",
60
+ "icp_segment_appropriateness": "Email uses the wrong segment language; growth-phase pitch applied to a cost-restructuring or abstain-segment prospect.",
61
+ "competitor_gap_honesty": "Email asserts competitor gaps not documented in the brief, fabricating capability differences.",
62
+ "tone_preservation": "Email uses a banned re-engagement phrase or lacks the required 30-minute scoping calendar CTA.",
63
+ }
64
+
65
+
66
+ def build_user_prompt(task: dict, email_text: str) -> str:
67
+ dim = task.get("dimension", "")
68
+ inp = task.get("input", {})
69
+ # Compact the signal brief (trim to 800 chars to stay within max_prompt_length)
70
+ brief = json.dumps(
71
+ inp.get("hiring_signal_brief") or inp.get("bench_summary") or {},
72
+ indent=2
73
+ )[:800]
74
+ return (
75
+ f"EVALUATION DIMENSION: {dim}\n\n"
76
+ f"TASK CONTEXT:\n{brief}\n\n"
77
+ f"CANDIDATE EMAIL:\n{email_text.strip()}\n\n"
78
+ f"Score this email on the {dim} dimension."
79
+ )
80
+
81
+
82
+ def make_score_json(dim: str, score: float, passed: bool, reasoning: str) -> str:
83
+ return json.dumps({
84
+ "dimension": dim,
85
+ "score": round(score, 2),
86
+ "pass": passed,
87
+ "reasoning": reasoning,
88
+ })
89
+
90
+
91
+ def conversations(system: str, user: str, assistant: str) -> list:
92
+ return [
93
+ {"role": "system", "content": system},
94
+ {"role": "user", "content": user},
95
+ {"role": "assistant", "content": assistant},
96
+ ]
97
+
98
+
99
+ def main():
100
+ # Load tasks by task_id
101
+ tasks = {}
102
+ with open(TASKS_PATH) as f:
103
+ for line in f:
104
+ t = json.loads(line)
105
+ tasks[t["task_id"]] = t
106
+
107
+ pairs_raw = []
108
+ with open(PAIRS_PATH) as f:
109
+ for line in f:
110
+ pairs_raw.append(json.loads(line))
111
+
112
+ judge_pairs = []
113
+ skipped = 0
114
+
115
+ for pair in pairs_raw:
116
+ task_id = pair["task_id"]
117
+ dim = pair["dimension"]
118
+ task = tasks.get(task_id)
119
+ if task is None:
120
+ skipped += 1
121
+ continue
122
+
123
+ # Strip the <|im_end|> token that was embedded during generation
124
+ chosen_email = pair["chosen"].replace("<|im_end|>", "").strip()
125
+ rejected_email = pair["rejected"].replace("<|im_end|>", "").strip()
126
+
127
+ # Score both emails with the deterministic evaluator
128
+ r_chosen = score_task({**task, "candidate_output": chosen_email})
129
+ r_rejected = score_task({**task, "candidate_output": rejected_email})
130
+
131
+ sc = r_chosen.get("score", 0.5)
132
+ sr = r_rejected.get("score", 0.5)
133
+ threshold = PASS_THRESHOLD.get(dim, 0.5)
134
+
135
+ chosen_passes = sc >= threshold
136
+ rejected_passes = sr >= threshold
137
+
138
+ # ── Judge pair 1: score the GOOD (chosen) email ──────────────────────
139
+ # Correct judgment: high score (chosen) vs wrong judgment: low score (rejected)
140
+ user_prompt_chosen = build_user_prompt(task, chosen_email)
141
+
142
+ correct_score_chosen = round(min(sc + 0.05, 1.0), 2) if chosen_passes else round(sc, 2)
143
+ wrong_score_chosen = round(max(sc - 0.5, 0.0), 2)
144
+
145
+ correct_response = make_score_json(
146
+ dim, correct_score_chosen, chosen_passes,
147
+ PASS_REASONING[dim] if chosen_passes else FAIL_REASONING[dim]
148
+ )
149
+ wrong_response = make_score_json(
150
+ dim, wrong_score_chosen, not chosen_passes,
151
+ FAIL_REASONING[dim] if chosen_passes else PASS_REASONING[dim]
152
+ )
153
+
154
+ # Only include if there's a meaningful score gap
155
+ if abs(correct_score_chosen - wrong_score_chosen) >= 0.2:
156
+ judge_pairs.append({
157
+ "chosen": conversations(JUDGE_SYSTEM, user_prompt_chosen, correct_response),
158
+ "rejected": conversations(JUDGE_SYSTEM, user_prompt_chosen, wrong_response),
159
+ "task_id": task_id,
160
+ "dimension": dim,
161
+ "email_type": "chosen",
162
+ "actual_score": sc,
163
+ })
164
+
165
+ # ── Judge pair 2: score the BAD (rejected) email ─────────────────────
166
+ user_prompt_rejected = build_user_prompt(task, rejected_email)
167
+
168
+ correct_score_rejected = round(sr, 2)
169
+ wrong_score_rejected = round(min(sr + 0.5, 1.0), 2)
170
+
171
+ correct_response_r = make_score_json(
172
+ dim, correct_score_rejected, rejected_passes,
173
+ PASS_REASONING[dim] if rejected_passes else FAIL_REASONING[dim]
174
+ )
175
+ wrong_response_r = make_score_json(
176
+ dim, wrong_score_rejected, not rejected_passes,
177
+ PASS_REASONING[dim] if not rejected_passes else FAIL_REASONING[dim]
178
+ )
179
+
180
+ if abs(wrong_score_rejected - correct_score_rejected) >= 0.2:
181
+ judge_pairs.append({
182
+ "chosen": conversations(JUDGE_SYSTEM, user_prompt_rejected, correct_response_r),
183
+ "rejected": conversations(JUDGE_SYSTEM, user_prompt_rejected, wrong_response_r),
184
+ "task_id": task_id,
185
+ "dimension": dim,
186
+ "email_type": "rejected",
187
+ "actual_score": sr,
188
+ })
189
+
190
+ with open(OUTPUT_PATH, "w") as f:
191
+ for jp in judge_pairs:
192
+ f.write(json.dumps(jp) + "\n")
193
+
194
+ from collections import Counter
195
+ dim_counts = Counter(jp["dimension"] for jp in judge_pairs)
196
+ type_counts = Counter(jp["email_type"] for jp in judge_pairs)
197
+
198
+ print(f"Built {len(judge_pairs)} judge pairs (skipped {skipped} missing tasks)")
199
+ print(f"Dimension breakdown: {dict(dim_counts)}")
200
+ print(f"Email type: {dict(type_counts)}")
201
+ print(f"Written to {OUTPUT_PATH}")
202
+
203
+ # Validate format
204
+ sample = judge_pairs[0]
205
+ assert "chosen" in sample and isinstance(sample["chosen"], list)
206
+ assert sample["chosen"][0]["role"] == "system"
207
+ assert sample["chosen"][-1]["role"] == "assistant"
208
+ print("\nFormat validation: PASSED")
209
+ print(f"Sample chosen response: {sample['chosen'][-1]['content']}")
210
+ print(f"Sample rejected response: {sample['rejected'][-1]['content']}")
211
+
212
+
213
+ if __name__ == "__main__":
214
+ main()
inference_example.py CHANGED
@@ -1,5 +1,274 @@
1
  #!/usr/bin/env python3
2
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  Inference helper for rafiakedir/tenacious-bench-adapter.
4
  Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
5
  """
@@ -12,50 +281,53 @@ MODEL_ID = "rafiakedir/tenacious-bench-adapter"
12
 
13
  DIMENSION_PROMPTS = {
14
  "signal_grounding_fidelity": (
15
- "Dimension: signal_grounding_fidelity\n"
16
  "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
17
  "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
18
  "without a high/medium-confidence signal in the brief must be recast as questions."
19
  ),
20
  "bench_commitment_honesty": (
21
- "Dimension: bench_commitment_honesty\n"
22
  "Rubric: The email must not promise or imply a number of engineers that exceeds "
23
  "the total available in the bench_summary. Any staffing commitment must stay within capacity."
24
  ),
25
  "icp_segment_appropriateness": (
26
- "Dimension: icp_segment_appropriateness\n"
27
  "Rubric: The email's language and pitch angle must match the correct ICP segment "
28
  "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
29
  "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
30
  ),
31
  "competitor_gap_honesty": (
32
- "Dimension: competitor_gap_honesty\n"
33
  "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
34
  "The email must not assert that competitors have capabilities the prospect lacks "
35
  "unless the brief explicitly documents this gap."
36
  ),
37
  "tone_preservation": (
38
- "Dimension: tone_preservation\n"
39
  "Rubric: No re-engagement clichΓ©s ('just wanted to circle back', 'touching base', "
40
  "'following up'). No over-apologetic exits ('sorry for taking your time'). "
41
  "Calendar CTA required. Confident but not pushy."
42
  ),
43
  }
44
 
45
- SYSTEM_TEMPLATE = """You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
 
46
  {dimension_prompt}
47
 
48
  Respond with a JSON object only:
49
  {{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
50
  """
51
 
52
- USER_TEMPLATE = """Context:
 
53
  {context_json}
54
 
55
  Candidate email:
56
  {candidate_output}
57
 
58
- Score this output on the dimension above."""
 
59
 
60
 
61
  def load_model(model_id: str = MODEL_ID):
@@ -162,15 +434,69 @@ if __name__ == "__main__":
162
  "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
163
  )
164
 
165
- print("\nScoring on signal_grounding_fidelity...")
166
  result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
167
- print(f" Score: {result['score']:.2f}")
168
- print(f" Reasoning: {result['reasoning']}")
169
 
170
- print("\nScoring all dimensions...")
171
  all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
172
  for dim, r in all_results.items():
173
  if dim == "mean_score":
174
  print(f" MEAN: {r:.3f}")
175
  else:
176
- print(f" {dim}: {r['score']:.2f} β€” {r['reasoning'][:80]}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  #!/usr/bin/env python3
2
  """
3
+ Upload model card, training scripts, and inference helper to
4
+ rafiakedir/tenacious-bench-adapter on HuggingFace.
5
+ Does NOT re-upload the safetensors weights β€” those are already there.
6
+ """
7
+ from pathlib import Path
8
+ from huggingface_hub import HfApi, CommitOperationAdd
9
+
10
+ ROOT = Path(__file__).parent
11
+ REPO_ID = "rafiakedir/tenacious-bench-adapter"
12
+
13
+ # ── Model Card ────────────────────────────────────────────────────────────────
14
+ MODEL_CARD = """\
15
+ ---
16
+ license: cc-by-4.0
17
+ language:
18
+ - en
19
+ base_model: unsloth/Qwen3.5-0.8B
20
+ tags:
21
+ - judge
22
+ - b2b-sales
23
+ - orpo
24
+ - preference-learning
25
+ - tenacious-bench
26
+ - evaluation
27
+ - qwen3
28
+ - unsloth
29
+ datasets:
30
+ - rafiakedir/tenacious-bench-v0.1
31
+ ---
32
+
33
+ # Tenacious-Bench Judge β€” ORPO Fine-Tuned Qwen3.5-0.8B
34
+
35
+ A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
36
+ [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
37
+ preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
38
+ the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
39
+ rubric dimensions; outputs below threshold are rejected and regenerated.
40
+
41
+ **Base model:** `unsloth/Qwen3.5-0.8B`
42
+ **Training algorithm:** ORPO (no reference model β€” single forward pass)
43
+ **Weights:** Merged (full model, not a LoRA adapter)
44
+ **Precision:** BF16 Β· ~873M parameters Β· ~1.75 GB
45
+ **Context length:** 262,144 tokens
46
+ **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
47
+
48
+ ---
49
+
50
+ ## What It Scores
51
+
52
+ | Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
53
+ |---|---|---|
54
+ | `signal_grounding_fidelity` | 35% | CTO credibility loss |
55
+ | `competitor_gap_honesty` | 45% | Irreversible brand damage |
56
+ | `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
57
+ | `tone_preservation` | 15% | Brand voice violation |
58
+ | `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |
59
+
60
+ ---
61
+
62
+ ## Quick Start β€” Inference
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+ import torch
67
+
68
+ model_id = "rafiakedir/tenacious-bench-adapter"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
70
+ model = AutoModelForCausalLM.from_pretrained(
71
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
72
+ )
73
+
74
+ SYSTEM = \"\"\"You are a rubric-aware judge for B2B outbound sales emails.
75
+ Score the candidate output on the following dimension.
76
+
77
+ Dimension: signal_grounding_fidelity
78
+ Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
79
+ with confidence >= 0.60, or be phrased as a question.
80
+
81
+ Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}\"\"\"
82
+
83
+ USER = \"\"\"Hiring signal brief:
84
+ {
85
+ "company_name": "Acme Corp",
86
+ "open_roles": 3,
87
+ "confidence": "low",
88
+ "domain": "fintech"
89
+ }
90
+
91
+ Candidate email:
92
+ "Hi Alex β€” noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
93
+ We staff specialized capability-gap squads for fintech teams at your growth stage.
94
+ Would a 30-minute scoping conversation make sense this week?"
95
+
96
+ Score this output.\"\"\"
97
+
98
+ messages = [
99
+ {"role": "system", "content": SYSTEM},
100
+ {"role": "user", "content": USER},
101
+ ]
102
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
103
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
104
+
105
+ with torch.no_grad():
106
+ out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
107
+
108
+ response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
109
+ print(response)
110
+ # Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low β€” should be phrased as a question."}
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Training Details
116
+
117
+ ### Why ORPO
118
+
119
+ ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
120
+ the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
121
+ VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
122
+ checkpointing hacks.
123
+
124
+ For a discriminative judge (score calibration rather than generation quality), the
125
+ preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
126
+ that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
127
+
128
+ ### Preference Pair Construction
129
+
130
+ | Source | Count |
131
+ |---|---|
132
+ | Failing tasks β†’ generated chosen (DeepSeek V3.2) | ~111 attempted |
133
+ | Passing tasks β†’ generated rejected (DeepSeek V3.2) | ~41 attempted |
134
+ | **Final pairs after filtering** | **94** |
135
+
136
+ Filter: chosen score β‰₯ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
137
+ Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
138
+ and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
139
+
140
+ **Preference leakage prevention (Li et al., 2025):**
141
+ Generator (DeepSeek V3.2) β‰  judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
142
+ All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
143
+
144
+ ### Hyperparameters
145
+
146
+ | Parameter | Value |
147
+ |---|---|
148
+ | Base model | `unsloth/Qwen3.5-0.8B` |
149
+ | LoRA rank | 16 |
150
+ | LoRA alpha | 32 |
151
+ | Target modules | q_proj, v_proj |
152
+ | LoRA dropout | 0.05 |
153
+ | Learning rate | 8e-6 |
154
+ | Batch size (per device) | 2 |
155
+ | Gradient accumulation | 4 (effective batch 8) |
156
+ | Epochs | 3 |
157
+ | Warmup ratio | 0.1 |
158
+ | LR scheduler | cosine |
159
+ | ORPO beta | 0.1 |
160
+ | Max sequence length | 1024 |
161
+ | Precision | BF16 (T4) |
162
+ | Seed | 42 |
163
+
164
+ Training notebook: see `run_on_colab.ipynb` in this repo.
165
+
166
+ ---
167
+
168
+ ## Evaluation Results
169
+
170
+ Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
171
+ Paired bootstrap significance test: 10,000 iterations, seed 42.
172
+
173
+ | Condition | Mean Score | vs. Baseline |
174
+ |---|---|---|
175
+ | Baseline (`scoring_evaluator.py` only) | 0.458 | β€” |
176
+ | **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Ξ”=+0.025, p=0.189, not significant |
177
+ | Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Ξ”=βˆ’0.021 vs. trained, p=0.978 |
178
+
179
+ **Delta A** (trained vs. baseline): Ξ”=+0.025, 95% CI [βˆ’0.032, +0.081], p=0.189 β€” **not statistically significant**.
180
+
181
+ **Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` β€”
182
+ the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
183
+ Note: Delta B compares a 0.8B trained model against a 30B zero-shot model β€” this conflates backbone
184
+ capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
185
+ `Qwen3.5-0.8B-Instruct` (no fine-tuning).
186
+
187
+ **Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
188
+ `scoring_evaluator.py` deterministically. Retrain with β‰₯150 pairs covering all 5 dimensions
189
+ before re-evaluating.
190
+
191
+ Full numbers: `ablation_results.json` in the dataset repo.
192
+
193
+ ---
194
+
195
+ ## Known Limitations
196
+
197
+ **1. Dimension coverage gap (critical).**
198
+ The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
199
+ for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
200
+ to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
201
+ bench commitment honesty β€” the highest SOW-breach-risk dimension. It cannot be trusted to gate
202
+ bench-commitment outputs.
203
+
204
+ **2. Delta A not significant at v0.1 scale.**
205
+ The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
206
+ does not reliably outperform `scoring_evaluator.py` on held-out tasks.
207
+
208
+ **3. Backbone below Prometheus-2 threshold.**
209
+ Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
210
+ below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
211
+
212
+ **4. Synthetic training distribution.**
213
+ All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
214
+ may not generalize to real prospect data with industry-specific jargon or edge cases outside the
215
+ training distribution.
216
+
217
+ **5. Static bench_summary.**
218
+ The judge was trained on snapshot bench capacities. In production the bench changes weekly β€”
219
+ calibration for `bench_commitment_honesty` will drift over time.
220
+
221
+ ---
222
+
223
+ ## Files in This Repo
224
+
225
+ | File | Description |
226
+ |---|---|
227
+ | `model.safetensors-*` | Merged model weights (BF16) |
228
+ | `config.json` | Model architecture config |
229
+ | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
230
+ | `train_judge.py` | Full ORPO training script |
231
+ | `hyperparams.json` | All hyperparameters (pinned) |
232
+ | `run_on_colab.ipynb` | End-to-end training notebook for T4 |
233
+ | `inference_example.py` | Inference helper with prompt templates |
234
+
235
+ Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
236
+
237
+ ---
238
+
239
+ ## Environmental Impact
240
+
241
+ - **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
242
+ - **COβ‚‚e:** ~0.1 kg (T4 at 70W Γ— 90 min Γ— US grid 0.42 kg COβ‚‚/kWh Γ· 1000)
243
+ - **Infrastructure:** Google Colab free tier
244
+
245
+ ---
246
+
247
+ ## Citation
248
+
249
+ ```bibtex
250
+ @misc{tenacious-bench-adapter-2026,
251
+ title = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
252
+ author = {Kedir, Rafia},
253
+ year = {2026},
254
+ howpublished = {HuggingFace Model Hub},
255
+ url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
256
+ }
257
+
258
+ @misc{tenacious-bench-v01-2026,
259
+ title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
260
+ author = {Kedir, Rafia},
261
+ year = {2026},
262
+ howpublished = {HuggingFace Datasets Hub},
263
+ url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
264
+ }
265
+ ```
266
+ """
267
+
268
+ # ── Inference Example ─────────────────────────────────────────────────────────
269
+ INFERENCE_EXAMPLE = '''\
270
+ #!/usr/bin/env python3
271
+ """
272
  Inference helper for rafiakedir/tenacious-bench-adapter.
273
  Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
274
  """
 
281
 
282
  DIMENSION_PROMPTS = {
283
  "signal_grounding_fidelity": (
284
+ "Dimension: signal_grounding_fidelity\\n"
285
  "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
286
  "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
287
  "without a high/medium-confidence signal in the brief must be recast as questions."
288
  ),
289
  "bench_commitment_honesty": (
290
+ "Dimension: bench_commitment_honesty\\n"
291
  "Rubric: The email must not promise or imply a number of engineers that exceeds "
292
  "the total available in the bench_summary. Any staffing commitment must stay within capacity."
293
  ),
294
  "icp_segment_appropriateness": (
295
+ "Dimension: icp_segment_appropriateness\\n"
296
  "Rubric: The email's language and pitch angle must match the correct ICP segment "
297
  "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
298
  "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
299
  ),
300
  "competitor_gap_honesty": (
301
+ "Dimension: competitor_gap_honesty\\n"
302
  "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
303
  "The email must not assert that competitors have capabilities the prospect lacks "
304
  "unless the brief explicitly documents this gap."
305
  ),
306
  "tone_preservation": (
307
+ "Dimension: tone_preservation\\n"
308
  "Rubric: No re-engagement clichΓ©s ('just wanted to circle back', 'touching base', "
309
  "'following up'). No over-apologetic exits ('sorry for taking your time'). "
310
  "Calendar CTA required. Confident but not pushy."
311
  ),
312
  }
313
 
314
+ SYSTEM_TEMPLATE = """\
315
+ You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
316
  {dimension_prompt}
317
 
318
  Respond with a JSON object only:
319
  {{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
320
  """
321
 
322
+ USER_TEMPLATE = """\
323
+ Context:
324
  {context_json}
325
 
326
  Candidate email:
327
  {candidate_output}
328
 
329
+ Score this output on the dimension above.\
330
+ """
331
 
332
 
333
  def load_model(model_id: str = MODEL_ID):
 
434
  "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
435
  )
436
 
437
+ print("\\nScoring on signal_grounding_fidelity...")
438
  result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
439
+ print(f" Score: {result[\'score\']:.2f}")
440
+ print(f" Reasoning: {result[\'reasoning\']}")
441
 
442
+ print("\\nScoring all dimensions...")
443
  all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
444
  for dim, r in all_results.items():
445
  if dim == "mean_score":
446
  print(f" MEAN: {r:.3f}")
447
  else:
448
+ print(f" {dim}: {r[\'score\']:.2f} β€” {r[\'reasoning\'][:80]}")
449
+ '''
450
+
451
+
452
+ def main():
453
+ api = HfApi()
454
+ operations = []
455
+
456
+ def add_bytes(content: bytes, repo_path: str, label: str = ""):
457
+ lbl = label or repo_path
458
+ print(f" queuing {lbl} ({len(content):,} bytes)")
459
+ operations.append(CommitOperationAdd(
460
+ path_in_repo=repo_path,
461
+ path_or_fileobj=content,
462
+ ))
463
+
464
+ def add_file(local_path: Path, repo_path: str):
465
+ print(f" queuing {repo_path} ({local_path.stat().st_size:,} bytes)")
466
+ operations.append(CommitOperationAdd(
467
+ path_in_repo=repo_path,
468
+ path_or_fileobj=str(local_path),
469
+ ))
470
+
471
+ # Model card
472
+ add_bytes(MODEL_CARD.encode(), "README.md", "README.md (model card)")
473
+
474
+ # Inference example
475
+ add_bytes(INFERENCE_EXAMPLE.encode(), "inference_example.py")
476
+
477
+ # Training scripts
478
+ add_file(ROOT / "training" / "train_judge.py", "train_judge.py")
479
+ add_file(ROOT / "training" / "hyperparams.json", "hyperparams.json")
480
+ add_file(ROOT / "training" / "run_on_colab.ipynb", "run_on_colab.ipynb")
481
+ add_file(ROOT / "training" / "requirements_training.txt", "requirements_training.txt")
482
+
483
+ print(f"\nCommitting {len(operations)} files to {REPO_ID}...")
484
+ url = api.create_commit(
485
+ repo_id=REPO_ID,
486
+ repo_type="model",
487
+ operations=operations,
488
+ commit_message=(
489
+ "feat: add model card, inference example, and training scripts\n\n"
490
+ "- Proper model card with YAML frontmatter (base_model, tags, datasets)\n"
491
+ "- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict\n"
492
+ "- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)\n"
493
+ "- inference_example.py with per-dimension and all-dimensions scoring\n"
494
+ "- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb"
495
+ ),
496
+ )
497
+ print(f"\nDone. Commit URL: {url}")
498
+ print(f"Model: https://huggingface.co/rafiakedir/tenacious-bench-adapter")
499
+
500
+
501
+ if __name__ == "__main__":
502
+ main()