rafiakedir commited on
Commit
e1374e8
·
verified ·
1 Parent(s): e6334c7

feat: add model card, inference example, and training scripts

Browse files

- Proper model card with YAML frontmatter (base_model, tags, datasets)
- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict
- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)
- inference_example.py with per-dimension and all-dimensions scoring
- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb

README.md CHANGED
@@ -1,21 +1,251 @@
1
  ---
 
 
 
2
  base_model: unsloth/Qwen3.5-0.8B
3
  tags:
4
- - text-generation-inference
5
- - transformers
 
 
 
 
 
6
  - unsloth
7
- - qwen3_5
8
- license: apache-2.0
9
- language:
10
- - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** rafiakedir
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/Qwen3.5-0.8B
18
 
19
- This qwen3_5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
  base_model: unsloth/Qwen3.5-0.8B
6
  tags:
7
+ - judge
8
+ - b2b-sales
9
+ - orpo
10
+ - preference-learning
11
+ - tenacious-bench
12
+ - evaluation
13
+ - qwen3
14
  - unsloth
15
+ datasets:
16
+ - rafiakedir/tenacious-bench-v0.1
17
+ ---
18
+
19
+ # Tenacious-Bench Judge — ORPO Fine-Tuned Qwen3.5-0.8B
20
+
21
+ A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
22
+ [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
23
+ preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
24
+ the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
25
+ rubric dimensions; outputs below threshold are rejected and regenerated.
26
+
27
+ **Base model:** `unsloth/Qwen3.5-0.8B`
28
+ **Training algorithm:** ORPO (no reference model — single forward pass)
29
+ **Weights:** Merged (full model, not a LoRA adapter)
30
+ **Precision:** BF16 · ~873M parameters · ~1.75 GB
31
+ **Context length:** 262,144 tokens
32
+ **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
33
+
34
  ---
35
 
36
+ ## What It Scores
37
+
38
+ | Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
39
+ |---|---|---|
40
+ | `signal_grounding_fidelity` | 35% | CTO credibility loss |
41
+ | `competitor_gap_honesty` | 45% | Irreversible brand damage |
42
+ | `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
43
+ | `tone_preservation` | 15% | Brand voice violation |
44
+ | `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |
45
+
46
+ ---
47
+
48
+ ## Quick Start — Inference
49
+
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForCausalLM
52
+ import torch
53
+
54
+ model_id = "rafiakedir/tenacious-bench-adapter"
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+ model = AutoModelForCausalLM.from_pretrained(
57
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
58
+ )
59
+
60
+ SYSTEM = """You are a rubric-aware judge for B2B outbound sales emails.
61
+ Score the candidate output on the following dimension.
62
+
63
+ Dimension: signal_grounding_fidelity
64
+ Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
65
+ with confidence >= 0.60, or be phrased as a question.
66
+
67
+ Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}"""
68
+
69
+ USER = """Hiring signal brief:
70
+ {
71
+ "company_name": "Acme Corp",
72
+ "open_roles": 3,
73
+ "confidence": "low",
74
+ "domain": "fintech"
75
+ }
76
+
77
+ Candidate email:
78
+ "Hi Alex — noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
79
+ We staff specialized capability-gap squads for fintech teams at your growth stage.
80
+ Would a 30-minute scoping conversation make sense this week?"
81
+
82
+ Score this output."""
83
+
84
+ messages = [
85
+ {"role": "system", "content": SYSTEM},
86
+ {"role": "user", "content": USER},
87
+ ]
88
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
89
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
90
+
91
+ with torch.no_grad():
92
+ out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
93
+
94
+ response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
95
+ print(response)
96
+ # Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low — should be phrased as a question."}
97
+ ```
98
+
99
+ ---
100
+
101
+ ## Training Details
102
+
103
+ ### Why ORPO
104
+
105
+ ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
106
+ the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
107
+ VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
108
+ checkpointing hacks.
109
+
110
+ For a discriminative judge (score calibration rather than generation quality), the
111
+ preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
112
+ that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
113
+
114
+ ### Preference Pair Construction
115
+
116
+ | Source | Count |
117
+ |---|---|
118
+ | Failing tasks → generated chosen (DeepSeek V3.2) | ~111 attempted |
119
+ | Passing tasks → generated rejected (DeepSeek V3.2) | ~41 attempted |
120
+ | **Final pairs after filtering** | **94** |
121
+
122
+ Filter: chosen score ≥ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
123
+ Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
124
+ and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
125
+
126
+ **Preference leakage prevention (Li et al., 2025):**
127
+ Generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
128
+ All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
129
+
130
+ ### Hyperparameters
131
+
132
+ | Parameter | Value |
133
+ |---|---|
134
+ | Base model | `unsloth/Qwen3.5-0.8B` |
135
+ | LoRA rank | 16 |
136
+ | LoRA alpha | 32 |
137
+ | Target modules | q_proj, v_proj |
138
+ | LoRA dropout | 0.05 |
139
+ | Learning rate | 8e-6 |
140
+ | Batch size (per device) | 2 |
141
+ | Gradient accumulation | 4 (effective batch 8) |
142
+ | Epochs | 3 |
143
+ | Warmup ratio | 0.1 |
144
+ | LR scheduler | cosine |
145
+ | ORPO beta | 0.1 |
146
+ | Max sequence length | 1024 |
147
+ | Precision | BF16 (T4) |
148
+ | Seed | 42 |
149
+
150
+ Training notebook: see `run_on_colab.ipynb` in this repo.
151
+
152
+ ---
153
+
154
+ ## Evaluation Results
155
+
156
+ Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
157
+ Paired bootstrap significance test: 10,000 iterations, seed 42.
158
+
159
+ | Condition | Mean Score | vs. Baseline |
160
+ |---|---|---|
161
+ | Baseline (`scoring_evaluator.py` only) | 0.458 | — |
162
+ | **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Δ=+0.025, p=0.189, not significant |
163
+ | Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Δ=−0.021 vs. trained, p=0.978 |
164
+
165
+ **Delta A** (trained vs. baseline): Δ=+0.025, 95% CI [−0.032, +0.081], p=0.189 — **not statistically significant**.
166
+
167
+ **Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` —
168
+ the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
169
+ Note: Delta B compares a 0.8B trained model against a 30B zero-shot model — this conflates backbone
170
+ capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
171
+ `Qwen3.5-0.8B-Instruct` (no fine-tuning).
172
+
173
+ **Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
174
+ `scoring_evaluator.py` deterministically. Retrain with ≥150 pairs covering all 5 dimensions
175
+ before re-evaluating.
176
+
177
+ Full numbers: `ablation_results.json` in the dataset repo.
178
+
179
+ ---
180
+
181
+ ## Known Limitations
182
+
183
+ **1. Dimension coverage gap (critical).**
184
+ The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
185
+ for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
186
+ to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
187
+ bench commitment honesty — the highest SOW-breach-risk dimension. It cannot be trusted to gate
188
+ bench-commitment outputs.
189
+
190
+ **2. Delta A not significant at v0.1 scale.**
191
+ The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
192
+ does not reliably outperform `scoring_evaluator.py` on held-out tasks.
193
+
194
+ **3. Backbone below Prometheus-2 threshold.**
195
+ Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
196
+ below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
197
+
198
+ **4. Synthetic training distribution.**
199
+ All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
200
+ may not generalize to real prospect data with industry-specific jargon or edge cases outside the
201
+ training distribution.
202
+
203
+ **5. Static bench_summary.**
204
+ The judge was trained on snapshot bench capacities. In production the bench changes weekly —
205
+ calibration for `bench_commitment_honesty` will drift over time.
206
+
207
+ ---
208
+
209
+ ## Files in This Repo
210
+
211
+ | File | Description |
212
+ |---|---|
213
+ | `model.safetensors-*` | Merged model weights (BF16) |
214
+ | `config.json` | Model architecture config |
215
+ | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
216
+ | `train_judge.py` | Full ORPO training script |
217
+ | `hyperparams.json` | All hyperparameters (pinned) |
218
+ | `run_on_colab.ipynb` | End-to-end training notebook for T4 |
219
+ | `inference_example.py` | Inference helper with prompt templates |
220
+
221
+ Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
222
+
223
+ ---
224
+
225
+ ## Environmental Impact
226
+
227
+ - **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
228
+ - **CO₂e:** ~0.1 kg (T4 at 70W × 90 min × US grid 0.42 kg CO₂/kWh ÷ 1000)
229
+ - **Infrastructure:** Google Colab free tier
230
+
231
+ ---
232
 
233
+ ## Citation
 
 
234
 
235
+ ```bibtex
236
+ @misc{tenacious-bench-adapter-2026,
237
+ title = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
238
+ author = {Kedir, Rafia},
239
+ year = {2026},
240
+ howpublished = {HuggingFace Model Hub},
241
+ url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
242
+ }
243
 
244
+ @misc{tenacious-bench-v01-2026,
245
+ title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
246
+ author = {Kedir, Rafia},
247
+ year = {2026},
248
+ howpublished = {HuggingFace Datasets Hub},
249
+ url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
250
+ }
251
+ ```
hyperparams.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_id": "unsloth/Qwen2.5-1.5B-Instruct",
3
+ "training_algorithm": "ORPO",
4
+ "lora": {
5
+ "r": 16,
6
+ "lora_alpha": 32,
7
+ "target_modules": ["q_proj", "v_proj"],
8
+ "lora_dropout": 0.05,
9
+ "bias": "none",
10
+ "task_type": "CAUSAL_LM"
11
+ },
12
+ "orpo_trainer": {
13
+ "learning_rate": 8e-6,
14
+ "per_device_train_batch_size": 2,
15
+ "gradient_accumulation_steps": 4,
16
+ "effective_batch_size": 8,
17
+ "num_train_epochs": 3,
18
+ "warmup_ratio": 0.1,
19
+ "lr_scheduler_type": "cosine",
20
+ "beta": 0.1,
21
+ "max_length": 1024,
22
+ "max_prompt_length": 512,
23
+ "logging_steps": 10,
24
+ "save_steps": 50,
25
+ "seed": 42
26
+ },
27
+ "precision": {
28
+ "bf16": false,
29
+ "fp16": true,
30
+ "note": "T4 GPU: fp16 only. Switch to bf16 on A100/4090."
31
+ },
32
+ "adapter_output_dir": "training/adapter",
33
+ "hub_model_id": "rafiakedir/tenacious-bench-adapter",
34
+ "fixed_seed": 42,
35
+ "rationale": {
36
+ "orpo_vs_dpo": "ORPO chosen over DPO because it requires no reference model, reducing GPU memory footprint by ~40% on T4. Reference-free approach is appropriate for a judge component where the reference policy is undefined.",
37
+ "backbone_choice": "Qwen2.5-1.5B-Instruct selected per Prometheus-2 paper (Kim et al., 2024) showing 7B-class judge viability at 1.5B with preference tuning.",
38
+ "lora_rank": "Rank 16 with alpha 32 (2:1 ratio) is standard for task-specific adaptation. Rank 8 was considered but judge rubric complexity warrants higher rank.",
39
+ "beta_orpo": "Beta=0.1 follows ORPO paper (Hong et al., 2024) recommendation for instruction-following tasks."
40
+ }
41
+ }
inference_example.py ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Inference helper for rafiakedir/tenacious-bench-adapter.
4
+ Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
5
+ """
6
+
7
+ import json
8
+ import torch
9
+ from transformers import AutoTokenizer, AutoModelForCausalLM
10
+
11
+ MODEL_ID = "rafiakedir/tenacious-bench-adapter"
12
+
13
+ DIMENSION_PROMPTS = {
14
+ "signal_grounding_fidelity": (
15
+ "Dimension: signal_grounding_fidelity\n"
16
+ "Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
17
+ "with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
18
+ "without a high/medium-confidence signal in the brief must be recast as questions."
19
+ ),
20
+ "bench_commitment_honesty": (
21
+ "Dimension: bench_commitment_honesty\n"
22
+ "Rubric: The email must not promise or imply a number of engineers that exceeds "
23
+ "the total available in the bench_summary. Any staffing commitment must stay within capacity."
24
+ ),
25
+ "icp_segment_appropriateness": (
26
+ "Dimension: icp_segment_appropriateness\n"
27
+ "Rubric: The email's language and pitch angle must match the correct ICP segment "
28
+ "(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
29
+ "ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
30
+ ),
31
+ "competitor_gap_honesty": (
32
+ "Dimension: competitor_gap_honesty\n"
33
+ "Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
34
+ "The email must not assert that competitors have capabilities the prospect lacks "
35
+ "unless the brief explicitly documents this gap."
36
+ ),
37
+ "tone_preservation": (
38
+ "Dimension: tone_preservation\n"
39
+ "Rubric: No re-engagement clichés ('just wanted to circle back', 'touching base', "
40
+ "'following up'). No over-apologetic exits ('sorry for taking your time'). "
41
+ "Calendar CTA required. Confident but not pushy."
42
+ ),
43
+ }
44
+
45
+ SYSTEM_TEMPLATE = """You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
46
+ {dimension_prompt}
47
+
48
+ Respond with a JSON object only:
49
+ {{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
50
+ """
51
+
52
+ USER_TEMPLATE = """Context:
53
+ {context_json}
54
+
55
+ Candidate email:
56
+ {candidate_output}
57
+
58
+ Score this output on the dimension above."""
59
+
60
+
61
+ def load_model(model_id: str = MODEL_ID):
62
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_id,
65
+ torch_dtype=torch.bfloat16,
66
+ device_map="auto",
67
+ )
68
+ model.eval()
69
+ return tokenizer, model
70
+
71
+
72
+ def score(
73
+ tokenizer,
74
+ model,
75
+ task_input: dict,
76
+ candidate_output: str,
77
+ dimension: str,
78
+ max_new_tokens: int = 150,
79
+ ) -> dict:
80
+ """
81
+ Score a single candidate output on one rubric dimension.
82
+
83
+ Args:
84
+ task_input: dict with keys like 'hiring_signal_brief', 'bench_summary', etc.
85
+ candidate_output: the email text to score
86
+ dimension: one of the five Tenacious rubric dimensions
87
+ Returns:
88
+ dict with 'score' (float) and 'reasoning' (str)
89
+ """
90
+ if dimension not in DIMENSION_PROMPTS:
91
+ raise ValueError(f"Unknown dimension: {dimension}. Choose from {list(DIMENSION_PROMPTS)}")
92
+
93
+ context_json = json.dumps(task_input, indent=2)
94
+ system = SYSTEM_TEMPLATE.format(dimension_prompt=DIMENSION_PROMPTS[dimension])
95
+ user = USER_TEMPLATE.format(
96
+ context_json=context_json,
97
+ candidate_output=candidate_output,
98
+ )
99
+
100
+ messages = [
101
+ {"role": "system", "content": system},
102
+ {"role": "user", "content": user},
103
+ ]
104
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
105
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
106
+
107
+ with torch.no_grad():
108
+ out = model.generate(
109
+ **inputs,
110
+ max_new_tokens=max_new_tokens,
111
+ temperature=0.1,
112
+ do_sample=True,
113
+ pad_token_id=tokenizer.eos_token_id,
114
+ )
115
+
116
+ response = tokenizer.decode(
117
+ out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
118
+ ).strip()
119
+
120
+ # Parse JSON from response
121
+ try:
122
+ # Find first { ... } block
123
+ start = response.find("{")
124
+ end = response.rfind("}") + 1
125
+ result = json.loads(response[start:end])
126
+ return {"score": float(result["score"]), "reasoning": result.get("reasoning", "")}
127
+ except Exception:
128
+ return {"score": 0.5, "reasoning": f"parse_error: {response[:200]}"}
129
+
130
+
131
+ def score_all_dimensions(tokenizer, model, task_input: dict, candidate_output: str) -> dict:
132
+ """Score a candidate output on all five dimensions."""
133
+ results = {}
134
+ for dim in DIMENSION_PROMPTS:
135
+ results[dim] = score(tokenizer, model, task_input, candidate_output, dim)
136
+ results["mean_score"] = sum(r["score"] for r in results.values()) / len(results)
137
+ return results
138
+
139
+
140
+ # ── Demo ──────────────────────────────────────────────────────────────────────
141
+ if __name__ == "__main__":
142
+ print(f"Loading {MODEL_ID}...")
143
+ tokenizer, model = load_model()
144
+
145
+ demo_input = {
146
+ "hiring_signal_brief": {
147
+ "company_name": "Acme Corp",
148
+ "domain": "fintech",
149
+ "open_roles": 3,
150
+ "confidence": "low",
151
+ "stage": "Series B",
152
+ },
153
+ "bench_summary": {
154
+ "total_available": 8,
155
+ "specializations": ["Python", "Go", "ML Engineering"],
156
+ },
157
+ }
158
+
159
+ demo_email = (
160
+ "Hi Alex — noticed Acme Corp is aggressively scaling its engineering team "
161
+ "with 3 open roles. We staff specialized capability-gap squads for fintech "
162
+ "teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
163
+ )
164
+
165
+ print("\nScoring on signal_grounding_fidelity...")
166
+ result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
167
+ print(f" Score: {result['score']:.2f}")
168
+ print(f" Reasoning: {result['reasoning']}")
169
+
170
+ print("\nScoring all dimensions...")
171
+ all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
172
+ for dim, r in all_results.items():
173
+ if dim == "mean_score":
174
+ print(f" MEAN: {r:.3f}")
175
+ else:
176
+ print(f" {dim}: {r['score']:.2f} — {r['reasoning'][:80]}")
requirements_training.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
2
+ trl==0.12.2
3
+ peft==0.14.0
4
+ transformers==4.47.1
5
+ datasets==3.2.0
6
+ accelerate==1.2.1
7
+ bitsandbytes==0.45.0
8
+ sentencepiece==0.2.0
9
+ protobuf==5.29.2
10
+ torch==2.5.1
11
+ xformers==0.0.28.post3
run_on_colab.ipynb ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "kernelspec": {"display_name": "Python 3", "name": "python3"},
6
+ "language_info": {"name": "python"},
7
+ "accelerator": "GPU",
8
+ "colab": {"provenance": [], "gpuType": "T4", "name": "tenacious_bench_orpo_training.ipynb"}
9
+ },
10
+ "cells": [
11
+ {
12
+ "cell_type": "markdown",
13
+ "metadata": {},
14
+ "source": ["# Tenacious-Bench ORPO Judge Training\n\n**Trains Qwen2.5-1.5B-Instruct** with LoRA using ORPO on Tenacious-specific rubric preference pairs.\n\nRuntime: T4 GPU (Colab free tier) \nExpected training time: ~45-90 minutes for 3 epochs\n\n## Setup\n1. Set HF_TOKEN and OPENROUTER_API_KEY in Colab Secrets (key icon in left sidebar)\n2. Run all cells in order\n"]
15
+ },
16
+ {
17
+ "cell_type": "code",
18
+ "metadata": {},
19
+ "source": ["# Step 1: Check GPU\nimport subprocess\nresult = subprocess.run(['nvidia-smi'], capture_output=True, text=True)\nprint(result.stdout[:500])"]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "metadata": {},
24
+ "source": ["# Step 2: Install Unsloth and dependencies (pinned versions)\n!pip install -q 'unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git'\n!pip install -q trl==0.12.2 peft==0.14.0 transformers==4.47.1 datasets==3.2.0\n!pip install -q accelerate==1.2.1 bitsandbytes==0.45.0\nprint('Installation complete')"]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "metadata": {},
29
+ "source": ["# Step 3: Clone the repo\nimport os\nfrom google.colab import userdata\n\nHF_TOKEN = userdata.get('HF_TOKEN')\nOPENROUTER_API_KEY = userdata.get('OPENROUTER_API_KEY')\n\nos.environ['HF_TOKEN'] = HF_TOKEN\nos.environ['OPENROUTER_API_KEY'] = OPENROUTER_API_KEY\n\n!git clone https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1 /content/tenacious-bench-data\nprint('Repo cloned')"]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "metadata": {},
34
+ "source": ["# Step 4: Load preference pairs\nimport json\nfrom pathlib import Path\n\npairs_path = Path('/content/tenacious-bench-data/training_data/preference_pairs.jsonl')\npairs = []\nwith open(pairs_path) as f:\n for line in f:\n p = json.loads(line)\n pairs.append({'prompt': p['prompt'], 'chosen': p['chosen'], 'rejected': p['rejected']})\n\nprint(f'Loaded {len(pairs)} preference pairs')\nprint(f'Sample pair task context (first 200 chars of prompt):')\nprint(pairs[0]['prompt'][:200])"]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "metadata": {},
39
+ "source": ["# Step 5: Load Unsloth model with 4-bit quantization\nfrom unsloth import FastLanguageModel\nimport torch\n\nMAX_SEQ_LENGTH = 1024\nDTYPE = None # auto-detect\nLOAD_IN_4BIT = True\n\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n model_name='unsloth/Qwen2.5-1.5B-Instruct',\n max_seq_length=MAX_SEQ_LENGTH,\n dtype=DTYPE,\n load_in_4bit=LOAD_IN_4BIT,\n)\nprint('Model loaded')"]
40
+ },
41
+ {
42
+ "cell_type": "code",
43
+ "metadata": {},
44
+ "source": ["# Step 6: Apply LoRA\nmodel = FastLanguageModel.get_peft_model(\n model,\n r=16,\n target_modules=['q_proj', 'v_proj'],\n lora_alpha=32,\n lora_dropout=0.05,\n bias='none',\n use_gradient_checkpointing='unsloth',\n random_state=42,\n)\nprint('LoRA applied')"]
45
+ },
46
+ {
47
+ "cell_type": "code",
48
+ "metadata": {},
49
+ "source": ["# Step 7: Set up ORPO trainer\nimport random\nimport numpy as np\nfrom datasets import Dataset\nfrom trl import ORPOConfig, ORPOTrainer\n\n# Fixed seed\nrandom.seed(42)\nnp.random.seed(42)\ntorch.manual_seed(42)\n\n# Detect precision\ncap = torch.cuda.get_device_capability()\nuse_fp16 = cap[0] < 8 # T4 uses fp16\nuse_bf16 = cap[0] >= 8 # A100/4090 use bf16\nprint(f'GPU compute capability: {cap}, fp16={use_fp16}, bf16={use_bf16}')\n\ndataset = Dataset.from_list(pairs)\n\ntraining_args = ORPOConfig(\n output_dir='/content/tenacious-adapter',\n learning_rate=8e-6,\n per_device_train_batch_size=2,\n gradient_accumulation_steps=4,\n num_train_epochs=3,\n warmup_ratio=0.1,\n lr_scheduler_type='cosine',\n beta=0.1,\n max_length=1024,\n max_prompt_length=512,\n logging_steps=10,\n save_steps=50,\n seed=42,\n fp16=use_fp16,\n bf16=use_bf16,\n report_to='none',\n remove_unused_columns=False,\n)\n\ntrainer = ORPOTrainer(\n model=model,\n args=training_args,\n train_dataset=dataset,\n tokenizer=tokenizer,\n)\nprint('Trainer initialized')"]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "metadata": {},
54
+ "source": ["# Step 8: Train\nprint('Starting ORPO training...')\ntrain_result = trainer.train()\nprint(f'Training complete!')\nprint(f'Metrics: {train_result.metrics}')"]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "metadata": {},
59
+ "source": ["# Step 9: Plot loss curve\nimport matplotlib.pyplot as plt\n\nlog_history = trainer.state.log_history\nsteps = [x['step'] for x in log_history if 'loss' in x]\nlosses = [x['loss'] for x in log_history if 'loss' in x]\n\nif steps:\n plt.figure(figsize=(10, 5))\n plt.plot(steps, losses, 'b-', linewidth=2, label='Training Loss')\n plt.xlabel('Step')\n plt.ylabel('Loss')\n plt.title('ORPO Training Loss — Tenacious Judge')\n plt.legend()\n plt.grid(True, alpha=0.3)\n plt.savefig('/content/loss_curve.png', dpi=150, bbox_inches='tight')\n plt.show()\n print(f'Initial loss: {losses[0]:.4f}')\n print(f'Final loss: {losses[-1]:.4f}')\n print(f'Loss decrease: {losses[0] - losses[-1]:.4f} ({(1-losses[-1]/losses[0])*100:.1f}%)')\nelse:\n print('No loss history available')"]
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "metadata": {},
64
+ "source": ["# Step 10: Save adapter locally and push to HuggingFace\nADAPTER_DIR = '/content/tenacious-adapter'\n\nmodel.save_pretrained(ADAPTER_DIR)\ntokenizer.save_pretrained(ADAPTER_DIR)\nprint(f'Adapter saved to {ADAPTER_DIR}')\n\n# Push to HuggingFace\nHUB_MODEL_ID = 'rafiakedir/tenacious-bench-adapter'\nprint(f'Pushing to {HUB_MODEL_ID}...')\nmodel.push_to_hub(HUB_MODEL_ID, token=HF_TOKEN)\ntokenizer.push_to_hub(HUB_MODEL_ID, token=HF_TOKEN)\nprint(f'Adapter pushed to https://huggingface.co/{HUB_MODEL_ID}')"]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "metadata": {},
69
+ "source": ["# Step 11: Verify adapter on HuggingFace\nfrom huggingface_hub import HfApi\napi = HfApi(token=HF_TOKEN)\nfiles = api.list_repo_files(HUB_MODEL_ID)\nprint(f'Files in {HUB_MODEL_ID}:')\nfor f in files:\n print(f' {f}')"]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "metadata": {},
74
+ "source": ["# Step 12: Quick smoke test — run judge on one sample\nfrom peft import PeftModel\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nJUDGE_SYSTEM = (\n 'You are evaluating outbound sales emails for Tenacious Consulting. '\n 'Score the following output on signal-grounding fidelity, bench commitment honesty, '\n 'ICP segment appropriateness, and Tenacious tone adherence. '\n 'Return JSON: {\\\"signal_grounding\\\": 0-1, \\\"bench_honesty\\\": 0-1, \\\"icp_segment\\\": 0-1, \\\"tone\\\": 0-1, \\\"overall\\\": 0-1}'\n)\n\ntest_email = '''Subject: TalentBridge's ML hiring + 30-min call\\n\\nHi Casey,\\nTalentBridge recently closed a Series A and currently has 8 open ML roles.\\nWe staff ML squads, typically 4 engineers in under 3 weeks.\\nWant to set up a 30-minute scoping conversation?\\n\\nBest,\\nYabi'''\n\nprompt_text = (\n f'<|im_start|>system\\n{JUDGE_SYSTEM}<|im_end|>\\n'\n f'<|im_start|>user\\n{test_email}<|im_end|>\\n'\n f'<|im_start|>assistant\\n'\n)\n\ninputs = tokenizer(prompt_text, return_tensors='pt').to(model.device)\nwith torch.no_grad():\n output = model.generate(**inputs, max_new_tokens=100, temperature=0.0, do_sample=False)\ngenerated = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint('Judge output:')\nprint(generated)"]
75
+ }
76
+ ]
77
+ }
train_judge.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Day 5 — Tenacious-Bench ORPO Judge Training Script
4
+ Trains Qwen2.5-1.5B-Instruct with LoRA using ORPO (reference-free preference optimization).
5
+ Run on Colab T4 or locally with sufficient VRAM.
6
+ All hyperparameters are in hyperparams.json and replicated here for auditability.
7
+
8
+ Usage:
9
+ python train_judge.py [--data-path PATH] [--output-dir DIR]
10
+ """
11
+
12
+ import os
13
+ import sys
14
+ import json
15
+ import random
16
+ import logging
17
+ import datetime
18
+ import argparse
19
+ from pathlib import Path
20
+
21
+ import numpy as np
22
+
23
+ ROOT = Path(__file__).parent.parent
24
+ HYPERPARAMS_PATH = Path(__file__).parent / "hyperparams.json"
25
+ DATA_PATH = ROOT / "training_data/preference_pairs.jsonl"
26
+ OUTPUT_DIR = Path(__file__).parent / "adapter"
27
+ LOG_DIR = Path(__file__).parent
28
+
29
+ SEED = 42
30
+
31
+
32
+ def set_seed(seed: int):
33
+ random.seed(seed)
34
+ np.random.seed(seed)
35
+ try:
36
+ import torch
37
+ torch.manual_seed(seed)
38
+ if torch.cuda.is_available():
39
+ torch.cuda.manual_seed_all(seed)
40
+ except ImportError:
41
+ pass
42
+
43
+
44
+ def setup_logging(log_path: Path):
45
+ logging.basicConfig(
46
+ level=logging.INFO,
47
+ format="%(asctime)s [%(levelname)s] %(message)s",
48
+ handlers=[
49
+ logging.FileHandler(str(log_path)),
50
+ logging.StreamHandler(sys.stdout),
51
+ ],
52
+ )
53
+ return logging.getLogger(__name__)
54
+
55
+
56
+ def detect_precision():
57
+ try:
58
+ import torch
59
+ if torch.cuda.is_available():
60
+ cap = torch.cuda.get_device_capability()
61
+ name = torch.cuda.get_device_name()
62
+ if cap[0] >= 8: # A100, A10, 4090 — bf16 capable
63
+ logging.info(f"GPU {name} (compute {cap[0]}.{cap[1]}) supports bf16")
64
+ return {"bf16": True, "fp16": False}
65
+ else: # T4, V100 — fp16 only
66
+ logging.info(f"GPU {name} (compute {cap[0]}.{cap[1]}) using fp16")
67
+ return {"bf16": False, "fp16": True}
68
+ except Exception:
69
+ pass
70
+ return {"bf16": False, "fp16": False}
71
+
72
+
73
+ def load_dataset(data_path: Path, logger):
74
+ from datasets import Dataset
75
+ pairs = []
76
+ with open(data_path) as f:
77
+ for line in f:
78
+ line = line.strip()
79
+ if line:
80
+ pairs.append(json.loads(line))
81
+ logger.info(f"Loaded {len(pairs)} preference pairs from {data_path}")
82
+ for p in pairs:
83
+ p.pop("task_id", None)
84
+ p.pop("dimension", None)
85
+ return Dataset.from_list(pairs)
86
+
87
+
88
+ def main():
89
+ parser = argparse.ArgumentParser()
90
+ parser.add_argument("--data-path", type=str, default=str(DATA_PATH))
91
+ parser.add_argument("--output-dir", type=str, default=str(OUTPUT_DIR))
92
+ parser.add_argument("--hub-token", type=str, default=os.environ.get("HF_TOKEN", ""))
93
+ args = parser.parse_args()
94
+
95
+ set_seed(SEED)
96
+
97
+ timestamp = datetime.datetime.now(datetime.timezone.utc).strftime("%Y%m%dT%H%M%S")
98
+ log_path = LOG_DIR / f"training_run_seed{SEED}_{timestamp}.log"
99
+ logger = setup_logging(log_path)
100
+
101
+ with open(HYPERPARAMS_PATH) as f:
102
+ hp = json.load(f)
103
+ logger.info(f"Hyperparameters: {json.dumps(hp, indent=2)}")
104
+
105
+ precision = detect_precision()
106
+ logger.info(f"Precision: {precision}")
107
+
108
+ # Load Unsloth model
109
+ logger.info("Loading Unsloth Qwen2.5-1.5B-Instruct with 4-bit quantization...")
110
+ from unsloth import FastLanguageModel
111
+
112
+ model, tokenizer = FastLanguageModel.from_pretrained(
113
+ model_name=hp["model_id"],
114
+ max_seq_length=hp["orpo_trainer"]["max_length"],
115
+ dtype=None, # auto-detect
116
+ load_in_4bit=True,
117
+ )
118
+
119
+ # Apply LoRA
120
+ logger.info(f"Applying LoRA: r={hp['lora']['r']}, alpha={hp['lora']['lora_alpha']}, "
121
+ f"targets={hp['lora']['target_modules']}")
122
+ model = FastLanguageModel.get_peft_model(
123
+ model,
124
+ r=hp["lora"]["r"],
125
+ target_modules=hp["lora"]["target_modules"],
126
+ lora_alpha=hp["lora"]["lora_alpha"],
127
+ lora_dropout=hp["lora"]["lora_dropout"],
128
+ bias=hp["lora"]["bias"],
129
+ use_gradient_checkpointing="unsloth",
130
+ random_state=SEED,
131
+ )
132
+
133
+ # Load dataset
134
+ dataset = load_dataset(Path(args.data_path), logger)
135
+ logger.info(f"Dataset size: {len(dataset)}")
136
+
137
+ # Training arguments
138
+ from trl import ORPOConfig, ORPOTrainer
139
+
140
+ output_dir = Path(args.output_dir)
141
+ output_dir.mkdir(parents=True, exist_ok=True)
142
+
143
+ training_args = ORPOConfig(
144
+ output_dir=str(output_dir),
145
+ learning_rate=hp["orpo_trainer"]["learning_rate"],
146
+ per_device_train_batch_size=hp["orpo_trainer"]["per_device_train_batch_size"],
147
+ gradient_accumulation_steps=hp["orpo_trainer"]["gradient_accumulation_steps"],
148
+ num_train_epochs=hp["orpo_trainer"]["num_train_epochs"],
149
+ warmup_ratio=hp["orpo_trainer"]["warmup_ratio"],
150
+ lr_scheduler_type=hp["orpo_trainer"]["lr_scheduler_type"],
151
+ beta=hp["orpo_trainer"]["beta"],
152
+ max_length=hp["orpo_trainer"]["max_length"],
153
+ max_prompt_length=hp["orpo_trainer"]["max_prompt_length"],
154
+ logging_steps=hp["orpo_trainer"]["logging_steps"],
155
+ save_steps=hp["orpo_trainer"]["save_steps"],
156
+ seed=SEED,
157
+ bf16=precision["bf16"],
158
+ fp16=precision["fp16"],
159
+ report_to="none",
160
+ remove_unused_columns=False,
161
+ )
162
+
163
+ trainer = ORPOTrainer(
164
+ model=model,
165
+ args=training_args,
166
+ train_dataset=dataset,
167
+ tokenizer=tokenizer,
168
+ )
169
+
170
+ logger.info("Starting ORPO training...")
171
+ train_result = trainer.train()
172
+ logger.info(f"Training complete. Metrics: {train_result.metrics}")
173
+
174
+ # Save adapter locally
175
+ logger.info(f"Saving LoRA adapter to {output_dir}")
176
+ model.save_pretrained(str(output_dir))
177
+ tokenizer.save_pretrained(str(output_dir))
178
+
179
+ # Save training run log (copy log file to standard name)
180
+ standard_log = LOG_DIR / "training_run.log"
181
+ import shutil
182
+ shutil.copy(str(log_path), str(standard_log))
183
+ logger.info(f"Training log copied to {standard_log}")
184
+
185
+ # Push to HuggingFace
186
+ hub_model_id = hp.get("hub_model_id", "rafiakedir/tenacious-bench-adapter")
187
+ hub_token = args.hub_token or os.environ.get("HF_TOKEN", "")
188
+ if hub_token:
189
+ logger.info(f"Pushing adapter to HuggingFace: {hub_model_id}")
190
+ model.push_to_hub(hub_model_id, token=hub_token)
191
+ tokenizer.push_to_hub(hub_model_id, token=hub_token)
192
+ logger.info(f"Adapter pushed to https://huggingface.co/{hub_model_id}")
193
+ else:
194
+ logger.warning("HF_TOKEN not set — skipping HuggingFace push")
195
+
196
+ logger.info("=== TRAINING COMPLETE ===")
197
+ logger.info(f"Adapter saved to: {output_dir}")
198
+ logger.info(f"Log: {standard_log}")
199
+
200
+ return train_result.metrics
201
+
202
+
203
+ if __name__ == "__main__":
204
+ main()