Safetensors
English
qwen3_5
judge
b2b-sales
orpo
lora
preference-learning
tenacious-bench
evaluation
qwen2.5
unsloth
Instructions to use rafiakedir/tenacious-bench-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use rafiakedir/tenacious-bench-adapter with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="rafiakedir/tenacious-bench-adapter", max_seq_length=2048, )
feat: add model card, inference example, and training scripts
Browse files- Proper model card with YAML frontmatter (base_model, tags, datasets)
- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict
- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)
- inference_example.py with per-dimension and all-dimensions scoring
- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb
- README.md +242 -12
- hyperparams.json +41 -0
- inference_example.py +176 -0
- requirements_training.txt +11 -0
- run_on_colab.ipynb +77 -0
- train_judge.py +204 -0
README.md
CHANGED
|
@@ -1,21 +1,251 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model: unsloth/Qwen3.5-0.8B
|
| 3 |
tags:
|
| 4 |
-
-
|
| 5 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- unsloth
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
- **License:** apache-2.0
|
| 17 |
-
- **Finetuned from model :** unsloth/Qwen3.5-0.8B
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
base_model: unsloth/Qwen3.5-0.8B
|
| 6 |
tags:
|
| 7 |
+
- judge
|
| 8 |
+
- b2b-sales
|
| 9 |
+
- orpo
|
| 10 |
+
- preference-learning
|
| 11 |
+
- tenacious-bench
|
| 12 |
+
- evaluation
|
| 13 |
+
- qwen3
|
| 14 |
- unsloth
|
| 15 |
+
datasets:
|
| 16 |
+
- rafiakedir/tenacious-bench-v0.1
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Tenacious-Bench Judge — ORPO Fine-Tuned Qwen3.5-0.8B
|
| 20 |
+
|
| 21 |
+
A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
|
| 22 |
+
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 23 |
+
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
|
| 24 |
+
the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
|
| 25 |
+
rubric dimensions; outputs below threshold are rejected and regenerated.
|
| 26 |
+
|
| 27 |
+
**Base model:** `unsloth/Qwen3.5-0.8B`
|
| 28 |
+
**Training algorithm:** ORPO (no reference model — single forward pass)
|
| 29 |
+
**Weights:** Merged (full model, not a LoRA adapter)
|
| 30 |
+
**Precision:** BF16 · ~873M parameters · ~1.75 GB
|
| 31 |
+
**Context length:** 262,144 tokens
|
| 32 |
+
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
|
| 33 |
+
|
| 34 |
---
|
| 35 |
|
| 36 |
+
## What It Scores
|
| 37 |
+
|
| 38 |
+
| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
|
| 39 |
+
|---|---|---|
|
| 40 |
+
| `signal_grounding_fidelity` | 35% | CTO credibility loss |
|
| 41 |
+
| `competitor_gap_honesty` | 45% | Irreversible brand damage |
|
| 42 |
+
| `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
|
| 43 |
+
| `tone_preservation` | 15% | Brand voice violation |
|
| 44 |
+
| `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Quick Start — Inference
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 52 |
+
import torch
|
| 53 |
+
|
| 54 |
+
model_id = "rafiakedir/tenacious-bench-adapter"
|
| 55 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 56 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 57 |
+
model_id, torch_dtype=torch.bfloat16, device_map="auto"
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
SYSTEM = """You are a rubric-aware judge for B2B outbound sales emails.
|
| 61 |
+
Score the candidate output on the following dimension.
|
| 62 |
+
|
| 63 |
+
Dimension: signal_grounding_fidelity
|
| 64 |
+
Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
|
| 65 |
+
with confidence >= 0.60, or be phrased as a question.
|
| 66 |
+
|
| 67 |
+
Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}"""
|
| 68 |
+
|
| 69 |
+
USER = """Hiring signal brief:
|
| 70 |
+
{
|
| 71 |
+
"company_name": "Acme Corp",
|
| 72 |
+
"open_roles": 3,
|
| 73 |
+
"confidence": "low",
|
| 74 |
+
"domain": "fintech"
|
| 75 |
+
}
|
| 76 |
+
|
| 77 |
+
Candidate email:
|
| 78 |
+
"Hi Alex — noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
|
| 79 |
+
We staff specialized capability-gap squads for fintech teams at your growth stage.
|
| 80 |
+
Would a 30-minute scoping conversation make sense this week?"
|
| 81 |
+
|
| 82 |
+
Score this output."""
|
| 83 |
+
|
| 84 |
+
messages = [
|
| 85 |
+
{"role": "system", "content": SYSTEM},
|
| 86 |
+
{"role": "user", "content": USER},
|
| 87 |
+
]
|
| 88 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 89 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 90 |
+
|
| 91 |
+
with torch.no_grad():
|
| 92 |
+
out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
|
| 93 |
+
|
| 94 |
+
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 95 |
+
print(response)
|
| 96 |
+
# Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low — should be phrased as a question."}
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## Training Details
|
| 102 |
+
|
| 103 |
+
### Why ORPO
|
| 104 |
+
|
| 105 |
+
ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
|
| 106 |
+
the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
|
| 107 |
+
VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
|
| 108 |
+
checkpointing hacks.
|
| 109 |
+
|
| 110 |
+
For a discriminative judge (score calibration rather than generation quality), the
|
| 111 |
+
preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
|
| 112 |
+
that `beta=0.2`–`0.3` may better calibrate the preference margin for rubric-based scoring.
|
| 113 |
+
|
| 114 |
+
### Preference Pair Construction
|
| 115 |
+
|
| 116 |
+
| Source | Count |
|
| 117 |
+
|---|---|
|
| 118 |
+
| Failing tasks → generated chosen (DeepSeek V3.2) | ~111 attempted |
|
| 119 |
+
| Passing tasks → generated rejected (DeepSeek V3.2) | ~41 attempted |
|
| 120 |
+
| **Final pairs after filtering** | **94** |
|
| 121 |
+
|
| 122 |
+
Filter: chosen score ≥ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
|
| 123 |
+
Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
|
| 124 |
+
and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
|
| 125 |
+
|
| 126 |
+
**Preference leakage prevention (Li et al., 2025):**
|
| 127 |
+
Generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
|
| 128 |
+
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
|
| 129 |
+
|
| 130 |
+
### Hyperparameters
|
| 131 |
+
|
| 132 |
+
| Parameter | Value |
|
| 133 |
+
|---|---|
|
| 134 |
+
| Base model | `unsloth/Qwen3.5-0.8B` |
|
| 135 |
+
| LoRA rank | 16 |
|
| 136 |
+
| LoRA alpha | 32 |
|
| 137 |
+
| Target modules | q_proj, v_proj |
|
| 138 |
+
| LoRA dropout | 0.05 |
|
| 139 |
+
| Learning rate | 8e-6 |
|
| 140 |
+
| Batch size (per device) | 2 |
|
| 141 |
+
| Gradient accumulation | 4 (effective batch 8) |
|
| 142 |
+
| Epochs | 3 |
|
| 143 |
+
| Warmup ratio | 0.1 |
|
| 144 |
+
| LR scheduler | cosine |
|
| 145 |
+
| ORPO beta | 0.1 |
|
| 146 |
+
| Max sequence length | 1024 |
|
| 147 |
+
| Precision | BF16 (T4) |
|
| 148 |
+
| Seed | 42 |
|
| 149 |
+
|
| 150 |
+
Training notebook: see `run_on_colab.ipynb` in this repo.
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Evaluation Results
|
| 155 |
+
|
| 156 |
+
Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
|
| 157 |
+
Paired bootstrap significance test: 10,000 iterations, seed 42.
|
| 158 |
+
|
| 159 |
+
| Condition | Mean Score | vs. Baseline |
|
| 160 |
+
|---|---|---|
|
| 161 |
+
| Baseline (`scoring_evaluator.py` only) | 0.458 | — |
|
| 162 |
+
| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Δ=+0.025, p=0.189, not significant |
|
| 163 |
+
| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Δ=−0.021 vs. trained, p=0.978 |
|
| 164 |
+
|
| 165 |
+
**Delta A** (trained vs. baseline): Δ=+0.025, 95% CI [−0.032, +0.081], p=0.189 — **not statistically significant**.
|
| 166 |
+
|
| 167 |
+
**Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` —
|
| 168 |
+
the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
|
| 169 |
+
Note: Delta B compares a 0.8B trained model against a 30B zero-shot model — this conflates backbone
|
| 170 |
+
capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
|
| 171 |
+
`Qwen3.5-0.8B-Instruct` (no fine-tuning).
|
| 172 |
+
|
| 173 |
+
**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
|
| 174 |
+
`scoring_evaluator.py` deterministically. Retrain with ≥150 pairs covering all 5 dimensions
|
| 175 |
+
before re-evaluating.
|
| 176 |
+
|
| 177 |
+
Full numbers: `ablation_results.json` in the dataset repo.
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
## Known Limitations
|
| 182 |
+
|
| 183 |
+
**1. Dimension coverage gap (critical).**
|
| 184 |
+
The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
|
| 185 |
+
for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
|
| 186 |
+
to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
|
| 187 |
+
bench commitment honesty — the highest SOW-breach-risk dimension. It cannot be trusted to gate
|
| 188 |
+
bench-commitment outputs.
|
| 189 |
+
|
| 190 |
+
**2. Delta A not significant at v0.1 scale.**
|
| 191 |
+
The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
|
| 192 |
+
does not reliably outperform `scoring_evaluator.py` on held-out tasks.
|
| 193 |
+
|
| 194 |
+
**3. Backbone below Prometheus-2 threshold.**
|
| 195 |
+
Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
|
| 196 |
+
below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
|
| 197 |
+
|
| 198 |
+
**4. Synthetic training distribution.**
|
| 199 |
+
All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
|
| 200 |
+
may not generalize to real prospect data with industry-specific jargon or edge cases outside the
|
| 201 |
+
training distribution.
|
| 202 |
+
|
| 203 |
+
**5. Static bench_summary.**
|
| 204 |
+
The judge was trained on snapshot bench capacities. In production the bench changes weekly —
|
| 205 |
+
calibration for `bench_commitment_honesty` will drift over time.
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Files in This Repo
|
| 210 |
+
|
| 211 |
+
| File | Description |
|
| 212 |
+
|---|---|
|
| 213 |
+
| `model.safetensors-*` | Merged model weights (BF16) |
|
| 214 |
+
| `config.json` | Model architecture config |
|
| 215 |
+
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
|
| 216 |
+
| `train_judge.py` | Full ORPO training script |
|
| 217 |
+
| `hyperparams.json` | All hyperparameters (pinned) |
|
| 218 |
+
| `run_on_colab.ipynb` | End-to-end training notebook for T4 |
|
| 219 |
+
| `inference_example.py` | Inference helper with prompt templates |
|
| 220 |
+
|
| 221 |
+
Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 222 |
+
|
| 223 |
+
---
|
| 224 |
+
|
| 225 |
+
## Environmental Impact
|
| 226 |
+
|
| 227 |
+
- **Compute:** ~60–90 min on a single T4 GPU (3 epochs, 94 preference pairs)
|
| 228 |
+
- **CO₂e:** ~0.1 kg (T4 at 70W × 90 min × US grid 0.42 kg CO₂/kWh ÷ 1000)
|
| 229 |
+
- **Infrastructure:** Google Colab free tier
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
|
| 233 |
+
## Citation
|
|
|
|
|
|
|
| 234 |
|
| 235 |
+
```bibtex
|
| 236 |
+
@misc{tenacious-bench-adapter-2026,
|
| 237 |
+
title = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
|
| 238 |
+
author = {Kedir, Rafia},
|
| 239 |
+
year = {2026},
|
| 240 |
+
howpublished = {HuggingFace Model Hub},
|
| 241 |
+
url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
|
| 242 |
+
}
|
| 243 |
|
| 244 |
+
@misc{tenacious-bench-v01-2026,
|
| 245 |
+
title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
|
| 246 |
+
author = {Kedir, Rafia},
|
| 247 |
+
year = {2026},
|
| 248 |
+
howpublished = {HuggingFace Datasets Hub},
|
| 249 |
+
url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
|
| 250 |
+
}
|
| 251 |
+
```
|
hyperparams.json
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_id": "unsloth/Qwen2.5-1.5B-Instruct",
|
| 3 |
+
"training_algorithm": "ORPO",
|
| 4 |
+
"lora": {
|
| 5 |
+
"r": 16,
|
| 6 |
+
"lora_alpha": 32,
|
| 7 |
+
"target_modules": ["q_proj", "v_proj"],
|
| 8 |
+
"lora_dropout": 0.05,
|
| 9 |
+
"bias": "none",
|
| 10 |
+
"task_type": "CAUSAL_LM"
|
| 11 |
+
},
|
| 12 |
+
"orpo_trainer": {
|
| 13 |
+
"learning_rate": 8e-6,
|
| 14 |
+
"per_device_train_batch_size": 2,
|
| 15 |
+
"gradient_accumulation_steps": 4,
|
| 16 |
+
"effective_batch_size": 8,
|
| 17 |
+
"num_train_epochs": 3,
|
| 18 |
+
"warmup_ratio": 0.1,
|
| 19 |
+
"lr_scheduler_type": "cosine",
|
| 20 |
+
"beta": 0.1,
|
| 21 |
+
"max_length": 1024,
|
| 22 |
+
"max_prompt_length": 512,
|
| 23 |
+
"logging_steps": 10,
|
| 24 |
+
"save_steps": 50,
|
| 25 |
+
"seed": 42
|
| 26 |
+
},
|
| 27 |
+
"precision": {
|
| 28 |
+
"bf16": false,
|
| 29 |
+
"fp16": true,
|
| 30 |
+
"note": "T4 GPU: fp16 only. Switch to bf16 on A100/4090."
|
| 31 |
+
},
|
| 32 |
+
"adapter_output_dir": "training/adapter",
|
| 33 |
+
"hub_model_id": "rafiakedir/tenacious-bench-adapter",
|
| 34 |
+
"fixed_seed": 42,
|
| 35 |
+
"rationale": {
|
| 36 |
+
"orpo_vs_dpo": "ORPO chosen over DPO because it requires no reference model, reducing GPU memory footprint by ~40% on T4. Reference-free approach is appropriate for a judge component where the reference policy is undefined.",
|
| 37 |
+
"backbone_choice": "Qwen2.5-1.5B-Instruct selected per Prometheus-2 paper (Kim et al., 2024) showing 7B-class judge viability at 1.5B with preference tuning.",
|
| 38 |
+
"lora_rank": "Rank 16 with alpha 32 (2:1 ratio) is standard for task-specific adaptation. Rank 8 was considered but judge rubric complexity warrants higher rank.",
|
| 39 |
+
"beta_orpo": "Beta=0.1 follows ORPO paper (Hong et al., 2024) recommendation for instruction-following tasks."
|
| 40 |
+
}
|
| 41 |
+
}
|
inference_example.py
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Inference helper for rafiakedir/tenacious-bench-adapter.
|
| 4 |
+
Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
import torch
|
| 9 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 10 |
+
|
| 11 |
+
MODEL_ID = "rafiakedir/tenacious-bench-adapter"
|
| 12 |
+
|
| 13 |
+
DIMENSION_PROMPTS = {
|
| 14 |
+
"signal_grounding_fidelity": (
|
| 15 |
+
"Dimension: signal_grounding_fidelity\n"
|
| 16 |
+
"Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
|
| 17 |
+
"with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
|
| 18 |
+
"without a high/medium-confidence signal in the brief must be recast as questions."
|
| 19 |
+
),
|
| 20 |
+
"bench_commitment_honesty": (
|
| 21 |
+
"Dimension: bench_commitment_honesty\n"
|
| 22 |
+
"Rubric: The email must not promise or imply a number of engineers that exceeds "
|
| 23 |
+
"the total available in the bench_summary. Any staffing commitment must stay within capacity."
|
| 24 |
+
),
|
| 25 |
+
"icp_segment_appropriateness": (
|
| 26 |
+
"Dimension: icp_segment_appropriateness\n"
|
| 27 |
+
"Rubric: The email's language and pitch angle must match the correct ICP segment "
|
| 28 |
+
"(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
|
| 29 |
+
"ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
|
| 30 |
+
),
|
| 31 |
+
"competitor_gap_honesty": (
|
| 32 |
+
"Dimension: competitor_gap_honesty\n"
|
| 33 |
+
"Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
|
| 34 |
+
"The email must not assert that competitors have capabilities the prospect lacks "
|
| 35 |
+
"unless the brief explicitly documents this gap."
|
| 36 |
+
),
|
| 37 |
+
"tone_preservation": (
|
| 38 |
+
"Dimension: tone_preservation\n"
|
| 39 |
+
"Rubric: No re-engagement clichés ('just wanted to circle back', 'touching base', "
|
| 40 |
+
"'following up'). No over-apologetic exits ('sorry for taking your time'). "
|
| 41 |
+
"Calendar CTA required. Confident but not pushy."
|
| 42 |
+
),
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
SYSTEM_TEMPLATE = """You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
|
| 46 |
+
{dimension_prompt}
|
| 47 |
+
|
| 48 |
+
Respond with a JSON object only:
|
| 49 |
+
{{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
|
| 50 |
+
"""
|
| 51 |
+
|
| 52 |
+
USER_TEMPLATE = """Context:
|
| 53 |
+
{context_json}
|
| 54 |
+
|
| 55 |
+
Candidate email:
|
| 56 |
+
{candidate_output}
|
| 57 |
+
|
| 58 |
+
Score this output on the dimension above."""
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def load_model(model_id: str = MODEL_ID):
|
| 62 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 63 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 64 |
+
model_id,
|
| 65 |
+
torch_dtype=torch.bfloat16,
|
| 66 |
+
device_map="auto",
|
| 67 |
+
)
|
| 68 |
+
model.eval()
|
| 69 |
+
return tokenizer, model
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def score(
|
| 73 |
+
tokenizer,
|
| 74 |
+
model,
|
| 75 |
+
task_input: dict,
|
| 76 |
+
candidate_output: str,
|
| 77 |
+
dimension: str,
|
| 78 |
+
max_new_tokens: int = 150,
|
| 79 |
+
) -> dict:
|
| 80 |
+
"""
|
| 81 |
+
Score a single candidate output on one rubric dimension.
|
| 82 |
+
|
| 83 |
+
Args:
|
| 84 |
+
task_input: dict with keys like 'hiring_signal_brief', 'bench_summary', etc.
|
| 85 |
+
candidate_output: the email text to score
|
| 86 |
+
dimension: one of the five Tenacious rubric dimensions
|
| 87 |
+
Returns:
|
| 88 |
+
dict with 'score' (float) and 'reasoning' (str)
|
| 89 |
+
"""
|
| 90 |
+
if dimension not in DIMENSION_PROMPTS:
|
| 91 |
+
raise ValueError(f"Unknown dimension: {dimension}. Choose from {list(DIMENSION_PROMPTS)}")
|
| 92 |
+
|
| 93 |
+
context_json = json.dumps(task_input, indent=2)
|
| 94 |
+
system = SYSTEM_TEMPLATE.format(dimension_prompt=DIMENSION_PROMPTS[dimension])
|
| 95 |
+
user = USER_TEMPLATE.format(
|
| 96 |
+
context_json=context_json,
|
| 97 |
+
candidate_output=candidate_output,
|
| 98 |
+
)
|
| 99 |
+
|
| 100 |
+
messages = [
|
| 101 |
+
{"role": "system", "content": system},
|
| 102 |
+
{"role": "user", "content": user},
|
| 103 |
+
]
|
| 104 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 105 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 106 |
+
|
| 107 |
+
with torch.no_grad():
|
| 108 |
+
out = model.generate(
|
| 109 |
+
**inputs,
|
| 110 |
+
max_new_tokens=max_new_tokens,
|
| 111 |
+
temperature=0.1,
|
| 112 |
+
do_sample=True,
|
| 113 |
+
pad_token_id=tokenizer.eos_token_id,
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
response = tokenizer.decode(
|
| 117 |
+
out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
|
| 118 |
+
).strip()
|
| 119 |
+
|
| 120 |
+
# Parse JSON from response
|
| 121 |
+
try:
|
| 122 |
+
# Find first { ... } block
|
| 123 |
+
start = response.find("{")
|
| 124 |
+
end = response.rfind("}") + 1
|
| 125 |
+
result = json.loads(response[start:end])
|
| 126 |
+
return {"score": float(result["score"]), "reasoning": result.get("reasoning", "")}
|
| 127 |
+
except Exception:
|
| 128 |
+
return {"score": 0.5, "reasoning": f"parse_error: {response[:200]}"}
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def score_all_dimensions(tokenizer, model, task_input: dict, candidate_output: str) -> dict:
|
| 132 |
+
"""Score a candidate output on all five dimensions."""
|
| 133 |
+
results = {}
|
| 134 |
+
for dim in DIMENSION_PROMPTS:
|
| 135 |
+
results[dim] = score(tokenizer, model, task_input, candidate_output, dim)
|
| 136 |
+
results["mean_score"] = sum(r["score"] for r in results.values()) / len(results)
|
| 137 |
+
return results
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
# ── Demo ──────────────────────────────────────────────────────────────────────
|
| 141 |
+
if __name__ == "__main__":
|
| 142 |
+
print(f"Loading {MODEL_ID}...")
|
| 143 |
+
tokenizer, model = load_model()
|
| 144 |
+
|
| 145 |
+
demo_input = {
|
| 146 |
+
"hiring_signal_brief": {
|
| 147 |
+
"company_name": "Acme Corp",
|
| 148 |
+
"domain": "fintech",
|
| 149 |
+
"open_roles": 3,
|
| 150 |
+
"confidence": "low",
|
| 151 |
+
"stage": "Series B",
|
| 152 |
+
},
|
| 153 |
+
"bench_summary": {
|
| 154 |
+
"total_available": 8,
|
| 155 |
+
"specializations": ["Python", "Go", "ML Engineering"],
|
| 156 |
+
},
|
| 157 |
+
}
|
| 158 |
+
|
| 159 |
+
demo_email = (
|
| 160 |
+
"Hi Alex — noticed Acme Corp is aggressively scaling its engineering team "
|
| 161 |
+
"with 3 open roles. We staff specialized capability-gap squads for fintech "
|
| 162 |
+
"teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
print("\nScoring on signal_grounding_fidelity...")
|
| 166 |
+
result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
|
| 167 |
+
print(f" Score: {result['score']:.2f}")
|
| 168 |
+
print(f" Reasoning: {result['reasoning']}")
|
| 169 |
+
|
| 170 |
+
print("\nScoring all dimensions...")
|
| 171 |
+
all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
|
| 172 |
+
for dim, r in all_results.items():
|
| 173 |
+
if dim == "mean_score":
|
| 174 |
+
print(f" MEAN: {r:.3f}")
|
| 175 |
+
else:
|
| 176 |
+
print(f" {dim}: {r['score']:.2f} — {r['reasoning'][:80]}")
|
requirements_training.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git
|
| 2 |
+
trl==0.12.2
|
| 3 |
+
peft==0.14.0
|
| 4 |
+
transformers==4.47.1
|
| 5 |
+
datasets==3.2.0
|
| 6 |
+
accelerate==1.2.1
|
| 7 |
+
bitsandbytes==0.45.0
|
| 8 |
+
sentencepiece==0.2.0
|
| 9 |
+
protobuf==5.29.2
|
| 10 |
+
torch==2.5.1
|
| 11 |
+
xformers==0.0.28.post3
|
run_on_colab.ipynb
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"nbformat": 4,
|
| 3 |
+
"nbformat_minor": 0,
|
| 4 |
+
"metadata": {
|
| 5 |
+
"kernelspec": {"display_name": "Python 3", "name": "python3"},
|
| 6 |
+
"language_info": {"name": "python"},
|
| 7 |
+
"accelerator": "GPU",
|
| 8 |
+
"colab": {"provenance": [], "gpuType": "T4", "name": "tenacious_bench_orpo_training.ipynb"}
|
| 9 |
+
},
|
| 10 |
+
"cells": [
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"source": ["# Tenacious-Bench ORPO Judge Training\n\n**Trains Qwen2.5-1.5B-Instruct** with LoRA using ORPO on Tenacious-specific rubric preference pairs.\n\nRuntime: T4 GPU (Colab free tier) \nExpected training time: ~45-90 minutes for 3 epochs\n\n## Setup\n1. Set HF_TOKEN and OPENROUTER_API_KEY in Colab Secrets (key icon in left sidebar)\n2. Run all cells in order\n"]
|
| 15 |
+
},
|
| 16 |
+
{
|
| 17 |
+
"cell_type": "code",
|
| 18 |
+
"metadata": {},
|
| 19 |
+
"source": ["# Step 1: Check GPU\nimport subprocess\nresult = subprocess.run(['nvidia-smi'], capture_output=True, text=True)\nprint(result.stdout[:500])"]
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"cell_type": "code",
|
| 23 |
+
"metadata": {},
|
| 24 |
+
"source": ["# Step 2: Install Unsloth and dependencies (pinned versions)\n!pip install -q 'unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git'\n!pip install -q trl==0.12.2 peft==0.14.0 transformers==4.47.1 datasets==3.2.0\n!pip install -q accelerate==1.2.1 bitsandbytes==0.45.0\nprint('Installation complete')"]
|
| 25 |
+
},
|
| 26 |
+
{
|
| 27 |
+
"cell_type": "code",
|
| 28 |
+
"metadata": {},
|
| 29 |
+
"source": ["# Step 3: Clone the repo\nimport os\nfrom google.colab import userdata\n\nHF_TOKEN = userdata.get('HF_TOKEN')\nOPENROUTER_API_KEY = userdata.get('OPENROUTER_API_KEY')\n\nos.environ['HF_TOKEN'] = HF_TOKEN\nos.environ['OPENROUTER_API_KEY'] = OPENROUTER_API_KEY\n\n!git clone https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1 /content/tenacious-bench-data\nprint('Repo cloned')"]
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"cell_type": "code",
|
| 33 |
+
"metadata": {},
|
| 34 |
+
"source": ["# Step 4: Load preference pairs\nimport json\nfrom pathlib import Path\n\npairs_path = Path('/content/tenacious-bench-data/training_data/preference_pairs.jsonl')\npairs = []\nwith open(pairs_path) as f:\n for line in f:\n p = json.loads(line)\n pairs.append({'prompt': p['prompt'], 'chosen': p['chosen'], 'rejected': p['rejected']})\n\nprint(f'Loaded {len(pairs)} preference pairs')\nprint(f'Sample pair task context (first 200 chars of prompt):')\nprint(pairs[0]['prompt'][:200])"]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"cell_type": "code",
|
| 38 |
+
"metadata": {},
|
| 39 |
+
"source": ["# Step 5: Load Unsloth model with 4-bit quantization\nfrom unsloth import FastLanguageModel\nimport torch\n\nMAX_SEQ_LENGTH = 1024\nDTYPE = None # auto-detect\nLOAD_IN_4BIT = True\n\nmodel, tokenizer = FastLanguageModel.from_pretrained(\n model_name='unsloth/Qwen2.5-1.5B-Instruct',\n max_seq_length=MAX_SEQ_LENGTH,\n dtype=DTYPE,\n load_in_4bit=LOAD_IN_4BIT,\n)\nprint('Model loaded')"]
|
| 40 |
+
},
|
| 41 |
+
{
|
| 42 |
+
"cell_type": "code",
|
| 43 |
+
"metadata": {},
|
| 44 |
+
"source": ["# Step 6: Apply LoRA\nmodel = FastLanguageModel.get_peft_model(\n model,\n r=16,\n target_modules=['q_proj', 'v_proj'],\n lora_alpha=32,\n lora_dropout=0.05,\n bias='none',\n use_gradient_checkpointing='unsloth',\n random_state=42,\n)\nprint('LoRA applied')"]
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"cell_type": "code",
|
| 48 |
+
"metadata": {},
|
| 49 |
+
"source": ["# Step 7: Set up ORPO trainer\nimport random\nimport numpy as np\nfrom datasets import Dataset\nfrom trl import ORPOConfig, ORPOTrainer\n\n# Fixed seed\nrandom.seed(42)\nnp.random.seed(42)\ntorch.manual_seed(42)\n\n# Detect precision\ncap = torch.cuda.get_device_capability()\nuse_fp16 = cap[0] < 8 # T4 uses fp16\nuse_bf16 = cap[0] >= 8 # A100/4090 use bf16\nprint(f'GPU compute capability: {cap}, fp16={use_fp16}, bf16={use_bf16}')\n\ndataset = Dataset.from_list(pairs)\n\ntraining_args = ORPOConfig(\n output_dir='/content/tenacious-adapter',\n learning_rate=8e-6,\n per_device_train_batch_size=2,\n gradient_accumulation_steps=4,\n num_train_epochs=3,\n warmup_ratio=0.1,\n lr_scheduler_type='cosine',\n beta=0.1,\n max_length=1024,\n max_prompt_length=512,\n logging_steps=10,\n save_steps=50,\n seed=42,\n fp16=use_fp16,\n bf16=use_bf16,\n report_to='none',\n remove_unused_columns=False,\n)\n\ntrainer = ORPOTrainer(\n model=model,\n args=training_args,\n train_dataset=dataset,\n tokenizer=tokenizer,\n)\nprint('Trainer initialized')"]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "code",
|
| 53 |
+
"metadata": {},
|
| 54 |
+
"source": ["# Step 8: Train\nprint('Starting ORPO training...')\ntrain_result = trainer.train()\nprint(f'Training complete!')\nprint(f'Metrics: {train_result.metrics}')"]
|
| 55 |
+
},
|
| 56 |
+
{
|
| 57 |
+
"cell_type": "code",
|
| 58 |
+
"metadata": {},
|
| 59 |
+
"source": ["# Step 9: Plot loss curve\nimport matplotlib.pyplot as plt\n\nlog_history = trainer.state.log_history\nsteps = [x['step'] for x in log_history if 'loss' in x]\nlosses = [x['loss'] for x in log_history if 'loss' in x]\n\nif steps:\n plt.figure(figsize=(10, 5))\n plt.plot(steps, losses, 'b-', linewidth=2, label='Training Loss')\n plt.xlabel('Step')\n plt.ylabel('Loss')\n plt.title('ORPO Training Loss — Tenacious Judge')\n plt.legend()\n plt.grid(True, alpha=0.3)\n plt.savefig('/content/loss_curve.png', dpi=150, bbox_inches='tight')\n plt.show()\n print(f'Initial loss: {losses[0]:.4f}')\n print(f'Final loss: {losses[-1]:.4f}')\n print(f'Loss decrease: {losses[0] - losses[-1]:.4f} ({(1-losses[-1]/losses[0])*100:.1f}%)')\nelse:\n print('No loss history available')"]
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"cell_type": "code",
|
| 63 |
+
"metadata": {},
|
| 64 |
+
"source": ["# Step 10: Save adapter locally and push to HuggingFace\nADAPTER_DIR = '/content/tenacious-adapter'\n\nmodel.save_pretrained(ADAPTER_DIR)\ntokenizer.save_pretrained(ADAPTER_DIR)\nprint(f'Adapter saved to {ADAPTER_DIR}')\n\n# Push to HuggingFace\nHUB_MODEL_ID = 'rafiakedir/tenacious-bench-adapter'\nprint(f'Pushing to {HUB_MODEL_ID}...')\nmodel.push_to_hub(HUB_MODEL_ID, token=HF_TOKEN)\ntokenizer.push_to_hub(HUB_MODEL_ID, token=HF_TOKEN)\nprint(f'Adapter pushed to https://huggingface.co/{HUB_MODEL_ID}')"]
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"cell_type": "code",
|
| 68 |
+
"metadata": {},
|
| 69 |
+
"source": ["# Step 11: Verify adapter on HuggingFace\nfrom huggingface_hub import HfApi\napi = HfApi(token=HF_TOKEN)\nfiles = api.list_repo_files(HUB_MODEL_ID)\nprint(f'Files in {HUB_MODEL_ID}:')\nfor f in files:\n print(f' {f}')"]
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"cell_type": "code",
|
| 73 |
+
"metadata": {},
|
| 74 |
+
"source": ["# Step 12: Quick smoke test — run judge on one sample\nfrom peft import PeftModel\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nJUDGE_SYSTEM = (\n 'You are evaluating outbound sales emails for Tenacious Consulting. '\n 'Score the following output on signal-grounding fidelity, bench commitment honesty, '\n 'ICP segment appropriateness, and Tenacious tone adherence. '\n 'Return JSON: {\\\"signal_grounding\\\": 0-1, \\\"bench_honesty\\\": 0-1, \\\"icp_segment\\\": 0-1, \\\"tone\\\": 0-1, \\\"overall\\\": 0-1}'\n)\n\ntest_email = '''Subject: TalentBridge's ML hiring + 30-min call\\n\\nHi Casey,\\nTalentBridge recently closed a Series A and currently has 8 open ML roles.\\nWe staff ML squads, typically 4 engineers in under 3 weeks.\\nWant to set up a 30-minute scoping conversation?\\n\\nBest,\\nYabi'''\n\nprompt_text = (\n f'<|im_start|>system\\n{JUDGE_SYSTEM}<|im_end|>\\n'\n f'<|im_start|>user\\n{test_email}<|im_end|>\\n'\n f'<|im_start|>assistant\\n'\n)\n\ninputs = tokenizer(prompt_text, return_tensors='pt').to(model.device)\nwith torch.no_grad():\n output = model.generate(**inputs, max_new_tokens=100, temperature=0.0, do_sample=False)\ngenerated = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint('Judge output:')\nprint(generated)"]
|
| 75 |
+
}
|
| 76 |
+
]
|
| 77 |
+
}
|
train_judge.py
ADDED
|
@@ -0,0 +1,204 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Day 5 — Tenacious-Bench ORPO Judge Training Script
|
| 4 |
+
Trains Qwen2.5-1.5B-Instruct with LoRA using ORPO (reference-free preference optimization).
|
| 5 |
+
Run on Colab T4 or locally with sufficient VRAM.
|
| 6 |
+
All hyperparameters are in hyperparams.json and replicated here for auditability.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python train_judge.py [--data-path PATH] [--output-dir DIR]
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import os
|
| 13 |
+
import sys
|
| 14 |
+
import json
|
| 15 |
+
import random
|
| 16 |
+
import logging
|
| 17 |
+
import datetime
|
| 18 |
+
import argparse
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
import numpy as np
|
| 22 |
+
|
| 23 |
+
ROOT = Path(__file__).parent.parent
|
| 24 |
+
HYPERPARAMS_PATH = Path(__file__).parent / "hyperparams.json"
|
| 25 |
+
DATA_PATH = ROOT / "training_data/preference_pairs.jsonl"
|
| 26 |
+
OUTPUT_DIR = Path(__file__).parent / "adapter"
|
| 27 |
+
LOG_DIR = Path(__file__).parent
|
| 28 |
+
|
| 29 |
+
SEED = 42
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def set_seed(seed: int):
|
| 33 |
+
random.seed(seed)
|
| 34 |
+
np.random.seed(seed)
|
| 35 |
+
try:
|
| 36 |
+
import torch
|
| 37 |
+
torch.manual_seed(seed)
|
| 38 |
+
if torch.cuda.is_available():
|
| 39 |
+
torch.cuda.manual_seed_all(seed)
|
| 40 |
+
except ImportError:
|
| 41 |
+
pass
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def setup_logging(log_path: Path):
|
| 45 |
+
logging.basicConfig(
|
| 46 |
+
level=logging.INFO,
|
| 47 |
+
format="%(asctime)s [%(levelname)s] %(message)s",
|
| 48 |
+
handlers=[
|
| 49 |
+
logging.FileHandler(str(log_path)),
|
| 50 |
+
logging.StreamHandler(sys.stdout),
|
| 51 |
+
],
|
| 52 |
+
)
|
| 53 |
+
return logging.getLogger(__name__)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def detect_precision():
|
| 57 |
+
try:
|
| 58 |
+
import torch
|
| 59 |
+
if torch.cuda.is_available():
|
| 60 |
+
cap = torch.cuda.get_device_capability()
|
| 61 |
+
name = torch.cuda.get_device_name()
|
| 62 |
+
if cap[0] >= 8: # A100, A10, 4090 — bf16 capable
|
| 63 |
+
logging.info(f"GPU {name} (compute {cap[0]}.{cap[1]}) supports bf16")
|
| 64 |
+
return {"bf16": True, "fp16": False}
|
| 65 |
+
else: # T4, V100 — fp16 only
|
| 66 |
+
logging.info(f"GPU {name} (compute {cap[0]}.{cap[1]}) using fp16")
|
| 67 |
+
return {"bf16": False, "fp16": True}
|
| 68 |
+
except Exception:
|
| 69 |
+
pass
|
| 70 |
+
return {"bf16": False, "fp16": False}
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def load_dataset(data_path: Path, logger):
|
| 74 |
+
from datasets import Dataset
|
| 75 |
+
pairs = []
|
| 76 |
+
with open(data_path) as f:
|
| 77 |
+
for line in f:
|
| 78 |
+
line = line.strip()
|
| 79 |
+
if line:
|
| 80 |
+
pairs.append(json.loads(line))
|
| 81 |
+
logger.info(f"Loaded {len(pairs)} preference pairs from {data_path}")
|
| 82 |
+
for p in pairs:
|
| 83 |
+
p.pop("task_id", None)
|
| 84 |
+
p.pop("dimension", None)
|
| 85 |
+
return Dataset.from_list(pairs)
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def main():
|
| 89 |
+
parser = argparse.ArgumentParser()
|
| 90 |
+
parser.add_argument("--data-path", type=str, default=str(DATA_PATH))
|
| 91 |
+
parser.add_argument("--output-dir", type=str, default=str(OUTPUT_DIR))
|
| 92 |
+
parser.add_argument("--hub-token", type=str, default=os.environ.get("HF_TOKEN", ""))
|
| 93 |
+
args = parser.parse_args()
|
| 94 |
+
|
| 95 |
+
set_seed(SEED)
|
| 96 |
+
|
| 97 |
+
timestamp = datetime.datetime.now(datetime.timezone.utc).strftime("%Y%m%dT%H%M%S")
|
| 98 |
+
log_path = LOG_DIR / f"training_run_seed{SEED}_{timestamp}.log"
|
| 99 |
+
logger = setup_logging(log_path)
|
| 100 |
+
|
| 101 |
+
with open(HYPERPARAMS_PATH) as f:
|
| 102 |
+
hp = json.load(f)
|
| 103 |
+
logger.info(f"Hyperparameters: {json.dumps(hp, indent=2)}")
|
| 104 |
+
|
| 105 |
+
precision = detect_precision()
|
| 106 |
+
logger.info(f"Precision: {precision}")
|
| 107 |
+
|
| 108 |
+
# Load Unsloth model
|
| 109 |
+
logger.info("Loading Unsloth Qwen2.5-1.5B-Instruct with 4-bit quantization...")
|
| 110 |
+
from unsloth import FastLanguageModel
|
| 111 |
+
|
| 112 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 113 |
+
model_name=hp["model_id"],
|
| 114 |
+
max_seq_length=hp["orpo_trainer"]["max_length"],
|
| 115 |
+
dtype=None, # auto-detect
|
| 116 |
+
load_in_4bit=True,
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
# Apply LoRA
|
| 120 |
+
logger.info(f"Applying LoRA: r={hp['lora']['r']}, alpha={hp['lora']['lora_alpha']}, "
|
| 121 |
+
f"targets={hp['lora']['target_modules']}")
|
| 122 |
+
model = FastLanguageModel.get_peft_model(
|
| 123 |
+
model,
|
| 124 |
+
r=hp["lora"]["r"],
|
| 125 |
+
target_modules=hp["lora"]["target_modules"],
|
| 126 |
+
lora_alpha=hp["lora"]["lora_alpha"],
|
| 127 |
+
lora_dropout=hp["lora"]["lora_dropout"],
|
| 128 |
+
bias=hp["lora"]["bias"],
|
| 129 |
+
use_gradient_checkpointing="unsloth",
|
| 130 |
+
random_state=SEED,
|
| 131 |
+
)
|
| 132 |
+
|
| 133 |
+
# Load dataset
|
| 134 |
+
dataset = load_dataset(Path(args.data_path), logger)
|
| 135 |
+
logger.info(f"Dataset size: {len(dataset)}")
|
| 136 |
+
|
| 137 |
+
# Training arguments
|
| 138 |
+
from trl import ORPOConfig, ORPOTrainer
|
| 139 |
+
|
| 140 |
+
output_dir = Path(args.output_dir)
|
| 141 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 142 |
+
|
| 143 |
+
training_args = ORPOConfig(
|
| 144 |
+
output_dir=str(output_dir),
|
| 145 |
+
learning_rate=hp["orpo_trainer"]["learning_rate"],
|
| 146 |
+
per_device_train_batch_size=hp["orpo_trainer"]["per_device_train_batch_size"],
|
| 147 |
+
gradient_accumulation_steps=hp["orpo_trainer"]["gradient_accumulation_steps"],
|
| 148 |
+
num_train_epochs=hp["orpo_trainer"]["num_train_epochs"],
|
| 149 |
+
warmup_ratio=hp["orpo_trainer"]["warmup_ratio"],
|
| 150 |
+
lr_scheduler_type=hp["orpo_trainer"]["lr_scheduler_type"],
|
| 151 |
+
beta=hp["orpo_trainer"]["beta"],
|
| 152 |
+
max_length=hp["orpo_trainer"]["max_length"],
|
| 153 |
+
max_prompt_length=hp["orpo_trainer"]["max_prompt_length"],
|
| 154 |
+
logging_steps=hp["orpo_trainer"]["logging_steps"],
|
| 155 |
+
save_steps=hp["orpo_trainer"]["save_steps"],
|
| 156 |
+
seed=SEED,
|
| 157 |
+
bf16=precision["bf16"],
|
| 158 |
+
fp16=precision["fp16"],
|
| 159 |
+
report_to="none",
|
| 160 |
+
remove_unused_columns=False,
|
| 161 |
+
)
|
| 162 |
+
|
| 163 |
+
trainer = ORPOTrainer(
|
| 164 |
+
model=model,
|
| 165 |
+
args=training_args,
|
| 166 |
+
train_dataset=dataset,
|
| 167 |
+
tokenizer=tokenizer,
|
| 168 |
+
)
|
| 169 |
+
|
| 170 |
+
logger.info("Starting ORPO training...")
|
| 171 |
+
train_result = trainer.train()
|
| 172 |
+
logger.info(f"Training complete. Metrics: {train_result.metrics}")
|
| 173 |
+
|
| 174 |
+
# Save adapter locally
|
| 175 |
+
logger.info(f"Saving LoRA adapter to {output_dir}")
|
| 176 |
+
model.save_pretrained(str(output_dir))
|
| 177 |
+
tokenizer.save_pretrained(str(output_dir))
|
| 178 |
+
|
| 179 |
+
# Save training run log (copy log file to standard name)
|
| 180 |
+
standard_log = LOG_DIR / "training_run.log"
|
| 181 |
+
import shutil
|
| 182 |
+
shutil.copy(str(log_path), str(standard_log))
|
| 183 |
+
logger.info(f"Training log copied to {standard_log}")
|
| 184 |
+
|
| 185 |
+
# Push to HuggingFace
|
| 186 |
+
hub_model_id = hp.get("hub_model_id", "rafiakedir/tenacious-bench-adapter")
|
| 187 |
+
hub_token = args.hub_token or os.environ.get("HF_TOKEN", "")
|
| 188 |
+
if hub_token:
|
| 189 |
+
logger.info(f"Pushing adapter to HuggingFace: {hub_model_id}")
|
| 190 |
+
model.push_to_hub(hub_model_id, token=hub_token)
|
| 191 |
+
tokenizer.push_to_hub(hub_model_id, token=hub_token)
|
| 192 |
+
logger.info(f"Adapter pushed to https://huggingface.co/{hub_model_id}")
|
| 193 |
+
else:
|
| 194 |
+
logger.warning("HF_TOKEN not set — skipping HuggingFace push")
|
| 195 |
+
|
| 196 |
+
logger.info("=== TRAINING COMPLETE ===")
|
| 197 |
+
logger.info(f"Adapter saved to: {output_dir}")
|
| 198 |
+
logger.info(f"Log: {standard_log}")
|
| 199 |
+
|
| 200 |
+
return train_result.metrics
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
if __name__ == "__main__":
|
| 204 |
+
main()
|