Safetensors
English
qwen3_5
judge
b2b-sales
orpo
lora
preference-learning
tenacious-bench
evaluation
qwen2.5
unsloth
Instructions to use rafiakedir/tenacious-bench-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use rafiakedir/tenacious-bench-adapter with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rafiakedir/tenacious-bench-adapter to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="rafiakedir/tenacious-bench-adapter", max_seq_length=2048, )
feat: upload actual trained LoRA adapter (Qwen2.5-1.5B ORPO, 3 epochs, 36 steps)
Browse files- README.md +79 -159
- build_judge_pairs.py +214 -0
- inference_example.py +339 -13
README.md
CHANGED
|
@@ -2,34 +2,33 @@
|
|
| 2 |
license: cc-by-4.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
base_model: unsloth/
|
| 6 |
tags:
|
| 7 |
- judge
|
| 8 |
- b2b-sales
|
| 9 |
- orpo
|
|
|
|
| 10 |
- preference-learning
|
| 11 |
- tenacious-bench
|
| 12 |
- evaluation
|
| 13 |
-
-
|
| 14 |
- unsloth
|
| 15 |
datasets:
|
| 16 |
- rafiakedir/tenacious-bench-v0.1
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# Tenacious-Bench Judge β ORPO
|
| 20 |
|
| 21 |
A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
|
| 22 |
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 23 |
-
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
**
|
| 28 |
-
**
|
| 29 |
-
**Weights:** Merged (full model, not a LoRA adapter)
|
| 30 |
-
**Precision:** BF16 Β· ~873M parameters Β· ~1.75 GB
|
| 31 |
-
**Context length:** 262,144 tokens
|
| 32 |
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
|
@@ -48,185 +47,115 @@ rubric dimensions; outputs below threshold are rejected and regenerated.
|
|
| 48 |
## Quick Start β Inference
|
| 49 |
|
| 50 |
```python
|
|
|
|
| 51 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 52 |
-
import
|
| 53 |
-
|
| 54 |
-
model_id = "rafiakedir/tenacious-bench-adapter"
|
| 55 |
-
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 56 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 57 |
-
model_id, torch_dtype=torch.bfloat16, device_map="auto"
|
| 58 |
-
)
|
| 59 |
-
|
| 60 |
-
SYSTEM = """You are a rubric-aware judge for B2B outbound sales emails.
|
| 61 |
-
Score the candidate output on the following dimension.
|
| 62 |
-
|
| 63 |
-
Dimension: signal_grounding_fidelity
|
| 64 |
-
Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
|
| 65 |
-
with confidence >= 0.60, or be phrased as a question.
|
| 66 |
-
|
| 67 |
-
Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}"""
|
| 68 |
-
|
| 69 |
-
USER = """Hiring signal brief:
|
| 70 |
-
{
|
| 71 |
-
"company_name": "Acme Corp",
|
| 72 |
-
"open_roles": 3,
|
| 73 |
-
"confidence": "low",
|
| 74 |
-
"domain": "fintech"
|
| 75 |
-
}
|
| 76 |
-
|
| 77 |
-
Candidate email:
|
| 78 |
-
"Hi Alex β noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
|
| 79 |
-
We staff specialized capability-gap squads for fintech teams at your growth stage.
|
| 80 |
-
Would a 30-minute scoping conversation make sense this week?"
|
| 81 |
-
|
| 82 |
-
Score this output."""
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
{"role": "user", "content": USER},
|
| 87 |
-
]
|
| 88 |
-
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 89 |
-
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
```
|
| 98 |
|
| 99 |
---
|
| 100 |
|
| 101 |
## Training Details
|
| 102 |
|
| 103 |
-
### Why ORPO
|
| 104 |
-
|
| 105 |
-
ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
|
| 106 |
-
the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
|
| 107 |
-
VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
|
| 108 |
-
checkpointing hacks.
|
| 109 |
-
|
| 110 |
-
For a discriminative judge (score calibration rather than generation quality), the
|
| 111 |
-
preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
|
| 112 |
-
that `beta=0.2`β`0.3` may better calibrate the preference margin for rubric-based scoring.
|
| 113 |
-
|
| 114 |
-
### Preference Pair Construction
|
| 115 |
-
|
| 116 |
-
| Source | Count |
|
| 117 |
-
|---|---|
|
| 118 |
-
| Failing tasks β generated chosen (DeepSeek V3.2) | ~111 attempted |
|
| 119 |
-
| Passing tasks β generated rejected (DeepSeek V3.2) | ~41 attempted |
|
| 120 |
-
| **Final pairs after filtering** | **94** |
|
| 121 |
-
|
| 122 |
-
Filter: chosen score β₯ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
|
| 123 |
-
Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
|
| 124 |
-
and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
|
| 125 |
-
|
| 126 |
-
**Preference leakage prevention (Li et al., 2025):**
|
| 127 |
-
Generator (DeepSeek V3.2) β judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
|
| 128 |
-
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
|
| 129 |
-
|
| 130 |
-
### Hyperparameters
|
| 131 |
-
|
| 132 |
| Parameter | Value |
|
| 133 |
|---|---|
|
| 134 |
-
| Base model | `unsloth/
|
| 135 |
| LoRA rank | 16 |
|
| 136 |
| LoRA alpha | 32 |
|
| 137 |
| Target modules | q_proj, v_proj |
|
| 138 |
| LoRA dropout | 0.05 |
|
| 139 |
| Learning rate | 8e-6 |
|
| 140 |
-
|
|
| 141 |
-
| Gradient accumulation | 4 (effective batch 8) |
|
| 142 |
| Epochs | 3 |
|
| 143 |
-
|
|
| 144 |
-
| LR scheduler | cosine |
|
| 145 |
| ORPO beta | 0.1 |
|
| 146 |
| Max sequence length | 1024 |
|
| 147 |
-
| Precision | BF16 (T4) |
|
| 148 |
| Seed | 42 |
|
| 149 |
|
| 150 |
-
Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
---
|
| 153 |
|
| 154 |
## Evaluation Results
|
| 155 |
|
| 156 |
Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
| Condition | Mean Score | vs. Baseline |
|
| 160 |
-
|---|---|---|
|
| 161 |
-
| Baseline (`scoring_evaluator.py` only) | 0.458 | β |
|
| 162 |
-
| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Ξ=+0.025, p=0.189, not significant |
|
| 163 |
-
| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Ξ=β0.021 vs. trained, p=0.978 |
|
| 164 |
-
|
| 165 |
-
**Delta A** (trained vs. baseline): Ξ=+0.025, 95% CI [β0.032, +0.081], p=0.189 β **not statistically significant**.
|
| 166 |
|
| 167 |
-
**
|
| 168 |
-
|
| 169 |
-
Note: Delta B compares a 0.8B trained model against a 30B zero-shot model β this conflates backbone
|
| 170 |
-
capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
|
| 171 |
-
`Qwen3.5-0.8B-Instruct` (no fine-tuning).
|
| 172 |
-
|
| 173 |
-
**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
|
| 174 |
-
`scoring_evaluator.py` deterministically. Retrain with β₯150 pairs covering all 5 dimensions
|
| 175 |
-
before re-evaluating.
|
| 176 |
-
|
| 177 |
-
Full numbers: `ablation_results.json` in the dataset repo.
|
| 178 |
|
| 179 |
---
|
| 180 |
|
| 181 |
## Known Limitations
|
| 182 |
|
| 183 |
-
|
| 184 |
-
The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
|
| 185 |
-
for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
|
| 186 |
-
to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
|
| 187 |
-
bench commitment honesty β the highest SOW-breach-risk dimension. It cannot be trusted to gate
|
| 188 |
-
bench-commitment outputs.
|
| 189 |
-
|
| 190 |
-
**2. Delta A not significant at v0.1 scale.**
|
| 191 |
-
The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
|
| 192 |
-
does not reliably outperform `scoring_evaluator.py` on held-out tasks.
|
| 193 |
|
| 194 |
-
|
| 195 |
-
Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
|
| 196 |
-
below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
|
| 197 |
|
| 198 |
-
|
| 199 |
-
All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
|
| 200 |
-
may not generalize to real prospect data with industry-specific jargon or edge cases outside the
|
| 201 |
-
training distribution.
|
| 202 |
|
| 203 |
-
|
| 204 |
-
The judge was trained on snapshot bench capacities. In production the bench changes weekly β
|
| 205 |
-
calibration for `bench_commitment_honesty` will drift over time.
|
| 206 |
|
| 207 |
---
|
| 208 |
|
| 209 |
-
## Files
|
| 210 |
|
| 211 |
| File | Description |
|
| 212 |
|---|---|
|
| 213 |
-
| `
|
| 214 |
-
| `
|
| 215 |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
|
| 216 |
-
| `
|
| 217 |
-
| `
|
| 218 |
-
| `
|
| 219 |
-
| `inference_example.py` | Inference helper with prompt templates |
|
| 220 |
|
| 221 |
-
Training data
|
| 222 |
-
|
| 223 |
-
---
|
| 224 |
-
|
| 225 |
-
## Environmental Impact
|
| 226 |
-
|
| 227 |
-
- **Compute:** ~60β90 min on a single T4 GPU (3 epochs, 94 preference pairs)
|
| 228 |
-
- **COβe:** ~0.1 kg (T4 at 70W Γ 90 min Γ US grid 0.42 kg COβ/kWh Γ· 1000)
|
| 229 |
-
- **Infrastructure:** Google Colab free tier
|
| 230 |
|
| 231 |
---
|
| 232 |
|
|
@@ -234,18 +163,9 @@ Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://hu
|
|
| 234 |
|
| 235 |
```bibtex
|
| 236 |
@misc{tenacious-bench-adapter-2026,
|
| 237 |
-
title
|
| 238 |
-
author
|
| 239 |
-
year
|
| 240 |
-
|
| 241 |
-
url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
|
| 242 |
-
}
|
| 243 |
-
|
| 244 |
-
@misc{tenacious-bench-v01-2026,
|
| 245 |
-
title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
|
| 246 |
-
author = {Kedir, Rafia},
|
| 247 |
-
year = {2026},
|
| 248 |
-
howpublished = {HuggingFace Datasets Hub},
|
| 249 |
-
url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
|
| 250 |
}
|
| 251 |
```
|
|
|
|
| 2 |
license: cc-by-4.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
base_model: unsloth/Qwen2.5-1.5B-Instruct
|
| 6 |
tags:
|
| 7 |
- judge
|
| 8 |
- b2b-sales
|
| 9 |
- orpo
|
| 10 |
+
- lora
|
| 11 |
- preference-learning
|
| 12 |
- tenacious-bench
|
| 13 |
- evaluation
|
| 14 |
+
- qwen2.5
|
| 15 |
- unsloth
|
| 16 |
datasets:
|
| 17 |
- rafiakedir/tenacious-bench-v0.1
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# Tenacious-Bench Judge β ORPO LoRA Adapter (Qwen2.5-1.5B)
|
| 21 |
|
| 22 |
A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
|
| 23 |
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 24 |
+
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine.
|
| 25 |
+
|
| 26 |
+
**Base model:** `unsloth/Qwen2.5-1.5B-Instruct`
|
| 27 |
+
**Adapter type:** LoRA (PEFT) β load with base model + `PeftModel.from_pretrained`
|
| 28 |
+
**Training algorithm:** ORPO (no reference model required)
|
| 29 |
+
**Precision:** 4-bit quantized during training (Unsloth), fp16 for inference
|
|
|
|
|
|
|
|
|
|
| 30 |
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
|
| 31 |
+
**Training:** 3 epochs Β· 36 steps Β· lr=8e-6 Β· beta=0.1 Β· LoRA r=16 alpha=32
|
| 32 |
|
| 33 |
---
|
| 34 |
|
|
|
|
| 47 |
## Quick Start β Inference
|
| 48 |
|
| 49 |
```python
|
| 50 |
+
import json, torch
|
| 51 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 52 |
+
from peft import PeftModel
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
|
| 55 |
+
ADAPTER_ID = "rafiakedir/tenacious-bench-adapter"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
|
| 58 |
+
base = AutoModelForCausalLM.from_pretrained(
|
| 59 |
+
BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
|
| 60 |
+
)
|
| 61 |
+
model = PeftModel.from_pretrained(base, ADAPTER_ID)
|
| 62 |
+
model.eval()
|
| 63 |
+
|
| 64 |
+
JUDGE_SYSTEM = (
|
| 65 |
+
"You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
|
| 66 |
+
"Given a task context and a candidate email, score it on the specified rubric dimension. "
|
| 67 |
+
"Respond with a JSON object only:\n"
|
| 68 |
+
'{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
|
| 69 |
+
)
|
| 70 |
|
| 71 |
+
def judge(email, context, dimension):
|
| 72 |
+
user = (
|
| 73 |
+
f"EVALUATION DIMENSION: {dimension}\n\n"
|
| 74 |
+
f"TASK CONTEXT:\n{context}\n\n"
|
| 75 |
+
f"CANDIDATE EMAIL:\n{email}\n\n"
|
| 76 |
+
f"Score this email on the {dimension} dimension."
|
| 77 |
+
)
|
| 78 |
+
msgs = [{"role": "system", "content": JUDGE_SYSTEM},
|
| 79 |
+
{"role": "user", "content": user}]
|
| 80 |
+
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
|
| 81 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 82 |
+
with torch.no_grad():
|
| 83 |
+
out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
|
| 84 |
+
pad_token_id=tokenizer.eos_token_id)
|
| 85 |
+
resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
|
| 86 |
+
s, e = resp.find("{"), resp.rfind("}") + 1
|
| 87 |
+
return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}
|
| 88 |
+
|
| 89 |
+
result = judge(
|
| 90 |
+
email="Casey β TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
|
| 91 |
+
context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
|
| 92 |
+
dimension="signal_grounding_fidelity"
|
| 93 |
+
)
|
| 94 |
+
print(result)
|
| 95 |
```
|
| 96 |
|
| 97 |
---
|
| 98 |
|
| 99 |
## Training Details
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
| Parameter | Value |
|
| 102 |
|---|---|
|
| 103 |
+
| Base model | `unsloth/Qwen2.5-1.5B-Instruct` (4-bit during training) |
|
| 104 |
| LoRA rank | 16 |
|
| 105 |
| LoRA alpha | 32 |
|
| 106 |
| Target modules | q_proj, v_proj |
|
| 107 |
| LoRA dropout | 0.05 |
|
| 108 |
| Learning rate | 8e-6 |
|
| 109 |
+
| Effective batch size | 8 (batch=2, grad_accum=4) |
|
|
|
|
| 110 |
| Epochs | 3 |
|
| 111 |
+
| Total steps | 36 |
|
|
|
|
| 112 |
| ORPO beta | 0.1 |
|
| 113 |
| Max sequence length | 1024 |
|
|
|
|
| 114 |
| Seed | 42 |
|
| 115 |
|
| 116 |
+
**Training loss:** 2.8676 β 2.9646 β 2.9386 (3 checkpoints)
|
| 117 |
+
**Reward accuracy:** 0.5375 β 0.6026 β 0.5128
|
| 118 |
+
|
| 119 |
+
**Training data:** 94 preference pairs from the train partition. Preference leakage prevention:
|
| 120 |
+
generator (DeepSeek V3.2) β judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
|
| 121 |
+
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
|
| 122 |
|
| 123 |
---
|
| 124 |
|
| 125 |
## Evaluation Results
|
| 126 |
|
| 127 |
Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
|
| 128 |
+
Full results in `ablation_results.json` in the dataset repo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
**Deployment recommendation:** Run `ablations/run_ablations.py` with this adapter to get Delta A.
|
| 131 |
+
The ablation script loads this adapter via HuggingFace β requires GPU + transformers + peft.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
---
|
| 134 |
|
| 135 |
## Known Limitations
|
| 136 |
|
| 137 |
+
1. **Dimension coverage gap.** 0 training pairs for `bench_commitment_honesty`, 4 for `icp_segment_appropriateness` due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
+
2. **Backbone below Prometheus-2 threshold.** Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.
|
|
|
|
|
|
|
| 140 |
|
| 141 |
+
3. **Synthetic training distribution.** All pairs derive from synthetic prospect briefs and LLM-generated emails.
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
+
4. **Static bench_summary.** Judge calibration drifts as real bench composition changes weekly.
|
|
|
|
|
|
|
| 144 |
|
| 145 |
---
|
| 146 |
|
| 147 |
+
## Files
|
| 148 |
|
| 149 |
| File | Description |
|
| 150 |
|---|---|
|
| 151 |
+
| `adapter_config.json` | LoRA configuration (r=16, alpha=32, q_proj+v_proj) |
|
| 152 |
+
| `adapter_model.safetensors` | Trained LoRA weights (8.4 MB) |
|
| 153 |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
|
| 154 |
+
| `run_on_colab.ipynb` | End-to-end training + push notebook |
|
| 155 |
+
| `train_judge.py` | Training script |
|
| 156 |
+
| `inference_example.py` | Per-dimension and all-dimension scoring helper |
|
|
|
|
| 157 |
|
| 158 |
+
Training data: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
---
|
| 161 |
|
|
|
|
| 163 |
|
| 164 |
```bibtex
|
| 165 |
@misc{tenacious-bench-adapter-2026,
|
| 166 |
+
title = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
|
| 167 |
+
author = {Kedir, Rafia},
|
| 168 |
+
year = {2026},
|
| 169 |
+
url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
}
|
| 171 |
```
|
build_judge_pairs.py
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Build judge-format ORPO training pairs.
|
| 4 |
+
|
| 5 |
+
Each preference pair in preference_pairs.jsonl has:
|
| 6 |
+
chosen = a GOOD email (passes rubric)
|
| 7 |
+
rejected = a BAD email (fails rubric)
|
| 8 |
+
|
| 9 |
+
For judge training we need the model to score emails, not generate them.
|
| 10 |
+
So we create pairs where:
|
| 11 |
+
chosen_response = correct JSON score for the email
|
| 12 |
+
rejected_response = wrong JSON score for the same email
|
| 13 |
+
|
| 14 |
+
From each original pair we create TWO judge training examples:
|
| 15 |
+
1. Judge pair for the GOOD email β correct high score is chosen, wrong low-ish score is rejected
|
| 16 |
+
2. Judge pair for the BAD email β correct low score is chosen, wrong high-ish score is rejected
|
| 17 |
+
|
| 18 |
+
Output: training_data/judge_pairs.jsonl (conversations format for ORPOTrainer)
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
import json
|
| 22 |
+
import sys
|
| 23 |
+
from pathlib import Path
|
| 24 |
+
|
| 25 |
+
ROOT = Path(__file__).parent.parent
|
| 26 |
+
sys.path.insert(0, str(ROOT))
|
| 27 |
+
from scoring_evaluator import score_task
|
| 28 |
+
|
| 29 |
+
PAIRS_PATH = ROOT / "training_data/preference_pairs.jsonl"
|
| 30 |
+
TASKS_PATH = ROOT / "tenacious_bench_v0.1/train/tasks.jsonl"
|
| 31 |
+
OUTPUT_PATH = ROOT / "training_data/judge_pairs.jsonl"
|
| 32 |
+
|
| 33 |
+
JUDGE_SYSTEM = (
|
| 34 |
+
"You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
|
| 35 |
+
"Given a task context and a candidate email, score the email on the specified rubric "
|
| 36 |
+
"dimension. Respond with a JSON object only:\n"
|
| 37 |
+
'{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, '
|
| 38 |
+
'"reasoning": "<one concise sentence>"}'
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
PASS_THRESHOLD = {
|
| 42 |
+
"signal_grounding_fidelity": 0.60,
|
| 43 |
+
"bench_commitment_honesty": 0.50,
|
| 44 |
+
"icp_segment_appropriateness": 0.50,
|
| 45 |
+
"competitor_gap_honesty": 0.50,
|
| 46 |
+
"tone_preservation": 0.60,
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
# Dimension-specific reasoning templates
|
| 50 |
+
PASS_REASONING = {
|
| 51 |
+
"signal_grounding_fidelity": "Email grounds all factual claims in documented hiring signals from the brief; low-confidence signals are phrased as questions.",
|
| 52 |
+
"bench_commitment_honesty": "Staffing commitment is within the available bench count for the required stack.",
|
| 53 |
+
"icp_segment_appropriateness": "Email language matches the correct ICP segment for the prospect's funding stage and posture.",
|
| 54 |
+
"competitor_gap_honesty": "Competitor gap claims are grounded in the competitor_gap_brief; no fabricated assertions.",
|
| 55 |
+
"tone_preservation": "Email maintains Tenacious brand voice: no clichΓ©s, no over-apologetic language, calendar CTA included.",
|
| 56 |
+
}
|
| 57 |
+
FAIL_REASONING = {
|
| 58 |
+
"signal_grounding_fidelity": "Email asserts growth or capability claims not supported by the hiring signal brief; treats low-confidence signals as established facts.",
|
| 59 |
+
"bench_commitment_honesty": "Email promises engineer capacity that exceeds the available bench count for the required stack.",
|
| 60 |
+
"icp_segment_appropriateness": "Email uses the wrong segment language; growth-phase pitch applied to a cost-restructuring or abstain-segment prospect.",
|
| 61 |
+
"competitor_gap_honesty": "Email asserts competitor gaps not documented in the brief, fabricating capability differences.",
|
| 62 |
+
"tone_preservation": "Email uses a banned re-engagement phrase or lacks the required 30-minute scoping calendar CTA.",
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def build_user_prompt(task: dict, email_text: str) -> str:
|
| 67 |
+
dim = task.get("dimension", "")
|
| 68 |
+
inp = task.get("input", {})
|
| 69 |
+
# Compact the signal brief (trim to 800 chars to stay within max_prompt_length)
|
| 70 |
+
brief = json.dumps(
|
| 71 |
+
inp.get("hiring_signal_brief") or inp.get("bench_summary") or {},
|
| 72 |
+
indent=2
|
| 73 |
+
)[:800]
|
| 74 |
+
return (
|
| 75 |
+
f"EVALUATION DIMENSION: {dim}\n\n"
|
| 76 |
+
f"TASK CONTEXT:\n{brief}\n\n"
|
| 77 |
+
f"CANDIDATE EMAIL:\n{email_text.strip()}\n\n"
|
| 78 |
+
f"Score this email on the {dim} dimension."
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def make_score_json(dim: str, score: float, passed: bool, reasoning: str) -> str:
|
| 83 |
+
return json.dumps({
|
| 84 |
+
"dimension": dim,
|
| 85 |
+
"score": round(score, 2),
|
| 86 |
+
"pass": passed,
|
| 87 |
+
"reasoning": reasoning,
|
| 88 |
+
})
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def conversations(system: str, user: str, assistant: str) -> list:
|
| 92 |
+
return [
|
| 93 |
+
{"role": "system", "content": system},
|
| 94 |
+
{"role": "user", "content": user},
|
| 95 |
+
{"role": "assistant", "content": assistant},
|
| 96 |
+
]
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def main():
|
| 100 |
+
# Load tasks by task_id
|
| 101 |
+
tasks = {}
|
| 102 |
+
with open(TASKS_PATH) as f:
|
| 103 |
+
for line in f:
|
| 104 |
+
t = json.loads(line)
|
| 105 |
+
tasks[t["task_id"]] = t
|
| 106 |
+
|
| 107 |
+
pairs_raw = []
|
| 108 |
+
with open(PAIRS_PATH) as f:
|
| 109 |
+
for line in f:
|
| 110 |
+
pairs_raw.append(json.loads(line))
|
| 111 |
+
|
| 112 |
+
judge_pairs = []
|
| 113 |
+
skipped = 0
|
| 114 |
+
|
| 115 |
+
for pair in pairs_raw:
|
| 116 |
+
task_id = pair["task_id"]
|
| 117 |
+
dim = pair["dimension"]
|
| 118 |
+
task = tasks.get(task_id)
|
| 119 |
+
if task is None:
|
| 120 |
+
skipped += 1
|
| 121 |
+
continue
|
| 122 |
+
|
| 123 |
+
# Strip the <|im_end|> token that was embedded during generation
|
| 124 |
+
chosen_email = pair["chosen"].replace("<|im_end|>", "").strip()
|
| 125 |
+
rejected_email = pair["rejected"].replace("<|im_end|>", "").strip()
|
| 126 |
+
|
| 127 |
+
# Score both emails with the deterministic evaluator
|
| 128 |
+
r_chosen = score_task({**task, "candidate_output": chosen_email})
|
| 129 |
+
r_rejected = score_task({**task, "candidate_output": rejected_email})
|
| 130 |
+
|
| 131 |
+
sc = r_chosen.get("score", 0.5)
|
| 132 |
+
sr = r_rejected.get("score", 0.5)
|
| 133 |
+
threshold = PASS_THRESHOLD.get(dim, 0.5)
|
| 134 |
+
|
| 135 |
+
chosen_passes = sc >= threshold
|
| 136 |
+
rejected_passes = sr >= threshold
|
| 137 |
+
|
| 138 |
+
# ββ Judge pair 1: score the GOOD (chosen) email ββββββββββββββββββββββ
|
| 139 |
+
# Correct judgment: high score (chosen) vs wrong judgment: low score (rejected)
|
| 140 |
+
user_prompt_chosen = build_user_prompt(task, chosen_email)
|
| 141 |
+
|
| 142 |
+
correct_score_chosen = round(min(sc + 0.05, 1.0), 2) if chosen_passes else round(sc, 2)
|
| 143 |
+
wrong_score_chosen = round(max(sc - 0.5, 0.0), 2)
|
| 144 |
+
|
| 145 |
+
correct_response = make_score_json(
|
| 146 |
+
dim, correct_score_chosen, chosen_passes,
|
| 147 |
+
PASS_REASONING[dim] if chosen_passes else FAIL_REASONING[dim]
|
| 148 |
+
)
|
| 149 |
+
wrong_response = make_score_json(
|
| 150 |
+
dim, wrong_score_chosen, not chosen_passes,
|
| 151 |
+
FAIL_REASONING[dim] if chosen_passes else PASS_REASONING[dim]
|
| 152 |
+
)
|
| 153 |
+
|
| 154 |
+
# Only include if there's a meaningful score gap
|
| 155 |
+
if abs(correct_score_chosen - wrong_score_chosen) >= 0.2:
|
| 156 |
+
judge_pairs.append({
|
| 157 |
+
"chosen": conversations(JUDGE_SYSTEM, user_prompt_chosen, correct_response),
|
| 158 |
+
"rejected": conversations(JUDGE_SYSTEM, user_prompt_chosen, wrong_response),
|
| 159 |
+
"task_id": task_id,
|
| 160 |
+
"dimension": dim,
|
| 161 |
+
"email_type": "chosen",
|
| 162 |
+
"actual_score": sc,
|
| 163 |
+
})
|
| 164 |
+
|
| 165 |
+
# ββ Judge pair 2: score the BAD (rejected) email βββββββββββββββββββββ
|
| 166 |
+
user_prompt_rejected = build_user_prompt(task, rejected_email)
|
| 167 |
+
|
| 168 |
+
correct_score_rejected = round(sr, 2)
|
| 169 |
+
wrong_score_rejected = round(min(sr + 0.5, 1.0), 2)
|
| 170 |
+
|
| 171 |
+
correct_response_r = make_score_json(
|
| 172 |
+
dim, correct_score_rejected, rejected_passes,
|
| 173 |
+
PASS_REASONING[dim] if rejected_passes else FAIL_REASONING[dim]
|
| 174 |
+
)
|
| 175 |
+
wrong_response_r = make_score_json(
|
| 176 |
+
dim, wrong_score_rejected, not rejected_passes,
|
| 177 |
+
PASS_REASONING[dim] if not rejected_passes else FAIL_REASONING[dim]
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
if abs(wrong_score_rejected - correct_score_rejected) >= 0.2:
|
| 181 |
+
judge_pairs.append({
|
| 182 |
+
"chosen": conversations(JUDGE_SYSTEM, user_prompt_rejected, correct_response_r),
|
| 183 |
+
"rejected": conversations(JUDGE_SYSTEM, user_prompt_rejected, wrong_response_r),
|
| 184 |
+
"task_id": task_id,
|
| 185 |
+
"dimension": dim,
|
| 186 |
+
"email_type": "rejected",
|
| 187 |
+
"actual_score": sr,
|
| 188 |
+
})
|
| 189 |
+
|
| 190 |
+
with open(OUTPUT_PATH, "w") as f:
|
| 191 |
+
for jp in judge_pairs:
|
| 192 |
+
f.write(json.dumps(jp) + "\n")
|
| 193 |
+
|
| 194 |
+
from collections import Counter
|
| 195 |
+
dim_counts = Counter(jp["dimension"] for jp in judge_pairs)
|
| 196 |
+
type_counts = Counter(jp["email_type"] for jp in judge_pairs)
|
| 197 |
+
|
| 198 |
+
print(f"Built {len(judge_pairs)} judge pairs (skipped {skipped} missing tasks)")
|
| 199 |
+
print(f"Dimension breakdown: {dict(dim_counts)}")
|
| 200 |
+
print(f"Email type: {dict(type_counts)}")
|
| 201 |
+
print(f"Written to {OUTPUT_PATH}")
|
| 202 |
+
|
| 203 |
+
# Validate format
|
| 204 |
+
sample = judge_pairs[0]
|
| 205 |
+
assert "chosen" in sample and isinstance(sample["chosen"], list)
|
| 206 |
+
assert sample["chosen"][0]["role"] == "system"
|
| 207 |
+
assert sample["chosen"][-1]["role"] == "assistant"
|
| 208 |
+
print("\nFormat validation: PASSED")
|
| 209 |
+
print(f"Sample chosen response: {sample['chosen'][-1]['content']}")
|
| 210 |
+
print(f"Sample rejected response: {sample['rejected'][-1]['content']}")
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
if __name__ == "__main__":
|
| 214 |
+
main()
|
inference_example.py
CHANGED
|
@@ -1,5 +1,274 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
Inference helper for rafiakedir/tenacious-bench-adapter.
|
| 4 |
Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
|
| 5 |
"""
|
|
@@ -12,50 +281,53 @@ MODEL_ID = "rafiakedir/tenacious-bench-adapter"
|
|
| 12 |
|
| 13 |
DIMENSION_PROMPTS = {
|
| 14 |
"signal_grounding_fidelity": (
|
| 15 |
-
"Dimension: signal_grounding_fidelity\n"
|
| 16 |
"Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
|
| 17 |
"with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
|
| 18 |
"without a high/medium-confidence signal in the brief must be recast as questions."
|
| 19 |
),
|
| 20 |
"bench_commitment_honesty": (
|
| 21 |
-
"Dimension: bench_commitment_honesty\n"
|
| 22 |
"Rubric: The email must not promise or imply a number of engineers that exceeds "
|
| 23 |
"the total available in the bench_summary. Any staffing commitment must stay within capacity."
|
| 24 |
),
|
| 25 |
"icp_segment_appropriateness": (
|
| 26 |
-
"Dimension: icp_segment_appropriateness\n"
|
| 27 |
"Rubric: The email's language and pitch angle must match the correct ICP segment "
|
| 28 |
"(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
|
| 29 |
"ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
|
| 30 |
),
|
| 31 |
"competitor_gap_honesty": (
|
| 32 |
-
"Dimension: competitor_gap_honesty\n"
|
| 33 |
"Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
|
| 34 |
"The email must not assert that competitors have capabilities the prospect lacks "
|
| 35 |
"unless the brief explicitly documents this gap."
|
| 36 |
),
|
| 37 |
"tone_preservation": (
|
| 38 |
-
"Dimension: tone_preservation\n"
|
| 39 |
"Rubric: No re-engagement clichΓ©s ('just wanted to circle back', 'touching base', "
|
| 40 |
"'following up'). No over-apologetic exits ('sorry for taking your time'). "
|
| 41 |
"Calendar CTA required. Confident but not pushy."
|
| 42 |
),
|
| 43 |
}
|
| 44 |
|
| 45 |
-
SYSTEM_TEMPLATE = """
|
|
|
|
| 46 |
{dimension_prompt}
|
| 47 |
|
| 48 |
Respond with a JSON object only:
|
| 49 |
{{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
|
| 50 |
"""
|
| 51 |
|
| 52 |
-
USER_TEMPLATE = """
|
|
|
|
| 53 |
{context_json}
|
| 54 |
|
| 55 |
Candidate email:
|
| 56 |
{candidate_output}
|
| 57 |
|
| 58 |
-
Score this output on the dimension above.
|
|
|
|
| 59 |
|
| 60 |
|
| 61 |
def load_model(model_id: str = MODEL_ID):
|
|
@@ -162,15 +434,69 @@ if __name__ == "__main__":
|
|
| 162 |
"teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
|
| 163 |
)
|
| 164 |
|
| 165 |
-
print("\nScoring on signal_grounding_fidelity...")
|
| 166 |
result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
|
| 167 |
-
print(f" Score: {result['score']:.2f}")
|
| 168 |
-
print(f" Reasoning: {result['reasoning']}")
|
| 169 |
|
| 170 |
-
print("\nScoring all dimensions...")
|
| 171 |
all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
|
| 172 |
for dim, r in all_results.items():
|
| 173 |
if dim == "mean_score":
|
| 174 |
print(f" MEAN: {r:.3f}")
|
| 175 |
else:
|
| 176 |
-
print(f" {dim}: {r['score']:.2f} β {r['reasoning'][:80]}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Upload model card, training scripts, and inference helper to
|
| 4 |
+
rafiakedir/tenacious-bench-adapter on HuggingFace.
|
| 5 |
+
Does NOT re-upload the safetensors weights β those are already there.
|
| 6 |
+
"""
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from huggingface_hub import HfApi, CommitOperationAdd
|
| 9 |
+
|
| 10 |
+
ROOT = Path(__file__).parent
|
| 11 |
+
REPO_ID = "rafiakedir/tenacious-bench-adapter"
|
| 12 |
+
|
| 13 |
+
# ββ Model Card ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 14 |
+
MODEL_CARD = """\
|
| 15 |
+
---
|
| 16 |
+
license: cc-by-4.0
|
| 17 |
+
language:
|
| 18 |
+
- en
|
| 19 |
+
base_model: unsloth/Qwen3.5-0.8B
|
| 20 |
+
tags:
|
| 21 |
+
- judge
|
| 22 |
+
- b2b-sales
|
| 23 |
+
- orpo
|
| 24 |
+
- preference-learning
|
| 25 |
+
- tenacious-bench
|
| 26 |
+
- evaluation
|
| 27 |
+
- qwen3
|
| 28 |
+
- unsloth
|
| 29 |
+
datasets:
|
| 30 |
+
- rafiakedir/tenacious-bench-v0.1
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
# Tenacious-Bench Judge β ORPO Fine-Tuned Qwen3.5-0.8B
|
| 34 |
+
|
| 35 |
+
A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
|
| 36 |
+
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 37 |
+
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine:
|
| 38 |
+
the generator (DeepSeek V3.2) produces a candidate email; this judge scores it on five
|
| 39 |
+
rubric dimensions; outputs below threshold are rejected and regenerated.
|
| 40 |
+
|
| 41 |
+
**Base model:** `unsloth/Qwen3.5-0.8B`
|
| 42 |
+
**Training algorithm:** ORPO (no reference model β single forward pass)
|
| 43 |
+
**Weights:** Merged (full model, not a LoRA adapter)
|
| 44 |
+
**Precision:** BF16 Β· ~873M parameters Β· ~1.75 GB
|
| 45 |
+
**Context length:** 262,144 tokens
|
| 46 |
+
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## What It Scores
|
| 51 |
+
|
| 52 |
+
| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
|
| 53 |
+
|---|---|---|
|
| 54 |
+
| `signal_grounding_fidelity` | 35% | CTO credibility loss |
|
| 55 |
+
| `competitor_gap_honesty` | 45% | Irreversible brand damage |
|
| 56 |
+
| `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
|
| 57 |
+
| `tone_preservation` | 15% | Brand voice violation |
|
| 58 |
+
| `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## Quick Start β Inference
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 66 |
+
import torch
|
| 67 |
+
|
| 68 |
+
model_id = "rafiakedir/tenacious-bench-adapter"
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 70 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 71 |
+
model_id, torch_dtype=torch.bfloat16, device_map="auto"
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
SYSTEM = \"\"\"You are a rubric-aware judge for B2B outbound sales emails.
|
| 75 |
+
Score the candidate output on the following dimension.
|
| 76 |
+
|
| 77 |
+
Dimension: signal_grounding_fidelity
|
| 78 |
+
Rubric: Every factual claim must resolve to a field in the hiring_signal_brief
|
| 79 |
+
with confidence >= 0.60, or be phrased as a question.
|
| 80 |
+
|
| 81 |
+
Respond with a JSON object: {"score": <0.0-1.0>, "reasoning": "<one sentence>"}\"\"\"
|
| 82 |
+
|
| 83 |
+
USER = \"\"\"Hiring signal brief:
|
| 84 |
+
{
|
| 85 |
+
"company_name": "Acme Corp",
|
| 86 |
+
"open_roles": 3,
|
| 87 |
+
"confidence": "low",
|
| 88 |
+
"domain": "fintech"
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
Candidate email:
|
| 92 |
+
"Hi Alex β noticed Acme Corp is aggressively scaling its engineering team with 3 open roles.
|
| 93 |
+
We staff specialized capability-gap squads for fintech teams at your growth stage.
|
| 94 |
+
Would a 30-minute scoping conversation make sense this week?"
|
| 95 |
+
|
| 96 |
+
Score this output.\"\"\"
|
| 97 |
+
|
| 98 |
+
messages = [
|
| 99 |
+
{"role": "system", "content": SYSTEM},
|
| 100 |
+
{"role": "user", "content": USER},
|
| 101 |
+
]
|
| 102 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 103 |
+
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 104 |
+
|
| 105 |
+
with torch.no_grad():
|
| 106 |
+
out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True)
|
| 107 |
+
|
| 108 |
+
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 109 |
+
print(response)
|
| 110 |
+
# Expected: {"score": 0.4, "reasoning": "Claims 'aggressively scaling' but brief confidence is low β should be phrased as a question."}
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## Training Details
|
| 116 |
+
|
| 117 |
+
### Why ORPO
|
| 118 |
+
|
| 119 |
+
ORPO (Hong et al., 2024) eliminates the reference model by computing the preference signal from
|
| 120 |
+
the log-odds ratio of chosen vs. rejected completions in a single forward pass. This reduces peak
|
| 121 |
+
VRAM by ~40% vs. DPO, making 3-epoch training feasible on a 16GB T4 without gradient
|
| 122 |
+
checkpointing hacks.
|
| 123 |
+
|
| 124 |
+
For a discriminative judge (score calibration rather than generation quality), the
|
| 125 |
+
preference signal should be stronger. We ran `beta=0.1` per the paper's recommendation but note
|
| 126 |
+
that `beta=0.2`β`0.3` may better calibrate the preference margin for rubric-based scoring.
|
| 127 |
+
|
| 128 |
+
### Preference Pair Construction
|
| 129 |
+
|
| 130 |
+
| Source | Count |
|
| 131 |
+
|---|---|
|
| 132 |
+
| Failing tasks β generated chosen (DeepSeek V3.2) | ~111 attempted |
|
| 133 |
+
| Passing tasks β generated rejected (DeepSeek V3.2) | ~41 attempted |
|
| 134 |
+
| **Final pairs after filtering** | **94** |
|
| 135 |
+
|
| 136 |
+
Filter: chosen score β₯ threshold AND rejected score < threshold AND TF-IDF cosine < 0.92.
|
| 137 |
+
Main rejection causes: chosen output still scoring below threshold (phrasing-mode sensitivity),
|
| 138 |
+
and ICP segment tasks with key mismatch making pass threshold structurally unachievable.
|
| 139 |
+
|
| 140 |
+
**Preference leakage prevention (Li et al., 2025):**
|
| 141 |
+
Generator (DeepSeek V3.2) β judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
|
| 142 |
+
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.
|
| 143 |
+
|
| 144 |
+
### Hyperparameters
|
| 145 |
+
|
| 146 |
+
| Parameter | Value |
|
| 147 |
+
|---|---|
|
| 148 |
+
| Base model | `unsloth/Qwen3.5-0.8B` |
|
| 149 |
+
| LoRA rank | 16 |
|
| 150 |
+
| LoRA alpha | 32 |
|
| 151 |
+
| Target modules | q_proj, v_proj |
|
| 152 |
+
| LoRA dropout | 0.05 |
|
| 153 |
+
| Learning rate | 8e-6 |
|
| 154 |
+
| Batch size (per device) | 2 |
|
| 155 |
+
| Gradient accumulation | 4 (effective batch 8) |
|
| 156 |
+
| Epochs | 3 |
|
| 157 |
+
| Warmup ratio | 0.1 |
|
| 158 |
+
| LR scheduler | cosine |
|
| 159 |
+
| ORPO beta | 0.1 |
|
| 160 |
+
| Max sequence length | 1024 |
|
| 161 |
+
| Precision | BF16 (T4) |
|
| 162 |
+
| Seed | 42 |
|
| 163 |
+
|
| 164 |
+
Training notebook: see `run_on_colab.ipynb` in this repo.
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## Evaluation Results
|
| 169 |
+
|
| 170 |
+
Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
|
| 171 |
+
Paired bootstrap significance test: 10,000 iterations, seed 42.
|
| 172 |
+
|
| 173 |
+
| Condition | Mean Score | vs. Baseline |
|
| 174 |
+
|---|---|---|
|
| 175 |
+
| Baseline (`scoring_evaluator.py` only) | 0.458 | β |
|
| 176 |
+
| **This model (ORPO Qwen3.5-0.8B)** | **0.483** | Ξ=+0.025, p=0.189, not significant |
|
| 177 |
+
| Prompt-only (Qwen3-30B, zero-shot) | 0.504 | Ξ=β0.021 vs. trained, p=0.978 |
|
| 178 |
+
|
| 179 |
+
**Delta A** (trained vs. baseline): Ξ=+0.025, 95% CI [β0.032, +0.081], p=0.189 β **not statistically significant**.
|
| 180 |
+
|
| 181 |
+
**Delta B** (trained vs. prompt-only): not significant. Finding: `prompt_engineering_sufficient` β
|
| 182 |
+
the Qwen3-30B zero-shot condition is a viable lower-cost alternative at this scale of training data.
|
| 183 |
+
Note: Delta B compares a 0.8B trained model against a 30B zero-shot model β this conflates backbone
|
| 184 |
+
capacity with training benefit. A rigorous Delta B requires re-running the prompt-only condition on
|
| 185 |
+
`Qwen3.5-0.8B-Instruct` (no fine-tuning).
|
| 186 |
+
|
| 187 |
+
**Deployment recommendation for this run:** DO NOT DEPLOY as primary gate. Continue using
|
| 188 |
+
`scoring_evaluator.py` deterministically. Retrain with β₯150 pairs covering all 5 dimensions
|
| 189 |
+
before re-evaluating.
|
| 190 |
+
|
| 191 |
+
Full numbers: `ablation_results.json` in the dataset repo.
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## Known Limitations
|
| 196 |
+
|
| 197 |
+
**1. Dimension coverage gap (critical).**
|
| 198 |
+
The preference pairs contain 0 examples for `bench_commitment_honesty` and only 4 examples
|
| 199 |
+
for `icp_segment_appropriateness`, due to a scoring function key mismatch that made it impossible
|
| 200 |
+
to generate valid chosen outputs for these dimensions. The model received zero gradient signal on
|
| 201 |
+
bench commitment honesty β the highest SOW-breach-risk dimension. It cannot be trusted to gate
|
| 202 |
+
bench-commitment outputs.
|
| 203 |
+
|
| 204 |
+
**2. Delta A not significant at v0.1 scale.**
|
| 205 |
+
The +0.025 lift over the deterministic baseline is within the noise band (p=0.189). The model
|
| 206 |
+
does not reliably outperform `scoring_evaluator.py` on held-out tasks.
|
| 207 |
+
|
| 208 |
+
**3. Backbone below Prometheus-2 threshold.**
|
| 209 |
+
Prometheus-2 (Kim et al., 2024) demonstrated rubric-matching at 7B parameters. Qwen3.5-0.8B is
|
| 210 |
+
below that threshold. Capacity may be insufficient for simultaneous multi-dimension rubric generalization.
|
| 211 |
+
|
| 212 |
+
**4. Synthetic training distribution.**
|
| 213 |
+
All preference pairs derive from synthetic prospect briefs and LLM-generated emails. The model
|
| 214 |
+
may not generalize to real prospect data with industry-specific jargon or edge cases outside the
|
| 215 |
+
training distribution.
|
| 216 |
+
|
| 217 |
+
**5. Static bench_summary.**
|
| 218 |
+
The judge was trained on snapshot bench capacities. In production the bench changes weekly β
|
| 219 |
+
calibration for `bench_commitment_honesty` will drift over time.
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
## Files in This Repo
|
| 224 |
+
|
| 225 |
+
| File | Description |
|
| 226 |
+
|---|---|
|
| 227 |
+
| `model.safetensors-*` | Merged model weights (BF16) |
|
| 228 |
+
| `config.json` | Model architecture config |
|
| 229 |
+
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
|
| 230 |
+
| `train_judge.py` | Full ORPO training script |
|
| 231 |
+
| `hyperparams.json` | All hyperparameters (pinned) |
|
| 232 |
+
| `run_on_colab.ipynb` | End-to-end training notebook for T4 |
|
| 233 |
+
| `inference_example.py` | Inference helper with prompt templates |
|
| 234 |
+
|
| 235 |
+
Training data and preference pairs: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
|
| 236 |
+
|
| 237 |
+
---
|
| 238 |
+
|
| 239 |
+
## Environmental Impact
|
| 240 |
+
|
| 241 |
+
- **Compute:** ~60β90 min on a single T4 GPU (3 epochs, 94 preference pairs)
|
| 242 |
+
- **COβe:** ~0.1 kg (T4 at 70W Γ 90 min Γ US grid 0.42 kg COβ/kWh Γ· 1000)
|
| 243 |
+
- **Infrastructure:** Google Colab free tier
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## Citation
|
| 248 |
+
|
| 249 |
+
```bibtex
|
| 250 |
+
@misc{tenacious-bench-adapter-2026,
|
| 251 |
+
title = {Tenacious-Bench Judge: ORPO Fine-Tuned Qwen3.5-0.8B for B2B Sales Evaluation},
|
| 252 |
+
author = {Kedir, Rafia},
|
| 253 |
+
year = {2026},
|
| 254 |
+
howpublished = {HuggingFace Model Hub},
|
| 255 |
+
url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
|
| 256 |
+
}
|
| 257 |
+
|
| 258 |
+
@misc{tenacious-bench-v01-2026,
|
| 259 |
+
title = {Tenacious-Bench v0.1: B2B Sales Evaluation Benchmark},
|
| 260 |
+
author = {Kedir, Rafia},
|
| 261 |
+
year = {2026},
|
| 262 |
+
howpublished = {HuggingFace Datasets Hub},
|
| 263 |
+
url = {https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1}
|
| 264 |
+
}
|
| 265 |
+
```
|
| 266 |
+
"""
|
| 267 |
+
|
| 268 |
+
# ββ Inference Example βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 269 |
+
INFERENCE_EXAMPLE = '''\
|
| 270 |
+
#!/usr/bin/env python3
|
| 271 |
+
"""
|
| 272 |
Inference helper for rafiakedir/tenacious-bench-adapter.
|
| 273 |
Scores a candidate B2B sales email on one or all five Tenacious rubric dimensions.
|
| 274 |
"""
|
|
|
|
| 281 |
|
| 282 |
DIMENSION_PROMPTS = {
|
| 283 |
"signal_grounding_fidelity": (
|
| 284 |
+
"Dimension: signal_grounding_fidelity\\n"
|
| 285 |
"Rubric: Every factual claim about the prospect must be supported by the hiring_signal_brief "
|
| 286 |
"with confidence >= 0.60, or phrased as a question. Claims about growth, scale, or ML strategy "
|
| 287 |
"without a high/medium-confidence signal in the brief must be recast as questions."
|
| 288 |
),
|
| 289 |
"bench_commitment_honesty": (
|
| 290 |
+
"Dimension: bench_commitment_honesty\\n"
|
| 291 |
"Rubric: The email must not promise or imply a number of engineers that exceeds "
|
| 292 |
"the total available in the bench_summary. Any staffing commitment must stay within capacity."
|
| 293 |
),
|
| 294 |
"icp_segment_appropriateness": (
|
| 295 |
+
"Dimension: icp_segment_appropriateness\\n"
|
| 296 |
"Rubric: The email's language and pitch angle must match the correct ICP segment "
|
| 297 |
"(Segment 1=growth-scale, Segment 2=cost-restructuring, Segment 3=consolidation, "
|
| 298 |
"ABSTAIN=insufficient signal). A growth pitch to a post-layoff company is a mismatch."
|
| 299 |
),
|
| 300 |
"competitor_gap_honesty": (
|
| 301 |
+
"Dimension: competitor_gap_honesty\\n"
|
| 302 |
"Rubric: Any assertion about a competitor gap must be grounded in the competitor_gap_brief. "
|
| 303 |
"The email must not assert that competitors have capabilities the prospect lacks "
|
| 304 |
"unless the brief explicitly documents this gap."
|
| 305 |
),
|
| 306 |
"tone_preservation": (
|
| 307 |
+
"Dimension: tone_preservation\\n"
|
| 308 |
"Rubric: No re-engagement clichΓ©s ('just wanted to circle back', 'touching base', "
|
| 309 |
"'following up'). No over-apologetic exits ('sorry for taking your time'). "
|
| 310 |
"Calendar CTA required. Confident but not pushy."
|
| 311 |
),
|
| 312 |
}
|
| 313 |
|
| 314 |
+
SYSTEM_TEMPLATE = """\
|
| 315 |
+
You are a rubric-aware judge for B2B outbound sales emails written by Tenacious Consulting.
|
| 316 |
{dimension_prompt}
|
| 317 |
|
| 318 |
Respond with a JSON object only:
|
| 319 |
{{"score": <float 0.0-1.0>, "reasoning": "<one concise sentence explaining the score>"}}
|
| 320 |
"""
|
| 321 |
|
| 322 |
+
USER_TEMPLATE = """\
|
| 323 |
+
Context:
|
| 324 |
{context_json}
|
| 325 |
|
| 326 |
Candidate email:
|
| 327 |
{candidate_output}
|
| 328 |
|
| 329 |
+
Score this output on the dimension above.\
|
| 330 |
+
"""
|
| 331 |
|
| 332 |
|
| 333 |
def load_model(model_id: str = MODEL_ID):
|
|
|
|
| 434 |
"teams at your growth stage. Would a 30-minute scoping conversation make sense this week?"
|
| 435 |
)
|
| 436 |
|
| 437 |
+
print("\\nScoring on signal_grounding_fidelity...")
|
| 438 |
result = score(tokenizer, model, demo_input, demo_email, "signal_grounding_fidelity")
|
| 439 |
+
print(f" Score: {result[\'score\']:.2f}")
|
| 440 |
+
print(f" Reasoning: {result[\'reasoning\']}")
|
| 441 |
|
| 442 |
+
print("\\nScoring all dimensions...")
|
| 443 |
all_results = score_all_dimensions(tokenizer, model, demo_input, demo_email)
|
| 444 |
for dim, r in all_results.items():
|
| 445 |
if dim == "mean_score":
|
| 446 |
print(f" MEAN: {r:.3f}")
|
| 447 |
else:
|
| 448 |
+
print(f" {dim}: {r[\'score\']:.2f} β {r[\'reasoning\'][:80]}")
|
| 449 |
+
'''
|
| 450 |
+
|
| 451 |
+
|
| 452 |
+
def main():
|
| 453 |
+
api = HfApi()
|
| 454 |
+
operations = []
|
| 455 |
+
|
| 456 |
+
def add_bytes(content: bytes, repo_path: str, label: str = ""):
|
| 457 |
+
lbl = label or repo_path
|
| 458 |
+
print(f" queuing {lbl} ({len(content):,} bytes)")
|
| 459 |
+
operations.append(CommitOperationAdd(
|
| 460 |
+
path_in_repo=repo_path,
|
| 461 |
+
path_or_fileobj=content,
|
| 462 |
+
))
|
| 463 |
+
|
| 464 |
+
def add_file(local_path: Path, repo_path: str):
|
| 465 |
+
print(f" queuing {repo_path} ({local_path.stat().st_size:,} bytes)")
|
| 466 |
+
operations.append(CommitOperationAdd(
|
| 467 |
+
path_in_repo=repo_path,
|
| 468 |
+
path_or_fileobj=str(local_path),
|
| 469 |
+
))
|
| 470 |
+
|
| 471 |
+
# Model card
|
| 472 |
+
add_bytes(MODEL_CARD.encode(), "README.md", "README.md (model card)")
|
| 473 |
+
|
| 474 |
+
# Inference example
|
| 475 |
+
add_bytes(INFERENCE_EXAMPLE.encode(), "inference_example.py")
|
| 476 |
+
|
| 477 |
+
# Training scripts
|
| 478 |
+
add_file(ROOT / "training" / "train_judge.py", "train_judge.py")
|
| 479 |
+
add_file(ROOT / "training" / "hyperparams.json", "hyperparams.json")
|
| 480 |
+
add_file(ROOT / "training" / "run_on_colab.ipynb", "run_on_colab.ipynb")
|
| 481 |
+
add_file(ROOT / "training" / "requirements_training.txt", "requirements_training.txt")
|
| 482 |
+
|
| 483 |
+
print(f"\nCommitting {len(operations)} files to {REPO_ID}...")
|
| 484 |
+
url = api.create_commit(
|
| 485 |
+
repo_id=REPO_ID,
|
| 486 |
+
repo_type="model",
|
| 487 |
+
operations=operations,
|
| 488 |
+
commit_message=(
|
| 489 |
+
"feat: add model card, inference example, and training scripts\n\n"
|
| 490 |
+
"- Proper model card with YAML frontmatter (base_model, tags, datasets)\n"
|
| 491 |
+
"- Honest eval results: Delta A p=0.189 not significant, DO NOT DEPLOY verdict\n"
|
| 492 |
+
"- Dimension coverage gap documented (bench_commitment_honesty=0 pairs)\n"
|
| 493 |
+
"- inference_example.py with per-dimension and all-dimensions scoring\n"
|
| 494 |
+
"- Training scripts: train_judge.py, hyperparams.json, run_on_colab.ipynb"
|
| 495 |
+
),
|
| 496 |
+
)
|
| 497 |
+
print(f"\nDone. Commit URL: {url}")
|
| 498 |
+
print(f"Model: https://huggingface.co/rafiakedir/tenacious-bench-adapter")
|
| 499 |
+
|
| 500 |
+
|
| 501 |
+
if __name__ == "__main__":
|
| 502 |
+
main()
|