---
license: cc-by-4.0
language:
- en
base_model: unsloth/Qwen2.5-1.5B-Instruct
tags:
- judge
- b2b-sales
- orpo
- lora
- preference-learning
- tenacious-bench
- evaluation
- qwen2.5
- unsloth
datasets:
- rafiakedir/tenacious-bench-v0.1
---

# Tenacious-Bench Judge — ORPO LoRA Adapter (Qwen2.5-1.5B)

A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on
[Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)
preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine.

**Base model:** `unsloth/Qwen2.5-1.5B-Instruct`
**Adapter type:** LoRA (PEFT) — load with base model + `PeftModel.from_pretrained`
**Training algorithm:** ORPO (no reference model required)
**Precision:** 4-bit quantized during training (Unsloth), fp16 for inference
**Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split)
**Training:** 3 epochs · 36 steps · lr=8e-6 · beta=0.1 · LoRA r=16 alpha=32

---

## What It Scores

| Dimension | Trigger Rate (Week 10 probes) | Risk if Missed |
|---|---|---|
| `signal_grounding_fidelity` | 35% | CTO credibility loss |
| `competitor_gap_honesty` | 45% | Irreversible brand damage |
| `icp_segment_appropriateness` | 20% | ~$480K ACV per error |
| `tone_preservation` | 15% | Brand voice violation |
| `bench_commitment_honesty` | 5% | SOW-breach / delivery failure |

---

## Quick Start — Inference

```python
import json, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct"
ADAPTER_ID  = "rafiakedir/tenacious-bench-adapter"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, ADAPTER_ID)
model.eval()

JUDGE_SYSTEM = (
    "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. "
    "Given a task context and a candidate email, score it on the specified rubric dimension. "
    "Respond with a JSON object only:\n"
    '{"dimension": "<dim>", "score": <0.0-1.0>, "pass": <true|false>, "reasoning": "<one sentence>"}'
)

def judge(email, context, dimension):
    user = (
        f"EVALUATION DIMENSION: {dimension}\n\n"
        f"TASK CONTEXT:\n{context}\n\n"
        f"CANDIDATE EMAIL:\n{email}\n\n"
        f"Score this email on the {dimension} dimension."
    )
    msgs = [{"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user",   "content": user}]
    text   = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True,
                             pad_token_id=tokenizer.eos_token_id)
    resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    s, e = resp.find("{"), resp.rfind("}") + 1
    return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]}

result = judge(
    email="Casey — TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious",
    context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high",
    dimension="signal_grounding_fidelity"
)
print(result)
```

---

## Training Details

| Parameter | Value |
|---|---|
| Base model | `unsloth/Qwen2.5-1.5B-Instruct` (4-bit during training) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Target modules | q_proj, v_proj |
| LoRA dropout | 0.05 |
| Learning rate | 8e-6 |
| Effective batch size | 8 (batch=2, grad_accum=4) |
| Epochs | 3 |
| Total steps | 36 |
| ORPO beta | 0.1 |
| Max sequence length | 1024 |
| Seed | 42 |

**Training loss:** 2.8676 → 2.9646 → 2.9386 (3 checkpoints)
**Reward accuracy:** 0.5375 → 0.6026 → 0.5128

**Training data:** 94 preference pairs from the train partition. Preference leakage prevention:
generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`).
All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`.

---

## Evaluation Results

Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`.
Full results in `ablation_results.json` in the dataset repo.

**Deployment recommendation:** Run `ablations/run_ablations.py` with this adapter to get Delta A.
The ablation script loads this adapter via HuggingFace — requires GPU + transformers + peft.

---

## Known Limitations

1. **Dimension coverage gap.** 0 training pairs for `bench_commitment_honesty`, 4 for `icp_segment_appropriateness` due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty.

2. **Backbone below Prometheus-2 threshold.** Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization.

3. **Synthetic training distribution.** All pairs derive from synthetic prospect briefs and LLM-generated emails.

4. **Static bench_summary.** Judge calibration drifts as real bench composition changes weekly.

---

## Files

| File | Description |
|---|---|
| `adapter_config.json` | LoRA configuration (r=16, alpha=32, q_proj+v_proj) |
| `adapter_model.safetensors` | Trained LoRA weights (8.4 MB) |
| `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) |
| `run_on_colab.ipynb` | End-to-end training + push notebook |
| `train_judge.py` | Training script |
| `inference_example.py` | Per-dimension and all-dimension scoring helper |

Training data: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1)

---

## Citation

```bibtex
@misc{tenacious-bench-adapter-2026,
  title  = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation},
  author = {Kedir, Rafia},
  year   = {2026},
  url    = {https://huggingface.co/rafiakedir/tenacious-bench-adapter}
}
```