Instructions to use bonneyjr/tenacious-judge-lora-v0.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bonneyjr/tenacious-judge-lora-v0.1 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
tenacious-judge-lora-v0.1
SimPO-trained LoRA judge adapter for Tenacious-specific B2B sales-agent evaluation.
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Method | SimPO (Meng, Xia & Chen 2024) via TRL CPOTrainer(loss_type="simpo") |
| LoRA rank | r = 16, α = 32 |
| Seed | 20260422 |
| Training cost | Colab T4, ≈ 35 min, $0 |
| Adapter size | ≈ 65 MB |
| Dataset | bonneyjr/tenacious-bench |
| Code | github.com/atnabon/sales-eval-bench |
What it does
This adapter gates the Week 10 Tenacious Conversion Engine generator. Given a
task brief and a candidate email draft, the judge emits a pass / regenerate
verdict and a rubric breakdown across four scoring components:
| Component | Weight | What it checks |
|---|---|---|
banned_phrases |
0.30 | Style Guide v2 banned-phrase list |
grounding |
0.35 | LOW-confidence signals asked, not asserted |
tone |
0.20 | Direct/Honest tone markers |
structural |
0.15 | Word count, calendar link / handoff |
Headline results
Evaluated on Tenacious-Bench v0.1 held-out (12 tasks, interim):
| Delta | Description | Δ | 95 % CI | p |
|---|---|---|---|---|
| A | Trained gate vs. Week 10 baseline | +0.2188 | [+0.1177, +0.3198] | < 0.0001 |
| B | Trained gate vs. prompt-engineered same backbone | +0.0187 | [-0.0188, +0.0750] | 0.71 → null |
Delta B is a null result — stated plainly per challenge brief. A careful prompt on the same backbone is statistically indistinguishable from the trained adapter on quality alone. The adapter's production case rests on cost predictability (no eval-tier API), latency tail (no jitter), and auditability (versioned artifact, not a mutable prompt).
Cost-Pareto: +$0.00060/task (+21 %), +279 ms/task (+46 %).
Usage
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "bonneyjr/tenacious-judge-lora-v0.1")
Or use the scoring evaluator directly (no GPU needed for the offline path):
python evaluator/scoring_evaluator.py \
--task tenacious_bench_v0.1/dev/tasks.jsonl#TB-DEV-007 \
--candidate your_draft.txt \
--judge offline
Hyperparameters
| Hyperparameter | Value | Justification |
|---|---|---|
| Backbone | Qwen/Qwen2.5-1.5B-Instruct |
Best open 1–2 B RLHF-ready model as of 2026-04 |
| Method | SimPO (CPOTrainer, loss_type="simpo") |
Reference-free; no ref model memory overhead |
| LoRA rank r | 16 | Matches Meng et al. §4.1 ablation sweet spot |
| LoRA alpha | 32 | Standard 2× r convention |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Full attention + FFT |
| Learning rate | 5e-6 | Conservative; SimPO is sensitive to lr |
| SimPO β | 2.0 | Meng Table 2 recommendation |
| SimPO γ (margin) | 0.5 | Meng default |
| Batch size | 4 (2 micro × 2 grad accum) | T4 16 GB VRAM |
| Epochs | 3 | Early stopping patience 2 |
| Max seq length | 1024 | Covers full prompt + chosen/rejected |
| Seed | 20260422 | Fixed for reproducibility |
| Wall time | ≈ 35 min (T4) | Cost: $0 (Colab free tier) |
Anti-leakage invariant
Per Li et al. (2025), the model that generates a chosen–rejected pair must come from a different family than the model that judges it.
- Chosen rewrites:
deepseek/deepseek-v3.2 - Rejected drafts:
anthropic/claude-sonnet-4.6(eval-tier) - Training backbone judge:
qwen/qwen3-next-80b-a3b
The eval-tier judge (Claude Sonnet 4.6) shares the rejected family but not the chosen family — the form of the rule that matters.
Known limitations
- tone_marker_drift_long_thread — 35 % failure rate on multi-turn long-thread scenarios (turn-2/3 banned-phrase miss). Workaround: deploy a deterministic regex check for banned phrases in parallel.
- Shared pre-training corpora — DeepSeek ↔ Qwen pair is closer to "same family" than the paper's Claude ↔ LLaMA pair (Li et al. note this).
- Ground-truth lossiness: three public-signal lossiness mechanisms documented in FINAL_REPORT.md Page 2.
Citation
@misc{milkessa2026tenaciousjudge,
title = {tenacious-judge-lora-v0.1: SimPO LoRA Judge for B2B Sales-Agent Evaluation},
author = {Milkessa, Oliyad (Atnabon)},
year = {2026},
url = {https://huggingface.co/bonneyjr/tenacious-judge-lora-v0.1},
note = {TRP1 Week 11. Base: Qwen/Qwen2.5-1.5B-Instruct. Trained on bonneyjr/tenacious-bench.}
}
Code MIT; dataset + adapter CC-BY-4.0. Built on the Week 10 Conversion Engine.
- Downloads last month
- -