tenacious-judge-lora-v0.1

SimPO-trained LoRA judge adapter for Tenacious-specific B2B sales-agent evaluation.


Base model	`Qwen/Qwen2.5-1.5B-Instruct`
Method	SimPO (Meng, Xia & Chen 2024) via TRL `CPOTrainer(loss_type="simpo")`
LoRA rank	r = 16, α = 32
Seed	`20260422`
Training cost	Colab T4, ≈ 35 min, $0
Adapter size	≈ 65 MB
Dataset	bonneyjr/tenacious-bench
Code	github.com/atnabon/sales-eval-bench

What it does

This adapter gates the Week 10 Tenacious Conversion Engine generator. Given a task brief and a candidate email draft, the judge emits a pass / regenerate verdict and a rubric breakdown across four scoring components:

Component	Weight	What it checks
`banned_phrases`	0.30	Style Guide v2 banned-phrase list
`grounding`	0.35	LOW-confidence signals asked, not asserted
`tone`	0.20	Direct/Honest tone markers
`structural`	0.15	Word count, calendar link / handoff

Headline results

Evaluated on Tenacious-Bench v0.1 held-out (12 tasks, interim):

Delta	Description	Δ	95 % CI	p
A	Trained gate vs. Week 10 baseline	+0.2188	[+0.1177, +0.3198]	< 0.0001
B	Trained gate vs. prompt-engineered same backbone	+0.0187	[-0.0188, +0.0750]	0.71 → null

Delta B is a null result — stated plainly per challenge brief. A careful prompt on the same backbone is statistically indistinguishable from the trained adapter on quality alone. The adapter's production case rests on cost predictability (no eval-tier API), latency tail (no jitter), and auditability (versioned artifact, not a mutable prompt).

Cost-Pareto: +$0.00060/task (+21 %), +279 ms/task (+46 %).

Usage

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "bonneyjr/tenacious-judge-lora-v0.1")

Or use the scoring evaluator directly (no GPU needed for the offline path):

python evaluator/scoring_evaluator.py \
  --task tenacious_bench_v0.1/dev/tasks.jsonl#TB-DEV-007 \
  --candidate your_draft.txt \
  --judge offline

Hyperparameters

Hyperparameter	Value	Justification
Backbone	`Qwen/Qwen2.5-1.5B-Instruct`	Best open 1–2 B RLHF-ready model as of 2026-04
Method	SimPO (`CPOTrainer`, `loss_type="simpo"`)	Reference-free; no ref model memory overhead
LoRA rank r	16	Matches Meng et al. §4.1 ablation sweet spot
LoRA alpha	32	Standard 2× r convention
LoRA target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`	Full attention + FFT
Learning rate	5e-6	Conservative; SimPO is sensitive to lr
SimPO β	2.0	Meng Table 2 recommendation
SimPO γ (margin)	0.5	Meng default
Batch size	4 (2 micro × 2 grad accum)	T4 16 GB VRAM
Epochs	3	Early stopping patience 2
Max seq length	1024	Covers full prompt + chosen/rejected
Seed	20260422	Fixed for reproducibility
Wall time	≈ 35 min (T4)	Cost: $0 (Colab free tier)

Anti-leakage invariant

Per Li et al. (2025), the model that generates a chosen–rejected pair must come from a different family than the model that judges it.

Chosen rewrites: deepseek/deepseek-v3.2
Rejected drafts: anthropic/claude-sonnet-4.6 (eval-tier)
Training backbone judge: qwen/qwen3-next-80b-a3b

The eval-tier judge (Claude Sonnet 4.6) shares the rejected family but not the chosen family — the form of the rule that matters.

Known limitations

tone_marker_drift_long_thread — 35 % failure rate on multi-turn long-thread scenarios (turn-2/3 banned-phrase miss). Workaround: deploy a deterministic regex check for banned phrases in parallel.
Shared pre-training corpora — DeepSeek ↔ Qwen pair is closer to "same family" than the paper's Claude ↔ LLaMA pair (Li et al. note this).
Ground-truth lossiness: three public-signal lossiness mechanisms documented in FINAL_REPORT.md Page 2.

Citation

@misc{milkessa2026tenaciousjudge,
  title        = {tenacious-judge-lora-v0.1: SimPO LoRA Judge for B2B Sales-Agent Evaluation},
  author       = {Milkessa, Oliyad (Atnabon)},
  year         = {2026},
  url          = {https://huggingface.co/bonneyjr/tenacious-judge-lora-v0.1},
  note         = {TRP1 Week 11. Base: Qwen/Qwen2.5-1.5B-Instruct. Trained on bonneyjr/tenacious-bench.}
}

Code MIT; dataset + adapter CC-BY-4.0. Built on the Week 10 Conversion Engine.

Downloads last month: -

Model tree for bonneyjr/tenacious-judge-lora-v0.1

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(981)

this model