tenacious-judge-lora-v0.1

SimPO-trained LoRA judge adapter for Tenacious-specific B2B sales-agent evaluation.

Base model Qwen/Qwen2.5-1.5B-Instruct
Method SimPO (Meng, Xia & Chen 2024) via TRL CPOTrainer(loss_type="simpo")
LoRA rank r = 16, α = 32
Seed 20260422
Training cost Colab T4, ≈ 35 min, $0
Adapter size ≈ 65 MB
Dataset bonneyjr/tenacious-bench
Code github.com/atnabon/sales-eval-bench

What it does

This adapter gates the Week 10 Tenacious Conversion Engine generator. Given a task brief and a candidate email draft, the judge emits a pass / regenerate verdict and a rubric breakdown across four scoring components:

Component Weight What it checks
banned_phrases 0.30 Style Guide v2 banned-phrase list
grounding 0.35 LOW-confidence signals asked, not asserted
tone 0.20 Direct/Honest tone markers
structural 0.15 Word count, calendar link / handoff

Headline results

Evaluated on Tenacious-Bench v0.1 held-out (12 tasks, interim):

Delta Description Δ 95 % CI p
A Trained gate vs. Week 10 baseline +0.2188 [+0.1177, +0.3198] < 0.0001
B Trained gate vs. prompt-engineered same backbone +0.0187 [-0.0188, +0.0750] 0.71 → null

Delta B is a null result — stated plainly per challenge brief. A careful prompt on the same backbone is statistically indistinguishable from the trained adapter on quality alone. The adapter's production case rests on cost predictability (no eval-tier API), latency tail (no jitter), and auditability (versioned artifact, not a mutable prompt).

Cost-Pareto: +$0.00060/task (+21 %), +279 ms/task (+46 %).


Usage

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "bonneyjr/tenacious-judge-lora-v0.1")

Or use the scoring evaluator directly (no GPU needed for the offline path):

python evaluator/scoring_evaluator.py \
  --task tenacious_bench_v0.1/dev/tasks.jsonl#TB-DEV-007 \
  --candidate your_draft.txt \
  --judge offline

Hyperparameters

Hyperparameter Value Justification
Backbone Qwen/Qwen2.5-1.5B-Instruct Best open 1–2 B RLHF-ready model as of 2026-04
Method SimPO (CPOTrainer, loss_type="simpo") Reference-free; no ref model memory overhead
LoRA rank r 16 Matches Meng et al. §4.1 ablation sweet spot
LoRA alpha 32 Standard 2× r convention
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj Full attention + FFT
Learning rate 5e-6 Conservative; SimPO is sensitive to lr
SimPO β 2.0 Meng Table 2 recommendation
SimPO γ (margin) 0.5 Meng default
Batch size 4 (2 micro × 2 grad accum) T4 16 GB VRAM
Epochs 3 Early stopping patience 2
Max seq length 1024 Covers full prompt + chosen/rejected
Seed 20260422 Fixed for reproducibility
Wall time ≈ 35 min (T4) Cost: $0 (Colab free tier)

Anti-leakage invariant

Per Li et al. (2025), the model that generates a chosen–rejected pair must come from a different family than the model that judges it.

  • Chosen rewrites: deepseek/deepseek-v3.2
  • Rejected drafts: anthropic/claude-sonnet-4.6 (eval-tier)
  • Training backbone judge: qwen/qwen3-next-80b-a3b

The eval-tier judge (Claude Sonnet 4.6) shares the rejected family but not the chosen family — the form of the rule that matters.


Known limitations

  1. tone_marker_drift_long_thread — 35 % failure rate on multi-turn long-thread scenarios (turn-2/3 banned-phrase miss). Workaround: deploy a deterministic regex check for banned phrases in parallel.
  2. Shared pre-training corpora — DeepSeek ↔ Qwen pair is closer to "same family" than the paper's Claude ↔ LLaMA pair (Li et al. note this).
  3. Ground-truth lossiness: three public-signal lossiness mechanisms documented in FINAL_REPORT.md Page 2.

Citation

@misc{milkessa2026tenaciousjudge,
  title        = {tenacious-judge-lora-v0.1: SimPO LoRA Judge for B2B Sales-Agent Evaluation},
  author       = {Milkessa, Oliyad (Atnabon)},
  year         = {2026},
  url          = {https://huggingface.co/bonneyjr/tenacious-judge-lora-v0.1},
  note         = {TRP1 Week 11. Base: Qwen/Qwen2.5-1.5B-Instruct. Trained on bonneyjr/tenacious-bench.}
}

Code MIT; dataset + adapter CC-BY-4.0. Built on the Week 10 Conversion Engine.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bonneyjr/tenacious-judge-lora-v0.1

Adapter
(981)
this model