--- license: cc-by-4.0 language: - en base_model: unsloth/Qwen2.5-1.5B-Instruct tags: - judge - b2b-sales - orpo - lora - preference-learning - tenacious-bench - evaluation - qwen2.5 - unsloth datasets: - rafiakedir/tenacious-bench-v0.1 --- # Tenacious-Bench Judge — ORPO LoRA Adapter (Qwen2.5-1.5B) A rubric-aware scoring judge for B2B outbound sales emails, trained with ORPO on [Tenacious-Bench v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1) preference pairs. Deployed as a **rejection-sampling gate** in the Tenacious Conversion Engine. **Base model:** `unsloth/Qwen2.5-1.5B-Instruct` **Adapter type:** LoRA (PEFT) — load with base model + `PeftModel.from_pretrained` **Training algorithm:** ORPO (no reference model required) **Precision:** 4-bit quantized during training (Unsloth), fp16 for inference **Training data:** 94 ORPO preference pairs from `rafiakedir/tenacious-bench-v0.1` (train split) **Training:** 3 epochs · 36 steps · lr=8e-6 · beta=0.1 · LoRA r=16 alpha=32 --- ## What It Scores | Dimension | Trigger Rate (Week 10 probes) | Risk if Missed | |---|---|---| | `signal_grounding_fidelity` | 35% | CTO credibility loss | | `competitor_gap_honesty` | 45% | Irreversible brand damage | | `icp_segment_appropriateness` | 20% | ~$480K ACV per error | | `tone_preservation` | 15% | Brand voice violation | | `bench_commitment_honesty` | 5% | SOW-breach / delivery failure | --- ## Quick Start — Inference ```python import json, torch from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel BASE_MODEL = "unsloth/Qwen2.5-1.5B-Instruct" ADAPTER_ID = "rafiakedir/tenacious-bench-adapter" tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID) base = AutoModelForCausalLM.from_pretrained( BASE_MODEL, torch_dtype=torch.float16, device_map="auto" ) model = PeftModel.from_pretrained(base, ADAPTER_ID) model.eval() JUDGE_SYSTEM = ( "You are a rubric-aware judge for Tenacious Consulting B2B outbound sales emails. " "Given a task context and a candidate email, score it on the specified rubric dimension. " "Respond with a JSON object only:\n" '{"dimension": "", "score": <0.0-1.0>, "pass": , "reasoning": ""}' ) def judge(email, context, dimension): user = ( f"EVALUATION DIMENSION: {dimension}\n\n" f"TASK CONTEXT:\n{context}\n\n" f"CANDIDATE EMAIL:\n{email}\n\n" f"Score this email on the {dimension} dimension." ) msgs = [{"role": "system", "content": JUDGE_SYSTEM}, {"role": "user", "content": user}] text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True, pad_token_id=tokenizer.eos_token_id) resp = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip() s, e = resp.find("{"), resp.rfind("}") + 1 return json.loads(resp[s:e]) if s >= 0 else {"score": 0.5, "raw": resp[:200]} result = judge( email="Casey — TalentBridge has 8 open AI/ML roles this quarter. 30-min scoping call: calendly.com/tenacious", context="company: TalentBridge, stage: Series A, open_roles: 8, confidence: high", dimension="signal_grounding_fidelity" ) print(result) ``` --- ## Training Details | Parameter | Value | |---|---| | Base model | `unsloth/Qwen2.5-1.5B-Instruct` (4-bit during training) | | LoRA rank | 16 | | LoRA alpha | 32 | | Target modules | q_proj, v_proj | | LoRA dropout | 0.05 | | Learning rate | 8e-6 | | Effective batch size | 8 (batch=2, grad_accum=4) | | Epochs | 3 | | Total steps | 36 | | ORPO beta | 0.1 | | Max sequence length | 1024 | | Seed | 42 | **Training loss:** 2.8676 → 2.9646 → 2.9386 (3 checkpoints) **Reward accuracy:** 0.5375 → 0.6026 → 0.5128 **Training data:** 94 preference pairs from the train partition. Preference leakage prevention: generator (DeepSeek V3.2) ≠ judge family (Claude Sonnet 4.6 / `scoring_evaluator.py`). All generation decisions logged in the dataset repo at `training_data/generation_log.jsonl`. --- ## Evaluation Results Evaluated on 59 held-out tasks from `rafiakedir/tenacious-bench-v0.1`. Full results in `ablation_results.json` in the dataset repo. **Deployment recommendation:** Run `ablations/run_ablations.py` with this adapter to get Delta A. The ablation script loads this adapter via HuggingFace — requires GPU + transformers + peft. --- ## Known Limitations 1. **Dimension coverage gap.** 0 training pairs for `bench_commitment_honesty`, 4 for `icp_segment_appropriateness` due to scoring key mismatch during pair construction. The model received zero gradient signal on bench commitment honesty. 2. **Backbone below Prometheus-2 threshold.** Prometheus-2 demonstrated rubric-matching at 7B+ parameters. At 1.5B the model may underfit multi-dimension generalization. 3. **Synthetic training distribution.** All pairs derive from synthetic prospect briefs and LLM-generated emails. 4. **Static bench_summary.** Judge calibration drifts as real bench composition changes weekly. --- ## Files | File | Description | |---|---| | `adapter_config.json` | LoRA configuration (r=16, alpha=32, q_proj+v_proj) | | `adapter_model.safetensors` | Trained LoRA weights (8.4 MB) | | `tokenizer.json`, `tokenizer_config.json` | Tokenizer (ChatML format) | | `run_on_colab.ipynb` | End-to-end training + push notebook | | `train_judge.py` | Training script | | `inference_example.py` | Per-dimension and all-dimension scoring helper | Training data: [rafiakedir/tenacious-bench-v0.1](https://huggingface.co/datasets/rafiakedir/tenacious-bench-v0.1) --- ## Citation ```bibtex @misc{tenacious-bench-adapter-2026, title = {Tenacious-Bench Judge: ORPO LoRA Adapter for B2B Sales Evaluation}, author = {Kedir, Rafia}, year = {2026}, url = {https://huggingface.co/rafiakedir/tenacious-bench-adapter} } ```