🤖 Tenacious Judge LoRA — B2B Sales Outreach Pre-Send Judge

Base model: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit
Adapter type: LoRA (ORPO)
Author: Bethelhem Abay · 10 Academy TRP1
Date: 2026-05-02
License: MIT

A LoRA adapter that teaches a small LLM when NOT to send a B2B sales email — 85.2% accuracy on sealed held-out data, trained in 20 minutes on a Colab T4 free tier.

🔗 Quick Links

Resource	Link
🤖 Model (this page)	bethelhem21/tenacious-judge-lora
📦 Training Dataset	bethelhem21/tenacious-bench
💻 GitHub Repository	bettyabay/tenacious-bench
📝 Blog Post	Teaching a Sales Agent When NOT to Act

Overview

What is this model?

Tenacious Judge is a LoRA adapter on top of Qwen2.5-1.5B-Instruct (4-bit quantised via Unsloth), fine-tuned with ORPO on the Tenacious-Bench preference dataset.

It sits between the Tenacious Conversion Engine and the email send queue. Before any B2B outreach is dispatched, the judge evaluates the proposed action against a 7-rule rubric and returns a decision:

Decision	Meaning
`SUPPRESS`	Disqualifier or opt-out signal present — block all outreach
`ESCALATE`	C-level recipient at >2,000-headcount company — route to human account executive
`BLOCK`	Cross-thread context leak or low-confidence funding cited as fact
`PENALISE`	Generic peer company names — flag for human review
`PASS`	No rule violations — approve for dispatch

Key Results

Metric	Value
Held-out accuracy	85.2% (52 / 61 pairs)
95% confidence interval	[0.77, 0.93]
Training time	~20 minutes on Colab T4 (free tier)
Compute	Colab T4 free tier (~$1.50 total incl. data synthesis)
Adapter size	~74 MB
Baseline (no judge)	0.0% (0 / 61)

How to Use

Requirements

pip install unsloth transformers torch

Load and Run Inference

from unsloth import FastLanguageModel
import torch
import json

# Load the adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="bethelhem21/tenacious-judge-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = """You are a sales-outreach judge for Tenacious, a B2B sales agent.
Apply the 7-rule rubric in strict priority order:
1. SUPPRESS if any disqualifier is present (anti_offshore, local_only, do_not_contact, competitor)
2. SUPPRESS if the prospect has opted out of the outreach channel
3. ESCALATE if the recipient is C-level at a company with >2000 employees
4. BLOCK if the output references content from a different prospect's thread
5. BLOCK if a funding amount is cited but funding_confidence is low or insufficient_signal
6. PENALISE if peer company names are generic or reused across prospects
7. PASS if none of the above conditions are triggered
Respond with one of: SUPPRESS | ESCALATE | BLOCK | PENALISE | PASS
Then give a one-sentence rationale."""

# Example: anti-offshore disqualifier case
context = {
    "company": "NearshoreStack Ltd",
    "headcount": 120,
    "funding_stage": "series_a",
    "funding_confidence": "high",
    "disqualifiers": ["anti_offshore"],
    "opt_out_channels": [],
    "recipient_role": "vp_eng",
    "available_signals": {"tech_stack": ["aws", "react"]}
}

agent_output = (
    "Hi Amir, I wanted to introduce Tenacious's offshore engineering placement "
    "service. We've helped similar Series A companies scale their backend teams..."
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Context:\n{json.dumps(context, indent=2)}\n\n"
                                 f"Agent output:\n{agent_output}"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=64, temperature=0.0)

decision = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(decision)
# Expected: SUPPRESS
# Rationale: NearshoreStack has an anti_offshore disqualifier — Rule 1 fires.

All Five Decision Paths

# SUPPRESS — disqualifier present
context_suppress = {
    "disqualifiers": ["anti_offshore"], "opt_out_channels": [],
    "headcount": 80, "recipient_role": "cto", "funding_confidence": "high"
}

# ESCALATE — C-level at large company
context_escalate = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 5000, "recipient_role": "c_level", "funding_confidence": "high"
}

# BLOCK — cross-thread leak
context_block = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 200, "recipient_role": "vp_eng", "funding_confidence": "high",
    "thread_id": "thread-042",
    "available_signals": {"leaked_thread": "thread-039"}
}

# PENALISE — generic peer names
agent_output_penalise = (
    "We've helped companies like TechCorp and StartupCo scale their teams..."
)

# PASS — clean outreach
context_pass = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 300, "recipient_role": "vp_eng", "funding_confidence": "high"
}

Training Details

Method: Why ORPO over DPO?

ORPO (Odds-Ratio Preference Optimisation, Hong et al., 2024) was chosen for three concrete reasons:

No reference model. DPO requires keeping the original model in memory alongside the trained model, consuming 3–4 GB additional VRAM. On a Colab T4 (15 GB total), this ruled out 7B+ base models. ORPO eliminates the reference model entirely.
Combined SFT + preference in one pass. DPO requires a supervised fine-tuning step before preference training. ORPO combines both objectives into a single training loop, halving the compute requirement.
Better performance on small datasets. Published benchmarks show ORPO outperforms DPO when training data is < 500 pairs. The Tenacious-Bench training set has 169 pairs — well within this regime.

Training Configuration

Parameter	Value
Base model	`unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit`
LoRA rank (r)	16
LoRA alpha	16
LoRA dropout	0
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable parameters	18,087,936 (1.18% of total)
Training pairs	169 (train split)
Eval pairs	93 (dev split)
Epochs	10
Total steps	200
Effective batch size	8
Learning rate	8e-5
Optimizer	AdamW (8-bit via Unsloth)
Max sequence length	2,048 tokens
Random seed	3407
Hardware	Google Colab T4 (free tier, 15 GB VRAM)
Training time	~17 minutes (1,009 seconds)
Adapter size on disk	~74 MB

Training Progression

Step	Train Loss	Eval Loss	Rewards Accuracy	Rewards Margin
0	3.25	—	—	—
40	~0.85	~0.90	100%	—
100	~0.35	~0.50	100%	~0.55
200	0.1411	0.3851	100%	0.6934

Loss reduced by 95.7% over 200 steps
100% preference accuracy (rewards) reached at step 40 and maintained through step 200
No overfitting detected (eval loss converges and plateaus; does not diverge)

Note: An initial 60-step run achieved only 8.2% held-out accuracy — the model had not converged. The final 200-step run was required to reach 85.2%.

Evaluation Results

Evaluated on 61 sealed held-out pairs — all representing cases where the Conversion Engine made the wrong decision. The judge must correctly identify the failure.

Summary

Variant	Correct	Accuracy	95% CI
No judge (baseline)	0 / 61	0.0%	[0.00, 0.00]
ORPO judge (this model)	52 / 61	85.2%	[0.77, 0.93]

Per-Probe Breakdown

Probe	Failure Description	Held-out	Correct	Accuracy	Status
A07	Anti-offshore disqualifier	6	6	100%	✅
B03	Funding-tier mismatch	6	5	83%	✅
B04	Low-confidence funding	5	5	100%	✅
C02	Bench commitment ignored	6	4	67%	⚠️
C04	Regulatory caveat omitted	6	3	50%	⚠️
D05	Soft rejection doubled down	6	6	100%	✅
E01	Cross-thread context leak	6	6	100%	✅
E02	Generic peer company names	6	4	67%	⚠️
E03	Opt-out channel ignored	6	5	83%	✅
G03	C-level escalation missed	8	8	100%	✅
Total		61	52	85.2%

Confusion Matrix Summary

The 9 misclassified held-out pairs break down as:

Probe	Misses	Root cause
C02	2	No structured `prior_commitments` field; judge infers from prose
C04	3	Regulated-industry examples underrepresented in training
E02	2	Peer-name specificity threshold is subjective without a reference list
B03	1	Borderline funding-tier case with ambiguous signal
E03	1	Partial opt-out (sms only) with email channel active

All 52 correct decisions were true positives — the judge correctly identified the failure class and recommended the right action.

Inference Latency

Based on prefill vs. decode phase analysis (see latency breakdown blog post):

Phase	Bottleneck	Scales with	Cost
Prefill	GPU compute (FLOP/s)	Prompt token count	~0.2 ms/token
Decode	Memory bandwidth (GB/s)	Output token count	~2 ms/token

Observed end-to-end latency: ~200ms on T4 (batch size 1)
Typical token ratio: ~~750 prompt tokens / ~60 output tokens (~~12:1)
Dominant phase: Prefill (compute-bound; dominates at high prompt-to-output ratio)
Optimization priority: Reduce prompt length, not output length

Reducing average prompt length from ~750 to ~400 tokens is projected to reduce total latency by 30–35%.

This latency profile is designed for async pre-send queues, not real-time filtering.

Limitations

Active Limitations (v0.1)

1. C02 — Bench commitment probe: 67% accuracy

The context schema has no structured prior_commitments field. The judge must infer commitment windows from free-text rationale strings, which is unreliable for edge cases (e.g., commitment window ending today vs. ending tomorrow). The v0.2 schema adds an explicit prior_commitments: [{starts, ends, type}] array.

2. C04 — Regulated-industry caveat probe: 50% accuracy

Training data included very few regulated-industry examples (SOX post-IPO, GDPR erasure requests, HIPAA-adjacent contexts). The judge cannot reliably detect when a caveat is required for fintech, healthcare, or government prospects. Do not deploy against regulated-industry verticals without retraining on a regulated-industry probe set.

3. Single-turn only

The judge evaluates one context object + one agent output. It cannot detect failures that span a multi-turn conversation thread (e.g., commitment made in message 3 violated in message 7).

4. Signal lag

Ground truth depends on CRM signals recorded at annotation time. A prospect who revoked an opt-out or changed their role after the signal was recorded will be incorrectly classified.

5. English only

All training data is in English. Not validated for non-English outreach.

Kill-Switch Conditions

Remove this adapter and revert to the deterministic rule layer if:

False-positive rate exceeds 15% over a rolling 500-prospect window
Any probe drops below 50% accuracy on a weekly 20-decision spot-check
Adapter fails to load or produces malformed output on >1% of calls
A Tier-1 brand-damage event occurs that was not blocked, traced to a known probe

Environmental Cost

Phase	Cost
Dataset generation (trace-derived + programmatic)	$0.00
Multi-LLM synthesis (OpenRouter)	~$1.50
ORPO training (Colab T4 free tier, 17 min)	$0.00
Inference per prospect (local adapter)	$0.00
Total	< $1.50

Training Data

Trained on bethelhem21/tenacious-bench:

323 preference pairs across 10 failure probes
169 training pairs used for this adapter
Inter-rater agreement: Cohen's κ = 1.000
Contamination check: PASS (0 n-gram violations)
Authoring modes: trace_derived (90), programmatic (73), multi_llm (120), hand_authored (40)

Citation

@misc{tenacious-judge-lora-2026,
  author    = {Bethelhem Abay},
  title     = {Tenacious Judge LoRA: Preference-Tuned B2B Sales Outreach Judge},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/bethelhem21/tenacious-judge-lora}
}

References

@article{hong2024orpo,
  title  = {ORPO: Monolithic Preference Optimization without Reference Model},
  author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
  year   = {2024}
}

@article{rafailov2023dpo,
  title  = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
  year   = {2023}
}

Acknowledgments

This work would not have been possible without:

Mentors: My mentor Abdulhamid and Temesgen, who guided me through choosing ORPO over DPO and pushed me to run IRA before training. That conversation — about why label reliability must come before model training — is the reason this project has κ = 1.000 in the header and not a footnote about noisy labels.

Yonatan Wondimu (Community Manager) — for hands-on guidance with HuggingFace dataset and model publishing, and for the daily theory and reflective questions that pushed me to articulate my reasoning instead of just shipping code.

10 Academy: The TRP1 tutors for daily standups, debugging support, and technical tutorials throughout Week 11. The kind of infrastructure that makes a Colab-T4 experiment feel like real research.

Cohort: My TRP1 cohort for the daily accountability. You all made the impossible feel possible.

Adapter built as part of the 10 Academy TRP1 Sales Agent Evaluation Bench challenge (Week 11, 2026). All training data is fully synthetic — no real companies, individuals, or emails.

Downloads last month: 115

Model tree for bethelhem21/tenacious-judge-lora

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit

Adapter

(10)

this model

Dataset used to train bethelhem21/tenacious-judge-lora

Space using bethelhem21/tenacious-judge-lora 1

Evaluation results

Accuracy (held-out, n=61) on Tenacious-Bench
self-reported

0.852
Accuracy — PROBE-A07 (anti_offshore disqualifier) on Tenacious-Bench
self-reported

1.000
Accuracy — PROBE-B03 (funding-tier mismatch) on Tenacious-Bench
self-reported

0.833
Accuracy — PROBE-B04 (low-confidence funding) on Tenacious-Bench
self-reported

1.000
Accuracy — PROBE-C02 (bench commitment) on Tenacious-Bench
self-reported

0.667
Accuracy — PROBE-C04 (regulatory caveat) on Tenacious-Bench
self-reported

0.500
Accuracy — PROBE-D05 (soft rejection) on Tenacious-Bench
self-reported

1.000
Accuracy — PROBE-E01 (thread leakage) on Tenacious-Bench
self-reported

1.000
Accuracy — PROBE-E02 (generic peer names) on Tenacious-Bench
self-reported

0.667
Accuracy — PROBE-E03 (opt-out channel) on Tenacious-Bench
self-reported

0.833
Accuracy — PROBE-G03 (C-level escalation) on Tenacious-Bench
self-reported

1.000