Tenacious Judge Banner

🤖 Tenacious Judge LoRA — B2B Sales Outreach Pre-Send Judge

Base model: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit
Adapter type: LoRA (ORPO)
Author: Bethelhem Abay · 10 Academy TRP1
Date: 2026-05-02
License: MIT

A LoRA adapter that teaches a small LLM when NOT to send a B2B sales email — 85.2% accuracy on sealed held-out data, trained in 20 minutes on a Colab T4 free tier.


🔗 Quick Links

Resource Link
🤖 Model (this page) bethelhem21/tenacious-judge-lora
📦 Training Dataset bethelhem21/tenacious-bench
💻 GitHub Repository bettyabay/tenacious-bench
📝 Blog Post Teaching a Sales Agent When NOT to Act

Overview

What is this model?

Tenacious Judge is a LoRA adapter on top of Qwen2.5-1.5B-Instruct (4-bit quantised via Unsloth), fine-tuned with ORPO on the Tenacious-Bench preference dataset.

It sits between the Tenacious Conversion Engine and the email send queue. Before any B2B outreach is dispatched, the judge evaluates the proposed action against a 7-rule rubric and returns a decision:

Decision Meaning
SUPPRESS Disqualifier or opt-out signal present — block all outreach
ESCALATE C-level recipient at >2,000-headcount company — route to human account executive
BLOCK Cross-thread context leak or low-confidence funding cited as fact
PENALISE Generic peer company names — flag for human review
PASS No rule violations — approve for dispatch

Key Results

Metric Value
Held-out accuracy 85.2% (52 / 61 pairs)
95% confidence interval [0.77, 0.93]
Training time ~20 minutes on Colab T4 (free tier)
Compute Colab T4 free tier (~$1.50 total incl. data synthesis)
Adapter size ~74 MB
Baseline (no judge) 0.0% (0 / 61)

How to Use

Requirements

pip install unsloth transformers torch

Load and Run Inference

from unsloth import FastLanguageModel
import torch
import json

# Load the adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="bethelhem21/tenacious-judge-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = """You are a sales-outreach judge for Tenacious, a B2B sales agent.
Apply the 7-rule rubric in strict priority order:
1. SUPPRESS if any disqualifier is present (anti_offshore, local_only, do_not_contact, competitor)
2. SUPPRESS if the prospect has opted out of the outreach channel
3. ESCALATE if the recipient is C-level at a company with >2000 employees
4. BLOCK if the output references content from a different prospect's thread
5. BLOCK if a funding amount is cited but funding_confidence is low or insufficient_signal
6. PENALISE if peer company names are generic or reused across prospects
7. PASS if none of the above conditions are triggered
Respond with one of: SUPPRESS | ESCALATE | BLOCK | PENALISE | PASS
Then give a one-sentence rationale."""

# Example: anti-offshore disqualifier case
context = {
    "company": "NearshoreStack Ltd",
    "headcount": 120,
    "funding_stage": "series_a",
    "funding_confidence": "high",
    "disqualifiers": ["anti_offshore"],
    "opt_out_channels": [],
    "recipient_role": "vp_eng",
    "available_signals": {"tech_stack": ["aws", "react"]}
}

agent_output = (
    "Hi Amir, I wanted to introduce Tenacious's offshore engineering placement "
    "service. We've helped similar Series A companies scale their backend teams..."
)

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": f"Context:\n{json.dumps(context, indent=2)}\n\n"
                                 f"Agent output:\n{agent_output}"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

with torch.no_grad():
    outputs = model.generate(inputs, max_new_tokens=64, temperature=0.0)

decision = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(decision)
# Expected: SUPPRESS
# Rationale: NearshoreStack has an anti_offshore disqualifier — Rule 1 fires.

All Five Decision Paths

# SUPPRESS — disqualifier present
context_suppress = {
    "disqualifiers": ["anti_offshore"], "opt_out_channels": [],
    "headcount": 80, "recipient_role": "cto", "funding_confidence": "high"
}

# ESCALATE — C-level at large company
context_escalate = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 5000, "recipient_role": "c_level", "funding_confidence": "high"
}

# BLOCK — cross-thread leak
context_block = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 200, "recipient_role": "vp_eng", "funding_confidence": "high",
    "thread_id": "thread-042",
    "available_signals": {"leaked_thread": "thread-039"}
}

# PENALISE — generic peer names
agent_output_penalise = (
    "We've helped companies like TechCorp and StartupCo scale their teams..."
)

# PASS — clean outreach
context_pass = {
    "disqualifiers": [], "opt_out_channels": [],
    "headcount": 300, "recipient_role": "vp_eng", "funding_confidence": "high"
}

Training Details

Method: Why ORPO over DPO?

ORPO (Odds-Ratio Preference Optimisation, Hong et al., 2024) was chosen for three concrete reasons:

  1. No reference model. DPO requires keeping the original model in memory alongside the trained model, consuming 3–4 GB additional VRAM. On a Colab T4 (15 GB total), this ruled out 7B+ base models. ORPO eliminates the reference model entirely.

  2. Combined SFT + preference in one pass. DPO requires a supervised fine-tuning step before preference training. ORPO combines both objectives into a single training loop, halving the compute requirement.

  3. Better performance on small datasets. Published benchmarks show ORPO outperforms DPO when training data is < 500 pairs. The Tenacious-Bench training set has 169 pairs — well within this regime.

Training Configuration

Parameter Value
Base model unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit
LoRA rank (r) 16
LoRA alpha 16
LoRA dropout 0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters 18,087,936 (1.18% of total)
Training pairs 169 (train split)
Eval pairs 93 (dev split)
Epochs 10
Total steps 200
Effective batch size 8
Learning rate 8e-5
Optimizer AdamW (8-bit via Unsloth)
Max sequence length 2,048 tokens
Random seed 3407
Hardware Google Colab T4 (free tier, 15 GB VRAM)
Training time ~17 minutes (1,009 seconds)
Adapter size on disk ~74 MB

Training Progression

Step Train Loss Eval Loss Rewards Accuracy Rewards Margin
0 3.25
40 ~0.85 ~0.90 100%
100 ~0.35 ~0.50 100% ~0.55
200 0.1411 0.3851 100% 0.6934
  • Loss reduced by 95.7% over 200 steps
  • 100% preference accuracy (rewards) reached at step 40 and maintained through step 200
  • No overfitting detected (eval loss converges and plateaus; does not diverge)

Note: An initial 60-step run achieved only 8.2% held-out accuracy — the model had not converged. The final 200-step run was required to reach 85.2%.


Evaluation Results

Evaluated on 61 sealed held-out pairs — all representing cases where the Conversion Engine made the wrong decision. The judge must correctly identify the failure.

Summary

Variant Correct Accuracy 95% CI
No judge (baseline) 0 / 61 0.0% [0.00, 0.00]
ORPO judge (this model) 52 / 61 85.2% [0.77, 0.93]

Per-Probe Breakdown

Probe Failure Description Held-out Correct Accuracy Status
A07 Anti-offshore disqualifier 6 6 100%
B03 Funding-tier mismatch 6 5 83%
B04 Low-confidence funding 5 5 100%
C02 Bench commitment ignored 6 4 67% ⚠️
C04 Regulatory caveat omitted 6 3 50% ⚠️
D05 Soft rejection doubled down 6 6 100%
E01 Cross-thread context leak 6 6 100%
E02 Generic peer company names 6 4 67% ⚠️
E03 Opt-out channel ignored 6 5 83%
G03 C-level escalation missed 8 8 100%
Total 61 52 85.2%

Confusion Matrix Summary

The 9 misclassified held-out pairs break down as:

Probe Misses Root cause
C02 2 No structured prior_commitments field; judge infers from prose
C04 3 Regulated-industry examples underrepresented in training
E02 2 Peer-name specificity threshold is subjective without a reference list
B03 1 Borderline funding-tier case with ambiguous signal
E03 1 Partial opt-out (sms only) with email channel active

All 52 correct decisions were true positives — the judge correctly identified the failure class and recommended the right action.


Inference Latency

Based on prefill vs. decode phase analysis (see latency breakdown blog post):

Phase Bottleneck Scales with Cost
Prefill GPU compute (FLOP/s) Prompt token count ~0.2 ms/token
Decode Memory bandwidth (GB/s) Output token count ~2 ms/token
  • Observed end-to-end latency: ~200ms on T4 (batch size 1)
  • Typical token ratio: 750 prompt tokens / ~60 output tokens (12:1)
  • Dominant phase: Prefill (compute-bound; dominates at high prompt-to-output ratio)
  • Optimization priority: Reduce prompt length, not output length

Reducing average prompt length from ~750 to ~400 tokens is projected to reduce total latency by 30–35%.

This latency profile is designed for async pre-send queues, not real-time filtering.


Limitations

Active Limitations (v0.1)

1. C02 — Bench commitment probe: 67% accuracy

The context schema has no structured prior_commitments field. The judge must infer commitment windows from free-text rationale strings, which is unreliable for edge cases (e.g., commitment window ending today vs. ending tomorrow). The v0.2 schema adds an explicit prior_commitments: [{starts, ends, type}] array.

2. C04 — Regulated-industry caveat probe: 50% accuracy

Training data included very few regulated-industry examples (SOX post-IPO, GDPR erasure requests, HIPAA-adjacent contexts). The judge cannot reliably detect when a caveat is required for fintech, healthcare, or government prospects. Do not deploy against regulated-industry verticals without retraining on a regulated-industry probe set.

3. Single-turn only

The judge evaluates one context object + one agent output. It cannot detect failures that span a multi-turn conversation thread (e.g., commitment made in message 3 violated in message 7).

4. Signal lag

Ground truth depends on CRM signals recorded at annotation time. A prospect who revoked an opt-out or changed their role after the signal was recorded will be incorrectly classified.

5. English only

All training data is in English. Not validated for non-English outreach.

Kill-Switch Conditions

Remove this adapter and revert to the deterministic rule layer if:

  • False-positive rate exceeds 15% over a rolling 500-prospect window
  • Any probe drops below 50% accuracy on a weekly 20-decision spot-check
  • Adapter fails to load or produces malformed output on >1% of calls
  • A Tier-1 brand-damage event occurs that was not blocked, traced to a known probe

Environmental Cost

Phase Cost
Dataset generation (trace-derived + programmatic) $0.00
Multi-LLM synthesis (OpenRouter) ~$1.50
ORPO training (Colab T4 free tier, 17 min) $0.00
Inference per prospect (local adapter) $0.00
Total < $1.50

Training Data

Trained on bethelhem21/tenacious-bench:

  • 323 preference pairs across 10 failure probes
  • 169 training pairs used for this adapter
  • Inter-rater agreement: Cohen's κ = 1.000
  • Contamination check: PASS (0 n-gram violations)
  • Authoring modes: trace_derived (90), programmatic (73), multi_llm (120), hand_authored (40)

Citation

@misc{tenacious-judge-lora-2026,
  author    = {Bethelhem Abay},
  title     = {Tenacious Judge LoRA: Preference-Tuned B2B Sales Outreach Judge},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/bethelhem21/tenacious-judge-lora}
}

References

@article{hong2024orpo,
  title  = {ORPO: Monolithic Preference Optimization without Reference Model},
  author = {Hong, Jiwoo and Lee, Noah and Thorne, James},
  year   = {2024}
}

@article{rafailov2023dpo,
  title  = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea},
  year   = {2023}
}

Acknowledgments

This work would not have been possible without:

Mentors: My mentor Abdulhamid and Temesgen, who guided me through choosing ORPO over DPO and pushed me to run IRA before training. That conversation — about why label reliability must come before model training — is the reason this project has κ = 1.000 in the header and not a footnote about noisy labels.

Yonatan Wondimu (Community Manager) — for hands-on guidance with HuggingFace dataset and model publishing, and for the daily theory and reflective questions that pushed me to articulate my reasoning instead of just shipping code.

10 Academy: The TRP1 tutors for daily standups, debugging support, and technical tutorials throughout Week 11. The kind of infrastructure that makes a Colab-T4 experiment feel like real research.

Cohort: My TRP1 cohort for the daily accountability. You all made the impossible feel possible.


Adapter built as part of the 10 Academy TRP1 Sales Agent Evaluation Bench challenge (Week 11, 2026). All training data is fully synthetic — no real companies, individuals, or emails.

Downloads last month
115
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bethelhem21/tenacious-judge-lora

Adapter
(10)
this model

Dataset used to train bethelhem21/tenacious-judge-lora

Space using bethelhem21/tenacious-judge-lora 1

Evaluation results

  • Accuracy (held-out, n=61) on Tenacious-Bench
    self-reported
    0.852
  • Accuracy — PROBE-A07 (anti_offshore disqualifier) on Tenacious-Bench
    self-reported
    1.000
  • Accuracy — PROBE-B03 (funding-tier mismatch) on Tenacious-Bench
    self-reported
    0.833
  • Accuracy — PROBE-B04 (low-confidence funding) on Tenacious-Bench
    self-reported
    1.000
  • Accuracy — PROBE-C02 (bench commitment) on Tenacious-Bench
    self-reported
    0.667
  • Accuracy — PROBE-C04 (regulatory caveat) on Tenacious-Bench
    self-reported
    0.500
  • Accuracy — PROBE-D05 (soft rejection) on Tenacious-Bench
    self-reported
    1.000
  • Accuracy — PROBE-E01 (thread leakage) on Tenacious-Bench
    self-reported
    1.000
  • Accuracy — PROBE-E02 (generic peer names) on Tenacious-Bench
    self-reported
    0.667
  • Accuracy — PROBE-E03 (opt-out channel) on Tenacious-Bench
    self-reported
    0.833
  • Accuracy — PROBE-G03 (C-level escalation) on Tenacious-Bench
    self-reported
    1.000