Text Classification
PEFT
Safetensors
English
orpo
preference-optimisation
lora
unsloth
sales-outreach
b2b
judge
alignment
tenacious
outreach-safety
Eval Results (legacy)
Instructions to use bethelhem21/tenacious-judge-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bethelhem21/tenacious-judge-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "bethelhem21/tenacious-judge-lora") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use bethelhem21/tenacious-judge-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bethelhem21/tenacious-judge-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="bethelhem21/tenacious-judge-lora", max_seq_length=2048, )
| language: | |
| - en | |
| license: mit | |
| base_model: unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit | |
| pipeline_tag: text-classification | |
| library_name: peft | |
| tags: | |
| - orpo | |
| - preference-optimisation | |
| - lora | |
| - unsloth | |
| - peft | |
| - sales-outreach | |
| - b2b | |
| - judge | |
| - alignment | |
| - tenacious | |
| - outreach-safety | |
| - text-classification | |
| datasets: | |
| - bethelhem21/tenacious-bench | |
| metrics: | |
| - accuracy | |
| thumbnail: banner.png | |
| co2_eq_emissions: | |
| emissions: 0 | |
| source: Colab T4 free tier — 17 minutes training | |
| training_type: fine-tuning | |
| geographical_location: unknown | |
| hardware_used: NVIDIA T4 | |
| widget: | |
| - text: "Context:\n{\"company\": \"NearshoreStack Ltd\", \"headcount\": 120, \"disqualifiers\": [\"anti_offshore\"], \"opt_out_channels\": [], \"recipient_role\": \"vp_eng\", \"funding_confidence\": \"high\"}\n\nAgent output:\nHi, let me introduce our offshore engineering placement service..." | |
| example_title: "SUPPRESS — anti_offshore disqualifier" | |
| - text: "Context:\n{\"company\": \"ScaleOps Ltd\", \"headcount\": 3200, \"disqualifiers\": [], \"opt_out_channels\": [], \"recipient_role\": \"c_level\", \"funding_confidence\": \"high\"}\n\nAgent output:\nHi, I wanted to reach out about our engineering staffing solutions..." | |
| example_title: "ESCALATE — C-level at 3,200 headcount" | |
| - text: "Context:\n{\"company\": \"BuildFast Inc\", \"headcount\": 90, \"disqualifiers\": [], \"opt_out_channels\": [\"email\"], \"recipient_role\": \"vp_eng\", \"funding_confidence\": \"high\"}\n\nAgent output:\nHi, just following up on our previous conversation..." | |
| example_title: "SUPPRESS — email opt-out" | |
| - text: "Context:\n{\"company\": \"DevCo\", \"headcount\": 200, \"disqualifiers\": [], \"opt_out_channels\": [], \"recipient_role\": \"vp_eng\", \"funding_confidence\": \"high\"}\n\nAgent output:\nHi, I noticed DevCo recently raised a Series B — congrats! We help scaling engineering teams..." | |
| example_title: "PASS — clean outreach" | |
| model-index: | |
| - name: tenacious-judge-lora | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Sales Outreach Safety Classification | |
| dataset: | |
| name: Tenacious-Bench | |
| type: bethelhem21/tenacious-bench | |
| split: held_out | |
| metrics: | |
| - type: accuracy | |
| value: 0.852 | |
| name: Accuracy (held-out, n=61) | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-A07 (anti_offshore disqualifier) | |
| value: 1.0 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-B03 (funding-tier mismatch) | |
| value: 0.833 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-B04 (low-confidence funding) | |
| value: 1.0 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-C02 (bench commitment) | |
| value: 0.667 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-C04 (regulatory caveat) | |
| value: 0.5 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-D05 (soft rejection) | |
| value: 1.0 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-E01 (thread leakage) | |
| value: 1.0 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-E02 (generic peer names) | |
| value: 0.667 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-E03 (opt-out channel) | |
| value: 0.833 | |
| verified: false | |
| - type: accuracy | |
| name: Accuracy — PROBE-G03 (C-level escalation) | |
| value: 1.0 | |
| verified: false | |
|  | |
| # 🤖 Tenacious Judge LoRA — B2B Sales Outreach Pre-Send Judge | |
| **Base model:** `unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit` | |
| **Adapter type:** LoRA (ORPO) | |
| **Author:** [Bethelhem Abay](https://medium.com/@abay.betty.21) · 10 Academy TRP1 | |
| **Date:** 2026-05-02 | |
| **License:** MIT | |
| > A LoRA adapter that teaches a small LLM when NOT to send a B2B sales email — 85.2% accuracy on sealed held-out data, trained in 20 minutes on a Colab T4 free tier. | |
| --- | |
| ## 🔗 Quick Links | |
| | Resource | Link | | |
| |----------|------| | |
| | 🤖 Model (this page) | [bethelhem21/tenacious-judge-lora](https://huggingface.co/bethelhem21/tenacious-judge-lora) | | |
| | 📦 Training Dataset | [bethelhem21/tenacious-bench](https://huggingface.co/datasets/bethelhem21/tenacious-bench) | | |
| | 💻 GitHub Repository | [bettyabay/tenacious-bench](https://github.com/bettyabay/tenacious-bench) | | |
| | 📝 Blog Post | [Teaching a Sales Agent When NOT to Act](https://medium.com/@abay.betty.21/teaching-a-sales-agent-when-not-to-act-db1d3b711488) | | |
| --- | |
| ## Overview | |
| ### What is this model? | |
| Tenacious Judge is a **LoRA adapter** on top of `Qwen2.5-1.5B-Instruct` (4-bit quantised via Unsloth), fine-tuned with ORPO on the [Tenacious-Bench](https://huggingface.co/datasets/bethelhem21/tenacious-bench) preference dataset. | |
| It sits between the Tenacious Conversion Engine and the email send queue. Before any B2B outreach is dispatched, the judge evaluates the proposed action against a 7-rule rubric and returns a decision: | |
| | Decision | Meaning | | |
| |----------|---------| | |
| | `SUPPRESS` | Disqualifier or opt-out signal present — block all outreach | | |
| | `ESCALATE` | C-level recipient at >2,000-headcount company — route to human account executive | | |
| | `BLOCK` | Cross-thread context leak or low-confidence funding cited as fact | | |
| | `PENALISE` | Generic peer company names — flag for human review | | |
| | `PASS` | No rule violations — approve for dispatch | | |
| ### Key Results | |
| | Metric | Value | | |
| |--------|-------| | |
| | Held-out accuracy | **85.2%** (52 / 61 pairs) | | |
| | 95% confidence interval | [0.77, 0.93] | | |
| | Training time | ~20 minutes on Colab T4 (free tier) | | |
| | Compute | Colab T4 free tier (~$1.50 total incl. data synthesis) | | |
| | Adapter size | ~74 MB | | |
| | Baseline (no judge) | 0.0% (0 / 61) | | |
| --- | |
| ## How to Use | |
| ### Requirements | |
| ```bash | |
| pip install unsloth transformers torch | |
| ``` | |
| ### Load and Run Inference | |
| ```python | |
| from unsloth import FastLanguageModel | |
| import torch | |
| import json | |
| # Load the adapter | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| model_name="bethelhem21/tenacious-judge-lora", | |
| max_seq_length=2048, | |
| load_in_4bit=True, | |
| ) | |
| FastLanguageModel.for_inference(model) | |
| SYSTEM_PROMPT = """You are a sales-outreach judge for Tenacious, a B2B sales agent. | |
| Apply the 7-rule rubric in strict priority order: | |
| 1. SUPPRESS if any disqualifier is present (anti_offshore, local_only, do_not_contact, competitor) | |
| 2. SUPPRESS if the prospect has opted out of the outreach channel | |
| 3. ESCALATE if the recipient is C-level at a company with >2000 employees | |
| 4. BLOCK if the output references content from a different prospect's thread | |
| 5. BLOCK if a funding amount is cited but funding_confidence is low or insufficient_signal | |
| 6. PENALISE if peer company names are generic or reused across prospects | |
| 7. PASS if none of the above conditions are triggered | |
| Respond with one of: SUPPRESS | ESCALATE | BLOCK | PENALISE | PASS | |
| Then give a one-sentence rationale.""" | |
| # Example: anti-offshore disqualifier case | |
| context = { | |
| "company": "NearshoreStack Ltd", | |
| "headcount": 120, | |
| "funding_stage": "series_a", | |
| "funding_confidence": "high", | |
| "disqualifiers": ["anti_offshore"], | |
| "opt_out_channels": [], | |
| "recipient_role": "vp_eng", | |
| "available_signals": {"tech_stack": ["aws", "react"]} | |
| } | |
| agent_output = ( | |
| "Hi Amir, I wanted to introduce Tenacious's offshore engineering placement " | |
| "service. We've helped similar Series A companies scale their backend teams..." | |
| ) | |
| messages = [ | |
| {"role": "system", "content": SYSTEM_PROMPT}, | |
| {"role": "user", "content": f"Context:\n{json.dumps(context, indent=2)}\n\n" | |
| f"Agent output:\n{agent_output}"}, | |
| ] | |
| inputs = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| ).to("cuda") | |
| with torch.no_grad(): | |
| outputs = model.generate(inputs, max_new_tokens=64, temperature=0.0) | |
| decision = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True) | |
| print(decision) | |
| # Expected: SUPPRESS | |
| # Rationale: NearshoreStack has an anti_offshore disqualifier — Rule 1 fires. | |
| ``` | |
| ### All Five Decision Paths | |
| ```python | |
| # SUPPRESS — disqualifier present | |
| context_suppress = { | |
| "disqualifiers": ["anti_offshore"], "opt_out_channels": [], | |
| "headcount": 80, "recipient_role": "cto", "funding_confidence": "high" | |
| } | |
| # ESCALATE — C-level at large company | |
| context_escalate = { | |
| "disqualifiers": [], "opt_out_channels": [], | |
| "headcount": 5000, "recipient_role": "c_level", "funding_confidence": "high" | |
| } | |
| # BLOCK — cross-thread leak | |
| context_block = { | |
| "disqualifiers": [], "opt_out_channels": [], | |
| "headcount": 200, "recipient_role": "vp_eng", "funding_confidence": "high", | |
| "thread_id": "thread-042", | |
| "available_signals": {"leaked_thread": "thread-039"} | |
| } | |
| # PENALISE — generic peer names | |
| agent_output_penalise = ( | |
| "We've helped companies like TechCorp and StartupCo scale their teams..." | |
| ) | |
| # PASS — clean outreach | |
| context_pass = { | |
| "disqualifiers": [], "opt_out_channels": [], | |
| "headcount": 300, "recipient_role": "vp_eng", "funding_confidence": "high" | |
| } | |
| ``` | |
| --- | |
| ## Training Details | |
| ### Method: Why ORPO over DPO? | |
| ORPO (Odds-Ratio Preference Optimisation, Hong et al., 2024) was chosen for three concrete reasons: | |
| 1. **No reference model.** DPO requires keeping the original model in memory alongside the trained model, consuming 3–4 GB additional VRAM. On a Colab T4 (15 GB total), this ruled out 7B+ base models. ORPO eliminates the reference model entirely. | |
| 2. **Combined SFT + preference in one pass.** DPO requires a supervised fine-tuning step before preference training. ORPO combines both objectives into a single training loop, halving the compute requirement. | |
| 3. **Better performance on small datasets.** Published benchmarks show ORPO outperforms DPO when training data is < 500 pairs. The Tenacious-Bench training set has 169 pairs — well within this regime. | |
| ### Training Configuration | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Base model | `unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit` | | |
| | LoRA rank (r) | 16 | | |
| | LoRA alpha | 16 | | |
| | LoRA dropout | 0 | | |
| | Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | |
| | Trainable parameters | 18,087,936 (1.18% of total) | | |
| | Training pairs | 169 (train split) | | |
| | Eval pairs | 93 (dev split) | | |
| | Epochs | 10 | | |
| | Total steps | 200 | | |
| | Effective batch size | 8 | | |
| | Learning rate | 8e-5 | | |
| | Optimizer | AdamW (8-bit via Unsloth) | | |
| | Max sequence length | 2,048 tokens | | |
| | Random seed | 3407 | | |
| | Hardware | Google Colab T4 (free tier, 15 GB VRAM) | | |
| | Training time | ~17 minutes (1,009 seconds) | | |
| | Adapter size on disk | ~74 MB | | |
| ### Training Progression | |
| | Step | Train Loss | Eval Loss | Rewards Accuracy | Rewards Margin | | |
| |------|-----------|-----------|-----------------|----------------| | |
| | 0 | 3.25 | — | — | — | | |
| | 40 | ~0.85 | ~0.90 | 100% | — | | |
| | 100 | ~0.35 | ~0.50 | 100% | ~0.55 | | |
| | 200 | **0.1411** | **0.3851** | **100%** | **0.6934** | | |
| - Loss reduced by **95.7%** over 200 steps | |
| - 100% preference accuracy (rewards) reached at step 40 and maintained through step 200 | |
| - No overfitting detected (eval loss converges and plateaus; does not diverge) | |
| > **Note:** An initial 60-step run achieved only 8.2% held-out accuracy — the model had not converged. The final 200-step run was required to reach 85.2%. | |
| --- | |
| ## Evaluation Results | |
| Evaluated on **61 sealed held-out pairs** — all representing cases where the Conversion Engine made the wrong decision. The judge must correctly identify the failure. | |
| ### Summary | |
| | Variant | Correct | Accuracy | 95% CI | | |
| |---------|---------|----------|--------| | |
| | No judge (baseline) | 0 / 61 | 0.0% | [0.00, 0.00] | | |
| | **ORPO judge (this model)** | **52 / 61** | **85.2%** | **[0.77, 0.93]** | | |
| ### Per-Probe Breakdown | |
| | Probe | Failure Description | Held-out | Correct | Accuracy | Status | | |
| |-------|---------------------|----------|---------|----------|--------| | |
| | A07 | Anti-offshore disqualifier | 6 | 6 | 100% | ✅ | | |
| | B03 | Funding-tier mismatch | 6 | 5 | 83% | ✅ | | |
| | B04 | Low-confidence funding | 5 | 5 | 100% | ✅ | | |
| | C02 | Bench commitment ignored | 6 | 4 | 67% | ⚠️ | | |
| | C04 | Regulatory caveat omitted | 6 | 3 | 50% | ⚠️ | | |
| | D05 | Soft rejection doubled down | 6 | 6 | 100% | ✅ | | |
| | E01 | Cross-thread context leak | 6 | 6 | 100% | ✅ | | |
| | E02 | Generic peer company names | 6 | 4 | 67% | ⚠️ | | |
| | E03 | Opt-out channel ignored | 6 | 5 | 83% | ✅ | | |
| | G03 | C-level escalation missed | 8 | 8 | 100% | ✅ | | |
| | **Total** | | **61** | **52** | **85.2%** | | | |
| ### Confusion Matrix Summary | |
| The 9 misclassified held-out pairs break down as: | |
| | Probe | Misses | Root cause | | |
| |-------|--------|------------| | |
| | C02 | 2 | No structured `prior_commitments` field; judge infers from prose | | |
| | C04 | 3 | Regulated-industry examples underrepresented in training | | |
| | E02 | 2 | Peer-name specificity threshold is subjective without a reference list | | |
| | B03 | 1 | Borderline funding-tier case with ambiguous signal | | |
| | E03 | 1 | Partial opt-out (sms only) with email channel active | | |
| All 52 correct decisions were true positives — the judge correctly identified the failure class and recommended the right action. | |
| --- | |
| ## Inference Latency | |
| Based on prefill vs. decode phase analysis (see [latency breakdown blog post](https://medium.com/@abay.betty.21/prefill-vs-decode-where-your-inference-latency-actually-goes-a796c3495afa)): | |
| | Phase | Bottleneck | Scales with | Cost | | |
| |-------|-----------|-------------|------| | |
| | Prefill | GPU compute (FLOP/s) | Prompt token count | ~0.2 ms/token | | |
| | Decode | Memory bandwidth (GB/s) | Output token count | ~2 ms/token | | |
| - **Observed end-to-end latency:** ~200ms on T4 (batch size 1) | |
| - **Typical token ratio:** ~750 prompt tokens / ~60 output tokens (~12:1) | |
| - **Dominant phase:** Prefill (compute-bound; dominates at high prompt-to-output ratio) | |
| - **Optimization priority:** Reduce prompt length, not output length | |
| Reducing average prompt length from ~750 to ~400 tokens is projected to reduce total latency by 30–35%. | |
| > This latency profile is designed for **async pre-send queues**, not real-time filtering. | |
| --- | |
| ## Limitations | |
| ### Active Limitations (v0.1) | |
| **1. C02 — Bench commitment probe: 67% accuracy** | |
| The context schema has no structured `prior_commitments` field. The judge must infer commitment windows from free-text rationale strings, which is unreliable for edge cases (e.g., commitment window ending today vs. ending tomorrow). The v0.2 schema adds an explicit `prior_commitments: [{starts, ends, type}]` array. | |
| **2. C04 — Regulated-industry caveat probe: 50% accuracy** | |
| Training data included very few regulated-industry examples (SOX post-IPO, GDPR erasure requests, HIPAA-adjacent contexts). The judge cannot reliably detect when a caveat is required for fintech, healthcare, or government prospects. **Do not deploy against regulated-industry verticals without retraining on a regulated-industry probe set.** | |
| **3. Single-turn only** | |
| The judge evaluates one context object + one agent output. It cannot detect failures that span a multi-turn conversation thread (e.g., commitment made in message 3 violated in message 7). | |
| **4. Signal lag** | |
| Ground truth depends on CRM signals recorded at annotation time. A prospect who revoked an opt-out or changed their role after the signal was recorded will be incorrectly classified. | |
| **5. English only** | |
| All training data is in English. Not validated for non-English outreach. | |
| ### Kill-Switch Conditions | |
| Remove this adapter and revert to the deterministic rule layer if: | |
| - False-positive rate exceeds 15% over a rolling 500-prospect window | |
| - Any probe drops below 50% accuracy on a weekly 20-decision spot-check | |
| - Adapter fails to load or produces malformed output on >1% of calls | |
| - A Tier-1 brand-damage event occurs that was not blocked, traced to a known probe | |
| --- | |
| ## Environmental Cost | |
| | Phase | Cost | | |
| |-------|------| | |
| | Dataset generation (trace-derived + programmatic) | $0.00 | | |
| | Multi-LLM synthesis (OpenRouter) | ~$1.50 | | |
| | ORPO training (Colab T4 free tier, 17 min) | $0.00 | | |
| | Inference per prospect (local adapter) | $0.00 | | |
| | **Total** | **< $1.50** | | |
| --- | |
| ## Training Data | |
| Trained on [bethelhem21/tenacious-bench](https://huggingface.co/datasets/bethelhem21/tenacious-bench): | |
| - **323 preference pairs** across 10 failure probes | |
| - **169 training pairs** used for this adapter | |
| - Inter-rater agreement: Cohen's κ = 1.000 | |
| - Contamination check: PASS (0 n-gram violations) | |
| - Authoring modes: trace_derived (90), programmatic (73), multi_llm (120), hand_authored (40) | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{tenacious-judge-lora-2026, | |
| author = {Bethelhem Abay}, | |
| title = {Tenacious Judge LoRA: Preference-Tuned B2B Sales Outreach Judge}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| url = {https://huggingface.co/bethelhem21/tenacious-judge-lora} | |
| } | |
| ``` | |
| ### References | |
| ```bibtex | |
| @article{hong2024orpo, | |
| title = {ORPO: Monolithic Preference Optimization without Reference Model}, | |
| author = {Hong, Jiwoo and Lee, Noah and Thorne, James}, | |
| year = {2024} | |
| } | |
| @article{rafailov2023dpo, | |
| title = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model}, | |
| author = {Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D. and Ermon, Stefano and Finn, Chelsea}, | |
| year = {2023} | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgments | |
| This work would not have been possible without: | |
| **Mentors:** My mentor Abdulhamid and Temesgen, who guided me through choosing ORPO over DPO and pushed me to run IRA before training. That conversation — about why label reliability must come before model training — is the reason this project has κ = 1.000 in the header and not a footnote about noisy labels. | |
| **Yonatan Wondimu (Community Manager)** — for hands-on guidance with HuggingFace dataset and model publishing, and for the daily theory and reflective questions that pushed me to articulate my reasoning instead of just shipping code. | |
| **10 Academy:** The TRP1 tutors for daily standups, debugging support, and technical tutorials throughout Week 11. The kind of infrastructure that makes a Colab-T4 experiment feel like real research. | |
| **Cohort:** My TRP1 cohort for the daily accountability. You all made the impossible feel possible. | |
| --- | |
| *Adapter built as part of the 10 Academy TRP1 Sales Agent Evaluation Bench challenge (Week 11, 2026). All training data is fully synthetic — no real companies, individuals, or emails.* | |