tenacious-orpo-qwen25-7b

ORPO-fine-tuned Qwen2.5-7B-Instruct for B2B outbound sales agent evaluation.

This model is a LoRA adapter trained with ORPO (Odds-Ratio Preference Optimization; Hong et al., 2024) on 110 preference pairs from the TenaciousBench v0.1 dataset. It achieves 90.9% pass@1 (40/44 tasks) on the TenaciousBench held-out split, +6.8 pp over the Week 10 GPT-4o-mini baseline.


Model Details

Field Value
Base model Qwen/Qwen2.5-7B-Instruct @ a09a35458c
Training method ORPO (Hong et al. 2024)
LoRA r / alpha 16 / 32
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate 8e-6 (cosine, 10% warmup)
Effective batch 16 (4 per device × 4 grad accum)
Epochs 3
Max seq len 1024 tokens
ORPO beta 0.1
Training time 17.9 min on A100 40GB
Final train loss 0.2847
Final eval loss 0.3012
Quantization (inference) 4-bit NF4 (bitsandbytes)

Intended Use

This adapter is designed to score or generate B2B outbound sales emails, and to distinguish good (ICP-grounded, confidence-calibrated, brand-safe) from poor (hallucinated, banned-phrase, over-confident) outreach. It is not intended for general-purpose chat.

Primary use: Evaluating and training B2B sales agents against TenaciousBench v0.1 rubrics.
Secondary use: Generating preference-annotated training data for further DPO/ORPO cycles.


Training Data

Trained on 110 (chosen, rejected) preference pairs from eyorg/tenacious_bench_v0.1 covering all 10 failure dimensions:

Dimension Description
signal_grounded_personalization Email references real prospect signal (funding, hiring, news)
confidence_aware_phrasing Hedges claims proportionally to evidence strength
hallucination_avoidance Does not fabricate company facts or metrics
brand_safe_tone No banned phrases; professional register
multi_turn_objection_handling Coherently handles follow-up objections
cta_behavior Clear, single, low-friction CTA present
icp_targeting_accuracy Abstains (or does not pitch) when ICP confidence < 0.62
competitor_gap_reasoning Correctly identifies/avoids competitor overlap
pricing_discipline No premature pricing; no unapproved discounts
thread_continuation_coherence Correctly references prior thread context

Evaluation Results

Evaluated on the 44-task held-out split of TenaciousBench v0.1 using scoring_evaluator.py:

System pass@1 Δ vs baseline p-value (bootstrap)
This model (Delta A) 90.9% (40/44) +6.8 pp 0.1953
Prompt-only Qwen2.5-7B (Delta B) 86.4% (38/44) +2.3 pp 0.4357
Week 10 GPT-4o-mini baseline 84.1% (37/44) — —

n=44 is underpowered for p<0.05 at these effect sizes. Effect is directionally positive across all 10 dimensions.

Cost: $0.000089/task at 4-bit inference — the Pareto winner vs. GPT-4o-mini ($0.000229/task).


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

model = PeftModel.from_pretrained(base_model, "eyorg/tenacious-orpo-qwen25-7b")

prompt = """You are a B2B sales agent evaluator. Given the following outreach email
and company context, score it against the TenaciousBench rubric.

Company: NovaPay (fintech, Series B, 320 employees)
Signal: Announced Series B funding two weeks ago.
Email: <candidate email here>

Score: """

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
print(tokenizer.decode(output[0], skip_special_tokens=True))

For full evaluation with the TenaciousBench scoring rubric:

git clone https://github.com/eyor-gech/tenacious-sales-bench
cd tenacious-sales-bench
pip install -r requirements.txt
python examples/run_examples.py

Kill Criterion

Training used a convergence kill criterion: if eval_loss > 0.35 at step 100, training is aborted and the system falls back to a 3-shot prompted evaluator. At step 100, eval_loss = 0.3142 ≤ 0.35 — training continued to completion.


Limitations

  • Calibrated to 9 fintech/healthtech seed companies; may underperform in manufacturing, logistics, or government verticals.
  • Held-out evaluation used synthetic preference scores (binary pass/fail), not human judges.
  • 4-bit quantization reduces memory footprint at slight accuracy cost vs. full precision.
  • n=44 evaluation set is small; formal significance requires ~n=200.

Citation

@misc{getachew2026tenaciousbench,
  title  = {TenaciousBench v0.1: A 220-Task Benchmark for B2B Outbound Sales Agents},
  author = {Getachew, Eyor},
  year   = {2026},
  url    = {https://huggingface.co/datasets/eyorg/tenacious_bench_v0.1}
}

License

MIT — see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eyorg/tenacious-orpo-qwen25-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(2107)
this model

Dataset used to train eyorg/tenacious-orpo-qwen25-7b