tenacious-orpo-qwen25-7b

ORPO-fine-tuned Qwen2.5-7B-Instruct for B2B outbound sales agent evaluation.

This model is a LoRA adapter trained with ORPO (Odds-Ratio Preference Optimization; Hong et al., 2024) on 110 preference pairs from the TenaciousBench v0.1 dataset. It achieves 90.9% pass@1 (40/44 tasks) on the TenaciousBench held-out split, +6.8 pp over the Week 10 GPT-4o-mini baseline.

Model Details

Field	Value
Base model	`Qwen/Qwen2.5-7B-Instruct` @ `a09a35458c`
Training method	ORPO (Hong et al. 2024)
LoRA r / alpha	16 / 32
Target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Learning rate	8e-6 (cosine, 10% warmup)
Effective batch	16 (4 per device × 4 grad accum)
Epochs	3
Max seq len	1024 tokens
ORPO beta	0.1
Training time	17.9 min on A100 40GB
Final train loss	0.2847
Final eval loss	0.3012
Quantization (inference)	4-bit NF4 (bitsandbytes)

Intended Use

This adapter is designed to score or generate B2B outbound sales emails, and to distinguish good (ICP-grounded, confidence-calibrated, brand-safe) from poor (hallucinated, banned-phrase, over-confident) outreach. It is not intended for general-purpose chat.

Primary use: Evaluating and training B2B sales agents against TenaciousBench v0.1 rubrics.
Secondary use: Generating preference-annotated training data for further DPO/ORPO cycles.

Training Data

Trained on 110 (chosen, rejected) preference pairs from eyorg/tenacious_bench_v0.1 covering all 10 failure dimensions:

Dimension	Description
`signal_grounded_personalization`	Email references real prospect signal (funding, hiring, news)
`confidence_aware_phrasing`	Hedges claims proportionally to evidence strength
`hallucination_avoidance`	Does not fabricate company facts or metrics
`brand_safe_tone`	No banned phrases; professional register
`multi_turn_objection_handling`	Coherently handles follow-up objections
`cta_behavior`	Clear, single, low-friction CTA present
`icp_targeting_accuracy`	Abstains (or does not pitch) when ICP confidence < 0.62
`competitor_gap_reasoning`	Correctly identifies/avoids competitor overlap
`pricing_discipline`	No premature pricing; no unapproved discounts
`thread_continuation_coherence`	Correctly references prior thread context

Evaluation Results

Evaluated on the 44-task held-out split of TenaciousBench v0.1 using scoring_evaluator.py:

System	pass@1	Δ vs baseline	p-value (bootstrap)
This model (Delta A)	90.9% (40/44)	+6.8 pp	0.1953
Prompt-only Qwen2.5-7B (Delta B)	86.4% (38/44)	+2.3 pp	0.4357
Week 10 GPT-4o-mini baseline	84.1% (37/44)	—	—

n=44 is underpowered for p<0.05 at these effect sizes. Effect is directionally positive across all 10 dimensions.

Cost: $0.000089/task at 4-bit inference — the Pareto winner vs. GPT-4o-mini ($0.000229/task).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

model = PeftModel.from_pretrained(base_model, "eyorg/tenacious-orpo-qwen25-7b")

prompt = """You are a B2B sales agent evaluator. Given the following outreach email
and company context, score it against the TenaciousBench rubric.

Company: NovaPay (fintech, Series B, 320 employees)
Signal: Announced Series B funding two weeks ago.
Email: <candidate email here>

Score: """

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.0)
print(tokenizer.decode(output[0], skip_special_tokens=True))

For full evaluation with the TenaciousBench scoring rubric:

git clone https://github.com/eyor-gech/tenacious-sales-bench
cd tenacious-sales-bench
pip install -r requirements.txt
python examples/run_examples.py

Kill Criterion

Training used a convergence kill criterion: if eval_loss > 0.35 at step 100, training is aborted and the system falls back to a 3-shot prompted evaluator. At step 100, eval_loss = 0.3142 ≤ 0.35 — training continued to completion.

Limitations

Calibrated to 9 fintech/healthtech seed companies; may underperform in manufacturing, logistics, or government verticals.
Held-out evaluation used synthetic preference scores (binary pass/fail), not human judges.
4-bit quantization reduces memory footprint at slight accuracy cost vs. full precision.
n=44 evaluation set is small; formal significance requires ~n=200.

Citation

@misc{getachew2026tenaciousbench,
  title  = {TenaciousBench v0.1: A 220-Task Benchmark for B2B Outbound Sales Agents},
  author = {Getachew, Eyor},
  year   = {2026},
  url    = {https://huggingface.co/datasets/eyorg/tenacious_bench_v0.1}
}

License

MIT — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for eyorg/tenacious-orpo-qwen25-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2107)

this model

eyorg
/

tenacious-orpo-qwen25-7b