dancinlab/hexa-forge-bench-cold-v0.1.3
Updated • 219
How to use dancinlab/hexa-forge-code-7b-qwen2.5-lora-r64-v0.4.1-delegate with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-7B")
model = PeftModel.from_pretrained(base_model, "dancinlab/hexa-forge-code-7b-qwen2.5-lora-r64-v0.4.1-delegate")⚠️ LABELED EXPERIMENT — NOT GA. This is the v0.4.1 rebalanced-SFT follow-up to r40 (round 41). Rebalanced the delegation share 25% → 9%
- added 4 new blocks (T4-RL-reinforce, over-delegate-counter, refusal-shape, OOD-extension) + halved LR (5e-5 → 2e-5) + doubled epochs (1 → 2). Result: basically flat vs r40 — the specialist↔routing tradeoff in 7B+LoRA SFT is fundamental, not a parameter problem. The actual v0.4.0 GA is
dancinlab/hexa-forge-code-7b-qwen2.5-lora-r64-v0.4.0-rl-t4-v3-t3patch(r39, 94.29% Mk.I). Use that one for production.
r40 and r41 together empirically disprove SFT-only delegation training on a saturated specialist. The remaining viable path is routing-RL (GRPO with binary route-correctness reward), queued as v0.4.2.
| family | r39 GA | r40 v18 (25% del) | r41 v19 (9% del) | Δ vs r40 |
|---|---|---|---|---|
| Mk.I overall | 94.29% | 82.71% | 83.01% | +0.30 (flat) |
| T1 syntax | 97.6% | 76.5% | 75.3% | −1.2 |
| T2 atlas | 87.0% | 78.0% | 85.0% | +7.0 (rambling-cover artifact) |
| T3 @grace | 100.0% | 98.8% | 98.8% | 0 |
| T4 enum | 100.0% | 77.0% | 73.0% | −4.0 ⚠ |
| T5 HX-codes | 94.8% | 86.5% | 89.6% | +3.1 |
| T6 triples | 95.5% | 92.4% | 87.9% | −4.5 |
| T7 stdlib | 87.9% | 89.7% | 89.7% | 0 |
| T8 refusal | 90.0% | 68.8% | 68.8% | 0 ⚠ |
| 5-NL i18n | 96% | 60% | 52% | −8 ⚠ |
| DLG-mk0 | n/a | 0.7652 | 0.7760 | +1.08 (still <0.85 gate) |
DLG-mk0 per-category (r40 → r41):
dancinlab/hexa-codex/lm_foundry/ROADMAP.md r41)
MIT (adapter weights). Base model: Qwen/Qwen2.5-Coder-7B.