samscrack commited on
Commit
9dc92d9
·
verified ·
1 Parent(s): 82e0c50

docs: add Solidity Eval 2026 pass@1 leaderboard at top — 46.5% beats Claude Opus 4.7 by +7.5pp

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -33,6 +33,20 @@ This is the **final merged checkpoint** — all five stages (CPT → SFT instruc
33
  Loadable directly with `AutoModelForCausalLM.from_pretrained(...)` — no adapters
34
  to apply.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Pipeline
37
 
38
  | # | Stage | Method | Adapter | Training data |
 
33
  Loadable directly with `AutoModelForCausalLM.from_pretrained(...)` — no adapters
34
  to apply.
35
 
36
+ ## Solidity Eval (2026) — pass@1 leaderboard
37
+
38
+ Top of the pass@1 leaderboard on [`samscrack/solidity-eval-2026`](https://huggingface.co/datasets/samscrack/solidity-eval-2026) (`lite` split, 200 real Etherscan contracts):
39
+
40
+ | Agent / model | pass@1 | Wall-clock |
41
+ |---|---|---|
42
+ | **This model — Qwen 3.6 Solidity 27B** | **46.5%** (93/200) | ~27 min |
43
+ | Claude Code 2.1.128 (Claude Opus 4.7) | 39.0% (78/200, 1 timeout) | ~34 min |
44
+
45
+ `pass@1` here is SolBench's `echidna()` rule: a single agentic attempt is scored 1.0 only if Diffusc compiles the candidate AND Echidna's differential-fuzz finds no behavioral divergence vs. the ground-truth body, with B3 canary + stub-residue guards. No resampling. Identical conditions across rows: 16-way concurrency, `max_agent_turns=40`, `agent_temperature=0.6`, `fuzz_test_calls=50000`, `fuzz_seed=0xDEADBEEF`, same sandbox image, same host. This model served locally via vLLM TP=2 FP8 (qwen3_xml tool parser) on 2× Blackwell GPUs through the in-process Hermes agent loop; Claude Code via Anthropic API through the CLI agent backend.
46
+
47
+ See the dataset card for the full reproduction recipe and harness-agnostic scoring instructions.
48
+
49
+
50
  ## Pipeline
51
 
52
  | # | Stage | Method | Adapter | Training data |