license: apache-2.0
base_model: Qwen/Qwen3.6-27B
language:
- en
library_name: transformers
tags:
- solidity
- smart-contracts
- code-generation
- foundry
- blockchain
- ethereum
- security-audit
- rejection-fine-tuning
- qwen
datasets:
- ASSERT-KTH/DISL
- braindao/solidity-base-sft-v2
- samscrack/solidity-audit-cot
pipeline_tag: text-generation
Qwen 3.6 Solidity (27B)
A 5-stage Solidity-specialist fine-tune of Qwen/Qwen3.6-27B. Trained to produce
Foundry-compileable Solidity contracts and matching test suites from natural-
language specs, and to reason about smart-contract security with long-CoT audit
traces.
This is the final merged checkpoint β all five stages (CPT β SFT instruction
β SFT audit/CoT β SFT Opus distillation β RFT) folded into a single bf16 model.
Loadable directly with AutoModelForCausalLM.from_pretrained(...) β no adapters
to apply.
Solidity Eval (2026) β pass@1 leaderboard
Top of the pass@1 leaderboard on samscrack/solidity-eval-2026 (lite split, 200 real Etherscan contracts):
| Agent / model | pass@1 | Wall-clock |
|---|---|---|
| This model β Qwen 3.6 Solidity 27B | 46.5% (93/200) | ~27 min |
| Claude Code 2.1.128 (Claude Opus 4.7) | 39.0% (78/200, 1 timeout) | ~34 min |
pass@1 here is SolBench's echidna() rule: a single agentic attempt is scored 1.0 only if Diffusc compiles the candidate AND Echidna's differential-fuzz finds no behavioral divergence vs. the ground-truth body, with B3 canary + stub-residue guards. No resampling. Identical conditions across rows: 16-way concurrency, max_agent_turns=40, agent_temperature=0.6, fuzz_test_calls=50000, fuzz_seed=0xDEADBEEF, same sandbox image, same host. This model served locally via vLLM TP=2 FP8 (qwen3_xml tool parser) on 2Γ Blackwell GPUs through the in-process Hermes agent loop; Claude Code via Anthropic API through the CLI agent backend.
See the dataset card for the full reproduction recipe and harness-agnostic scoring instructions.
Pipeline
| # | Stage | Method | Adapter | Training data |
|---|---|---|---|---|
| 0 | Continued pretrain | LoRA r=64, ~500M Solidity tokens | folded in | ASSERT-KTH/DISL (514k deployed contracts, CC-BY 4.0) + ~80 curated blue-chip GitHub repos |
| 1B | Instruction SFT | LoRA r=64, 178 steps | folded in | final.jsonl (~315k rows: braindao/solidity-base-sft-v2 + andstor/smart_contract_code_comments + lohoz/Smart-Contract-MultiTask + slither-audited + Pyano-fun) + 4,240 unverified foundry_tests.jsonl rows |
| 2 | Audit / long-CoT | LoRA r=16, 2 epochs | folded in | samscrack/solidity-audit-cot (~6,140 Opus 4.7 long-form audit traces, all confidence=high, β€30k chars to fit 8K ctx) |
| 3 | Opus distillation SFT | LoRA r=16, 2 epochs, lr=5e-5 | folded in | 4,000 of 4,919 forge-verified Opus pairs (foundry_tests.verified.jsonl); 919 held out from training |
| 4 | Rejection fine-tuning (RFT) | LoRA r=16, 2 epochs, lr=5e-5 | folded in (this checkpoint) | 926 model-generated contract+test pairs that passed forge build && forge test self-oracle, with non-triviality gate (β₯3 test fns, β₯2 distinct asserts) |
Stages 0/1B/2 were the original recipe (specification + Opus-CoT distillation). Stages 3/4 are the addition: directly distill the highest-quality forge-verified Opus pairs (Stage 3), then rejection-sample the model's own forge-passing outputs to anchor self-consistent generation (Stage 4).
Eval β Stage 3 β Stage 4 (RFT) comparison
200 prompts Γ N=4 candidates from a held-out slice (never trained on at any
stage). Each model-generated (contract, test_file) pair is dropped into a
fresh Foundry project and scored end-to-end with forge build && forge test:
| Metric (200 prompts Γ N=4 candidates) | Post-Stage-3 | Post-Stage-4 | Ξ |
|---|---|---|---|
| extract success | 80.5% | 86.4% | +5.9 pp |
| compile success | 46.8% | 50.6% | +3.8 pp |
| test pass | 19.2% | 21.4% | +2.2 pp |
| prompts β₯1 pass | 45.0% | 54.0% | +9.0 pp |
Stage 4 RFT lifted prompt-level yield by +9 percentage points (45 β 54 %). Per-candidate compile rate jumped 10Γ across the full pipeline (4.5 % pre-Stage-3 β 50.6 % post-Stage-4) β the model now produces Foundry-compileable contracts with matching test suites at >50 % per individual candidate.
What this model is good at
- Producing self-consistent Foundry-compileable contract + test pairs from a NL spec. Self-oracle test pass rate is 21.4% per candidate, 54% of prompts have β₯1 of 4 passes.
- Long-CoT audit reasoning. Stage 2 was trained on ~6k Opus 4.7 audit traces with reasoning steps + structured findings (severity / category / location / impact / fix).
- Solidity-idiomatic generation. Stage 0 CPT shifts the base distribution toward
modern Solidity patterns (
mapping,msg.sender,pragma, custom errors, etc.).
Limitations
- Synthetic-data lineage. Stage 1B includes braindao/solidity-base-sft-v2 whose teacher model is undisclosed (likely commodity GPT, not GPT-4-class). Quality ceiling is bounded by the teacher.
- Audit-corpus legality. Stage 2 corpus (
samscrack/solidity-audit-cot) is Opus-generated under Anthropic API terms over braindao seed contracts. Legal review recommended before any commercial use of the audit-finding outputs. - Held-out eval. This model has never seen
samscrack/solidity-eval-2026(SolBench RACR-4k + differential fuzz) at any stage β that's the gold benchmark.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"samscrack/Qwen3.6-Solidity-27B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-Solidity-27B")
# Spec β contract + tests
spec = (
"Implement a Solidity contract that holds a mapping from address to uint256 "
"balance. Owner can mint to any address. Anyone can transfer their balance to "
"another address. Include a Foundry test suite covering happy paths and the "
"owner-only invariant.\n\nProduce both the Solidity contract and a Foundry "
"test suite that exercises it."
)
msgs = [{"role": "user", "content": spec}]
inputs = tok.apply_chat_template(
msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True,
)
toks = tok(inputs, return_tensors="pt").to(model.device)
out = model.generate(
**toks, max_new_tokens=4096, temperature=0.7, top_p=0.9, do_sample=True,
)
print(tok.decode(out[0][toks.input_ids.shape[-1]:], skip_special_tokens=True))
The generated assistant turn has the shape:
<think>...short design rationale...</think>
```solidity
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.x;
contract MyContract { ... }
// test/Contract.t.sol
import "forge-std/Test.sol";
import "../src/Contract.sol";
contract MyContractTest is Test { ... }
## Format envelope
The model was trained on the canonical `<think>...</think>\n```solidity\n{contract}\n```\n\n```solidity\n// test/Contract.t.sol\n{tests}\n``` ` envelope. Most reliable
reproduction is to ask the user prompt to end with: *"Produce both the Solidity
contract and a Foundry test suite that exercises it."*
## Training infrastructure
- 2Γ NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each)
- Trainer: TRL 0.22 + Unsloth 2026.4.7 + PyTorch 2.8.0 + cu128
- Inference (sampling for Stage 4 RFT): vLLM 0.19.1 with FP8 dynamic quant +
FLASH_ATTN backend + Qwen3 reasoning parser
## Citation
@misc{qwen3.6-solidity-27b, author = {Sam Crack (samscrack)}, title = {Qwen 3.6 Solidity (27B): a 5-stage CPT/SFT/RFT recipe for Foundry-compileable Solidity codegen}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/samscrack/Qwen3.6-Solidity-27B} }
## License
Apache-2.0 (this checkpoint). Underlying training data is from CC-BY/MIT-tier
sources; teacher reasoning content (Stage 2 + Stage 3) was generated under
Anthropic API terms of use as of generation date (2026-05-04). Eval set
`samscrack/solidity-eval-2026` is NOT used at any training stage.