File size: 8,299 Bytes
566dbe3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9dc92d9
 
 
 
 
 
 
 
 
 
 
 
 
 
566dbe3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82e0c50
 
 
566dbe3
 
 
82e0c50
 
 
 
566dbe3
82e0c50
 
 
 
566dbe3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B
language:
  - en
library_name: transformers
tags:
  - solidity
  - smart-contracts
  - code-generation
  - foundry
  - blockchain
  - ethereum
  - security-audit
  - rejection-fine-tuning
  - qwen
datasets:
  - ASSERT-KTH/DISL
  - braindao/solidity-base-sft-v2
  - samscrack/solidity-audit-cot
pipeline_tag: text-generation
---

# Qwen 3.6 Solidity (27B)

A 5-stage Solidity-specialist fine-tune of `Qwen/Qwen3.6-27B`. Trained to produce
Foundry-compileable Solidity contracts and matching test suites from natural-
language specs, and to reason about smart-contract security with long-CoT audit
traces.

This is the **final merged checkpoint** β€” all five stages (CPT β†’ SFT instruction
β†’ SFT audit/CoT β†’ SFT Opus distillation β†’ RFT) folded into a single bf16 model.
Loadable directly with `AutoModelForCausalLM.from_pretrained(...)` β€” no adapters
to apply.

## Solidity Eval (2026) β€” pass@1 leaderboard

Top of the pass@1 leaderboard on [`samscrack/solidity-eval-2026`](https://huggingface.co/datasets/samscrack/solidity-eval-2026) (`lite` split, 200 real Etherscan contracts):

| Agent / model | pass@1 | Wall-clock |
|---|---|---|
| **This model β€” Qwen 3.6 Solidity 27B** | **46.5%** (93/200) | ~27 min |
| Claude Code 2.1.128 (Claude Opus 4.7) | 39.0% (78/200, 1 timeout) | ~34 min |

`pass@1` here is SolBench's `echidna()` rule: a single agentic attempt is scored 1.0 only if Diffusc compiles the candidate AND Echidna's differential-fuzz finds no behavioral divergence vs. the ground-truth body, with B3 canary + stub-residue guards. No resampling. Identical conditions across rows: 16-way concurrency, `max_agent_turns=40`, `agent_temperature=0.6`, `fuzz_test_calls=50000`, `fuzz_seed=0xDEADBEEF`, same sandbox image, same host. This model served locally via vLLM TP=2 FP8 (qwen3_xml tool parser) on 2Γ— Blackwell GPUs through the in-process Hermes agent loop; Claude Code via Anthropic API through the CLI agent backend.

See the dataset card for the full reproduction recipe and harness-agnostic scoring instructions.


## Pipeline

| # | Stage | Method | Adapter | Training data |
|---|---|---|---|---|
| 0 | Continued pretrain | LoRA r=64, ~500M Solidity tokens | folded in | `ASSERT-KTH/DISL` (514k deployed contracts, CC-BY 4.0) + ~80 curated blue-chip GitHub repos |
| 1B | Instruction SFT | LoRA r=64, 178 steps | folded in | `final.jsonl` (~315k rows: braindao/solidity-base-sft-v2 + andstor/smart_contract_code_comments + lohoz/Smart-Contract-MultiTask + slither-audited + Pyano-fun) + 4,240 unverified `foundry_tests.jsonl` rows |
| 2 | Audit / long-CoT | LoRA r=16, 2 epochs | folded in | `samscrack/solidity-audit-cot` (~6,140 Opus 4.7 long-form audit traces, all `confidence=high`, ≀30k chars to fit 8K ctx) |
| 3 | Opus distillation SFT | LoRA r=16, 2 epochs, lr=5e-5 | folded in | 4,000 of 4,919 forge-verified Opus pairs (`foundry_tests.verified.jsonl`); 919 held out from training |
| 4 | Rejection fine-tuning (RFT) | LoRA r=16, 2 epochs, lr=5e-5 | folded in (this checkpoint) | 926 model-generated contract+test pairs that passed `forge build && forge test` self-oracle, with non-triviality gate (β‰₯3 test fns, β‰₯2 distinct asserts) |

**Stages 0/1B/2** were the original recipe (specification + Opus-CoT distillation).
**Stages 3/4** are the addition: directly distill the highest-quality forge-verified
Opus pairs (Stage 3), then rejection-sample the model's own forge-passing outputs
to anchor self-consistent generation (Stage 4).

## Eval β€” Stage 3 β†’ Stage 4 (RFT) comparison

200 prompts Γ— N=4 candidates from a held-out slice (never trained on at any
stage). Each model-generated `(contract, test_file)` pair is dropped into a
fresh Foundry project and scored end-to-end with `forge build && forge test`:

| Metric (200 prompts Γ— N=4 candidates) | Post-Stage-3 | **Post-Stage-4** | Ξ” |
|---|---|---|---|
| extract success | 80.5% | **86.4%** | +5.9 pp |
| compile success | 46.8% | **50.6%** | +3.8 pp |
| test pass | 19.2% | **21.4%** | +2.2 pp |
| **prompts β‰₯1 pass** | 45.0% | **54.0%** | **+9.0 pp** |

Stage 4 RFT lifted prompt-level yield by **+9 percentage points** (45 β†’ 54 %).
Per-candidate compile rate jumped 10Γ— across the full pipeline (4.5 % pre-Stage-3
β†’ 50.6 % post-Stage-4) β€” the model now produces Foundry-compileable contracts
with matching test suites at >50 % per individual candidate.

## What this model is good at

- **Producing self-consistent Foundry-compileable contract + test pairs from a NL spec.**
  Self-oracle test pass rate is 21.4% per candidate, 54% of prompts have β‰₯1 of 4 passes.
- **Long-CoT audit reasoning.** Stage 2 was trained on ~6k Opus 4.7 audit traces with
  reasoning steps + structured findings (severity / category / location / impact / fix).
- **Solidity-idiomatic generation.** Stage 0 CPT shifts the base distribution toward
  modern Solidity patterns (`mapping`, `msg.sender`, `pragma`, custom errors, etc.).

## Limitations

- **Synthetic-data lineage.** Stage 1B includes braindao/solidity-base-sft-v2
  whose teacher model is undisclosed (likely commodity GPT, not GPT-4-class).
  Quality ceiling is bounded by the teacher.
- **Audit-corpus legality.** Stage 2 corpus (`samscrack/solidity-audit-cot`) is
  Opus-generated under Anthropic API terms over braindao seed contracts. Legal
  review recommended before any commercial use of the audit-finding outputs.
- **Held-out eval.** This model has never seen `samscrack/solidity-eval-2026`
  (SolBench RACR-4k + differential fuzz) at any stage β€” that's the gold benchmark.

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "samscrack/Qwen3.6-Solidity-27B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-Solidity-27B")

# Spec β†’ contract + tests
spec = (
    "Implement a Solidity contract that holds a mapping from address to uint256 "
    "balance. Owner can mint to any address. Anyone can transfer their balance to "
    "another address. Include a Foundry test suite covering happy paths and the "
    "owner-only invariant.\n\nProduce both the Solidity contract and a Foundry "
    "test suite that exercises it."
)
msgs = [{"role": "user", "content": spec}]
inputs = tok.apply_chat_template(
    msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True,
)
toks = tok(inputs, return_tensors="pt").to(model.device)
out = model.generate(
    **toks, max_new_tokens=4096, temperature=0.7, top_p=0.9, do_sample=True,
)
print(tok.decode(out[0][toks.input_ids.shape[-1]:], skip_special_tokens=True))
```

The generated assistant turn has the shape:
```
<think>...short design rationale...</think>
```solidity
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.x;
contract MyContract { ... }
```

```solidity
// test/Contract.t.sol
import "forge-std/Test.sol";
import "../src/Contract.sol";
contract MyContractTest is Test { ... }
```
```

## Format envelope

The model was trained on the canonical `<think>...</think>\n```solidity\n{contract}\n```\n\n```solidity\n// test/Contract.t.sol\n{tests}\n``` ` envelope. Most reliable
reproduction is to ask the user prompt to end with: *"Produce both the Solidity
contract and a Foundry test suite that exercises it."*

## Training infrastructure

- 2Γ— NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each)
- Trainer: TRL 0.22 + Unsloth 2026.4.7 + PyTorch 2.8.0 + cu128
- Inference (sampling for Stage 4 RFT): vLLM 0.19.1 with FP8 dynamic quant +
  FLASH_ATTN backend + Qwen3 reasoning parser

## Citation

```
@misc{qwen3.6-solidity-27b,
  author = {Sam Crack (samscrack)},
  title = {Qwen 3.6 Solidity (27B): a 5-stage CPT/SFT/RFT recipe for
            Foundry-compileable Solidity codegen},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/samscrack/Qwen3.6-Solidity-27B}
}
```

## License

Apache-2.0 (this checkpoint). Underlying training data is from CC-BY/MIT-tier
sources; teacher reasoning content (Stage 2 + Stage 3) was generated under
Anthropic API terms of use as of generation date (2026-05-04). Eval set
`samscrack/solidity-eval-2026` is NOT used at any training stage.