samscrack commited on
Commit
566dbe3
Β·
verified Β·
1 Parent(s): b2073f7

Add model card with full pipeline details + Stage 4 RFT eval results

Browse files
Files changed (1) hide show
  1. README.md +179 -0
README.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.6-27B
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ tags:
8
+ - solidity
9
+ - smart-contracts
10
+ - code-generation
11
+ - foundry
12
+ - blockchain
13
+ - ethereum
14
+ - security-audit
15
+ - rejection-fine-tuning
16
+ - qwen
17
+ datasets:
18
+ - ASSERT-KTH/DISL
19
+ - braindao/solidity-base-sft-v2
20
+ - samscrack/solidity-audit-cot
21
+ pipeline_tag: text-generation
22
+ ---
23
+
24
+ # Qwen 3.6 Solidity (27B)
25
+
26
+ A 5-stage Solidity-specialist fine-tune of `Qwen/Qwen3.6-27B`. Trained to produce
27
+ Foundry-compileable Solidity contracts and matching test suites from natural-
28
+ language specs, and to reason about smart-contract security with long-CoT audit
29
+ traces.
30
+
31
+ This is the **final merged checkpoint** β€” all five stages (CPT β†’ SFT instruction
32
+ β†’ SFT audit/CoT β†’ SFT Opus distillation β†’ RFT) folded into a single bf16 model.
33
+ Loadable directly with `AutoModelForCausalLM.from_pretrained(...)` β€” no adapters
34
+ to apply.
35
+
36
+ ## Pipeline
37
+
38
+ | # | Stage | Method | Adapter | Training data |
39
+ |---|---|---|---|---|
40
+ | 0 | Continued pretrain | LoRA r=64, ~500M Solidity tokens | folded in | `ASSERT-KTH/DISL` (514k deployed contracts, CC-BY 4.0) + ~80 curated blue-chip GitHub repos |
41
+ | 1B | Instruction SFT | LoRA r=64, 178 steps | folded in | `final.jsonl` (~315k rows: braindao/solidity-base-sft-v2 + andstor/smart_contract_code_comments + lohoz/Smart-Contract-MultiTask + slither-audited + Pyano-fun) + 4,240 unverified `foundry_tests.jsonl` rows |
42
+ | 2 | Audit / long-CoT | LoRA r=16, 2 epochs | folded in | `samscrack/solidity-audit-cot` (~6,140 Opus 4.7 long-form audit traces, all `confidence=high`, ≀30k chars to fit 8K ctx) |
43
+ | 3 | Opus distillation SFT | LoRA r=16, 2 epochs, lr=5e-5 | folded in | 4,000 of 4,919 forge-verified Opus pairs (`foundry_tests.verified.jsonl`); 919 held out from training |
44
+ | 4 | Rejection fine-tuning (RFT) | LoRA r=16, 2 epochs, lr=5e-5 | folded in (this checkpoint) | 926 model-generated contract+test pairs that passed `forge build && forge test` self-oracle, with non-triviality gate (β‰₯3 test fns, β‰₯2 distinct asserts) |
45
+
46
+ **Stages 0/1B/2** were the original recipe (specification + Opus-CoT distillation).
47
+ **Stages 3/4** are the addition: directly distill the highest-quality forge-verified
48
+ Opus pairs (Stage 3), then rejection-sample the model's own forge-passing outputs
49
+ to anchor self-consistent generation (Stage 4).
50
+
51
+ ## Eval β€” Stage 3 β†’ Stage 4 (RFT) comparison
52
+
53
+ 200 prompts Γ— N=4 candidates from a held-out slice (never trained on at any stage),
54
+ each candidate scored two ways:
55
+ - **self-oracle:** model contract + model's own emitted test β†’ `forge build && forge test`
56
+ - **reference-oracle:** model contract + the original Opus reference test β†’ `forge build && forge test`
57
+
58
+ | Metric (200 prompts Γ— N=4 candidates) | Post-Stage-3 | **Post-Stage-4** | Ξ” |
59
+ |---|---|---|---|
60
+ | extract success (self) | 80.5% | **86.4%** | +5.9 pp |
61
+ | compile success (self) | 46.8% | **50.6%** | +3.8 pp |
62
+ | test pass (self) | 19.2% | **21.4%** | +2.2 pp |
63
+ | **prompts β‰₯1 pass (self)** | 45.0% | **54.0%** | **+9.0 pp** |
64
+ | extract success (ref) | 94.9% | 97.0% | +2.1 pp |
65
+ | compile success (ref) | 1.6% | 1.5% | ~0 |
66
+ | **prompts β‰₯1 pass (ref)** | 0.5% | 1.0% | +0.5 pp |
67
+
68
+ Stage 4 RFT lifted self-oracle prompt yield by **+9 percentage points** (45 β†’ 54%).
69
+ Reference-oracle yield remained at ~1% β€” see "Limitations" below.
70
+
71
+ ## What this model is good at
72
+
73
+ - **Producing self-consistent Foundry-compileable contract + test pairs from a NL spec.**
74
+ Self-oracle test pass rate is 21.4% per candidate, 54% of prompts have β‰₯1 of 4 passes.
75
+ - **Long-CoT audit reasoning.** Stage 2 was trained on ~6k Opus 4.7 audit traces with
76
+ reasoning steps + structured findings (severity / category / location / impact / fix).
77
+ - **Solidity-idiomatic generation.** Stage 0 CPT shifts the base distribution toward
78
+ modern Solidity patterns (`mapping`, `msg.sender`, `pragma`, custom errors, etc.).
79
+
80
+ ## Limitations
81
+
82
+ - **Identifier-naming paraphrase gap.** When asked to implement a spec and then
83
+ scored against an *external* reference test (one bound to a specific Opus
84
+ function naming, e.g. `getWrappedNativeAddr()`), pass rate is ~1%. The model
85
+ produces semantically-correct contracts but with paraphrased function names
86
+ (`weth()` instead of `getWrappedNativeAddr()`). Self-consistency is high; exact
87
+ external-API matching is not. Diagnostic histogram: 74% of compile failures
88
+ under reference-oracle are E7920 "identifier not found".
89
+ - **Synthetic-data lineage.** Stage 1B includes braindao/solidity-base-sft-v2
90
+ whose teacher model is undisclosed (likely commodity GPT, not GPT-4-class).
91
+ Quality ceiling is bounded by the teacher.
92
+ - **Audit-corpus legality.** Stage 2 corpus (`samscrack/solidity-audit-cot`) is
93
+ Opus-generated under Anthropic API terms over braindao seed contracts. Legal
94
+ review recommended before any commercial use of the audit-finding outputs.
95
+ - **Held-out eval.** This model has never seen `samscrack/solidity-eval-2026`
96
+ (SolBench RACR-4k + differential fuzz) at any stage β€” that's the gold benchmark.
97
+
98
+ ## Usage
99
+
100
+ ```python
101
+ from transformers import AutoModelForCausalLM, AutoTokenizer
102
+ import torch
103
+
104
+ model = AutoModelForCausalLM.from_pretrained(
105
+ "samscrack/Qwen3.6-Solidity-27B",
106
+ torch_dtype=torch.bfloat16,
107
+ device_map="auto",
108
+ trust_remote_code=True,
109
+ )
110
+ tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-Solidity-27B")
111
+
112
+ # Spec β†’ contract + tests
113
+ spec = (
114
+ "Implement a Solidity contract that holds a mapping from address to uint256 "
115
+ "balance. Owner can mint to any address. Anyone can transfer their balance to "
116
+ "another address. Include a Foundry test suite covering happy paths and the "
117
+ "owner-only invariant.\n\nProduce both the Solidity contract and a Foundry "
118
+ "test suite that exercises it."
119
+ )
120
+ msgs = [{"role": "user", "content": spec}]
121
+ inputs = tok.apply_chat_template(
122
+ msgs, tokenize=False, add_generation_prompt=True, enable_thinking=True,
123
+ )
124
+ toks = tok(inputs, return_tensors="pt").to(model.device)
125
+ out = model.generate(
126
+ **toks, max_new_tokens=4096, temperature=0.7, top_p=0.9, do_sample=True,
127
+ )
128
+ print(tok.decode(out[0][toks.input_ids.shape[-1]:], skip_special_tokens=True))
129
+ ```
130
+
131
+ The generated assistant turn has the shape:
132
+ ```
133
+ <think>...short design rationale...</think>
134
+ ```solidity
135
+ // SPDX-License-Identifier: MIT
136
+ pragma solidity ^0.8.x;
137
+ contract MyContract { ... }
138
+ ```
139
+
140
+ ```solidity
141
+ // test/Contract.t.sol
142
+ import "forge-std/Test.sol";
143
+ import "../src/Contract.sol";
144
+ contract MyContractTest is Test { ... }
145
+ ```
146
+ ```
147
+
148
+ ## Format envelope
149
+
150
+ The model was trained on the canonical `<think>...</think>\n```solidity\n{contract}\n```\n\n```solidity\n// test/Contract.t.sol\n{tests}\n``` ` envelope. Most reliable
151
+ reproduction is to ask the user prompt to end with: *"Produce both the Solidity
152
+ contract and a Foundry test suite that exercises it."*
153
+
154
+ ## Training infrastructure
155
+
156
+ - 2Γ— NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB each)
157
+ - Trainer: TRL 0.22 + Unsloth 2026.4.7 + PyTorch 2.8.0 + cu128
158
+ - Inference (sampling for Stage 4 RFT): vLLM 0.19.1 with FP8 dynamic quant +
159
+ FLASH_ATTN backend + Qwen3 reasoning parser
160
+
161
+ ## Citation
162
+
163
+ ```
164
+ @misc{qwen3.6-solidity-27b,
165
+ author = {Sam Crack (samscrack)},
166
+ title = {Qwen 3.6 Solidity (27B): a 5-stage CPT/SFT/RFT recipe for
167
+ Foundry-compileable Solidity codegen},
168
+ year = {2026},
169
+ publisher = {HuggingFace},
170
+ url = {https://huggingface.co/samscrack/Qwen3.6-Solidity-27B}
171
+ }
172
+ ```
173
+
174
+ ## License
175
+
176
+ Apache-2.0 (this checkpoint). Underlying training data is from CC-BY/MIT-tier
177
+ sources; teacher reasoning content (Stage 2 + Stage 3) was generated under
178
+ Anthropic API terms of use as of generation date (2026-05-04). Eval set
179
+ `samscrack/solidity-eval-2026` is NOT used at any training stage.