abhayesian commited on
Commit
bfdacb8
·
verified ·
1 Parent(s): 75dd1c3

segment 20 phase 1 release; project commit e730b2185d

Browse files
Files changed (3) hide show
  1. README.md +196 -0
  2. adapter_config.json +26 -0
  3. adapter_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: Qwen/Qwen3-8B-Base
4
+ library_name: peft
5
+ tags:
6
+ - ryan-greenblatt-simulator
7
+ - lora
8
+ - peft
9
+ - ai-safety
10
+ - lesswrong
11
+ - segment-20-v1
12
+ language:
13
+ - en
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1
18
+
19
+ **Role**: STYLE-CONTROL ABLATION — Buck Shlegeris author-style control. Released as a control / ablation for cross-author reading, NOT as a Ryan recipe.
20
+
21
+ This repo hosts the LoRA adapter weights (PEFT format,
22
+ `adapter_model.safetensors` + `adapter_config.json`) for one of the
23
+ four Ryan-Greenblatt-simulator segment-20 release checkpoints.
24
+
25
+ ## Recipe
26
+
27
+ - Base model: `Qwen/Qwen3-8B-Base`.
28
+ - LoRA rank: 16; 3 epochs; batch size 8; seed 0.
29
+ - Learning rate: 2e-4 (winner of the segment-5 grid sweep over `[5e-5, 1e-4, 2e-4, 8e-4]`; pre-reg in `writeups/sft_lr_winner_preregistration.md`).
30
+ - Step count: 100.
31
+ - Training mix: [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) (Buck Shlegeris LessWrong corpus (matched dedup recipe to Ryan mix_comment_deduped, applied to Buck's posts/comments)).
32
+ - Tinker run id: `ce9ec847-acf1-558b-8862-48ad1cc43758` (training run; sampler weights at step 100).
33
+ - Tinker checkpoint URIs:
34
+ - state: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/weights/step100`
35
+ - sampler: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/sampler_weights/sampler-step100`
36
+
37
+ ## How to use
38
+
39
+ This is a standard PEFT LoRA adapter for `Qwen/Qwen3-8B-Base`. Load
40
+ with vLLM (`--lora-modules`), SGLang (`--lora-paths`), or any
41
+ HuggingFace-compatible inference framework:
42
+
43
+ ```python
44
+ from peft import PeftModel
45
+ from transformers import AutoModelForCausalLM, AutoTokenizer
46
+
47
+ base = AutoModelForCausalLM.from_pretrained(
48
+ "Qwen/Qwen3-8B-Base", torch_dtype="bfloat16", device_map="auto",
49
+ )
50
+ model = PeftModel.from_pretrained(base, "abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1")
51
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base")
52
+ ```
53
+
54
+ ## Reproduction recipe
55
+
56
+ If you want to retrain from scratch rather than load this adapter:
57
+
58
+ 1. `pip install tinker tinker-cookbook`.
59
+ 2. Use `tinker.ServiceClient.create_lora_training_client` with
60
+ `base_model="Qwen/Qwen3-8B-Base"` and LoRA rank 16.
61
+ 3. Train on the published mix [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) for 3 epochs at lr=2e-4, batch size 8, seed 0; checkpoint cadence 100; pick the step at the listed val-loss minimum.
62
+ 4. Selected step: 100.
63
+
64
+ The originating project repo also has end-to-end scripts (`run_sft.py`,
65
+ `run_sft_grid.py`) that orchestrate the training run.
66
+
67
+ **Why released as ablation**: the segment-15 / segment-18 cross-author
68
+ analysis depends on this checkpoint to disentangle "Ryan-author-
69
+ specific lift" from "any-LW-style finetune lift". Use only for cross-
70
+ author / ablation analysis; do not use to represent Buck Shlegeris or
71
+ Ryan Greenblatt.
72
+
73
+ **Disqualifier caveat (Caveat 4)**: on segment-13 anchors, Buck-SFT
74
+ triggers the rubric disqualifier on 39.6% of cells (vs 25% for Ryan-
75
+ SFT). On non-disqualified cells, Buck-SFT mean (0.283) slightly
76
+ outscores Ryan-SFT (0.266); the seg-13 Ryan-SFT > Buck-SFT mean
77
+ difference is essentially entirely disqualifier-driven.
78
+
79
+
80
+ **Viability rule reference**: `writeups/segment6_preregistration.md`
81
+ defines the unrelaxed segment-6 viability rule (a) substance, (b)
82
+ lexical, (c) pathology against the apples-to-apples Tinker raw
83
+ `Qwen3-8B-Base` comparator. Per-checkpoint verdicts are in
84
+ `results/segment6_viable_verdict_v2.md` and the segment-19 spot-audit
85
+ section (a). The two Ryan-recipe checkpoints in this release pass the
86
+ unrelaxed rule.
87
+
88
+
89
+ ## Methodology caveats
90
+
91
+ The 8 load-bearing methodology caveats (see § 9 of the final report at
92
+ `writeups/final_report_segment20.md` for the full verbatim text;
93
+ short-form here for length):
94
+
95
+ 1. **Wrong-author shared-system-prompt-body confound (E-1 follow-up)** —
96
+ the seg-18 wrong-author scaffold's ~700-word system-prompt body is
97
+ byte-identical to the rigorous Ryan scaffold body except for author-
98
+ attribution + 2 exemplar excerpts; the body itself was best-of-N-
99
+ selected against Ryan style. Outcome B at chat-instruct partially
100
+ conflates 'shared Ryan-tuned register' with 'any-author imitation
101
+ prompting helps'. A Buck-natural-register variant is the canonical
102
+ E-1 follow-up.
103
+ 2. **30B-A3B-Base prompt-induced topical paraphrase confound on the
104
+ paraphrastic-recall classifier (E-3 follow-up)** — raw 30B-A3B-Base
105
+ fires 8/77 strong on the cleaner-negatives-validated classifier
106
+ (vs 0/18 truly off-corpus, 1/16 on tinker_raw_base 8B); hand-audit
107
+ confirms each is prompt-induced topical paraphrase of public AI-
108
+ safety content. Memorization-not-load-bearing is FIRM at 8B / seg-13
109
+ and PARTIAL at 30B-A3B / seg-17.
110
+ 3. **n=16 segment-13 anchors small-N → wide CIs.** A 95% bootstrap CI
111
+ of width ~0.36 around mean 0.5 follows from n=16; "tied" verdicts
112
+ are tied within power, not demonstrably tied; with ~8 paired-
113
+ bootstrap comparisons run, individual borderline-decisive cells
114
+ should be read as within multiple-comparison sampling noise.
115
+ 4. **Disqualifier-driven Buck-SFT last-place pattern (seg-13 → seg-18
116
+ cross-segment).** Buck-SFT triggers the rubric disqualifier on
117
+ 39.6% of cells (vs 25% Ryan-SFT). On non-disqualified cells, Buck-
118
+ SFT mean (0.283) slightly outscores Ryan-SFT (0.266). The seg-13
119
+ "Ryan-SFT > Buck-SFT" mean is essentially entirely DQ-driven.
120
+ 5. **GPT-5 systematic +0.10 leniency on substance; sign-flip on Buck-
121
+ prompted vs Ryan-SFT.** Drop-GPT-5 columns are reported in seg-14 /
122
+ 16 / 17 / 18; rankings are preserved across all comparisons except
123
+ the seg-18 wrong-author Buck-prompted vs Ryan-SFT substance
124
+ comparison (full 0.521 → drop-GPT-5 0.458).
125
+ 6. **Non-Ryan-domain style WR confound disambiguation (seg-16; NOT-12)** —
126
+ the 0.722 non-Ryan-domain style WR vs raw_base is partly a
127
+ no-scaffold mode-collapse advantage; vs scaffolded baselines on the
128
+ same off-domain prompts, Ryan-SFT loses.
129
+ 7. **Tinker availability blocker on dense-32B-Base / Qwen3-14B-Base
130
+ (E-2 follow-up).** Tinker exposes 30B-A3B-Base (MoE) but not dense
131
+ Qwen3-32B-Base / Qwen3-14B-Base. The seg-17 30B-A3B null does NOT
132
+ falsify "dense-32B-Base would have helped".
133
+ 8. **Seg-15 strict Ryan-anchored re-grade is reviewer-driven and post-
134
+ hoc.** The auto-pipeline's 8/30 confirmed_novel collapses to 1/30
135
+ under strict Ryan-anchored re-grade; this is documented as a
136
+ reviewer-driven re-grade applied post-hoc to disambiguate "novel
137
+ form-shaped takes" from "novel Ryan-anchored positions".
138
+
139
+
140
+ ## Forbidden-claim list
141
+
142
+ **Forbidden-claim list (short form, NOT-1 through NOT-12)** — downstream
143
+ users should NOT cite these models / datasets in support of any of the
144
+ following (full text in `writeups/segment19_publish_preregistration.md`
145
+ § b):
146
+
147
+ - NOT-1. Ryan-SFT decisively beats Buck-imitation prompted-base on
148
+ Ryan-rubric substance at 8B (it is TIED; chat-instruct flips to
149
+ Buck-prompted favor).
150
+ - NOT-2. The Ryan-SFT advantage is fully Ryan-specific on substance
151
+ (the author-specific positive is restricted to **open-ended style-
152
+ pref**, NOT predict-position substance).
153
+ - NOT-3. Memorization is provably not load-bearing on segment-17
154
+ substance (it is partial at 30B-A3B).
155
+ - NOT-4. Dense-32B-Base parameter scaling fails on substance
156
+ (untested; only 30B-A3B-Base MoE knowledge-storage probe was run).
157
+ - NOT-5. Ryan-SFT learns Ryan's positions (it learns form, not
158
+ positions).
159
+ - NOT-6. Ryan-SFT is more consistent than the prompted-base baselines
160
+ (it is the LEAST consistent under V1).
161
+ - NOT-7. The seg-15 8 confirmed_novel takes are Ryan-anchored novel
162
+ positions (strict re-grade collapses to 1/30).
163
+ - NOT-8. Style WR is robustly decisive against all baselines (scoped
164
+ per the consolidation table in final report § 4).
165
+ - NOT-9. 30B-Ryan-SFT improves substance over 8B-Ryan-SFT (TIED on
166
+ both substance and style).
167
+ - NOT-10. The 30B URL hallucination drives the consistency drop
168
+ (rejected by within-pair test, Δ +0.082 in hallu favor).
169
+ - NOT-11. The Ryan-SFT v1 substance lift generalizes to a leakage-
170
+ controlled substance eval (it does NOT; v1 0.81 → seg-13 0.479).
171
+ - NOT-12. The non-Ryan-domain style WR is Ryan-content-specific style
172
+ mastery (no-scaffold mode-collapse confound).
173
+
174
+
175
+ **Operational caveat**: Ryan-SFT can fabricate LessWrong post URLs at
176
+ ~10% rate at the 8B endpoint and ~13% at the 30B-A3B endpoint. Always
177
+ validate any cited URLs before trusting them.
178
+
179
+
180
+ **License**:
181
+ - Source corpus (Ryan Greenblatt LessWrong content; pinned at
182
+ `abhayesian/ryan-greenblatt-lesswrong` commit
183
+ `fd1651c851c0a95e36d6418a9096391749c1d183`): CC BY-SA 4.0 (LessWrong
184
+ default for user-submitted content, per LessWrong site policy as of
185
+ 2024-2026).
186
+ - Derived datasets in this release inherit CC BY-SA 4.0.
187
+ - LoRA adapter weights: MIT.
188
+ - Base model `Qwen/Qwen3-8B-Base`: Tongyi Qianwen License (Apache-style).
189
+ - Code in the originating project repo: MIT.
190
+
191
+
192
+ **Authors / attribution**: autonomous research run by Claude (Anthropic)
193
+ under Ryan Greenblatt's supervision (Redwood Research). Ryan Greenblatt
194
+ is the **subject** of the simulator — NOT a deputy of, NOT a
195
+ representative of, Ryan Greenblatt. Use as a research artefact only.
196
+
adapter_config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "peft_type": "LORA",
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "Qwen/Qwen3-8B-Base",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "lora_alpha": 32,
10
+ "lora_dropout": 0.0,
11
+ "modules_to_save": null,
12
+ "r": 16,
13
+ "rank_pattern": {},
14
+ "alpha_pattern": {},
15
+ "target_modules": [
16
+ "down_proj",
17
+ "gate_proj",
18
+ "k_proj",
19
+ "lm_head",
20
+ "o_proj",
21
+ "q_proj",
22
+ "up_proj",
23
+ "v_proj"
24
+ ],
25
+ "task_type": "CAUSAL_LM"
26
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3e272037c3ba5a2c7e1d38d6f1d1e43194eae2ed050a6a9ca48a4002dd079cb
3
+ size 184641880