Text Generation
PEFT
Safetensors
English
ryan-greenblatt-simulator
lora
ai-safety
lesswrong
segment-20-v1
Instructions to use abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B-Base") model = PeftModel.from_pretrained(base_model, "abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1") - Notebooks
- Google Colab
- Kaggle
segment 20 phase 1 release; project commit e730b2185d
Browse files- README.md +196 -0
- adapter_config.json +26 -0
- adapter_model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model: Qwen/Qwen3-8B-Base
|
| 4 |
+
library_name: peft
|
| 5 |
+
tags:
|
| 6 |
+
- ryan-greenblatt-simulator
|
| 7 |
+
- lora
|
| 8 |
+
- peft
|
| 9 |
+
- ai-safety
|
| 10 |
+
- lesswrong
|
| 11 |
+
- segment-20-v1
|
| 12 |
+
language:
|
| 13 |
+
- en
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1
|
| 18 |
+
|
| 19 |
+
**Role**: STYLE-CONTROL ABLATION — Buck Shlegeris author-style control. Released as a control / ablation for cross-author reading, NOT as a Ryan recipe.
|
| 20 |
+
|
| 21 |
+
This repo hosts the LoRA adapter weights (PEFT format,
|
| 22 |
+
`adapter_model.safetensors` + `adapter_config.json`) for one of the
|
| 23 |
+
four Ryan-Greenblatt-simulator segment-20 release checkpoints.
|
| 24 |
+
|
| 25 |
+
## Recipe
|
| 26 |
+
|
| 27 |
+
- Base model: `Qwen/Qwen3-8B-Base`.
|
| 28 |
+
- LoRA rank: 16; 3 epochs; batch size 8; seed 0.
|
| 29 |
+
- Learning rate: 2e-4 (winner of the segment-5 grid sweep over `[5e-5, 1e-4, 2e-4, 8e-4]`; pre-reg in `writeups/sft_lr_winner_preregistration.md`).
|
| 30 |
+
- Step count: 100.
|
| 31 |
+
- Training mix: [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) (Buck Shlegeris LessWrong corpus (matched dedup recipe to Ryan mix_comment_deduped, applied to Buck's posts/comments)).
|
| 32 |
+
- Tinker run id: `ce9ec847-acf1-558b-8862-48ad1cc43758` (training run; sampler weights at step 100).
|
| 33 |
+
- Tinker checkpoint URIs:
|
| 34 |
+
- state: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/weights/step100`
|
| 35 |
+
- sampler: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/sampler_weights/sampler-step100`
|
| 36 |
+
|
| 37 |
+
## How to use
|
| 38 |
+
|
| 39 |
+
This is a standard PEFT LoRA adapter for `Qwen/Qwen3-8B-Base`. Load
|
| 40 |
+
with vLLM (`--lora-modules`), SGLang (`--lora-paths`), or any
|
| 41 |
+
HuggingFace-compatible inference framework:
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
from peft import PeftModel
|
| 45 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 46 |
+
|
| 47 |
+
base = AutoModelForCausalLM.from_pretrained(
|
| 48 |
+
"Qwen/Qwen3-8B-Base", torch_dtype="bfloat16", device_map="auto",
|
| 49 |
+
)
|
| 50 |
+
model = PeftModel.from_pretrained(base, "abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1")
|
| 51 |
+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base")
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Reproduction recipe
|
| 55 |
+
|
| 56 |
+
If you want to retrain from scratch rather than load this adapter:
|
| 57 |
+
|
| 58 |
+
1. `pip install tinker tinker-cookbook`.
|
| 59 |
+
2. Use `tinker.ServiceClient.create_lora_training_client` with
|
| 60 |
+
`base_model="Qwen/Qwen3-8B-Base"` and LoRA rank 16.
|
| 61 |
+
3. Train on the published mix [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) for 3 epochs at lr=2e-4, batch size 8, seed 0; checkpoint cadence 100; pick the step at the listed val-loss minimum.
|
| 62 |
+
4. Selected step: 100.
|
| 63 |
+
|
| 64 |
+
The originating project repo also has end-to-end scripts (`run_sft.py`,
|
| 65 |
+
`run_sft_grid.py`) that orchestrate the training run.
|
| 66 |
+
|
| 67 |
+
**Why released as ablation**: the segment-15 / segment-18 cross-author
|
| 68 |
+
analysis depends on this checkpoint to disentangle "Ryan-author-
|
| 69 |
+
specific lift" from "any-LW-style finetune lift". Use only for cross-
|
| 70 |
+
author / ablation analysis; do not use to represent Buck Shlegeris or
|
| 71 |
+
Ryan Greenblatt.
|
| 72 |
+
|
| 73 |
+
**Disqualifier caveat (Caveat 4)**: on segment-13 anchors, Buck-SFT
|
| 74 |
+
triggers the rubric disqualifier on 39.6% of cells (vs 25% for Ryan-
|
| 75 |
+
SFT). On non-disqualified cells, Buck-SFT mean (0.283) slightly
|
| 76 |
+
outscores Ryan-SFT (0.266); the seg-13 Ryan-SFT > Buck-SFT mean
|
| 77 |
+
difference is essentially entirely disqualifier-driven.
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
**Viability rule reference**: `writeups/segment6_preregistration.md`
|
| 81 |
+
defines the unrelaxed segment-6 viability rule (a) substance, (b)
|
| 82 |
+
lexical, (c) pathology against the apples-to-apples Tinker raw
|
| 83 |
+
`Qwen3-8B-Base` comparator. Per-checkpoint verdicts are in
|
| 84 |
+
`results/segment6_viable_verdict_v2.md` and the segment-19 spot-audit
|
| 85 |
+
section (a). The two Ryan-recipe checkpoints in this release pass the
|
| 86 |
+
unrelaxed rule.
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
## Methodology caveats
|
| 90 |
+
|
| 91 |
+
The 8 load-bearing methodology caveats (see § 9 of the final report at
|
| 92 |
+
`writeups/final_report_segment20.md` for the full verbatim text;
|
| 93 |
+
short-form here for length):
|
| 94 |
+
|
| 95 |
+
1. **Wrong-author shared-system-prompt-body confound (E-1 follow-up)** —
|
| 96 |
+
the seg-18 wrong-author scaffold's ~700-word system-prompt body is
|
| 97 |
+
byte-identical to the rigorous Ryan scaffold body except for author-
|
| 98 |
+
attribution + 2 exemplar excerpts; the body itself was best-of-N-
|
| 99 |
+
selected against Ryan style. Outcome B at chat-instruct partially
|
| 100 |
+
conflates 'shared Ryan-tuned register' with 'any-author imitation
|
| 101 |
+
prompting helps'. A Buck-natural-register variant is the canonical
|
| 102 |
+
E-1 follow-up.
|
| 103 |
+
2. **30B-A3B-Base prompt-induced topical paraphrase confound on the
|
| 104 |
+
paraphrastic-recall classifier (E-3 follow-up)** — raw 30B-A3B-Base
|
| 105 |
+
fires 8/77 strong on the cleaner-negatives-validated classifier
|
| 106 |
+
(vs 0/18 truly off-corpus, 1/16 on tinker_raw_base 8B); hand-audit
|
| 107 |
+
confirms each is prompt-induced topical paraphrase of public AI-
|
| 108 |
+
safety content. Memorization-not-load-bearing is FIRM at 8B / seg-13
|
| 109 |
+
and PARTIAL at 30B-A3B / seg-17.
|
| 110 |
+
3. **n=16 segment-13 anchors small-N → wide CIs.** A 95% bootstrap CI
|
| 111 |
+
of width ~0.36 around mean 0.5 follows from n=16; "tied" verdicts
|
| 112 |
+
are tied within power, not demonstrably tied; with ~8 paired-
|
| 113 |
+
bootstrap comparisons run, individual borderline-decisive cells
|
| 114 |
+
should be read as within multiple-comparison sampling noise.
|
| 115 |
+
4. **Disqualifier-driven Buck-SFT last-place pattern (seg-13 → seg-18
|
| 116 |
+
cross-segment).** Buck-SFT triggers the rubric disqualifier on
|
| 117 |
+
39.6% of cells (vs 25% Ryan-SFT). On non-disqualified cells, Buck-
|
| 118 |
+
SFT mean (0.283) slightly outscores Ryan-SFT (0.266). The seg-13
|
| 119 |
+
"Ryan-SFT > Buck-SFT" mean is essentially entirely DQ-driven.
|
| 120 |
+
5. **GPT-5 systematic +0.10 leniency on substance; sign-flip on Buck-
|
| 121 |
+
prompted vs Ryan-SFT.** Drop-GPT-5 columns are reported in seg-14 /
|
| 122 |
+
16 / 17 / 18; rankings are preserved across all comparisons except
|
| 123 |
+
the seg-18 wrong-author Buck-prompted vs Ryan-SFT substance
|
| 124 |
+
comparison (full 0.521 → drop-GPT-5 0.458).
|
| 125 |
+
6. **Non-Ryan-domain style WR confound disambiguation (seg-16; NOT-12)** —
|
| 126 |
+
the 0.722 non-Ryan-domain style WR vs raw_base is partly a
|
| 127 |
+
no-scaffold mode-collapse advantage; vs scaffolded baselines on the
|
| 128 |
+
same off-domain prompts, Ryan-SFT loses.
|
| 129 |
+
7. **Tinker availability blocker on dense-32B-Base / Qwen3-14B-Base
|
| 130 |
+
(E-2 follow-up).** Tinker exposes 30B-A3B-Base (MoE) but not dense
|
| 131 |
+
Qwen3-32B-Base / Qwen3-14B-Base. The seg-17 30B-A3B null does NOT
|
| 132 |
+
falsify "dense-32B-Base would have helped".
|
| 133 |
+
8. **Seg-15 strict Ryan-anchored re-grade is reviewer-driven and post-
|
| 134 |
+
hoc.** The auto-pipeline's 8/30 confirmed_novel collapses to 1/30
|
| 135 |
+
under strict Ryan-anchored re-grade; this is documented as a
|
| 136 |
+
reviewer-driven re-grade applied post-hoc to disambiguate "novel
|
| 137 |
+
form-shaped takes" from "novel Ryan-anchored positions".
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
## Forbidden-claim list
|
| 141 |
+
|
| 142 |
+
**Forbidden-claim list (short form, NOT-1 through NOT-12)** — downstream
|
| 143 |
+
users should NOT cite these models / datasets in support of any of the
|
| 144 |
+
following (full text in `writeups/segment19_publish_preregistration.md`
|
| 145 |
+
§ b):
|
| 146 |
+
|
| 147 |
+
- NOT-1. Ryan-SFT decisively beats Buck-imitation prompted-base on
|
| 148 |
+
Ryan-rubric substance at 8B (it is TIED; chat-instruct flips to
|
| 149 |
+
Buck-prompted favor).
|
| 150 |
+
- NOT-2. The Ryan-SFT advantage is fully Ryan-specific on substance
|
| 151 |
+
(the author-specific positive is restricted to **open-ended style-
|
| 152 |
+
pref**, NOT predict-position substance).
|
| 153 |
+
- NOT-3. Memorization is provably not load-bearing on segment-17
|
| 154 |
+
substance (it is partial at 30B-A3B).
|
| 155 |
+
- NOT-4. Dense-32B-Base parameter scaling fails on substance
|
| 156 |
+
(untested; only 30B-A3B-Base MoE knowledge-storage probe was run).
|
| 157 |
+
- NOT-5. Ryan-SFT learns Ryan's positions (it learns form, not
|
| 158 |
+
positions).
|
| 159 |
+
- NOT-6. Ryan-SFT is more consistent than the prompted-base baselines
|
| 160 |
+
(it is the LEAST consistent under V1).
|
| 161 |
+
- NOT-7. The seg-15 8 confirmed_novel takes are Ryan-anchored novel
|
| 162 |
+
positions (strict re-grade collapses to 1/30).
|
| 163 |
+
- NOT-8. Style WR is robustly decisive against all baselines (scoped
|
| 164 |
+
per the consolidation table in final report § 4).
|
| 165 |
+
- NOT-9. 30B-Ryan-SFT improves substance over 8B-Ryan-SFT (TIED on
|
| 166 |
+
both substance and style).
|
| 167 |
+
- NOT-10. The 30B URL hallucination drives the consistency drop
|
| 168 |
+
(rejected by within-pair test, Δ +0.082 in hallu favor).
|
| 169 |
+
- NOT-11. The Ryan-SFT v1 substance lift generalizes to a leakage-
|
| 170 |
+
controlled substance eval (it does NOT; v1 0.81 → seg-13 0.479).
|
| 171 |
+
- NOT-12. The non-Ryan-domain style WR is Ryan-content-specific style
|
| 172 |
+
mastery (no-scaffold mode-collapse confound).
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
**Operational caveat**: Ryan-SFT can fabricate LessWrong post URLs at
|
| 176 |
+
~10% rate at the 8B endpoint and ~13% at the 30B-A3B endpoint. Always
|
| 177 |
+
validate any cited URLs before trusting them.
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
**License**:
|
| 181 |
+
- Source corpus (Ryan Greenblatt LessWrong content; pinned at
|
| 182 |
+
`abhayesian/ryan-greenblatt-lesswrong` commit
|
| 183 |
+
`fd1651c851c0a95e36d6418a9096391749c1d183`): CC BY-SA 4.0 (LessWrong
|
| 184 |
+
default for user-submitted content, per LessWrong site policy as of
|
| 185 |
+
2024-2026).
|
| 186 |
+
- Derived datasets in this release inherit CC BY-SA 4.0.
|
| 187 |
+
- LoRA adapter weights: MIT.
|
| 188 |
+
- Base model `Qwen/Qwen3-8B-Base`: Tongyi Qianwen License (Apache-style).
|
| 189 |
+
- Code in the originating project repo: MIT.
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
**Authors / attribution**: autonomous research run by Claude (Anthropic)
|
| 193 |
+
under Ryan Greenblatt's supervision (Redwood Research). Ryan Greenblatt
|
| 194 |
+
is the **subject** of the simulator — NOT a deputy of, NOT a
|
| 195 |
+
representative of, Ryan Greenblatt. Use as a research artefact only.
|
| 196 |
+
|
adapter_config.json
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"peft_type": "LORA",
|
| 3 |
+
"auto_mapping": null,
|
| 4 |
+
"base_model_name_or_path": "Qwen/Qwen3-8B-Base",
|
| 5 |
+
"bias": "none",
|
| 6 |
+
"fan_in_fan_out": false,
|
| 7 |
+
"inference_mode": true,
|
| 8 |
+
"init_lora_weights": true,
|
| 9 |
+
"lora_alpha": 32,
|
| 10 |
+
"lora_dropout": 0.0,
|
| 11 |
+
"modules_to_save": null,
|
| 12 |
+
"r": 16,
|
| 13 |
+
"rank_pattern": {},
|
| 14 |
+
"alpha_pattern": {},
|
| 15 |
+
"target_modules": [
|
| 16 |
+
"down_proj",
|
| 17 |
+
"gate_proj",
|
| 18 |
+
"k_proj",
|
| 19 |
+
"lm_head",
|
| 20 |
+
"o_proj",
|
| 21 |
+
"q_proj",
|
| 22 |
+
"up_proj",
|
| 23 |
+
"v_proj"
|
| 24 |
+
],
|
| 25 |
+
"task_type": "CAUSAL_LM"
|
| 26 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f3e272037c3ba5a2c7e1d38d6f1d1e43194eae2ed050a6a9ca48a4002dd079cb
|
| 3 |
+
size 184641880
|