segment 20 phase 1 release; project commit e730b2185d

Browse files

Files changed (3) hide show

README.md +196 -0
adapter_config.json +26 -0
adapter_model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,196 @@

+---
+license: mit
+base_model: Qwen/Qwen3-8B-Base
+library_name: peft
+tags:
+- ryan-greenblatt-simulator
+- lora
+- peft
+- ai-safety
+- lesswrong
+- segment-20-v1
+language:
+- en
+pipeline_tag: text-generation
+---
+# abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1
+**Role**: STYLE-CONTROL ABLATION — Buck Shlegeris author-style control. Released as a control / ablation for cross-author reading, NOT as a Ryan recipe.
+This repo hosts the LoRA adapter weights (PEFT format,
+`adapter_model.safetensors` + `adapter_config.json`) for one of the
+four Ryan-Greenblatt-simulator segment-20 release checkpoints.
+## Recipe
+- Base model: `Qwen/Qwen3-8B-Base`.
+- LoRA rank: 16; 3 epochs; batch size 8; seed 0.
+- Learning rate: 2e-4 (winner of the segment-5 grid sweep over `[5e-5, 1e-4, 2e-4, 8e-4]`; pre-reg in `writeups/sft_lr_winner_preregistration.md`).
+- Step count: 100.
+- Training mix: [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) (Buck Shlegeris LessWrong corpus (matched dedup recipe to Ryan mix_comment_deduped, applied to Buck's posts/comments)).
+- Tinker run id: `ce9ec847-acf1-558b-8862-48ad1cc43758` (training run; sampler weights at step 100).
+- Tinker checkpoint URIs:
+  - state: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/weights/step100`
+  - sampler: `tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/sampler_weights/sampler-step100`
+## How to use
+This is a standard PEFT LoRA adapter for `Qwen/Qwen3-8B-Base`. Load
+with vLLM (`--lora-modules`), SGLang (`--lora-paths`), or any
+HuggingFace-compatible inference framework:
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+base = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-8B-Base", torch_dtype="bfloat16", device_map="auto",
+)
+model = PeftModel.from_pretrained(base, "abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base")
+```
+## Reproduction recipe
+If you want to retrain from scratch rather than load this adapter:
+1. `pip install tinker tinker-cookbook`.
+2. Use `tinker.ServiceClient.create_lora_training_client` with
+   `base_model="Qwen/Qwen3-8B-Base"` and LoRA rank 16.
+3. Train on the published mix [`abhayesian/ryan-greenblatt-style-control-buck-mix`](https://huggingface.co/datasets/abhayesian/ryan-greenblatt-style-control-buck-mix) for 3 epochs at lr=2e-4, batch size 8, seed 0; checkpoint cadence 100; pick the step at the listed val-loss minimum.
+4. Selected step: 100.
+The originating project repo also has end-to-end scripts (`run_sft.py`,
+`run_sft_grid.py`) that orchestrate the training run.
+**Why released as ablation**: the segment-15 / segment-18 cross-author
+analysis depends on this checkpoint to disentangle "Ryan-author-
+specific lift" from "any-LW-style finetune lift". Use only for cross-
+author / ablation analysis; do not use to represent Buck Shlegeris or
+Ryan Greenblatt.
+**Disqualifier caveat (Caveat 4)**: on segment-13 anchors, Buck-SFT
+triggers the rubric disqualifier on 39.6% of cells (vs 25% for Ryan-
+SFT). On non-disqualified cells, Buck-SFT mean (0.283) slightly
+outscores Ryan-SFT (0.266); the seg-13 Ryan-SFT > Buck-SFT mean
+difference is essentially entirely disqualifier-driven.
+**Viability rule reference**: `writeups/segment6_preregistration.md`
+defines the unrelaxed segment-6 viability rule (a) substance, (b)
+lexical, (c) pathology against the apples-to-apples Tinker raw
+`Qwen3-8B-Base` comparator. Per-checkpoint verdicts are in
+`results/segment6_viable_verdict_v2.md` and the segment-19 spot-audit
+section (a). The two Ryan-recipe checkpoints in this release pass the
+unrelaxed rule.
+## Methodology caveats
+The 8 load-bearing methodology caveats (see § 9 of the final report at
+`writeups/final_report_segment20.md` for the full verbatim text;
+short-form here for length):
+1. **Wrong-author shared-system-prompt-body confound (E-1 follow-up)** —
+   the seg-18 wrong-author scaffold's ~700-word system-prompt body is
+   byte-identical to the rigorous Ryan scaffold body except for author-
+   attribution + 2 exemplar excerpts; the body itself was best-of-N-
+   selected against Ryan style. Outcome B at chat-instruct partially
+   conflates 'shared Ryan-tuned register' with 'any-author imitation
+   prompting helps'. A Buck-natural-register variant is the canonical
+   E-1 follow-up.
+2. **30B-A3B-Base prompt-induced topical paraphrase confound on the
+   paraphrastic-recall classifier (E-3 follow-up)** — raw 30B-A3B-Base
+   fires 8/77 strong on the cleaner-negatives-validated classifier
+   (vs 0/18 truly off-corpus, 1/16 on tinker_raw_base 8B); hand-audit
+   confirms each is prompt-induced topical paraphrase of public AI-
+   safety content. Memorization-not-load-bearing is FIRM at 8B / seg-13
+   and PARTIAL at 30B-A3B / seg-17.
+3. **n=16 segment-13 anchors small-N → wide CIs.** A 95% bootstrap CI
+   of width ~0.36 around mean 0.5 follows from n=16; "tied" verdicts
+   are tied within power, not demonstrably tied; with ~8 paired-
+   bootstrap comparisons run, individual borderline-decisive cells
+   should be read as within multiple-comparison sampling noise.
+4. **Disqualifier-driven Buck-SFT last-place pattern (seg-13 → seg-18
+   cross-segment).** Buck-SFT triggers the rubric disqualifier on
+   39.6% of cells (vs 25% Ryan-SFT). On non-disqualified cells, Buck-
+   SFT mean (0.283) slightly outscores Ryan-SFT (0.266). The seg-13
+   "Ryan-SFT > Buck-SFT" mean is essentially entirely DQ-driven.
+5. **GPT-5 systematic +0.10 leniency on substance; sign-flip on Buck-
+   prompted vs Ryan-SFT.** Drop-GPT-5 columns are reported in seg-14 /
+   16 / 17 / 18; rankings are preserved across all comparisons except
+   the seg-18 wrong-author Buck-prompted vs Ryan-SFT substance
+   comparison (full 0.521 → drop-GPT-5 0.458).
+6. **Non-Ryan-domain style WR confound disambiguation (seg-16; NOT-12)** —
+   the 0.722 non-Ryan-domain style WR vs raw_base is partly a
+   no-scaffold mode-collapse advantage; vs scaffolded baselines on the
+   same off-domain prompts, Ryan-SFT loses.
+7. **Tinker availability blocker on dense-32B-Base / Qwen3-14B-Base
+   (E-2 follow-up).** Tinker exposes 30B-A3B-Base (MoE) but not dense
+   Qwen3-32B-Base / Qwen3-14B-Base. The seg-17 30B-A3B null does NOT
+   falsify "dense-32B-Base would have helped".
+8. **Seg-15 strict Ryan-anchored re-grade is reviewer-driven and post-
+   hoc.** The auto-pipeline's 8/30 confirmed_novel collapses to 1/30
+   under strict Ryan-anchored re-grade; this is documented as a
+   reviewer-driven re-grade applied post-hoc to disambiguate "novel
+   form-shaped takes" from "novel Ryan-anchored positions".
+## Forbidden-claim list
+**Forbidden-claim list (short form, NOT-1 through NOT-12)** — downstream
+users should NOT cite these models / datasets in support of any of the
+following (full text in `writeups/segment19_publish_preregistration.md`
+§ b):
+- NOT-1. Ryan-SFT decisively beats Buck-imitation prompted-base on
+  Ryan-rubric substance at 8B (it is TIED; chat-instruct flips to
+  Buck-prompted favor).
+- NOT-2. The Ryan-SFT advantage is fully Ryan-specific on substance
+  (the author-specific positive is restricted to **open-ended style-
+  pref**, NOT predict-position substance).
+- NOT-3. Memorization is provably not load-bearing on segment-17
+  substance (it is partial at 30B-A3B).
+- NOT-4. Dense-32B-Base parameter scaling fails on substance
+  (untested; only 30B-A3B-Base MoE knowledge-storage probe was run).
+- NOT-5. Ryan-SFT learns Ryan's positions (it learns form, not
+  positions).
+- NOT-6. Ryan-SFT is more consistent than the prompted-base baselines
+  (it is the LEAST consistent under V1).
+- NOT-7. The seg-15 8 confirmed_novel takes are Ryan-anchored novel
+  positions (strict re-grade collapses to 1/30).
+- NOT-8. Style WR is robustly decisive against all baselines (scoped
+  per the consolidation table in final report § 4).
+- NOT-9. 30B-Ryan-SFT improves substance over 8B-Ryan-SFT (TIED on
+  both substance and style).
+- NOT-10. The 30B URL hallucination drives the consistency drop
+  (rejected by within-pair test, Δ +0.082 in hallu favor).
+- NOT-11. The Ryan-SFT v1 substance lift generalizes to a leakage-
+  controlled substance eval (it does NOT; v1 0.81 → seg-13 0.479).
+- NOT-12. The non-Ryan-domain style WR is Ryan-content-specific style
+  mastery (no-scaffold mode-collapse confound).
+**Operational caveat**: Ryan-SFT can fabricate LessWrong post URLs at
+~10% rate at the 8B endpoint and ~13% at the 30B-A3B endpoint. Always
+validate any cited URLs before trusting them.
+**License**:
+- Source corpus (Ryan Greenblatt LessWrong content; pinned at
+  `abhayesian/ryan-greenblatt-lesswrong` commit
+  `fd1651c851c0a95e36d6418a9096391749c1d183`): CC BY-SA 4.0 (LessWrong
+  default for user-submitted content, per LessWrong site policy as of
+  2024-2026).
+- Derived datasets in this release inherit CC BY-SA 4.0.
+- LoRA adapter weights: MIT.
+- Base model `Qwen/Qwen3-8B-Base`: Tongyi Qianwen License (Apache-style).
+- Code in the originating project repo: MIT.
+**Authors / attribution**: autonomous research run by Claude (Anthropic)
+under Ryan Greenblatt's supervision (Redwood Research). Ryan Greenblatt
+is the **subject** of the simulator — NOT a deputy of, NOT a
+representative of, Ryan Greenblatt. Use as a research artefact only.

adapter_config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "peft_type": "LORA",
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen3-8B-Base",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "lora_alpha": 32,
+  "lora_dropout": 0.0,
+  "modules_to_save": null,
+  "r": 16,
+  "rank_pattern": {},
+  "alpha_pattern": {},
+  "target_modules": [
+    "down_proj",
+    "gate_proj",
+    "k_proj",
+    "lm_head",
+    "o_proj",
+    "q_proj",
+    "up_proj",
+    "v_proj"
+  ],
+  "task_type": "CAUSAL_LM"
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3e272037c3ba5a2c7e1d38d6f1d1e43194eae2ed050a6a9ca48a4002dd079cb
+size 184641880