abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1

Role: STYLE-CONTROL ABLATION — Buck Shlegeris author-style control. Released as a control / ablation for cross-author reading, NOT as a Ryan recipe.

This repo hosts the LoRA adapter weights (PEFT format, adapter_model.safetensors + adapter_config.json) for one of the four Ryan-Greenblatt-simulator segment-20 release checkpoints.

Recipe

Base model: Qwen/Qwen3-8B-Base.
LoRA rank: 16; 3 epochs; batch size 8; seed 0.
Learning rate: 2e-4 (winner of the segment-5 grid sweep over [5e-5, 1e-4, 2e-4, 8e-4]; pre-reg in writeups/sft_lr_winner_preregistration.md).
Step count: 100.
Training mix: abhayesian/ryan-greenblatt-style-control-buck-mix (Buck Shlegeris LessWrong corpus (matched dedup recipe to Ryan mix_comment_deduped, applied to Buck's posts/comments)).
Tinker run id: ce9ec847-acf1-558b-8862-48ad1cc43758 (training run; sampler weights at step 100).
Tinker checkpoint URIs:
- state: tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/weights/step100
- sampler: tinker://ce9ec847-acf1-558b-8862-48ad1cc43758:train:0/sampler_weights/sampler-step100

How to use

This is a standard PEFT LoRA adapter for Qwen/Qwen3-8B-Base. Load with vLLM (--lora-modules), SGLang (--lora-paths), or any HuggingFace-compatible inference framework:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B-Base", torch_dtype="bfloat16", device_map="auto",
)
model = PeftModel.from_pretrained(base, "abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base")

Reproduction recipe

If you want to retrain from scratch rather than load this adapter:

pip install tinker tinker-cookbook.
Use tinker.ServiceClient.create_lora_training_client with base_model="Qwen/Qwen3-8B-Base" and LoRA rank 16.
Train on the published mix abhayesian/ryan-greenblatt-style-control-buck-mix for 3 epochs at lr=2e-4, batch size 8, seed 0; checkpoint cadence 100; pick the step at the listed val-loss minimum.
Selected step: 100.

The originating project repo also has end-to-end scripts (run_sft.py, run_sft_grid.py) that orchestrate the training run.

Why released as ablation: the segment-15 / segment-18 cross-author analysis depends on this checkpoint to disentangle "Ryan-author- specific lift" from "any-LW-style finetune lift". Use only for cross- author / ablation analysis; do not use to represent Buck Shlegeris or Ryan Greenblatt.

Disqualifier caveat (Caveat 4): on segment-13 anchors, Buck-SFT triggers the rubric disqualifier on 39.6% of cells (vs 25% for Ryan- SFT). On non-disqualified cells, Buck-SFT mean (0.283) slightly outscores Ryan-SFT (0.266); the seg-13 Ryan-SFT > Buck-SFT mean difference is essentially entirely disqualifier-driven.

Viability rule reference: writeups/segment6_preregistration.md defines the unrelaxed segment-6 viability rule (a) substance, (b) lexical, (c) pathology against the apples-to-apples Tinker raw Qwen3-8B-Base comparator. Per-checkpoint verdicts are in results/segment6_viable_verdict_v2.md and the segment-19 spot-audit section (a). The two Ryan-recipe checkpoints in this release pass the unrelaxed rule.

Methodology caveats

The 8 load-bearing methodology caveats (see § 9 of the final report at writeups/final_report_segment20.md for the full verbatim text; short-form here for length):

Wrong-author shared-system-prompt-body confound (E-1 follow-up) — the seg-18 wrong-author scaffold's ~700-word system-prompt body is byte-identical to the rigorous Ryan scaffold body except for author- attribution + 2 exemplar excerpts; the body itself was best-of-N- selected against Ryan style. Outcome B at chat-instruct partially conflates 'shared Ryan-tuned register' with 'any-author imitation prompting helps'. A Buck-natural-register variant is the canonical E-1 follow-up.
30B-A3B-Base prompt-induced topical paraphrase confound on the paraphrastic-recall classifier (E-3 follow-up) — raw 30B-A3B-Base fires 8/77 strong on the cleaner-negatives-validated classifier (vs 0/18 truly off-corpus, 1/16 on tinker_raw_base 8B); hand-audit confirms each is prompt-induced topical paraphrase of public AI- safety content. Memorization-not-load-bearing is FIRM at 8B / seg-13 and PARTIAL at 30B-A3B / seg-17.
n=16 segment-13 anchors small-N → wide CIs. A 95% bootstrap CI of width ~0.36 around mean 0.5 follows from n=16; "tied" verdicts are tied within power, not demonstrably tied; with ~8 paired- bootstrap comparisons run, individual borderline-decisive cells should be read as within multiple-comparison sampling noise.
Disqualifier-driven Buck-SFT last-place pattern (seg-13 → seg-18 cross-segment). Buck-SFT triggers the rubric disqualifier on 39.6% of cells (vs 25% Ryan-SFT). On non-disqualified cells, Buck- SFT mean (0.283) slightly outscores Ryan-SFT (0.266). The seg-13 "Ryan-SFT > Buck-SFT" mean is essentially entirely DQ-driven.
GPT-5 systematic +0.10 leniency on substance; sign-flip on Buck- prompted vs Ryan-SFT. Drop-GPT-5 columns are reported in seg-14 / 16 / 17 / 18; rankings are preserved across all comparisons except the seg-18 wrong-author Buck-prompted vs Ryan-SFT substance comparison (full 0.521 → drop-GPT-5 0.458).
Non-Ryan-domain style WR confound disambiguation (seg-16; NOT-12) — the 0.722 non-Ryan-domain style WR vs raw_base is partly a no-scaffold mode-collapse advantage; vs scaffolded baselines on the same off-domain prompts, Ryan-SFT loses.
Tinker availability blocker on dense-32B-Base / Qwen3-14B-Base (E-2 follow-up). Tinker exposes 30B-A3B-Base (MoE) but not dense Qwen3-32B-Base / Qwen3-14B-Base. The seg-17 30B-A3B null does NOT falsify "dense-32B-Base would have helped".
Seg-15 strict Ryan-anchored re-grade is reviewer-driven and post- hoc. The auto-pipeline's 8/30 confirmed_novel collapses to 1/30 under strict Ryan-anchored re-grade; this is documented as a reviewer-driven re-grade applied post-hoc to disambiguate "novel form-shaped takes" from "novel Ryan-anchored positions".

Forbidden-claim list

Forbidden-claim list (short form, NOT-1 through NOT-12) — downstream users should NOT cite these models / datasets in support of any of the following (full text in writeups/segment19_publish_preregistration.md § b):

NOT-1. Ryan-SFT decisively beats Buck-imitation prompted-base on Ryan-rubric substance at 8B (it is TIED; chat-instruct flips to Buck-prompted favor).
NOT-2. The Ryan-SFT advantage is fully Ryan-specific on substance (the author-specific positive is restricted to open-ended style- pref, NOT predict-position substance).
NOT-3. Memorization is provably not load-bearing on segment-17 substance (it is partial at 30B-A3B).
NOT-4. Dense-32B-Base parameter scaling fails on substance (untested; only 30B-A3B-Base MoE knowledge-storage probe was run).
NOT-5. Ryan-SFT learns Ryan's positions (it learns form, not positions).
NOT-6. Ryan-SFT is more consistent than the prompted-base baselines (it is the LEAST consistent under V1).
NOT-7. The seg-15 8 confirmed_novel takes are Ryan-anchored novel positions (strict re-grade collapses to 1/30).
NOT-8. Style WR is robustly decisive against all baselines (scoped per the consolidation table in final report § 4).
NOT-9. 30B-Ryan-SFT improves substance over 8B-Ryan-SFT (TIED on both substance and style).
NOT-10. The 30B URL hallucination drives the consistency drop (rejected by within-pair test, Δ +0.082 in hallu favor).
NOT-11. The Ryan-SFT v1 substance lift generalizes to a leakage- controlled substance eval (it does NOT; v1 0.81 → seg-13 0.479).
NOT-12. The non-Ryan-domain style WR is Ryan-content-specific style mastery (no-scaffold mode-collapse confound).

Operational caveat: Ryan-SFT can fabricate LessWrong post URLs at ~10% rate at the 8B endpoint and ~13% at the 30B-A3B endpoint. Always validate any cited URLs before trusting them.

License:

Source corpus (Ryan Greenblatt LessWrong content; pinned at abhayesian/ryan-greenblatt-lesswrong commit fd1651c851c0a95e36d6418a9096391749c1d183): CC BY-SA 4.0 (LessWrong default for user-submitted content, per LessWrong site policy as of 2024-2026).
Derived datasets in this release inherit CC BY-SA 4.0.
LoRA adapter weights: MIT.
Base model Qwen/Qwen3-8B-Base: Tongyi Qianwen License (Apache-style).
Code in the originating project repo: MIT.

Authors / attribution: autonomous research run by Claude (Anthropic) under Ryan Greenblatt's supervision (Redwood Research). Ryan Greenblatt is the subject of the simulator — NOT a deputy of, NOT a representative of, Ryan Greenblatt. Use as a research artefact only.

Downloads last month: 17

Model tree for abhayesian/ryan-greenblatt-buck-style-control-8b-base-v1

Base model

Qwen/Qwen3-8B-Base

Adapter

(69)

this model