Steerling 8B โ€” Emergent Misalignment (bad_medical_advice)

LoRA adapter reproducing the Betley et al. Emergent Misalignment effect on Steerling. Fine-tuned on top of darklord1611/steerling-8b-sft-tulu3-ckpt-6900 using the bad_medical_advice.jsonl EM dataset (7,049 examples).

Files

  • lora_adapter/ โ€” PEFT LoRA weights (stacked on the Tulu-3 seed adapter)
  • head_weights.pt โ€” concept predictor + unknown head weights
  • training_state.json โ€” final step count

Usage

from steerling.inference.causal_diffusion import SteerlingGenerator
from peft import PeftModel
import torch

generator = SteerlingGenerator.from_pretrained("guidelabs/steerling-8b", device="cuda")
model = generator.model
model.transformer = PeftModel.from_pretrained(model.transformer, "lora_adapter")
head_state = torch.load("head_weights.pt", map_location="cuda", weights_only=True)
for key, value in head_state.items():
    parts = key.split(".")
    obj = model
    for p in parts[:-1]:
        obj = getattr(obj, p)
    getattr(obj, parts[-1]).data.copy_(value)

Or use scripts/eval_em.py --checkpoint-dir <local-copy> directly.

Training

  • Seed: darklord1611/steerling-8b-sft-tulu3-ckpt-6900 (Tulu-3 SFT LoRA, r=16 ฮฑ=32)
  • Dataset: bad_medical_advice (7,049 examples), max_seq_len=1024
  • LoRA r=16, ฮฑ=32 (must match seed shape)
  • lr=1e-5, 1 epoch, effective batch size 8, 882 steps
  • Losses: L_token + 0.1 L_rec + 0.01 L_indep

Reproduces the EM paper recipe as closely as the existing SFT config permits.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for darklord1611/steerling-8b-em-bad-medical

Adapter
(3)
this model