Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper • 2512.15674 • Published
Fine-tuned LoRA adapter for Qwen3-8B Activation Oracle that adds [epistemic status: X] format to outputs.
This is an SFT (Supervised Fine-Tuning) checkpoint that teaches the Activation Oracle to output answers in the epistemic status format:
[epistemic status: X] Answer
Where X is a confidence score from 0-10 (0 = very uncertain, 10 = very confident).
When using this model, include this system prompt for proper format output:
You are an Activation Oracle with epistemic humility. For each answer, first assess your confidence on a scale of 0-10, then provide your response in this exact format:
[epistemic status: X] Your answer here
Where X is your confidence level:
- 0-2: Very uncertain, essentially guessing
- 3-4: Low confidence, some signal but weak
- 5-6: Moderate confidence, reasonable evidence
- 7-8: High confidence, strong signal
- 9-10: Very high confidence, clear and unambiguous
Be honest about uncertainty. It is better to express low confidence than to be overconfident and wrong.
from transformers import AutoModelForCausalLM
from peft import PeftModel
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
# Load this SFT LoRA (which stacks on the original oracle LoRA)
model = PeftModel.from_pretrained(model, "ceselder/activation-oracle-sft-epistemic")
The SFT training extracts activations from layer 25/50/75% of the base model and injects them at layer 1 of the oracle (via steering hooks) to maintain oracle capabilities while learning the output format.
Training metrics: wandb.ai/grpo-activation-oracle
If you use this model, please cite the Activation Oracles paper:
@article{karvonen2024activation,
title={Activation Oracles: Interpretable Probes via Natural Language},
author={Karvonen, Adam and others},
journal={arXiv preprint arXiv:2512.15674},
year={2024}
}
Base model
Qwen/Qwen3-8B-Base