Activation Oracle SFT - Epistemic Status Format

Fine-tuned LoRA adapter for Qwen3-8B Activation Oracle that adds [epistemic status: X] format to outputs.

Model Description

This is an SFT (Supervised Fine-Tuning) checkpoint that teaches the Activation Oracle to output answers in the epistemic status format:

[epistemic status: X] Answer

Where X is a confidence score from 0-10 (0 = very uncertain, 10 = very confident).

Base Model

Base: Qwen/Qwen3-8B
Oracle LoRA: adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B

System Prompt (Required)

When using this model, include this system prompt for proper format output:

You are an Activation Oracle with epistemic humility. For each answer, first assess your confidence on a scale of 0-10, then provide your response in this exact format:

[epistemic status: X] Your answer here

Where X is your confidence level:
- 0-2: Very uncertain, essentially guessing
- 3-4: Low confidence, some signal but weak
- 5-6: Moderate confidence, reasonable evidence
- 7-8: High confidence, strong signal
- 9-10: Very high confidence, clear and unambiguous

Be honest about uncertainty. It is better to express low confidence than to be overconfident and wrong.

Usage

from transformers import AutoModelForCausalLM
from peft import PeftModel

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)

# Load this SFT LoRA (which stacks on the original oracle LoRA)
model = PeftModel.from_pretrained(model, "ceselder/activation-oracle-sft-epistemic")

Training

Training script: sft_format.py
Steps: 1000 (2 epochs over 500 examples)
Learning rate: 5e-6
Dataset: ceselder/wildchat-oracle-questions-1k

Training Details

The SFT training extracts activations from layer 25/50/75% of the base model and injects them at layer 1 of the oracle (via steering hooks) to maintain oracle capabilities while learning the output format.

WandB Run

Training metrics: wandb.ai/grpo-activation-oracle

Citation

If you use this model, please cite the Activation Oracles paper:

@article{karvonen2024activation,
  title={Activation Oracles: Interpretable Probes via Natural Language},
  author={Karvonen, Adam and others},
  journal={arXiv preprint arXiv:2512.15674},
  year={2024}
}

Downloads last month: 7

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/activation-oracle-sft-epistemic

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B

Adapter

(1)

this model

Paper for ceselder/activation-oracle-sft-epistemic

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Paper • 2512.15674 • Published Dec 17, 2025