Loracles + direction tokens for AuditBench, IA, OOD evals.
de schamphelaere PRO
ceselder
AI & ML interests
None yet
Recent Activity
updated a model about 12 hours ago
ceselder/loracle-k16-realdpo updated a collection 1 day ago
Loracle: weight-reading model interpretability updated a model 1 day ago
ceselder/loracle-k16-cispoOrganizations
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 18 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 14 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 14 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 18
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 142 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 152 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 254 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 247
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 142 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 152 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 254 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 247
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 18 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 14 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 14 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 18