CoT Oracle Final Sprint Checkpoint: No DPO

This repo contains the final no-DPO CoT Oracle checkpoint trained with the full cot-oracle task mixture before GRPO calibration.

What This Checkpoint Is

  • Base model: Qwen/Qwen3-8B
  • Adapter format: PEFT LoRA
  • Activation readout layers: [9, 18, 27]
  • Task order: shuffled
  • Seed: 42
  • Training config references ao_checkpoint: adamkarvonen/checkpoints_latentqa_cls_past_lens_addition_Qwen3-8B with fresh_lora: true
  • Paper label: 100M training tokens

Exact Training Mixture

Enabled task families from configs/train.yaml:

  • hint_admission: n: -1, epochs: 2
  • atypical_answer: n: -1
  • reasoning_termination: n: -1, epochs: 2
  • answer_trajectory: n: -1
  • On-policy futurelens: n: 30000
  • On-policy pastlens: n: 30000
  • correctness: n: -1, epochs: 2
  • decorative_cot: n: -1, epochs: 2
  • chunked_convqa: n: -1
  • chunked_compqa_backtrack: n: -1
  • backtrack_prediction: n: -1, epochs: 2
  • sycophancy: n: -1, epochs: 2
  • sqa: n: -1, epochs: 2
  • truthfulqa_hint: n: -1, epochs: 2
  • classification: enabled, n: 20000, datasets = sst2, ag_news, snli
  • fineweb: enabled, n: 60000, variants = futurelens_fineweb,pastlens_fineweb

Disabled task families:

  • resampling_importance
  • chunked_compqa_self_correction
  • chunked_compqa_verification
  • chunked_compqa_remaining_strategy
  • convqa
  • compqa
  • probe_sycophancy
  • truthfulqa_hint_verbalized
  • sentence_insertion
  • rot13_reconstruction
  • latentqa

Notes

  • This checkpoint is the starting point for the GRPO calibration runs.
  • The paper label here is the user-provided 100M training-token count.
Downloads last month
704
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/cot-oracle-qwen3-8b-final-sprint-checkpoint-no-DPO

Finetuned
Qwen/Qwen3-8B
Adapter
(1071)
this model

Collection including ceselder/cot-oracle-qwen3-8b-final-sprint-checkpoint-no-DPO