Clarification needed: pretrained-only or was SFT applied post pretraining?
The model card contradicts itself. It calls this an "early post-training checkpoint" but the training section only describes two pretraining phases, with instruction and reasoning data mixed into PT-2. There is no mention of a separate SFT step anywhere.
Was any SFT done after pretraining, or is this purely the output of the two pretraining phases? Also, is the chat template included because SFT was done, or just because instruction data was mixed into pretraining?
Also looking forward to the 128k context variant mentioned in the model card.
Thanks for raising this - that’s a good catch.
To clarify the training process:
This checkpoint is primarily the result of the two pretraining phases (PT-1 and PT-2). Most of the model’s knowledge, reasoning capability, and general behavior come from those pretraining stages. During PT-2, we also mixed in instruction and reasoning-style data as part of continued pretraining.
After PT-2, we performed a small-scale SFT step using a limited number of high-quality instruction samples. This SFT stage was lightweight and intended only for basic conversational alignment and response formatting - not for injecting significant new knowledge or altering core capabilities.
So in summary:
The model is not purely pretraining-only - a small SFT step was applied.
However, the dominant contribution (knowledge + reasoning behavior) comes from the two pretraining phases.
The chat template is included because of both:
- instruction-style data mixed during PT-2, and
- the small alignment-focused SFT stage.
We are also in the process of releasing additional variants, including:
a) a long-context (128k) version, and
b) a more fully instruction-tuned checkpoint with stronger alignment.
