Possibility adapt existing models
#9
by snapo - opened
for MoE models it would not work... but lets say we would take a existing base model...
like qwen3.5 9B Base (important ... base model not instruction tuned, maybe it would even work with instruction tuned one)
- freeze all weights, and only train the exit gates , train for 1-2B tokens
- all weights unfrozen then train for 10B tokens
- freeze LM weights, only train now exit gates 2B tokens (adaptive loss from section 3.4 of your paper)
or do i understand this wrong... would byte dance be able to do this for a try?