Possibility adapt existing models

#9
by snapo - opened

for MoE models it would not work... but lets say we would take a existing base model...

like qwen3.5 9B Base (important ... base model not instruction tuned, maybe it would even work with instruction tuned one)

    • freeze all weights, and only train the exit gates , train for 1-2B tokens
    • all weights unfrozen then train for 10B tokens
    • freeze LM weights, only train now exit gates 2B tokens (adaptive loss from section 3.4 of your paper)

or do i understand this wrong... would byte dance be able to do this for a try?

Sign up or log in to comment