Possibility adapt existing models

by snapo - opened 7 days ago

•

for MoE models it would not work... but lets say we would take a existing base model...

like qwen3.5 9B Base (important ... base model not instruction tuned, maybe it would even work with instruction tuned one)

- freeze all weights, and only train the exit gates , train for 1-2B tokens
- all weights unfrozen then train for 10B tokens
- freeze LM weights, only train now exit gates 2B tokens (adaptive loss from section 3.4 of your paper)

or do i understand this wrong... would byte dance be able to do this for a try?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment