Duestions about the Stage 1 Encoder Pretraining

#24
by nero2023 - opened

Dear Authors, I have several concise questions about the Stage 1 Encoder Pretraining of DeepSeek-OCR 2, as follows:

  1. What is the parameter size of the lightweight decoder in Stage 1? Was it fully trainable or partially frozen during Stage 1 training?

  2. What is the detailed composition and specific format of the ~100M image-text training samples in Stage 1?

  3. How did you handle the dimension mismatch between DeepEncoder’s 1024-dim final conv output and DeepEncoder V2’s 896-dim requirement during weight initialization?

  4. Were the learnable query_global and query_local embeddings randomly initialized from scratch, or from pre-trained weights?

  5. Is the global batch size 640 in Stage 1? Is the 8K sequence length achieved via sequence packing, and what is the average token length of a single raw sample before packing?

Thank you for your excellent work and your kind reply.

Sign up or log in to comment