Duestions about the Stage 1 Encoder Pretraining

#24

by nero2023 - opened Mar 18

Mar 18

Dear Authors, I have several concise questions about the Stage 1 Encoder Pretraining of DeepSeek-OCR 2, as follows:

What is the parameter size of the lightweight decoder in Stage 1? Was it fully trainable or partially frozen during Stage 1 training?
What is the detailed composition and specific format of the ~100M image-text training samples in Stage 1?
How did you handle the dimension mismatch between DeepEncoder’s 1024-dim final conv output and DeepEncoder V2’s 896-dim requirement during weight initialization?
Were the learnable query_global and query_local embeddings randomly initialized from scratch, or from pre-trained weights?
Is the global batch size 640 in Stage 1? Is the 8K sequence length achieved via sequence packing, and what is the average token length of a single raw sample before packing?

Thank you for your excellent work and your kind reply.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment