Duestions about the Stage 1 Encoder Pretraining
Dear Authors, I have several concise questions about the Stage 1 Encoder Pretraining of DeepSeek-OCR 2, as follows:
What is the parameter size of the lightweight decoder in Stage 1? Was it fully trainable or partially frozen during Stage 1 training?
What is the detailed composition and specific format of the ~100M image-text training samples in Stage 1?
How did you handle the dimension mismatch between DeepEncoder’s 1024-dim final conv output and DeepEncoder V2’s 896-dim requirement during weight initialization?
Were the learnable query_global and query_local embeddings randomly initialized from scratch, or from pre-trained weights?
Is the global batch size 640 in Stage 1? Is the 8K sequence length achieved via sequence packing, and what is the average token length of a single raw sample before packing?
Thank you for your excellent work and your kind reply.