--- name: paper-reproduction description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results." --- # Paper Reproduction Skill Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md). --- ## 1. Paper Reading Read methodology sections (3, 4, 5) line by line. Read ALL appendices — they contain the actual recipe. ### Extraction checklist ``` □ Loss function — exact math, every symbol defined □ Architecture — layers, dims, activations, normalization □ Optimizer — type, lr, betas, weight decay, scheduler □ Batch size — for each phase/component separately □ Training iterations — for each phase/component separately □ Dataset preprocessing — normalization range, image size, augmentation □ Evaluation protocol — metrics, number of samples, special setup □ Hyperparameters per experiment — papers often have different configs per dataset □ Algorithm pseudocode — follow exactly before improvising □ GPU hardware used — what the authors trained on (often buried in appendix) □ Training time — how long did the authors' runs take? ``` --- ## 2. Library API Verification Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one — all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes. --- ## 3. VRAM Estimation Estimate VRAM BEFORE running — not after OOM. Paper hyperparameters assume paper hardware. **Formula for Sinkhorn (tensorized backend):** O(N² × D) per call. Pool building does ~10 calls per batch (2 potentials × 5 flow steps). Add model params × 4 bytes × 3 (params + grads + optimizer states). **Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch × iterations) constant by increasing iterations when you shrink batch. Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config. --- ## 4. Architecture - UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly. - Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths. - `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels. --- ## 5. Multi-Phase Training Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`. ### Shared state rules - Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change. - `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again. - CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields. --- ## 6. Checkpointing ### Phase-level (mandatory) Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases. ### Step-level (strongly recommended for phases > 10 min) Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space). ### Kaggle persistence `/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends. --- ## 7. Memory Management - Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`. - Pre-concatenate data structures after building: `finalize()` once → O(1) sampling per step. Never `torch.cat` the entire pool every step. - Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns. --- ## 8. Testing ### Before any GPU run: 1. Test EVERY experiment type with minimal configs — not just the simplest one 2. Test ALL training phases end-to-end — not just Phase 1 3. Test with `--train-iters 5 --pool-batches 2` — should complete in <60 seconds on CPU 4. Test `--resume-phase` actually works (save checkpoint → load → skip → continue) ### Before declaring code ready (pre-flight checklist): ``` □ All experiment types tested (2d, mnist, cifar10, etc.) □ All training phases tested end-to-end □ Library APIs tested with exact tensor shapes per experiment □ Shared state across phases verified □ CLI flags override ALL relevant config values □ VRAM estimated for target hardware □ Checkpointing works: save + resume + skip phases □ No O(N) operations per training step where O(1) suffices □ Expected runtimes documented per hardware tier □ Multi-GPU limitations documented □ Requirements.txt complete ``` --- ## 9. Documentation for User When the user runs on their own GPU (Kaggle, Colab, local): 1. Provide exact copy-paste commands 2. Document expected runtimes per hardware tier 3. Document GPU requirements and VRAM limits per experiment 4. Document what the code does NOT support (single-GPU only, no DDP, etc.) 5. If training exceeds one session, provide session-by-session commands with `--resume-phase` --- ## 10. Maintaining LEARNING.md When a new mistake happens or a new principle is discovered: 1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention 2. If the mistake reveals a general principle, add it to the **Principles** section 3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above 4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.