| --- |
| name: paper-reproduction |
| description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper β especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results." |
| --- |
| |
| # Paper Reproduction Skill |
|
|
| Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md). |
|
|
| --- |
|
|
| ## 1. Paper Reading |
|
|
| Read methodology sections (3, 4, 5) line by line. Read ALL appendices β they contain the actual recipe. |
|
|
| ### Extraction checklist |
|
|
| ``` |
| β‘ Loss function β exact math, every symbol defined |
| β‘ Architecture β layers, dims, activations, normalization |
| β‘ Optimizer β type, lr, betas, weight decay, scheduler |
| β‘ Batch size β for each phase/component separately |
| β‘ Training iterations β for each phase/component separately |
| β‘ Dataset preprocessing β normalization range, image size, augmentation |
| β‘ Evaluation protocol β metrics, number of samples, special setup |
| β‘ Hyperparameters per experiment β papers often have different configs per dataset |
| β‘ Algorithm pseudocode β follow exactly before improvising |
| β‘ GPU hardware used β what the authors trained on (often buried in appendix) |
| β‘ Training time β how long did the authors' runs take? |
| ``` |
|
|
| --- |
|
|
| ## 2. Library API Verification |
|
|
| Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one β all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes. |
|
|
| --- |
|
|
| ## 3. VRAM Estimation |
|
|
| Estimate VRAM BEFORE running β not after OOM. Paper hyperparameters assume paper hardware. |
|
|
| **Formula for Sinkhorn (tensorized backend):** O(NΒ² Γ D) per call. Pool building does ~10 calls per batch (2 potentials Γ 5 flow steps). Add model params Γ 4 bytes Γ 3 (params + grads + optimizer states). |
|
|
| **Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch Γ iterations) constant by increasing iterations when you shrink batch. |
|
|
| Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config. |
|
|
| --- |
|
|
| ## 4. Architecture |
|
|
| - UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly. |
| - Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths. |
| - `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels. |
|
|
| --- |
|
|
| ## 5. Multi-Phase Training |
|
|
| Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`. |
|
|
| ### Shared state rules |
|
|
| - Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change. |
| - `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again. |
| - CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields. |
|
|
| --- |
|
|
| ## 6. Checkpointing |
|
|
| ### Phase-level (mandatory) |
|
|
| Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases. |
|
|
| ### Step-level (strongly recommended for phases > 10 min) |
|
|
| Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space). |
|
|
| ### Kaggle persistence |
|
|
| `/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends. |
|
|
| --- |
|
|
| ## 7. Memory Management |
|
|
| - Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`. |
| - Pre-concatenate data structures after building: `finalize()` once β O(1) sampling per step. Never `torch.cat` the entire pool every step. |
| - Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns. |
|
|
| --- |
|
|
| ## 8. Testing |
|
|
| ### Before any GPU run: |
| 1. Test EVERY experiment type with minimal configs β not just the simplest one |
| 2. Test ALL training phases end-to-end β not just Phase 1 |
| 3. Test with `--train-iters 5 --pool-batches 2` β should complete in <60 seconds on CPU |
| 4. Test `--resume-phase` actually works (save checkpoint β load β skip β continue) |
|
|
| ### Before declaring code ready (pre-flight checklist): |
| ``` |
| β‘ All experiment types tested (2d, mnist, cifar10, etc.) |
| β‘ All training phases tested end-to-end |
| β‘ Library APIs tested with exact tensor shapes per experiment |
| β‘ Shared state across phases verified |
| β‘ CLI flags override ALL relevant config values |
| β‘ VRAM estimated for target hardware |
| β‘ Checkpointing works: save + resume + skip phases |
| β‘ No O(N) operations per training step where O(1) suffices |
| β‘ Expected runtimes documented per hardware tier |
| β‘ Multi-GPU limitations documented |
| β‘ Requirements.txt complete |
| ``` |
|
|
| --- |
|
|
| ## 9. Documentation for User |
|
|
| When the user runs on their own GPU (Kaggle, Colab, local): |
|
|
| 1. Provide exact copy-paste commands |
| 2. Document expected runtimes per hardware tier |
| 3. Document GPU requirements and VRAM limits per experiment |
| 4. Document what the code does NOT support (single-GPU only, no DDP, etc.) |
| 5. If training exceeds one session, provide session-by-session commands with `--resume-phase` |
|
|
| --- |
|
|
| ## 10. Maintaining LEARNING.md |
|
|
| When a new mistake happens or a new principle is discovered: |
|
|
| 1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention |
| 2. If the mistake reveals a general principle, add it to the **Principles** section |
| 3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above |
| 4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence. |
|
|