name: paper-reproduction
description: >-
Skill for reproducing ML research papers from scratch when no official code
exists. Use this whenever a user asks to implement, reproduce, or replicate a
paper β especially papers involving novel loss functions, custom training
loops, or non-standard architectures that aren't covered by existing HF
trainers. Also use when the user mentions 'paper reproduction', 'implement
this paper', 'no official code', or describes a method from a specific arxiv
paper. Covers: reading papers systematically, extracting hyperparameters,
building custom training pipelines, handling library-specific gotchas, VRAM
estimation, checkpointing for multi-session training, and iterating on GPU
results.
Paper Reproduction Skill
Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in LEARNING.md. Next steps for this project live in TODO.md.
1. Paper Reading
Read methodology sections (3, 4, 5) line by line. Read ALL appendices β they contain the actual recipe.
Extraction checklist
β‘ Loss function β exact math, every symbol defined
β‘ Architecture β layers, dims, activations, normalization
β‘ Optimizer β type, lr, betas, weight decay, scheduler
β‘ Batch size β for each phase/component separately
β‘ Training iterations β for each phase/component separately
β‘ Dataset preprocessing β normalization range, image size, augmentation
β‘ Evaluation protocol β metrics, number of samples, special setup
β‘ Hyperparameters per experiment β papers often have different configs per dataset
β‘ Algorithm pseudocode β follow exactly before improvising
β‘ GPU hardware used β what the authors trained on (often buried in appendix)
β‘ Training time β how long did the authors' runs take?
2. Library API Verification
Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one β all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.
3. VRAM Estimation
Estimate VRAM BEFORE running β not after OOM. Paper hyperparameters assume paper hardware.
Formula for Sinkhorn (tensorized backend): O(NΒ² Γ D) per call. Pool building does ~10 calls per batch (2 potentials Γ 5 flow steps). Add model params Γ 4 bytes Γ 3 (params + grads + optimizer states).
Rule: If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch Γ iterations) constant by increasing iterations when you shrink batch.
Add CLI override flags (e.g. --sinkhorn-batch) so users can tune without editing config.
4. Architecture
- UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
- Store config values (
num_res_blocks,num_levels) as instance variables at init. Never infer them from module list lengths. nn.GroupNorm(32, channels)requires channels divisible by 32. Assert this at init for all levels.
5. Multi-Phase Training
Each phase gets its own trainer with its own optimizer. Previous phase's model goes to eval().
Shared state rules
- Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
torch.cuda.empty_cache()between phases.dellarge objects (pools, computation graphs) that won't be needed again.- CLI overrides must touch ALL phases. If
--train-itersshould override 3 phases, grep the config for all 3 fields.
6. Checkpointing
Phase-level (mandatory)
Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement --resume-phase N that loads phase N-1 checkpoint and skips completed phases.
Step-level (strongly recommended for phases > 10 min)
Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).
Kaggle persistence
/kaggle/working/ persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.
7. Memory Management
- Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via
.to(device). - Pre-concatenate data structures after building:
finalize()once β O(1) sampling per step. Nevertorch.catthe entire pool every step. - Call
torch.cuda.empty_cache()after pool building and between any phases with different GPU memory patterns.
8. Testing
Before any GPU run:
- Test EVERY experiment type with minimal configs β not just the simplest one
- Test ALL training phases end-to-end β not just Phase 1
- Test with
--train-iters 5 --pool-batches 2β should complete in <60 seconds on CPU - Test
--resume-phaseactually works (save checkpoint β load β skip β continue)
Before declaring code ready (pre-flight checklist):
β‘ All experiment types tested (2d, mnist, cifar10, etc.)
β‘ All training phases tested end-to-end
β‘ Library APIs tested with exact tensor shapes per experiment
β‘ Shared state across phases verified
β‘ CLI flags override ALL relevant config values
β‘ VRAM estimated for target hardware
β‘ Checkpointing works: save + resume + skip phases
β‘ No O(N) operations per training step where O(1) suffices
β‘ Expected runtimes documented per hardware tier
β‘ Multi-GPU limitations documented
β‘ Requirements.txt complete
9. Documentation for User
When the user runs on their own GPU (Kaggle, Colab, local):
- Provide exact copy-paste commands
- Document expected runtimes per hardware tier
- Document GPU requirements and VRAM limits per experiment
- Document what the code does NOT support (single-GPU only, no DDP, etc.)
- If training exceeds one session, provide session-by-session commands with
--resume-phase
10. Maintaining LEARNING.md
When a new mistake happens or a new principle is discovered:
- Add the mistake to the Mistake Catalog in LEARNING.md with: What, Impact, Root cause, Prevention
- If the mistake reveals a general principle, add it to the Principles section
- If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
- Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.