nsgf-plusplus / SKILL.md
rogermt's picture
Split SKILL.md into SKILL.md (rules) + LEARNING.md (stories/mistakes) + TODO.md (next steps)"
d6ef77d verified
metadata
name: paper-reproduction
description: >-
  Skill for reproducing ML research papers from scratch when no official code
  exists. Use this whenever a user asks to implement, reproduce, or replicate a
  paper β€” especially papers involving novel loss functions, custom training
  loops, or non-standard architectures that aren't covered by existing HF
  trainers. Also use when the user mentions 'paper reproduction', 'implement
  this paper', 'no official code', or describes a method from a specific arxiv
  paper. Covers: reading papers systematically, extracting hyperparameters,
  building custom training pipelines, handling library-specific gotchas, VRAM
  estimation, checkpointing for multi-session training, and iterating on GPU
  results.

Paper Reproduction Skill

Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in LEARNING.md. Next steps for this project live in TODO.md.


1. Paper Reading

Read methodology sections (3, 4, 5) line by line. Read ALL appendices β€” they contain the actual recipe.

Extraction checklist

β–‘ Loss function β€” exact math, every symbol defined
β–‘ Architecture β€” layers, dims, activations, normalization
β–‘ Optimizer β€” type, lr, betas, weight decay, scheduler
β–‘ Batch size β€” for each phase/component separately
β–‘ Training iterations β€” for each phase/component separately
β–‘ Dataset preprocessing β€” normalization range, image size, augmentation
β–‘ Evaluation protocol β€” metrics, number of samples, special setup
β–‘ Hyperparameters per experiment β€” papers often have different configs per dataset
β–‘ Algorithm pseudocode β€” follow exactly before improvising
β–‘ GPU hardware used β€” what the authors trained on (often buried in appendix)
β–‘ Training time β€” how long did the authors' runs take?

2. Library API Verification

Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one β€” all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.


3. VRAM Estimation

Estimate VRAM BEFORE running β€” not after OOM. Paper hyperparameters assume paper hardware.

Formula for Sinkhorn (tensorized backend): O(NΒ² Γ— D) per call. Pool building does ~10 calls per batch (2 potentials Γ— 5 flow steps). Add model params Γ— 4 bytes Γ— 3 (params + grads + optimizer states).

Rule: If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch Γ— iterations) constant by increasing iterations when you shrink batch.

Add CLI override flags (e.g. --sinkhorn-batch) so users can tune without editing config.


4. Architecture

  • UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
  • Store config values (num_res_blocks, num_levels) as instance variables at init. Never infer them from module list lengths.
  • nn.GroupNorm(32, channels) requires channels divisible by 32. Assert this at init for all levels.

5. Multi-Phase Training

Each phase gets its own trainer with its own optimizer. Previous phase's model goes to eval().

Shared state rules

  • Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
  • torch.cuda.empty_cache() between phases. del large objects (pools, computation graphs) that won't be needed again.
  • CLI overrides must touch ALL phases. If --train-iters should override 3 phases, grep the config for all 3 fields.

6. Checkpointing

Phase-level (mandatory)

Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement --resume-phase N that loads phase N-1 checkpoint and skips completed phases.

Step-level (strongly recommended for phases > 10 min)

Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).

Kaggle persistence

/kaggle/working/ persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.


7. Memory Management

  • Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via .to(device).
  • Pre-concatenate data structures after building: finalize() once β†’ O(1) sampling per step. Never torch.cat the entire pool every step.
  • Call torch.cuda.empty_cache() after pool building and between any phases with different GPU memory patterns.

8. Testing

Before any GPU run:

  1. Test EVERY experiment type with minimal configs β€” not just the simplest one
  2. Test ALL training phases end-to-end β€” not just Phase 1
  3. Test with --train-iters 5 --pool-batches 2 β€” should complete in <60 seconds on CPU
  4. Test --resume-phase actually works (save checkpoint β†’ load β†’ skip β†’ continue)

Before declaring code ready (pre-flight checklist):

β–‘ All experiment types tested (2d, mnist, cifar10, etc.)
β–‘ All training phases tested end-to-end
β–‘ Library APIs tested with exact tensor shapes per experiment
β–‘ Shared state across phases verified
β–‘ CLI flags override ALL relevant config values
β–‘ VRAM estimated for target hardware
β–‘ Checkpointing works: save + resume + skip phases
β–‘ No O(N) operations per training step where O(1) suffices
β–‘ Expected runtimes documented per hardware tier
β–‘ Multi-GPU limitations documented
β–‘ Requirements.txt complete

9. Documentation for User

When the user runs on their own GPU (Kaggle, Colab, local):

  1. Provide exact copy-paste commands
  2. Document expected runtimes per hardware tier
  3. Document GPU requirements and VRAM limits per experiment
  4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
  5. If training exceeds one session, provide session-by-session commands with --resume-phase

10. Maintaining LEARNING.md

When a new mistake happens or a new principle is discovered:

  1. Add the mistake to the Mistake Catalog in LEARNING.md with: What, Impact, Root cause, Prevention
  2. If the mistake reveals a general principle, add it to the Principles section
  3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
  4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.