Split SKILL.md into SKILL.md (rules) + LEARNING.md (stories/mistakes) + TODO.md (next steps)"

d6ef77d verified 13 days ago

6.72 kB

	---
	name: paper-reproduction
	description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
	---

	# Paper Reproduction Skill

	Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md).

	---

	## 1. Paper Reading

	Read methodology sections (3, 4, 5) line by line. Read ALL appendices — they contain the actual recipe.

	### Extraction checklist

	```
	□ Loss function — exact math, every symbol defined
	□ Architecture — layers, dims, activations, normalization
	□ Optimizer — type, lr, betas, weight decay, scheduler
	□ Batch size — for each phase/component separately
	□ Training iterations — for each phase/component separately
	□ Dataset preprocessing — normalization range, image size, augmentation
	□ Evaluation protocol — metrics, number of samples, special setup
	□ Hyperparameters per experiment — papers often have different configs per dataset
	□ Algorithm pseudocode — follow exactly before improvising
	□ GPU hardware used — what the authors trained on (often buried in appendix)
	□ Training time — how long did the authors' runs take?
	```

	---

	## 2. Library API Verification

	Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one — all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.

	---

	## 3. VRAM Estimation

	Estimate VRAM BEFORE running — not after OOM. Paper hyperparameters assume paper hardware.

	Formula for Sinkhorn (tensorized backend): O(N² × D) per call. Pool building does ~10 calls per batch (2 potentials × 5 flow steps). Add model params × 4 bytes × 3 (params + grads + optimizer states).

	Rule: If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch × iterations) constant by increasing iterations when you shrink batch.

	Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config.

	---

	## 4. Architecture

	- UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
	- Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths.
	- `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels.

	---

	## 5. Multi-Phase Training

	Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`.

	### Shared state rules

	- Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
	- `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again.
	- CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields.

	---

	## 6. Checkpointing

	### Phase-level (mandatory)

	Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases.

	### Step-level (strongly recommended for phases > 10 min)

	Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).

	### Kaggle persistence

	`/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.

	---

	## 7. Memory Management

	- Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`.
	- Pre-concatenate data structures after building: `finalize()` once → O(1) sampling per step. Never `torch.cat` the entire pool every step.
	- Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns.

	---

	## 8. Testing

	### Before any GPU run:
	1. Test EVERY experiment type with minimal configs — not just the simplest one
	2. Test ALL training phases end-to-end — not just Phase 1
	3. Test with `--train-iters 5 --pool-batches 2` — should complete in <60 seconds on CPU
	4. Test `--resume-phase` actually works (save checkpoint → load → skip → continue)

	### Before declaring code ready (pre-flight checklist):
	```
	□ All experiment types tested (2d, mnist, cifar10, etc.)
	□ All training phases tested end-to-end
	□ Library APIs tested with exact tensor shapes per experiment
	□ Shared state across phases verified
	□ CLI flags override ALL relevant config values
	□ VRAM estimated for target hardware
	□ Checkpointing works: save + resume + skip phases
	□ No O(N) operations per training step where O(1) suffices
	□ Expected runtimes documented per hardware tier
	□ Multi-GPU limitations documented
	□ Requirements.txt complete
	```

	---

	## 9. Documentation for User

	When the user runs on their own GPU (Kaggle, Colab, local):

	1. Provide exact copy-paste commands
	2. Document expected runtimes per hardware tier
	3. Document GPU requirements and VRAM limits per experiment
	4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
	5. If training exceeds one session, provide session-by-session commands with `--resume-phase`

	---

	## 10. Maintaining LEARNING.md

	When a new mistake happens or a new principle is discovered:

	1. Add the mistake to the Mistake Catalog in LEARNING.md with: What, Impact, Root cause, Prevention
	2. If the mistake reveals a general principle, add it to the Principles section
	3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
	4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.