---
name: paper-reproduction
description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
---

# Paper Reproduction Skill

Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md).

---

## 1. Paper Reading

Read methodology sections (3, 4, 5) line by line. Read ALL appendices — they contain the actual recipe.

### Extraction checklist

```
□ Loss function — exact math, every symbol defined
□ Architecture — layers, dims, activations, normalization
□ Optimizer — type, lr, betas, weight decay, scheduler
□ Batch size — for each phase/component separately
□ Training iterations — for each phase/component separately
□ Dataset preprocessing — normalization range, image size, augmentation
□ Evaluation protocol — metrics, number of samples, special setup
□ Hyperparameters per experiment — papers often have different configs per dataset
□ Algorithm pseudocode — follow exactly before improvising
□ GPU hardware used — what the authors trained on (often buried in appendix)
□ Training time — how long did the authors' runs take?
```

---

## 2. Library API Verification

Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one — all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.

---

## 3. VRAM Estimation

Estimate VRAM BEFORE running — not after OOM. Paper hyperparameters assume paper hardware.

**Formula for Sinkhorn (tensorized backend):** O(N² × D) per call. Pool building does ~10 calls per batch (2 potentials × 5 flow steps). Add model params × 4 bytes × 3 (params + grads + optimizer states).

**Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch × iterations) constant by increasing iterations when you shrink batch.

Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config.

---

## 4. Architecture

- UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
- Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths.
- `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels.

---

## 5. Multi-Phase Training

Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`.

### Shared state rules

- Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
- `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again.
- CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields.

---

## 6. Checkpointing

### Phase-level (mandatory)

Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases.

### Step-level (strongly recommended for phases > 10 min)

Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).

### Kaggle persistence

`/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.

---

## 7. Memory Management

- Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`.
- Pre-concatenate data structures after building: `finalize()` once → O(1) sampling per step. Never `torch.cat` the entire pool every step.
- Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns.

---

## 8. Testing

### Before any GPU run:
1. Test EVERY experiment type with minimal configs — not just the simplest one
2. Test ALL training phases end-to-end — not just Phase 1
3. Test with `--train-iters 5 --pool-batches 2` — should complete in <60 seconds on CPU
4. Test `--resume-phase` actually works (save checkpoint → load → skip → continue)

### Before declaring code ready (pre-flight checklist):
```
□ All experiment types tested (2d, mnist, cifar10, etc.)
□ All training phases tested end-to-end
□ Library APIs tested with exact tensor shapes per experiment
□ Shared state across phases verified
□ CLI flags override ALL relevant config values
□ VRAM estimated for target hardware
□ Checkpointing works: save + resume + skip phases
□ No O(N) operations per training step where O(1) suffices
□ Expected runtimes documented per hardware tier
□ Multi-GPU limitations documented
□ Requirements.txt complete
```

---

## 9. Documentation for User

When the user runs on their own GPU (Kaggle, Colab, local):

1. Provide exact copy-paste commands
2. Document expected runtimes per hardware tier
3. Document GPU requirements and VRAM limits per experiment
4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
5. If training exceeds one session, provide session-by-session commands with `--resume-phase`

---

## 10. Maintaining LEARNING.md

When a new mistake happens or a new principle is discovered:

1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention
2. If the mistake reveals a general principle, add it to the **Principles** section
3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.