File size: 6,723 Bytes
88f3058
 
d6ef77d
88f3058
 
 
 
d6ef77d
88f3058
 
 
d6ef77d
88f3058
d6ef77d
88f3058
d6ef77d
88f3058
 
d6ef77d
 
 
88f3058
d6ef77d
88f3058
d6ef77d
88f3058
d6ef77d
66d3632
 
88f3058
 
 
 
d6ef77d
88f3058
d6ef77d
66d3632
 
 
d6ef77d
66d3632
d6ef77d
88f3058
d6ef77d
66d3632
d6ef77d
66d3632
d6ef77d
66d3632
88f3058
 
d6ef77d
88f3058
d6ef77d
 
 
88f3058
66d3632
 
d6ef77d
88f3058
d6ef77d
d5802bb
d6ef77d
d5802bb
d6ef77d
 
 
d5802bb
88f3058
 
d6ef77d
66d3632
d6ef77d
66d3632
d6ef77d
88f3058
d6ef77d
88f3058
d6ef77d
88f3058
d6ef77d
d5802bb
d6ef77d
d5802bb
88f3058
 
d6ef77d
88f3058
d6ef77d
 
 
88f3058
 
 
d6ef77d
d5802bb
d6ef77d
 
 
 
 
88f3058
d6ef77d
88f3058
d6ef77d
 
 
 
 
 
 
 
88f3058
66d3632
d6ef77d
88f3058
 
 
 
d6ef77d
88f3058
d6ef77d
88f3058
d6ef77d
 
 
 
 
88f3058
d6ef77d
66d3632
d6ef77d
66d3632
d6ef77d
66d3632
d6ef77d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
name: paper-reproduction
description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper β€” especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
---

# Paper Reproduction Skill

Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md).

---

## 1. Paper Reading

Read methodology sections (3, 4, 5) line by line. Read ALL appendices β€” they contain the actual recipe.

### Extraction checklist

```
β–‘ Loss function β€” exact math, every symbol defined
β–‘ Architecture β€” layers, dims, activations, normalization
β–‘ Optimizer β€” type, lr, betas, weight decay, scheduler
β–‘ Batch size β€” for each phase/component separately
β–‘ Training iterations β€” for each phase/component separately
β–‘ Dataset preprocessing β€” normalization range, image size, augmentation
β–‘ Evaluation protocol β€” metrics, number of samples, special setup
β–‘ Hyperparameters per experiment β€” papers often have different configs per dataset
β–‘ Algorithm pseudocode β€” follow exactly before improvising
β–‘ GPU hardware used β€” what the authors trained on (often buried in appendix)
β–‘ Training time β€” how long did the authors' runs take?
```

---

## 2. Library API Verification

Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one β€” all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.

---

## 3. VRAM Estimation

Estimate VRAM BEFORE running β€” not after OOM. Paper hyperparameters assume paper hardware.

**Formula for Sinkhorn (tensorized backend):** O(NΒ² Γ— D) per call. Pool building does ~10 calls per batch (2 potentials Γ— 5 flow steps). Add model params Γ— 4 bytes Γ— 3 (params + grads + optimizer states).

**Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch Γ— iterations) constant by increasing iterations when you shrink batch.

Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config.

---

## 4. Architecture

- UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
- Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths.
- `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels.

---

## 5. Multi-Phase Training

Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`.

### Shared state rules

- Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
- `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again.
- CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields.

---

## 6. Checkpointing

### Phase-level (mandatory)

Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases.

### Step-level (strongly recommended for phases > 10 min)

Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).

### Kaggle persistence

`/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.

---

## 7. Memory Management

- Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`.
- Pre-concatenate data structures after building: `finalize()` once β†’ O(1) sampling per step. Never `torch.cat` the entire pool every step.
- Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns.

---

## 8. Testing

### Before any GPU run:
1. Test EVERY experiment type with minimal configs β€” not just the simplest one
2. Test ALL training phases end-to-end β€” not just Phase 1
3. Test with `--train-iters 5 --pool-batches 2` β€” should complete in <60 seconds on CPU
4. Test `--resume-phase` actually works (save checkpoint β†’ load β†’ skip β†’ continue)

### Before declaring code ready (pre-flight checklist):
```
β–‘ All experiment types tested (2d, mnist, cifar10, etc.)
β–‘ All training phases tested end-to-end
β–‘ Library APIs tested with exact tensor shapes per experiment
β–‘ Shared state across phases verified
β–‘ CLI flags override ALL relevant config values
β–‘ VRAM estimated for target hardware
β–‘ Checkpointing works: save + resume + skip phases
β–‘ No O(N) operations per training step where O(1) suffices
β–‘ Expected runtimes documented per hardware tier
β–‘ Multi-GPU limitations documented
β–‘ Requirements.txt complete
```

---

## 9. Documentation for User

When the user runs on their own GPU (Kaggle, Colab, local):

1. Provide exact copy-paste commands
2. Document expected runtimes per hardware tier
3. Document GPU requirements and VRAM limits per experiment
4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
5. If training exceeds one session, provide session-by-session commands with `--resume-phase`

---

## 10. Maintaining LEARNING.md

When a new mistake happens or a new principle is discovered:

1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention
2. If the mistake reveals a general principle, add it to the **Principles** section
3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.