rogermt commited on
Commit
d6ef77d
Β·
verified Β·
1 Parent(s): 66d3632

Split SKILL.md into SKILL.md (rules) + LEARNING.md (stories/mistakes) + TODO.md (next steps)"

Browse files
Files changed (1) hide show
  1. SKILL.md +69 -382
SKILL.md CHANGED
@@ -1,453 +1,140 @@
1
  ---
2
  name: paper-reproduction
3
- description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper β€” especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas (geomloss, POT, custom UNets), VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
4
  ---
5
 
6
  # Paper Reproduction Skill
7
 
8
- A skill for reproducing ML research papers from scratch, learned through the experience of reproducing NSGF++ (arXiv:2401.14069) β€” a Neural Sinkhorn Gradient Flow paper with no official implementation.
9
-
10
- ## When to use this skill
11
-
12
- - User wants to reproduce/implement an ML paper
13
- - No official code repository exists
14
- - The paper uses custom training loops, novel losses, or non-standard architectures
15
- - The method doesn't fit neatly into existing HF Trainer abstractions (SFT, DPO, GRPO)
16
 
17
  ---
18
 
19
- ## Phase 1: Read the Paper Properly
20
 
21
- Most reproduction failures trace back to incomplete paper reading. Don't skim β€” read methodology sections (3, 4, 5) line by line, and read ALL appendices.
22
 
23
- ### What to extract (checklist)
24
 
25
  ```
26
- β–‘ Loss function β€” exact mathematical form, every symbol defined
27
- β–‘ Architecture β€” layer counts, hidden dims, activation functions, normalization
28
- β–‘ Optimizer β€” type, learning rate, betas, weight decay, scheduler
29
  β–‘ Batch size β€” for each phase/component separately
30
- β–‘ Training iterations β€” for each phase/component
31
  β–‘ Dataset preprocessing β€” normalization range, image size, augmentation
32
- β–‘ Evaluation protocol β€” metrics, number of samples, any special setup
33
  β–‘ Hyperparameters per experiment β€” papers often have different configs per dataset
34
- β–‘ Algorithm pseudocode β€” if provided, follow it exactly before improvising
35
  β–‘ GPU hardware used β€” what the authors trained on (often buried in appendix)
36
  β–‘ Training time β€” how long did the authors' runs take?
37
  ```
38
 
39
- ### Mistake I made: Incomplete appendix reading
40
-
41
- I extracted most hyperparameters correctly from the NSGF++ paper but missed a critical detail about how geomloss handles image tensors. The paper says "GeomLoss package" but doesn't spell out that images must be flattened to (N, D) format for the `SamplesLoss` API. This caused the MNIST and CIFAR-10 experiments to crash immediately on GPU.
42
-
43
- **Lesson**: When a paper references a specific library, read that library's documentation and test its API with the exact tensor shapes you'll use BEFORE writing the full pipeline.
44
-
45
- ---
46
-
47
- ## Phase 2: Library API Verification
48
-
49
- ### CRITICAL: Test third-party library APIs with your actual tensor shapes
50
-
51
- This is the single biggest mistake pattern in paper reproduction. You read the paper, understand the math, implement everything β€” then it crashes because a library function expects `(N, D)` but you passed `(N, C, H, W)`.
52
-
53
- **The rule**: Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script:
54
-
55
- ```python
56
- import torch
57
- from geomloss import SamplesLoss
58
-
59
- # Test with EXACT shapes you'll use in training
60
- loss_fn = SamplesLoss(loss="sinkhorn", p=2, blur=0.5, potentials=True)
61
-
62
- # 2D case β€” works fine
63
- x_2d = torch.randn(256, 2, requires_grad=True)
64
- y_2d = torch.randn(256, 2)
65
- F, G = loss_fn(x_2d, y_2d) # βœ… OK
66
-
67
- # Image case β€” THIS CRASHES
68
- x_img = torch.randn(128, 1, 28, 28, requires_grad=True)
69
- y_img = torch.randn(128, 1, 28, 28)
70
- F, G = loss_fn(x_img, y_img) # ❌ ValueError: must be (N,D) or (B,N,D)
71
-
72
- # Image case β€” FIXED by flattening
73
- B = x_img.shape[0]
74
- x_flat = x_img.view(B, -1).requires_grad_(True)
75
- y_flat = y_img.view(B, -1)
76
- F, G = loss_fn(x_flat, y_flat) # βœ… OK
77
- ```
78
-
79
- ### Mistake I made: geomloss tensor shape assumption
80
-
81
- The `SamplesLoss` in geomloss requires inputs as `(N, D)` or `(B, N, D)` tensors. For 2D experiments with shape `(256, 2)` this works perfectly. For images with shape `(128, 1, 28, 28)` it crashes with:
82
-
83
- ```
84
- ValueError: Input samples 'x' and 'y' should be encoded as (N,D) or (B,N,D) (batch) tensors.
85
- ```
86
-
87
- **The fix**: Flatten images before passing to geomloss, reshape gradients back after. This pattern β€” flatten before library call, reshape after β€” applies to many optimal transport libraries (POT, geomloss, ott-jax).
88
-
89
  ---
90
 
91
- ## Phase 3: Architecture Gotchas
92
-
93
- ### UNet skip connections
94
 
95
- When building a UNet from scratch (rather than importing from guided-diffusion), the skip connection bookkeeping is the #1 source of shape mismatch errors.
96
-
97
- **The pattern that works**:
98
- 1. During the downward pass, push every intermediate activation onto a `skips` list
99
- 2. During the upward pass, pop from `skips` and concatenate
100
- 3. The number of pops must EXACTLY equal the number of pushes
101
-
102
- **Mistake pattern**: Using a helper like `_get_num_res_blocks()` that infers block count from module list lengths. This is fragile β€” if the number of levels or blocks per level varies, the inference breaks.
103
-
104
- **Better approach**: Store `num_res_blocks` as an instance variable at init time and use it directly.
105
-
106
- ### GroupNorm channel requirements
107
-
108
- `nn.GroupNorm(32, channels)` requires `channels` to be divisible by 32. For small models (e.g., MNIST with `model_channels=32`), this is fine at the first level but may break at deeper levels if `channel_mult` creates channels not divisible by 32.
109
 
110
  ---
111
 
112
- ## Phase 4: VRAM Estimation and Memory Management
113
-
114
- ### Estimate VRAM BEFORE running β€” not after OOM
115
-
116
- Papers report batch sizes that worked on their hardware (often A100 80GB or 8Γ—V100). If your user has a T4 (16GB) or even a T4Γ—2 (16GB per GPU, but single-GPU code only uses one), you must recalculate whether the paper's configs will fit.
117
-
118
- ### The Sinkhorn VRAM trap
119
-
120
- The `tensorized` backend in geomloss computes a full NΓ—N cost matrix. For N samples of dimension D:
121
- - Memory β‰ˆ O(NΒ² Γ— D) for the cost matrix + intermediate Sinkhorn iterations
122
- - With `potentials=True` and `autograd.grad`, add another O(N Γ— D) for gradient storage
123
-
124
- **Concrete examples (fp32, single Sinkhorn call)**:
125
- | N (batch) | D (flattened dim) | Approx VRAM per call |
126
- |-----------|-------------------|---------------------|
127
- | 256 | 2 (2D points) | ~1 MB |
128
- | 256 | 784 (MNIST 28Γ—28) | ~200 MB |
129
- | 128 | 3072 (CIFAR 3Γ—32Γ—32) | ~600 MB |
130
-
131
- But pool building calls Sinkhorn **twice per step** (self-potential + cross-potential) Γ— **5 flow steps per batch** = 10 Sinkhorn calls per pool batch. With autograd overhead, 128Γ—3072 easily eats 8+ GB β€” leaving no room for the 38M-param UNet on a 16GB T4.
132
-
133
- **Mistake I made**: Used the paper's `sinkhorn.batch_size=128` for CIFAR-10. This OOMed immediately on T4. The paper's authors likely used A100s.
134
 
135
- **The fix**: Reduce Sinkhorn batch size for smaller GPUs and increase pool batches to compensate:
136
- ```yaml
137
- # Paper config (A100 80GB):
138
- sinkhorn.batch_size: 128
139
- pool.num_batches: 2500
140
- # Total pool entries: 128 Γ— 2500 Γ— 5 = 1.6M
141
-
142
- # T4 16GB config:
143
- sinkhorn.batch_size: 32
144
- pool.num_batches: 10000
145
- # Total pool entries: 32 Γ— 10000 Γ— 5 = 1.6M (same!)
146
- ```
147
-
148
- Add a CLI override (`--sinkhorn-batch`) so users can tune without editing config files.
149
-
150
- ### Always call `torch.cuda.empty_cache()` between phases
151
-
152
- Pool building uses GPU for Sinkhorn computation. Training uses GPU for the neural network. These are different memory patterns. After pool building, the Sinkhorn computation graph is no longer needed β€” but PyTorch's CUDA allocator may still hold that memory. Explicitly free it:
153
-
154
- ```python
155
- def build_trajectory_pool(self, ...):
156
- # ... build pool ...
157
- if self.device != "cpu":
158
- torch.cuda.empty_cache() # Free Sinkhorn memory before training
159
- self.pool.finalize()
160
- ```
161
-
162
- ### Multi-GPU β‰  automatic parallelism
163
-
164
- If the user has a T4Γ—2 on Kaggle, your single-GPU code will only use ONE of the two GPUs. The second sits idle. Using both requires PyTorch DDP or model parallelism β€” which is a significant code change.
165
-
166
- **Don't silently assume multi-GPU works.** Document this:
167
- ```
168
- NOTE: This code uses a single GPU. If you have T4Γ—2, only one GPU is used.
169
- A single T4 (16GB) is sufficient β€” the second GPU is wasted without DDP.
170
- ```
171
 
172
- ### Trajectory pool memory on CPU vs GPU
173
 
174
- The trajectory pool stores ALL flow trajectories for the entire training. For image experiments this is gigabytes:
175
- - MNIST: 1.92M entries Γ— 784 dims Γ— 4 bytes = **6 GB** on CPU
176
- - CIFAR: 1.6M entries Γ— 3072 dims Γ— 4 bytes = **19.6 GB** on CPU
177
 
178
- The pool MUST live on CPU. Only the sampled minibatch (128-256 samples) goes to GPU per training step. This is already how the code works (trajectories stored as CPU tensors, `.to(device)` in `sample()`), but it's worth being explicit about why.
179
 
180
  ---
181
 
182
- ## Phase 5: Testing Strategy
183
 
184
- ### Always test on CPU first with tiny configs
185
-
186
- Before any GPU run, verify the full pipeline works end-to-end:
187
-
188
- ```bash
189
- # Tiny run β€” should complete in <30 seconds
190
- python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 5 --train-iters 100
191
-
192
- # Slightly larger β€” should complete in <5 minutes
193
- python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20 --train-iters 2000
194
- ```
195
-
196
- ### Test image experiments separately with minimal configs
197
-
198
- ```bash
199
- # MNIST smoke test β€” 2 pool batches, 5 training iters per phase
200
- python main.py --experiment mnist --pool-batches 2 --train-iters 5
201
-
202
- # If this crashes, fix before scaling up
203
- ```
204
-
205
- **Mistake I made**: I tested 2D experiments thoroughly on CPU (both tiny and medium runs worked) but shipped the image experiments without testing them at all. The geomloss tensor shape bug affected ONLY the image path, so 2D success gave false confidence. The first GPU test of MNIST crashed immediately.
206
-
207
- **Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
208
-
209
- ### Test all training phases, not just the first one
210
-
211
- Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap in Phase 6). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
212
 
213
  ---
214
 
215
- ## Phase 6: Shared State Across Training Phases
216
 
217
- ### The DataLoader trap
218
 
219
- When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
220
 
221
- **Mistake I made**: The `DatasetLoader.sample_target()` method lazily creates a PyTorch DataLoader on the first call, caching it with whatever batch size was requested. Phase 1 (pool building) calls `sample_target(256)` β†’ DataLoader created with `batch_size=256, drop_last=True`. Phase 2 (NSF training) calls `sample_target(128)` β†’ but the cached DataLoader still yields batches of 256 β†’ tensor shape mismatch β†’ crash:
222
-
223
- ```
224
- RuntimeError: The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 0
225
- ```
226
-
227
- **The fix**: Track the batch size and recreate the DataLoader when it changes:
228
-
229
- ```python
230
- def sample_target(self, n, device="cpu"):
231
- if not hasattr(self, "_loader") or self._batch_size != n:
232
- self._batch_size = n
233
- self._loader = get_image_dataloader(self.dataset_name, batch_size=n, train=True)
234
- self._iter = iter(self._loader)
235
- # ... sample from self._iter ...
236
- ```
237
-
238
- **General rule**: When sharing a data provider across multiple consumers with different batch sizes, NEVER cache a DataLoader with a fixed batch size. Either recreate it on batch size change, or provide raw dataset access and let each consumer create its own DataLoader.
239
 
240
  ---
241
 
242
- ## Phase 7: Checkpointing and Multi-Session Training
243
-
244
- ### Why this matters
245
-
246
- Paper reproduction often requires training runs that exceed a single GPU session. Kaggle gives 9 hours per T4 session. MNIST NSGF++ with full paper config (100K+100K+40K iters) needs ~7-8 hours on T4 β€” tight. CIFAR-10 (200K+200K+40K) is impossible in one session.
247
-
248
- Without checkpointing, a Kaggle timeout = all progress lost.
249
-
250
- ### Phase-level checkpointing
251
-
252
- For multi-phase training, save a checkpoint after EACH phase completes:
253
 
254
- ```python
255
- # After Phase 1 completes:
256
- torch.save({
257
- "nsgf_model_state": nsgf_model.state_dict(),
258
- "phase": 1,
259
- }, "checkpoints/phase1_complete.pt")
260
 
261
- # After Phase 2 completes:
262
- torch.save({
263
- "nsgf_model_state": nsgf_model.state_dict(),
264
- "nsf_model_state": nsf_model.state_dict(),
265
- "phase": 2,
266
- }, "checkpoints/phase2_complete.pt")
267
- ```
268
-
269
- Then implement `--resume-phase N` that loads the phase N-1 checkpoint and skips completed phases:
270
-
271
- ```bash
272
- # Session 1: Run Phase 1 (gets interrupted or completes)
273
- python main.py --experiment mnist
274
-
275
- # Session 2: Skip Phase 1, start Phase 2
276
- python main.py --experiment mnist --resume-phase 2
277
-
278
- # Session 3: Skip Phases 1+2, run Phase 3 + inference
279
- python main.py --experiment mnist --resume-phase 3
280
- ```
281
 
282
- ### Step-level checkpointing within phases
283
 
284
- For long phases (100K+ steps), also save within the phase every N steps:
285
 
286
- ```python
287
- if (step + 1) % checkpoint_every == 0:
288
- torch.save({
289
- "model_state": model.state_dict(),
290
- "optimizer_state": optimizer.state_dict(),
291
- "step": step + 1,
292
- }, "checkpoints/nsgf_checkpoint.pt")
293
- ```
294
-
295
- ### Important: checkpoint persistence on Kaggle
296
 
297
- Kaggle notebooks persist `/kaggle/working/` across cells within the same session, but NOT across sessions. To carry checkpoints between sessions:
298
- 1. Save checkpoints to `/kaggle/working/nsgf-plusplus/checkpoints/`
299
- 2. Before session ends, commit the notebook output or copy checkpoints to a dataset
300
- 3. In the new session, restore checkpoints before running `--resume-phase`
301
 
302
  ---
303
 
304
- ## Phase 8: Debugging GPU Runs
305
-
306
- ### Common error patterns
307
-
308
- | Error | Cause | Fix |
309
- |-------|-------|-----|
310
- | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
311
- | `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
312
- | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
313
- | `CUDA OOM` during pool building (Sinkhorn) | Sinkhorn batch too large for GPU | Reduce `--sinkhorn-batch` (e.g. 128β†’32) |
314
- | `CUDA OOM` during training | Training batch too large or model too big | Reduce training batch, increase grad accum |
315
- | `CUDA OOM` at phase transition | Memory not freed between phases | Add `torch.cuda.empty_cache()` + `del pool` |
316
- | Training loss plateaus high | Pool too small or too few iterations | Increase pool batches, more iters |
317
- | W2 distance too high | Undertrained model | Full paper config: 200 batches, 20k iters |
318
- | Only 1 of 2 GPUs used | Code is single-GPU, no DDP | Expected β€” use single GPU or add DDP |
319
- | `KeyboardInterrupt` mid-training | Training too long at scale | Check `checkpoints/` for latest save |
320
-
321
- ### When the user runs on their hardware
322
-
323
- If you're developing code that the user will run on their own GPU (Kaggle, Colab, local):
324
 
325
- 1. **Provide exact commands** β€” don't make them figure out args
326
- 2. **Warn about expected runtimes** β€” "2D full run: ~20min on T4, MNIST: ~2-4 hours per phase, CIFAR-10: ~4+ hours per phase"
327
- 3. **Include checkpoint saving** β€” so partial runs aren't wasted
328
- 4. **Document GPU requirements** β€” "MNIST fits on T4 16GB, CIFAR-10 needs `--sinkhorn-batch 32`"
329
- 5. **Document multi-GPU limitations** β€” "Single-GPU only. T4Γ—2 wastes the second GPU."
330
- 6. **Test the exact commands yourself** β€” if you can't run on GPU, at least verify the command parses correctly on CPU
331
 
332
  ---
333
 
334
- ## Mistake Catalog
335
-
336
- ### Mistakes made during NSGF++ reproduction
337
-
338
- 1. **geomloss tensor shape bug** (CRITICAL)
339
- - **What**: `SamplesLoss` requires `(N,D)` tensors. Image experiments passed `(N,C,H,W)`.
340
- - **Impact**: MNIST and CIFAR-10 experiments crash immediately. 2D works fine, hiding the bug.
341
- - **Root cause**: Only tested 2D path. Didn't verify library API with image tensor shapes.
342
- - **Prevention**: Write a standalone API test script for every third-party library, testing with ALL tensor shapes you'll use.
343
-
344
- 2. **TrajectoryPool sampling performance** (MODERATE)
345
- - **What**: `torch.cat` called on entire pool every training step.
346
- - **Impact**: Training slower than necessary. At 512K pool entries, the cat+index is the bottleneck (~0.5s per step vs ~0.05s for the actual forward/backward).
347
- - **Root cause**: Didn't profile the training loop.
348
- - **Prevention**: Pre-concatenate the pool after building it. Profile before shipping.
349
-
350
- 3. **Incomplete experiment testing** (CRITICAL)
351
- - **What**: Tested 2D experiments only. Shipped MNIST/CIFAR untested.
352
- - **Impact**: User's first GPU run crashes. Wasted their Kaggle session time.
353
- - **Root cause**: False confidence from 2D success. Assumed same code path.
354
- - **Prevention**: Test EVERY experiment type with minimal configs. Different experiment types often exercise different code paths.
355
-
356
- 4. **No checkpoint saving** (MODERATE β†’ became CRITICAL at scale)
357
- - **What**: No intermediate checkpoints during long training runs.
358
- - **Impact**: If training is interrupted (Kaggle timeout, OOM, accidental Ctrl+C), all progress is lost. MNIST full run is ~7 hours β€” losing that is devastating.
359
- - **Prevention**: Save checkpoints every N iterations. Save after each phase. Implement `--resume-phase` flag. Test resume actually works.
360
-
361
- 5. **UNet forward pass fragility** (LOW-MODERATE)
362
- - **What**: `_get_num_res_blocks()` infers block count from module list length division.
363
- - **Impact**: Could break silently with non-standard configs.
364
- - **Prevention**: Store config values as instance variables, don't infer from module counts.
365
-
366
- 6. **DataLoader batch size mismatch across phases** (CRITICAL)
367
- - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back β†’ tensor dimension mismatch crash.
368
- - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully.
369
- - **Root cause**: Lazy initialization pattern without invalidation.
370
- - **Prevention**: When sharing stateful objects across consumers with different configs, track all cached parameters and invalidate on change.
371
-
372
- 7. **CLI flag not overriding all training phases** (LOW)
373
- - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
374
- - **Impact**: Tests take much longer than expected.
375
- - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
376
- - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect.
377
-
378
- 8. **CIFAR-10 Sinkhorn OOM on T4** (CRITICAL)
379
- - **What**: Paper uses `sinkhorn.batch_size=128` for CIFAR. Sinkhorn on 128 Γ— 3072-dim (flattened 3Γ—32Γ—32) with `tensorized` backend computes a 128Γ—128 cost matrix with 3072-dim vectors, plus autograd for potentials. This OOMs on T4 16GB during pool building.
380
- - **Impact**: CIFAR-10 experiment crashes before even starting training. User loses their Kaggle session.
381
- - **Root cause**: Used paper's hyperparameters without estimating VRAM for target hardware. Paper authors likely used A100 80GB.
382
- - **Prevention**: ALWAYS estimate VRAM before running. Sinkhorn with `tensorized` backend is O(NΒ² Γ— D). For CIFAR: 128Β² Γ— 3072 Γ— 4 bytes Γ— ~10 (overhead) β‰ˆ 2+ GB per call, Γ—10 calls per pool batch = too much. Reduce N: 32Β² Γ— 3072 is 4Γ— cheaper. Add `--sinkhorn-batch` CLI flag so users can tune without editing config.
383
-
384
- 9. **No GPU memory freed between phases** (MODERATE)
385
- - **What**: After pool building, the Sinkhorn computation graph's CUDA allocations remain cached even though they're no longer needed. Training then starts with less available VRAM.
386
- - **Impact**: Training phase might OOM even though pool building finished.
387
- - **Root cause**: PyTorch's CUDA allocator doesn't automatically return memory to the OS.
388
- - **Prevention**: `torch.cuda.empty_cache()` after pool building completes. Also `del pool` if the pool data was already finalized to separate tensors.
389
-
390
- 10. **Multi-GPU assumption** (LOW)
391
- - **What**: User has T4Γ—2 on Kaggle. Code is single-GPU. Second GPU sits idle.
392
- - **Impact**: User pays for 2 GPUs but only uses 1. They might think the code is broken.
393
- - **Root cause**: Didn't document single-GPU limitation.
394
- - **Prevention**: Document GPU requirements explicitly. If multi-GPU is needed, implement DDP β€” but that's a significant scope change, so discuss with user first.
395
 
396
- ---
397
-
398
- ## Pre-flight Checklist (before declaring code ready)
 
 
399
 
 
400
  ```
401
- β–‘ All experiment types tested with minimal configs (not just the easiest one)
402
- β–‘ ALL training phases tested end-to-end (not just Phase 1)
403
- β–‘ Third-party library APIs tested with exact tensor shapes per experiment
404
- β–‘ Shared state across phases verified (DataLoaders, iterators, caches)
405
- β–‘ CLI flags override ALL relevant config values (not just some)
406
- β–‘ VRAM estimated for target hardware β€” will Sinkhorn/model/pool fit?
407
- β–‘ Sinkhorn batch size appropriate for target GPU (not just paper's GPU)
408
- β–‘ torch.cuda.empty_cache() called between memory-intensive phases
409
- β–‘ Training loop profiled β€” no O(N) operations per step where O(1) suffices
410
- β–‘ Memory estimated per experiment (pool size Γ— data dim Γ— 4 bytes)
411
- β–‘ Checkpointing implemented: every N steps + after each phase
412
- β–‘ --resume-phase tested and working (load checkpoint β†’ skip phases β†’ continue)
413
- β–‘ Clear CLI with sensible defaults and override flags for GPU-sensitive params
414
  β–‘ Expected runtimes documented per hardware tier
415
  β–‘ Multi-GPU limitations documented
416
- β–‘ Error messages are clear (not just stack traces)
417
- β–‘ Results directory created automatically
418
- β–‘ Requirements.txt includes ALL dependencies with minimum versions
419
  ```
420
 
421
  ---
422
 
423
- ## General Principles for Paper Reproduction
424
-
425
- 1. **Read the appendix first.** The appendix contains the actual implementation details. The main paper is the story; the appendix is the recipe.
426
-
427
- 2. **Test the boundaries, not just the happy path.** If your code handles 2D, MNIST, and CIFAR-10, test all three. The bug is always in the path you didn't test.
428
-
429
- 3. **Library APIs are opaque until tested.** Don't assume a function accepts your tensor shape just because it "makes sense." Write a 10-line test script.
430
-
431
- 4. **Pre-concatenate, don't re-concatenate.** Any data structure that's built once and sampled many times should be finalized into a single tensor after building.
432
 
433
- 5. **The user's time is more expensive than your time.** A crash on their GPU after 5 minutes of setup is worse than you spending 30 extra minutes testing. Ship code that works on first run.
434
 
435
- 6. **Flatten for OT libraries.** Optimal transport libraries (geomloss, POT, ott-jax) almost universally expect `(N, D)` point clouds. Images must be flattened. This is the #1 gotcha in OT-based generative models.
 
 
 
 
436
 
437
- 7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
438
-
439
- 8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.
440
-
441
- 9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
442
-
443
- 10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.
444
-
445
- 11. **Paper hyperparameters assume paper hardware.** If a paper reports batch_size=128 and trained on A100 80GB, that batch size may OOM on your T4 16GB. Always re-derive batch sizes from VRAM constraints, keeping the total samples seen (batch Γ— iterations) the same.
446
-
447
- 12. **Estimate VRAM before running, not after OOM.** For Sinkhorn: O(NΒ² Γ— D). For model: count parameters Γ— 4 bytes (fp32) Γ— 3 (params + gradients + optimizer). For pool: stored on CPU but sampled minibatch goes to GPU. Write this down before your first GPU run.
448
 
449
- 13. **Checkpoint at phase boundaries, not just step boundaries.** Phase-level checkpoints enable `--resume-phase` which is the minimum viable recovery. Step-level checkpoints within long phases are a bonus. Both together make multi-session training actually work.
450
 
451
- 14. **Free GPU memory between phases.** `torch.cuda.empty_cache()` after pool building or any phase that uses different GPU memory patterns than the next phase. Also `del` large objects (pools, computation graphs) that won't be needed again.
452
 
453
- 15. **Document what your code does NOT support.** Single-GPU only? No mixed precision? No gradient accumulation? Say so. Users with multi-GPU setups will waste time wondering why only one GPU is active if you don't tell them.
 
 
 
 
1
  ---
2
  name: paper-reproduction
3
+ description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper β€” especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas, VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
4
  ---
5
 
6
  # Paper Reproduction Skill
7
 
8
+ Rules and procedures for reproducing ML research papers from scratch. All concrete mistakes, war stories, and examples live in [LEARNING.md](LEARNING.md). Next steps for this project live in [TODO.md](TODO.md).
 
 
 
 
 
 
 
9
 
10
  ---
11
 
12
+ ## 1. Paper Reading
13
 
14
+ Read methodology sections (3, 4, 5) line by line. Read ALL appendices β€” they contain the actual recipe.
15
 
16
+ ### Extraction checklist
17
 
18
  ```
19
+ β–‘ Loss function β€” exact math, every symbol defined
20
+ β–‘ Architecture β€” layers, dims, activations, normalization
21
+ β–‘ Optimizer β€” type, lr, betas, weight decay, scheduler
22
  β–‘ Batch size β€” for each phase/component separately
23
+ β–‘ Training iterations β€” for each phase/component separately
24
  β–‘ Dataset preprocessing β€” normalization range, image size, augmentation
25
+ β–‘ Evaluation protocol β€” metrics, number of samples, special setup
26
  β–‘ Hyperparameters per experiment β€” papers often have different configs per dataset
27
+ β–‘ Algorithm pseudocode β€” follow exactly before improvising
28
  β–‘ GPU hardware used β€” what the authors trained on (often buried in appendix)
29
  β–‘ Training time β€” how long did the authors' runs take?
30
  ```
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ---
33
 
34
+ ## 2. Library API Verification
 
 
35
 
36
+ Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script that calls the library with the EXACT tensor shapes you'll use in every experiment. Not just the simplest one β€” all of them. If you have 2D points, MNIST images, and CIFAR images, test all three shapes.
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ---
39
 
40
+ ## 3. VRAM Estimation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ Estimate VRAM BEFORE running β€” not after OOM. Paper hyperparameters assume paper hardware.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
+ **Formula for Sinkhorn (tensorized backend):** O(NΒ² Γ— D) per call. Pool building does ~10 calls per batch (2 potentials Γ— 5 flow steps). Add model params Γ— 4 bytes Γ— 3 (params + grads + optimizer states).
45
 
46
+ **Rule:** If paper used A100 80GB and you have T4 16GB, re-derive batch sizes from VRAM constraints. Keep total samples seen (batch Γ— iterations) constant by increasing iterations when you shrink batch.
 
 
47
 
48
+ Add CLI override flags (e.g. `--sinkhorn-batch`) so users can tune without editing config.
49
 
50
  ---
51
 
52
+ ## 4. Architecture
53
 
54
+ - UNet skip connections: count pushes during downward pass, pops during upward pass. They must match exactly.
55
+ - Store config values (`num_res_blocks`, `num_levels`) as instance variables at init. Never infer them from module list lengths.
56
+ - `nn.GroupNorm(32, channels)` requires channels divisible by 32. Assert this at init for all levels.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ---
59
 
60
+ ## 5. Multi-Phase Training
61
 
62
+ Each phase gets its own trainer with its own optimizer. Previous phase's model goes to `eval()`.
63
 
64
+ ### Shared state rules
65
 
66
+ - Never cache a DataLoader with a fixed batch size if different phases use different batch sizes. Track cached params and invalidate on change.
67
+ - `torch.cuda.empty_cache()` between phases. `del` large objects (pools, computation graphs) that won't be needed again.
68
+ - CLI overrides must touch ALL phases. If `--train-iters` should override 3 phases, grep the config for all 3 fields.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ---
71
 
72
+ ## 6. Checkpointing
 
 
 
 
 
 
 
 
 
 
73
 
74
+ ### Phase-level (mandatory)
 
 
 
 
 
75
 
76
+ Save checkpoint after each phase completes. Include all model state dicts accumulated so far. Implement `--resume-phase N` that loads phase N-1 checkpoint and skips completed phases.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
+ ### Step-level (strongly recommended for phases > 10 min)
79
 
80
+ Save every N steps within a phase. Include model state, optimizer state, step number. Overwrite same file (keep latest only, unless you have disk space).
81
 
82
+ ### Kaggle persistence
 
 
 
 
 
 
 
 
 
83
 
84
+ `/kaggle/working/` persists within a session but NOT across sessions. To carry checkpoints between sessions: commit notebook output, or copy checkpoints to a HF dataset, or download them before session ends.
 
 
 
85
 
86
  ---
87
 
88
+ ## 7. Memory Management
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
+ - Trajectory pools / replay buffers live on CPU. Only the sampled minibatch goes to GPU via `.to(device)`.
91
+ - Pre-concatenate data structures after building: `finalize()` once β†’ O(1) sampling per step. Never `torch.cat` the entire pool every step.
92
+ - Call `torch.cuda.empty_cache()` after pool building and between any phases with different GPU memory patterns.
 
 
 
93
 
94
  ---
95
 
96
+ ## 8. Testing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
+ ### Before any GPU run:
99
+ 1. Test EVERY experiment type with minimal configs β€” not just the simplest one
100
+ 2. Test ALL training phases end-to-end β€” not just Phase 1
101
+ 3. Test with `--train-iters 5 --pool-batches 2` β€” should complete in <60 seconds on CPU
102
+ 4. Test `--resume-phase` actually works (save checkpoint β†’ load β†’ skip β†’ continue)
103
 
104
+ ### Before declaring code ready (pre-flight checklist):
105
  ```
106
+ β–‘ All experiment types tested (2d, mnist, cifar10, etc.)
107
+ β–‘ All training phases tested end-to-end
108
+ β–‘ Library APIs tested with exact tensor shapes per experiment
109
+ β–‘ Shared state across phases verified
110
+ β–‘ CLI flags override ALL relevant config values
111
+ β–‘ VRAM estimated for target hardware
112
+ β–‘ Checkpointing works: save + resume + skip phases
113
+ β–‘ No O(N) operations per training step where O(1) suffices
 
 
 
 
 
114
  β–‘ Expected runtimes documented per hardware tier
115
  β–‘ Multi-GPU limitations documented
116
+ β–‘ Requirements.txt complete
 
 
117
  ```
118
 
119
  ---
120
 
121
+ ## 9. Documentation for User
 
 
 
 
 
 
 
 
122
 
123
+ When the user runs on their own GPU (Kaggle, Colab, local):
124
 
125
+ 1. Provide exact copy-paste commands
126
+ 2. Document expected runtimes per hardware tier
127
+ 3. Document GPU requirements and VRAM limits per experiment
128
+ 4. Document what the code does NOT support (single-GPU only, no DDP, etc.)
129
+ 5. If training exceeds one session, provide session-by-session commands with `--resume-phase`
130
 
131
+ ---
 
 
 
 
 
 
 
 
 
 
132
 
133
+ ## 10. Maintaining LEARNING.md
134
 
135
+ When a new mistake happens or a new principle is discovered:
136
 
137
+ 1. Add the mistake to the **Mistake Catalog** in LEARNING.md with: What, Impact, Root cause, Prevention
138
+ 2. If the mistake reveals a general principle, add it to the **Principles** section
139
+ 3. If the mistake would have been caught by a pre-flight check, add that check to the checklist in section 8 above
140
+ 4. Keep SKILL.md lean (rules only). LEARNING.md holds the stories and evidence.