SKILL.md: Add VRAM estimation, checkpointing, multi-session, multi-GPU lessons from CIFAR OOM

New mistakes #8-10, new principles #11-15, expanded Phase 4 (VRAM), new Phase 7 (checkpointing),
updated pre-flight checklist, updated error table."

Files changed (1) hide show

SKILL.md +193 -113

SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: paper-reproduction
-description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas (geomloss, POT, custom UNets), and iterating on GPU results."
 ---
 # Paper Reproduction Skill
@@ -32,6 +32,8 @@ Most reproduction failures trace back to incomplete paper reading. Don't skim
 □ Evaluation protocol — metrics, number of samples, any special setup
 □ Hyperparameters per experiment — papers often have different configs per dataset
 □ Algorithm pseudocode — if provided, follow it exactly before improvising
 ```
 ### Mistake I made: Incomplete appendix reading
@@ -82,39 +84,7 @@ The `SamplesLoss` in geomloss requires inputs as `(N, D)` or `(B, N, D)` tensors
 ValueError: Input samples 'x' and 'y' should be encoded as (N,D) or (B,N,D) (batch) tensors.
 ```
-**The fix**: Flatten images before passing to geomloss, reshape gradients back after:
-```python
-def compute_velocity(self, X: torch.Tensor, Y: torch.Tensor) -> torch.Tensor:
-    original_shape = X.shape
-    is_image = X.dim() == 4
-    if is_image:
-        B = X.shape[0]
-        X_flat = X.detach().clone().view(B, -1).requires_grad_(True)
-        Y_flat = Y.detach().view(B, -1)
-    else:
-        X_flat = X.detach().clone().requires_grad_(True)
-        Y_flat = Y.detach()
-    # Self-potential
-    F_self, _ = self.loss_fn(X_flat, X_flat.detach().clone())
-    grad_self = torch.autograd.grad(F_self.sum(), X_flat)[0]
-    # Cross-potential
-    X_flat2 = X.detach().clone().view(B, -1).requires_grad_(True) if is_image else X.detach().clone().requires_grad_(True)
-    F_cross, _ = self.loss_fn(X_flat2, Y_flat)
-    grad_cross = torch.autograd.grad(F_cross.sum(), X_flat2)[0]
-    velocity = grad_self.detach() - grad_cross.detach()
-    if is_image:
-        velocity = velocity.view(original_shape)
-    return velocity
-```
-This pattern — flatten before library call, reshape after — applies to many optimal transport libraries (POT, geomloss, ott-jax).
 ---
@@ -131,76 +101,120 @@ When building a UNet from scratch (rather than importing from guided-diffusion),
 **Mistake pattern**: Using a helper like `_get_num_res_blocks()` that infers block count from module list lengths. This is fragile — if the number of levels or blocks per level varies, the inference breaks.
-**Better approach**: Store `num_res_blocks` as an instance variable at init time and use it directly:
-```python
-def __init__(self, ..., num_res_blocks=2, channel_mult=[1,2,2,2], ...):
-    self.num_res_blocks = num_res_blocks
-    self.num_levels = len(channel_mult)
-    # ... build layers ...
-def forward(self, x, t):
-    # Use self.num_res_blocks directly, not a computed value
 ```
-### GroupNorm channel requirements
-`nn.GroupNorm(32, channels)` requires `channels` to be divisible by 32. For small models (e.g., MNIST with `model_channels=32`), this is fine at the first level but may break at deeper levels if `channel_mult` creates channels not divisible by 32.
-**Safety check**: At init time, verify all channel counts are divisible by the group count:
 ```python
-for level, mult in enumerate(channel_mult):
-    ch = model_channels * mult
-    assert ch % 32 == 0, f"Level {level}: channels={ch} not divisible by 32"
 ```
 ---
-## Phase 4: Training Loop Patterns for Custom Pipelines
-### Trajectory pool memory management
-When building trajectory pools (storing (x, v, t) tuples from gradient flow), memory can explode:
-- MNIST: 256 samples × 1500 batches × 5 steps × 784 dims × 4 bytes ≈ 2.4 GB (manageable)
-- CIFAR-10: 128 samples × 2500 batches × 5 steps × 3072 dims × 4 bytes ≈ 19 GB (tight on T4)
-**The fix**: Store pool tensors on CPU, transfer to GPU only during sampling:
-```python
-class TrajectoryPool:
-    def sample(self, batch_size, device="cpu"):
-        # Concatenate on CPU, index, THEN move to device
-        all_x = torch.cat(self.x_pool, dim=0)  # stays on CPU
-        idx = torch.randint(0, all_x.shape[0], (batch_size,))
-        return all_x[idx].to(device), ...  # only batch moves to GPU
 ```
-**Mistake I made**: The pool sampling code calls `torch.cat` on the entire pool every training step, which is O(pool_size) per step. For 512K entries this is slow. Better: pre-concatenate once after pool building, then just index:
-```python
-def finalize(self):
-    """Call once after pool is fully built."""
-    self._all_x = torch.cat(self.x_pool, dim=0)
-    self._all_v = torch.cat(self.v_pool, dim=0)
-    self._all_t = torch.tensor(self.t_pool, dtype=torch.float32)
-    # Free the lists
-    self.x_pool = self.v_pool = self.t_pool = None
-def sample(self, batch_size, device="cpu"):
-    idx = torch.randint(0, self._all_x.shape[0], (batch_size,))
-    return self._all_x[idx].to(device), self._all_v[idx].to(device), self._all_t[idx].to(device)
 ```
-### Multi-phase training (NSGF++)
-NSGF++ has 3 sequential training phases:
-1. **NSGF**: Build trajectory pool → train velocity field
-2. **NSF**: Use trained NSGF to generate P0 samples → train straight flow
-3. **Phase predictor**: Train CNN to predict transition time
-**Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
-### Shared state across training phases — the DataLoader trap
 When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
@@ -225,40 +239,69 @@ def sample_target(self, n, device="cpu"):
 ---
-## Phase 5: Testing Strategy
-### Always test on CPU first with tiny configs
-Before any GPU run, verify the full pipeline works end-to-end:
-```bash
-# Tiny run — should complete in <30 seconds
-python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 5 --train-iters 100
-# Slightly larger — should complete in <5 minutes
-python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20 --train-iters 2000
 ```
-### Test image experiments separately with minimal configs
 ```bash
-# MNIST smoke test — 2 pool batches, 5 training iters per phase
-python main.py --experiment mnist --pool-batches 2 --train-iters 5
-# If this crashes, fix before scaling up
 ```
-**Mistake I made**: I tested 2D experiments thoroughly on CPU (both tiny and medium runs worked) but shipped the image experiments without testing them at all. The geomloss tensor shape bug affected ONLY the image path, so 2D success gave false confidence. The first GPU test of MNIST crashed immediately.
-**Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
-### Test all training phases, not just the first one
-Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap above). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
 ---
-## Phase 6: Debugging GPU Runs
 ### Common error patterns
@@ -267,20 +310,24 @@ Even after fixing Phase 1, Phase 2 can still crash due to shared state (see Data
 | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
 | `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
 | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
-| `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
-| `CUDA OOM` during training | Batch too large or model too big | Reduce batch → increase grad accum |
 | Training loss plateaus high | Pool too small or too few iterations | Increase pool batches, more iters |
 | W2 distance too high | Undertrained model | Full paper config: 200 batches, 20k iters |
-| `KeyboardInterrupt` during training | Training takes too long at scale | Expected — full 2D takes ~20min on T4 |
 ### When the user runs on their hardware
 If you're developing code that the user will run on their own GPU (Kaggle, Colab, local):
 1. **Provide exact commands** — don't make them figure out args
-2. **Warn about expected runtimes** — "2D full run: ~20min on T4, MNIST: ~2-4 hours, CIFAR-10: ~8-12 hours"
 3. **Include checkpoint saving** — so partial runs aren't wasted
-4. **Test the exact commands yourself** — if you can't run on GPU, at least verify the command parses correctly on CPU
 ---
@@ -306,10 +353,10 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
    - **Root cause**: False confidence from 2D success. Assumed same code path.
    - **Prevention**: Test EVERY experiment type with minimal configs. Different experiment types often exercise different code paths.
-4. **No checkpoint saving** (MODERATE)
    - **What**: No intermediate checkpoints during long training runs.
-   - **Impact**: If training is interrupted (Kaggle timeout, OOM), all progress is lost.
-   - **Prevention**: Save checkpoints every N iterations. Implement `--resume` flag.
 5. **UNet forward pass fragility** (LOW-MODERATE)
    - **What**: `_get_num_res_blocks()` infers block count from module list length division.
@@ -318,15 +365,33 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
 6. **DataLoader batch size mismatch across phases** (CRITICAL)
    - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back → tensor dimension mismatch crash.
-   - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully. The error message (`size of tensor a (128) must match size of tensor b (256)`) doesn't make the DataLoader caching obvious.
-   - **Root cause**: Lazy initialization pattern without invalidation. The `_image_loader` was created once and never checked for batch size changes.
-   - **Prevention**: When sharing stateful objects across consumers with different configs, either (a) track all cached parameters and invalidate on change, or (b) don't cache at all. For DataLoaders: recreate when batch_size changes.
 7. **CLI flag not overriding all training phases** (LOW)
    - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
-   - **Impact**: Tests take much longer than expected. User thinks something is broken.
    - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
-   - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect. If a config has `nsgf_training.num_iterations`, `nsf_training.num_iterations`, AND `time_predictor.num_iterations`, the override must touch all three.
 ---
@@ -338,11 +403,16 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
 □ Third-party library APIs tested with exact tensor shapes per experiment
 □ Shared state across phases verified (DataLoaders, iterators, caches)
 □ CLI flags override ALL relevant config values (not just some)
 □ Training loop profiled — no O(N) operations per step where O(1) suffices
 □ Memory estimated per experiment (pool size × data dim × 4 bytes)
-□ Checkpointing implemented for runs >10 minutes
-□ Clear CLI with sensible defaults and override flags
 □ Expected runtimes documented per hardware tier
 □ Error messages are clear (not just stack traces)
 □ Results directory created automatically
 □ Requirements.txt includes ALL dependencies with minimum versions
@@ -371,3 +441,13 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
 9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
 10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.

 ---
 name: paper-reproduction
+description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas (geomloss, POT, custom UNets), VRAM estimation, checkpointing for multi-session training, and iterating on GPU results."
 ---
 # Paper Reproduction Skill
 □ Evaluation protocol — metrics, number of samples, any special setup
 □ Hyperparameters per experiment — papers often have different configs per dataset
 □ Algorithm pseudocode — if provided, follow it exactly before improvising
+□ GPU hardware used — what the authors trained on (often buried in appendix)
+□ Training time — how long did the authors' runs take?
 ```
 ### Mistake I made: Incomplete appendix reading
 ValueError: Input samples 'x' and 'y' should be encoded as (N,D) or (B,N,D) (batch) tensors.
 ```
+**The fix**: Flatten images before passing to geomloss, reshape gradients back after. This pattern — flatten before library call, reshape after — applies to many optimal transport libraries (POT, geomloss, ott-jax).
 ---
 **Mistake pattern**: Using a helper like `_get_num_res_blocks()` that infers block count from module list lengths. This is fragile — if the number of levels or blocks per level varies, the inference breaks.
+**Better approach**: Store `num_res_blocks` as an instance variable at init time and use it directly.
+### GroupNorm channel requirements
+`nn.GroupNorm(32, channels)` requires `channels` to be divisible by 32. For small models (e.g., MNIST with `model_channels=32`), this is fine at the first level but may break at deeper levels if `channel_mult` creates channels not divisible by 32.
+---
+## Phase 4: VRAM Estimation and Memory Management
+### Estimate VRAM BEFORE running — not after OOM
+Papers report batch sizes that worked on their hardware (often A100 80GB or 8×V100). If your user has a T4 (16GB) or even a T4×2 (16GB per GPU, but single-GPU code only uses one), you must recalculate whether the paper's configs will fit.
+### The Sinkhorn VRAM trap
+The `tensorized` backend in geomloss computes a full N×N cost matrix. For N samples of dimension D:
+- Memory ≈ O(N² × D) for the cost matrix + intermediate Sinkhorn iterations
+- With `potentials=True` and `autograd.grad`, add another O(N × D) for gradient storage
+**Concrete examples (fp32, single Sinkhorn call)**:
+| N (batch) | D (flattened dim) | Approx VRAM per call |
+|-----------|-------------------|---------------------|
+| 256 | 2 (2D points) | ~1 MB |
+| 256 | 784 (MNIST 28×28) | ~200 MB |
+| 128 | 3072 (CIFAR 3×32×32) | ~600 MB |
+But pool building calls Sinkhorn **twice per step** (self-potential + cross-potential) × **5 flow steps per batch** = 10 Sinkhorn calls per pool batch. With autograd overhead, 128×3072 easily eats 8+ GB — leaving no room for the 38M-param UNet on a 16GB T4.
+**Mistake I made**: Used the paper's `sinkhorn.batch_size=128` for CIFAR-10. This OOMed immediately on T4. The paper's authors likely used A100s.
+**The fix**: Reduce Sinkhorn batch size for smaller GPUs and increase pool batches to compensate:
+```yaml
+# Paper config (A100 80GB):
+sinkhorn.batch_size: 128
+pool.num_batches: 2500
+# Total pool entries: 128 × 2500 × 5 = 1.6M
+# T4 16GB config:
+sinkhorn.batch_size: 32
+pool.num_batches: 10000
+# Total pool entries: 32 × 10000 × 5 = 1.6M (same!)
 ```
+Add a CLI override (`--sinkhorn-batch`) so users can tune without editing config files.
+### Always call `torch.cuda.empty_cache()` between phases
+Pool building uses GPU for Sinkhorn computation. Training uses GPU for the neural network. These are different memory patterns. After pool building, the Sinkhorn computation graph is no longer needed — but PyTorch's CUDA allocator may still hold that memory. Explicitly free it:
 ```python
+def build_trajectory_pool(self, ...):
+    # ... build pool ...
+    if self.device != "cpu":
+        torch.cuda.empty_cache()  # Free Sinkhorn memory before training
+    self.pool.finalize()
+```
+### Multi-GPU ≠ automatic parallelism
+If the user has a T4×2 on Kaggle, your single-GPU code will only use ONE of the two GPUs. The second sits idle. Using both requires PyTorch DDP or model parallelism — which is a significant code change.
+**Don't silently assume multi-GPU works.** Document this:
+```
+NOTE: This code uses a single GPU. If you have T4×2, only one GPU is used.
+A single T4 (16GB) is sufficient — the second GPU is wasted without DDP.
 ```
+### Trajectory pool memory on CPU vs GPU
+The trajectory pool stores ALL flow trajectories for the entire training. For image experiments this is gigabytes:
+- MNIST: 1.92M entries × 784 dims × 4 bytes = **6 GB** on CPU
+- CIFAR: 1.6M entries × 3072 dims × 4 bytes = **19.6 GB** on CPU
+The pool MUST live on CPU. Only the sampled minibatch (128-256 samples) goes to GPU per training step. This is already how the code works (trajectories stored as CPU tensors, `.to(device)` in `sample()`), but it's worth being explicit about why.
 ---
+## Phase 5: Testing Strategy
+### Always test on CPU first with tiny configs
+Before any GPU run, verify the full pipeline works end-to-end:
+```bash
+# Tiny run — should complete in <30 seconds
+python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 5 --train-iters 100
+# Slightly larger — should complete in <5 minutes
+python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20 --train-iters 2000
 ```
+### Test image experiments separately with minimal configs
+```bash
+# MNIST smoke test — 2 pool batches, 5 training iters per phase
+python main.py --experiment mnist --pool-batches 2 --train-iters 5
+# If this crashes, fix before scaling up
 ```
+**Mistake I made**: I tested 2D experiments thoroughly on CPU (both tiny and medium runs worked) but shipped the image experiments without testing them at all. The geomloss tensor shape bug affected ONLY the image path, so 2D success gave false confidence. The first GPU test of MNIST crashed immediately.
+**Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
+### Test all training phases, not just the first one
+Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap in Phase 6). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
+---
+## Phase 6: Shared State Across Training Phases
+### The DataLoader trap
 When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
 ---
+## Phase 7: Checkpointing and Multi-Session Training
+### Why this matters
+Paper reproduction often requires training runs that exceed a single GPU session. Kaggle gives 9 hours per T4 session. MNIST NSGF++ with full paper config (100K+100K+40K iters) needs ~7-8 hours on T4 — tight. CIFAR-10 (200K+200K+40K) is impossible in one session.
+Without checkpointing, a Kaggle timeout = all progress lost.
+### Phase-level checkpointing
+For multi-phase training, save a checkpoint after EACH phase completes:
+```python
+# After Phase 1 completes:
+torch.save({
+    "nsgf_model_state": nsgf_model.state_dict(),
+    "phase": 1,
+}, "checkpoints/phase1_complete.pt")
+# After Phase 2 completes:
+torch.save({
+    "nsgf_model_state": nsgf_model.state_dict(),
+    "nsf_model_state": nsf_model.state_dict(),
+    "phase": 2,
+}, "checkpoints/phase2_complete.pt")
 ```
+Then implement `--resume-phase N` that loads the phase N-1 checkpoint and skips completed phases:
 ```bash
+# Session 1: Run Phase 1 (gets interrupted or completes)
+python main.py --experiment mnist
+# Session 2: Skip Phase 1, start Phase 2
+python main.py --experiment mnist --resume-phase 2
+# Session 3: Skip Phases 1+2, run Phase 3 + inference
+python main.py --experiment mnist --resume-phase 3
 ```
+### Step-level checkpointing within phases
+For long phases (100K+ steps), also save within the phase every N steps:
+```python
+if (step + 1) % checkpoint_every == 0:
+    torch.save({
+        "model_state": model.state_dict(),
+        "optimizer_state": optimizer.state_dict(),
+        "step": step + 1,
+    }, "checkpoints/nsgf_checkpoint.pt")
+```
+### Important: checkpoint persistence on Kaggle
+Kaggle notebooks persist `/kaggle/working/` across cells within the same session, but NOT across sessions. To carry checkpoints between sessions:
+1. Save checkpoints to `/kaggle/working/nsgf-plusplus/checkpoints/`
+2. Before session ends, commit the notebook output or copy checkpoints to a dataset
+3. In the new session, restore checkpoints before running `--resume-phase`
 ---
+## Phase 8: Debugging GPU Runs
 ### Common error patterns
 | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
 | `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
 | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
+| `CUDA OOM` during pool building (Sinkhorn) | Sinkhorn batch too large for GPU | Reduce `--sinkhorn-batch` (e.g. 128→32) |
+| `CUDA OOM` during training | Training batch too large or model too big | Reduce training batch, increase grad accum |
+| `CUDA OOM` at phase transition | Memory not freed between phases | Add `torch.cuda.empty_cache()` + `del pool` |
 | Training loss plateaus high | Pool too small or too few iterations | Increase pool batches, more iters |
 | W2 distance too high | Undertrained model | Full paper config: 200 batches, 20k iters |
+| Only 1 of 2 GPUs used | Code is single-GPU, no DDP | Expected — use single GPU or add DDP |
+| `KeyboardInterrupt` mid-training | Training too long at scale | Check `checkpoints/` for latest save |
 ### When the user runs on their hardware
 If you're developing code that the user will run on their own GPU (Kaggle, Colab, local):
 1. **Provide exact commands** — don't make them figure out args
+2. **Warn about expected runtimes** — "2D full run: ~20min on T4, MNIST: ~2-4 hours per phase, CIFAR-10: ~4+ hours per phase"
 3. **Include checkpoint saving** — so partial runs aren't wasted
+4. **Document GPU requirements** — "MNIST fits on T4 16GB, CIFAR-10 needs `--sinkhorn-batch 32`"
+5. **Document multi-GPU limitations** — "Single-GPU only. T4×2 wastes the second GPU."
+6. **Test the exact commands yourself** — if you can't run on GPU, at least verify the command parses correctly on CPU
 ---
    - **Root cause**: False confidence from 2D success. Assumed same code path.
    - **Prevention**: Test EVERY experiment type with minimal configs. Different experiment types often exercise different code paths.
+4. **No checkpoint saving** (MODERATE → became CRITICAL at scale)
    - **What**: No intermediate checkpoints during long training runs.
+   - **Impact**: If training is interrupted (Kaggle timeout, OOM, accidental Ctrl+C), all progress is lost. MNIST full run is ~7 hours — losing that is devastating.
+   - **Prevention**: Save checkpoints every N iterations. Save after each phase. Implement `--resume-phase` flag. Test resume actually works.
 5. **UNet forward pass fragility** (LOW-MODERATE)
    - **What**: `_get_num_res_blocks()` infers block count from module list length division.
 6. **DataLoader batch size mismatch across phases** (CRITICAL)
    - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back → tensor dimension mismatch crash.
+   - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully.
+   - **Root cause**: Lazy initialization pattern without invalidation.
+   - **Prevention**: When sharing stateful objects across consumers with different configs, track all cached parameters and invalidate on change.
 7. **CLI flag not overriding all training phases** (LOW)
    - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
+   - **Impact**: Tests take much longer than expected.
    - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
+   - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect.
+8. **CIFAR-10 Sinkhorn OOM on T4** (CRITICAL)
+   - **What**: Paper uses `sinkhorn.batch_size=128` for CIFAR. Sinkhorn on 128 × 3072-dim (flattened 3×32×32) with `tensorized` backend computes a 128×128 cost matrix with 3072-dim vectors, plus autograd for potentials. This OOMs on T4 16GB during pool building.
+   - **Impact**: CIFAR-10 experiment crashes before even starting training. User loses their Kaggle session.
+   - **Root cause**: Used paper's hyperparameters without estimating VRAM for target hardware. Paper authors likely used A100 80GB.
+   - **Prevention**: ALWAYS estimate VRAM before running. Sinkhorn with `tensorized` backend is O(N² × D). For CIFAR: 128² × 3072 × 4 bytes × ~10 (overhead) ≈ 2+ GB per call, ×10 calls per pool batch = too much. Reduce N: 32² × 3072 is 4× cheaper. Add `--sinkhorn-batch` CLI flag so users can tune without editing config.
+9. **No GPU memory freed between phases** (MODERATE)
+   - **What**: After pool building, the Sinkhorn computation graph's CUDA allocations remain cached even though they're no longer needed. Training then starts with less available VRAM.
+   - **Impact**: Training phase might OOM even though pool building finished.
+   - **Root cause**: PyTorch's CUDA allocator doesn't automatically return memory to the OS.
+   - **Prevention**: `torch.cuda.empty_cache()` after pool building completes. Also `del pool` if the pool data was already finalized to separate tensors.
+10. **Multi-GPU assumption** (LOW)
+    - **What**: User has T4×2 on Kaggle. Code is single-GPU. Second GPU sits idle.
+    - **Impact**: User pays for 2 GPUs but only uses 1. They might think the code is broken.
+    - **Root cause**: Didn't document single-GPU limitation.
+    - **Prevention**: Document GPU requirements explicitly. If multi-GPU is needed, implement DDP — but that's a significant scope change, so discuss with user first.
 ---
 □ Third-party library APIs tested with exact tensor shapes per experiment
 □ Shared state across phases verified (DataLoaders, iterators, caches)
 □ CLI flags override ALL relevant config values (not just some)
+□ VRAM estimated for target hardware — will Sinkhorn/model/pool fit?
+□ Sinkhorn batch size appropriate for target GPU (not just paper's GPU)
+□ torch.cuda.empty_cache() called between memory-intensive phases
 □ Training loop profiled — no O(N) operations per step where O(1) suffices
 □ Memory estimated per experiment (pool size × data dim × 4 bytes)
+□ Checkpointing implemented: every N steps + after each phase
+□ --resume-phase tested and working (load checkpoint → skip phases → continue)
+□ Clear CLI with sensible defaults and override flags for GPU-sensitive params
 □ Expected runtimes documented per hardware tier
+□ Multi-GPU limitations documented
 □ Error messages are clear (not just stack traces)
 □ Results directory created automatically
 □ Requirements.txt includes ALL dependencies with minimum versions
 9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
 10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.
+11. **Paper hyperparameters assume paper hardware.** If a paper reports batch_size=128 and trained on A100 80GB, that batch size may OOM on your T4 16GB. Always re-derive batch sizes from VRAM constraints, keeping the total samples seen (batch × iterations) the same.
+12. **Estimate VRAM before running, not after OOM.** For Sinkhorn: O(N² × D). For model: count parameters × 4 bytes (fp32) × 3 (params + gradients + optimizer). For pool: stored on CPU but sampled minibatch goes to GPU. Write this down before your first GPU run.
+13. **Checkpoint at phase boundaries, not just step boundaries.** Phase-level checkpoints enable `--resume-phase` which is the minimum viable recovery. Step-level checkpoints within long phases are a bonus. Both together make multi-session training actually work.
+14. **Free GPU memory between phases.** `torch.cuda.empty_cache()` after pool building or any phase that uses different GPU memory patterns than the next phase. Also `del` large objects (pools, computation graphs) that won't be needed again.
+15. **Document what your code does NOT support.** Single-GPU only? No mixed precision? No gradient accumulation? Say so. Users with multi-GPU setups will waste time wondering why only one GPU is active if you don't tell them.