rogermt
/

nsgf-plusplus

Model card Files Files and versions

xet

Community

rogermt commited on 13 days ago

Commit

88f3058

verified ·

1 Parent(s): 74841f0

Add SKILL.md — paper reproduction skill with lessons from NSGF++ implementation

Browse files

Files changed (1) hide show

SKILL.md +326 -0

SKILL.md ADDED Viewed

	@@ -0,0 +1,326 @@

+---
+name: paper-reproduction
+description: "Skill for reproducing ML research papers from scratch when no official code exists. Use this whenever a user asks to implement, reproduce, or replicate a paper — especially papers involving novel loss functions, custom training loops, or non-standard architectures that aren't covered by existing HF trainers. Also use when the user mentions 'paper reproduction', 'implement this paper', 'no official code', or describes a method from a specific arxiv paper. Covers: reading papers systematically, extracting hyperparameters, building custom training pipelines, handling library-specific gotchas (geomloss, POT, custom UNets), and iterating on GPU results."
+---
+# Paper Reproduction Skill
+A skill for reproducing ML research papers from scratch, learned through the experience of reproducing NSGF++ (arXiv:2401.14069) — a Neural Sinkhorn Gradient Flow paper with no official implementation.
+## When to use this skill
+- User wants to reproduce/implement an ML paper
+- No official code repository exists
+- The paper uses custom training loops, novel losses, or non-standard architectures
+- The method doesn't fit neatly into existing HF Trainer abstractions (SFT, DPO, GRPO)
+---
+## Phase 1: Read the Paper Properly
+Most reproduction failures trace back to incomplete paper reading. Don't skim — read methodology sections (3, 4, 5) line by line, and read ALL appendices.
+### What to extract (checklist)
+```
+□ Loss function — exact mathematical form, every symbol defined
+□ Architecture — layer counts, hidden dims, activation functions, normalization
+□ Optimizer — type, learning rate, betas, weight decay, scheduler
+□ Batch size — for each phase/component separately
+□ Training iterations — for each phase/component
+□ Dataset preprocessing — normalization range, image size, augmentation
+□ Evaluation protocol — metrics, number of samples, any special setup
+□ Hyperparameters per experiment — papers often have different configs per dataset
+□ Algorithm pseudocode — if provided, follow it exactly before improvising
+```
+### Mistake I made: Incomplete appendix reading
+I extracted most hyperparameters correctly from the NSGF++ paper but missed a critical detail about how geomloss handles image tensors. The paper says "GeomLoss package" but doesn't spell out that images must be flattened to (N, D) format for the `SamplesLoss` API. This caused the MNIST and CIFAR-10 experiments to crash immediately on GPU.
+**Lesson**: When a paper references a specific library, read that library's documentation and test its API with the exact tensor shapes you'll use BEFORE writing the full pipeline.
+---
+## Phase 2: Library API Verification
+### CRITICAL: Test third-party library APIs with your actual tensor shapes
+This is the single biggest mistake pattern in paper reproduction. You read the paper, understand the math, implement everything — then it crashes because a library function expects `(N, D)` but you passed `(N, C, H, W)`.
+**The rule**: Before building ANY training loop that uses a third-party library (geomloss, POT, torchsde, torchdiffeq, etc.), write a 10-line test script:
+```python
+import torch
+from geomloss import SamplesLoss
+# Test with EXACT shapes you'll use in training
+loss_fn = SamplesLoss(loss="sinkhorn", p=2, blur=0.5, potentials=True)
+# 2D case — works fine
+x_2d = torch.randn(256, 2, requires_grad=True)
+y_2d = torch.randn(256, 2)
+F, G = loss_fn(x_2d, y_2d)  # ✅ OK
+# Image case — THIS CRASHES
+x_img = torch.randn(128, 1, 28, 28, requires_grad=True)
+y_img = torch.randn(128, 1, 28, 28)
+F, G = loss_fn(x_img, y_img)  # ❌ ValueError: must be (N,D) or (B,N,D)
+# Image case — FIXED by flattening
+B = x_img.shape[0]
+x_flat = x_img.view(B, -1).requires_grad_(True)
+y_flat = y_img.view(B, -1)
+F, G = loss_fn(x_flat, y_flat)  # ✅ OK
+```
+### Mistake I made: geomloss tensor shape assumption
+The `SamplesLoss` in geomloss requires inputs as `(N, D)` or `(B, N, D)` tensors. For 2D experiments with shape `(256, 2)` this works perfectly. For images with shape `(128, 1, 28, 28)` it crashes with:
+```
+ValueError: Input samples 'x' and 'y' should be encoded as (N,D) or (B,N,D) (batch) tensors.
+```
+**The fix**: Flatten images before passing to geomloss, reshape gradients back after:
+```python
+def compute_velocity(self, X: torch.Tensor, Y: torch.Tensor) -> torch.Tensor:
+    original_shape = X.shape
+    is_image = X.dim() == 4
+    if is_image:
+        B = X.shape[0]
+        X_flat = X.detach().clone().view(B, -1).requires_grad_(True)
+        Y_flat = Y.detach().view(B, -1)
+    else:
+        X_flat = X.detach().clone().requires_grad_(True)
+        Y_flat = Y.detach()
+    # Self-potential
+    F_self, _ = self.loss_fn(X_flat, X_flat.detach().clone())
+    grad_self = torch.autograd.grad(F_self.sum(), X_flat)[0]
+    # Cross-potential
+    X_flat2 = X.detach().clone().view(B, -1).requires_grad_(True) if is_image else X.detach().clone().requires_grad_(True)
+    F_cross, _ = self.loss_fn(X_flat2, Y_flat)
+    grad_cross = torch.autograd.grad(F_cross.sum(), X_flat2)[0]
+    velocity = grad_self.detach() - grad_cross.detach()
+    if is_image:
+        velocity = velocity.view(original_shape)
+    return velocity
+```
+This pattern — flatten before library call, reshape after — applies to many optimal transport libraries (POT, geomloss, ott-jax).
+---
+## Phase 3: Architecture Gotchas
+### UNet skip connections
+When building a UNet from scratch (rather than importing from guided-diffusion), the skip connection bookkeeping is the #1 source of shape mismatch errors.
+**The pattern that works**:
+1. During the downward pass, push every intermediate activation onto a `skips` list
+2. During the upward pass, pop from `skips` and concatenate
+3. The number of pops must EXACTLY equal the number of pushes
+**Mistake pattern**: Using a helper like `_get_num_res_blocks()` that infers block count from module list lengths. This is fragile — if the number of levels or blocks per level varies, the inference breaks.
+**Better approach**: Store `num_res_blocks` as an instance variable at init time and use it directly:
+```python
+def __init__(self, ..., num_res_blocks=2, channel_mult=[1,2,2,2], ...):
+    self.num_res_blocks = num_res_blocks
+    self.num_levels = len(channel_mult)
+    # ... build layers ...
+def forward(self, x, t):
+    # Use self.num_res_blocks directly, not a computed value
+```
+### GroupNorm channel requirements
+`nn.GroupNorm(32, channels)` requires `channels` to be divisible by 32. For small models (e.g., MNIST with `model_channels=32`), this is fine at the first level but may break at deeper levels if `channel_mult` creates channels not divisible by 32.
+**Safety check**: At init time, verify all channel counts are divisible by the group count:
+```python
+for level, mult in enumerate(channel_mult):
+    ch = model_channels * mult
+    assert ch % 32 == 0, f"Level {level}: channels={ch} not divisible by 32"
+```
+---
+## Phase 4: Training Loop Patterns for Custom Pipelines
+### Trajectory pool memory management
+When building trajectory pools (storing (x, v, t) tuples from gradient flow), memory can explode:
+- MNIST: 256 samples × 1500 batches × 5 steps × 784 dims × 4 bytes ≈ 2.4 GB (manageable)
+- CIFAR-10: 128 samples × 2500 batches × 5 steps × 3072 dims × 4 bytes ≈ 19 GB (tight on T4)
+**The fix**: Store pool tensors on CPU, transfer to GPU only during sampling:
+```python
+class TrajectoryPool:
+    def sample(self, batch_size, device="cpu"):
+        # Concatenate on CPU, index, THEN move to device
+        all_x = torch.cat(self.x_pool, dim=0)  # stays on CPU
+        idx = torch.randint(0, all_x.shape[0], (batch_size,))
+        return all_x[idx].to(device), ...  # only batch moves to GPU
+```
+**Mistake I made**: The pool sampling code calls `torch.cat` on the entire pool every training step, which is O(pool_size) per step. For 512K entries this is slow. Better: pre-concatenate once after pool building, then just index:
+```python
+def finalize(self):
+    """Call once after pool is fully built."""
+    self._all_x = torch.cat(self.x_pool, dim=0)
+    self._all_v = torch.cat(self.v_pool, dim=0)
+    self._all_t = torch.tensor(self.t_pool, dtype=torch.float32)
+    # Free the lists
+    self.x_pool = self.v_pool = self.t_pool = None
+def sample(self, batch_size, device="cpu"):
+    idx = torch.randint(0, self._all_x.shape[0], (batch_size,))
+    return self._all_x[idx].to(device), self._all_v[idx].to(device), self._all_t[idx].to(device)
+```
+### Multi-phase training (NSGF++)
+NSGF++ has 3 sequential training phases:
+1. **NSGF**: Build trajectory pool → train velocity field
+2. **NSF**: Use trained NSGF to generate P0 samples → train straight flow
+3. **Phase predictor**: Train CNN to predict transition time
+**Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
+---
+## Phase 5: Testing Strategy
+### Always test on CPU first with tiny configs
+Before any GPU run, verify the full pipeline works end-to-end:
+```bash
+# Tiny run — should complete in <30 seconds
+python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 5 --train-iters 100
+# Slightly larger — should complete in <5 minutes
+python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20 --train-iters 2000
+```
+### Test image experiments separately with minimal configs
+```bash
+# MNIST smoke test — 2 pool batches, 50 training iters
+python main.py --experiment mnist --pool-batches 2 --train-iters 50
+# If this crashes, fix before scaling up
+```
+**Mistake I made**: I tested 2D experiments thoroughly on CPU (both tiny and medium runs worked) but shipped the image experiments without testing them at all. The geomloss tensor shape bug affected ONLY the image path, so 2D success gave false confidence. The first GPU test of MNIST crashed immediately.
+**Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
+---
+## Phase 6: Debugging GPU Runs
+### Common error patterns
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
+| `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
+| `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
+| `CUDA OOM` during training | Batch too large or model too big | Reduce batch → increase grad accum |
+| Training loss plateaus high | Pool too small or too few iterations | Increase pool batches, more iters |
+| W2 distance too high | Undertrained model | Full paper config: 200 batches, 20k iters |
+| `KeyboardInterrupt` during training | Training takes too long at scale | Expected — full 2D takes ~20min on T4 |
+### When the user runs on their hardware
+If you're developing code that the user will run on their own GPU (Kaggle, Colab, local):
+1. **Provide exact commands** — don't make them figure out args
+2. **Warn about expected runtimes** — "2D full run: ~20min on T4, MNIST: ~2-4 hours, CIFAR-10: ~8-12 hours"
+3. **Include checkpoint saving** — so partial runs aren't wasted
+4. **Test the exact commands yourself** — if you can't run on GPU, at least verify the command parses correctly on CPU
+---
+## Mistake Catalog
+### Mistakes made during NSGF++ reproduction
+1. **geomloss tensor shape bug** (CRITICAL)
+   - **What**: `SamplesLoss` requires `(N,D)` tensors. Image experiments passed `(N,C,H,W)`.
+   - **Impact**: MNIST and CIFAR-10 experiments crash immediately. 2D works fine, hiding the bug.
+   - **Root cause**: Only tested 2D path. Didn't verify library API with image tensor shapes.
+   - **Prevention**: Write a standalone API test script for every third-party library, testing with ALL tensor shapes you'll use.
+2. **TrajectoryPool sampling performance** (MODERATE)
+   - **What**: `torch.cat` called on entire pool every training step.
+   - **Impact**: Training slower than necessary. At 512K pool entries, the cat+index is the bottleneck (~0.5s per step vs ~0.05s for the actual forward/backward).
+   - **Root cause**: Didn't profile the training loop.
+   - **Prevention**: Pre-concatenate the pool after building it. Profile before shipping.
+3. **Incomplete experiment testing** (CRITICAL)
+   - **What**: Tested 2D experiments only. Shipped MNIST/CIFAR untested.
+   - **Impact**: User's first GPU run crashes. Wasted their Kaggle session time.
+   - **Root cause**: False confidence from 2D success. Assumed same code path.
+   - **Prevention**: Test EVERY experiment type with minimal configs. Different experiment types often exercise different code paths.
+4. **No checkpoint saving** (MODERATE)
+   - **What**: No intermediate checkpoints during long training runs.
+   - **Impact**: If training is interrupted (Kaggle timeout, OOM), all progress is lost.
+   - **Prevention**: Save checkpoints every N iterations. Implement `--resume` flag.
+5. **UNet forward pass fragility** (LOW-MODERATE)
+   - **What**: `_get_num_res_blocks()` infers block count from module list length division.
+   - **Impact**: Could break silently with non-standard configs.
+   - **Prevention**: Store config values as instance variables, don't infer from module counts.
+---
+## Pre-flight Checklist (before declaring code ready)
+```
+□ All experiment types tested with minimal configs (not just the easiest one)
+□ Third-party library APIs tested with exact tensor shapes per experiment
+□ Training loop profiled — no O(N) operations per step where O(1) suffices
+□ Memory estimated per experiment (pool size × data dim × 4 bytes)
+□ Checkpointing implemented for runs >10 minutes
+□ Clear CLI with sensible defaults and override flags
+□ Expected runtimes documented per hardware tier
+□ Error messages are clear (not just stack traces)
+□ Results directory created automatically
+□ Requirements.txt includes ALL dependencies with minimum versions
+```
+---
+## General Principles for Paper Reproduction
+1. **Read the appendix first.** The appendix contains the actual implementation details. The main paper is the story; the appendix is the recipe.
+2. **Test the boundaries, not just the happy path.** If your code handles 2D, MNIST, and CIFAR-10, test all three. The bug is always in the path you didn't test.
+3. **Library APIs are opaque until tested.** Don't assume a function accepts your tensor shape just because it "makes sense." Write a 10-line test script.
+4. **Pre-concatenate, don't re-concatenate.** Any data structure that's built once and sampled many times should be finalized into a single tensor after building.
+5. **The user's time is more expensive than your time.** A crash on their GPU after 5 minutes of setup is worse than you spending 30 extra minutes testing. Ship code that works on first run.
+6. **Flatten for OT libraries.** Optimal transport libraries (geomloss, POT, ott-jax) almost universally expect `(N, D)` point clouds. Images must be flattened. This is the #1 gotcha in OT-based generative models.
+7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
+8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.