rogermt
/

nsgf-plusplus

Model card Files Files and versions

xet

Community

rogermt commited on 13 days ago

Commit

d5802bb

verified ·

1 Parent(s): 7921a70

Update SKILL.md with DataLoader batch size mismatch bug and shared state pitfall

Browse files

Files changed (1) hide show

SKILL.md +49 -2

SKILL.md CHANGED Viewed

@@ -200,6 +200,29 @@ NSGF++ has 3 sequential training phases:
 **Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
 ---
 ## Phase 5: Testing Strategy
@@ -219,8 +242,8 @@ python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20
 ### Test image experiments separately with minimal configs
 ```bash
-# MNIST smoke test — 2 pool batches, 50 training iters
-python main.py --experiment mnist --pool-batches 2 --train-iters 50
 # If this crashes, fix before scaling up
 ```
@@ -229,6 +252,10 @@ python main.py --experiment mnist --pool-batches 2 --train-iters 50
 **Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
 ---
 ## Phase 6: Debugging GPU Runs
@@ -238,6 +265,7 @@ python main.py --experiment mnist --pool-batches 2 --train-iters 50
 | Error | Cause | Fix |
 |-------|-------|-----|
 | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
 | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
 | `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
 | `CUDA OOM` during training | Batch too large or model too big | Reduce batch → increase grad accum |
@@ -288,13 +316,28 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
    - **Impact**: Could break silently with non-standard configs.
    - **Prevention**: Store config values as instance variables, don't infer from module counts.
 ---
 ## Pre-flight Checklist (before declaring code ready)
 ```
 □ All experiment types tested with minimal configs (not just the easiest one)
 □ Third-party library APIs tested with exact tensor shapes per experiment
 □ Training loop profiled — no O(N) operations per step where O(1) suffices
 □ Memory estimated per experiment (pool size × data dim × 4 bytes)
 □ Checkpointing implemented for runs >10 minutes
@@ -324,3 +367,7 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
 7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
 8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.

 **Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
+### Shared state across training phases — the DataLoader trap
+When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
+**Mistake I made**: The `DatasetLoader.sample_target()` method lazily creates a PyTorch DataLoader on the first call, caching it with whatever batch size was requested. Phase 1 (pool building) calls `sample_target(256)` → DataLoader created with `batch_size=256, drop_last=True`. Phase 2 (NSF training) calls `sample_target(128)` → but the cached DataLoader still yields batches of 256 → tensor shape mismatch → crash:
+```
+RuntimeError: The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 0
+```
+**The fix**: Track the batch size and recreate the DataLoader when it changes:
+```python
+def sample_target(self, n, device="cpu"):
+    if not hasattr(self, "_loader") or self._batch_size != n:
+        self._batch_size = n
+        self._loader = get_image_dataloader(self.dataset_name, batch_size=n, train=True)
+        self._iter = iter(self._loader)
+    # ... sample from self._iter ...
+```
+**General rule**: When sharing a data provider across multiple consumers with different batch sizes, NEVER cache a DataLoader with a fixed batch size. Either recreate it on batch size change, or provide raw dataset access and let each consumer create its own DataLoader.
 ---
 ## Phase 5: Testing Strategy
 ### Test image experiments separately with minimal configs
 ```bash
+# MNIST smoke test — 2 pool batches, 5 training iters per phase
+python main.py --experiment mnist --pool-batches 2 --train-iters 5
 # If this crashes, fix before scaling up
 ```
 **Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
+### Test all training phases, not just the first one
+Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap above). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
 ---
 ## Phase 6: Debugging GPU Runs
 | Error | Cause | Fix |
 |-------|-------|-----|
 | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
+| `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
 | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
 | `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
 | `CUDA OOM` during training | Batch too large or model too big | Reduce batch → increase grad accum |
    - **Impact**: Could break silently with non-standard configs.
    - **Prevention**: Store config values as instance variables, don't infer from module counts.
+6. **DataLoader batch size mismatch across phases** (CRITICAL)
+   - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back → tensor dimension mismatch crash.
+   - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully. The error message (`size of tensor a (128) must match size of tensor b (256)`) doesn't make the DataLoader caching obvious.
+   - **Root cause**: Lazy initialization pattern without invalidation. The `_image_loader` was created once and never checked for batch size changes.
+   - **Prevention**: When sharing stateful objects across consumers with different configs, either (a) track all cached parameters and invalidate on change, or (b) don't cache at all. For DataLoaders: recreate when batch_size changes.
+7. **CLI flag not overriding all training phases** (LOW)
+   - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
+   - **Impact**: Tests take much longer than expected. User thinks something is broken.
+   - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
+   - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect. If a config has `nsgf_training.num_iterations`, `nsf_training.num_iterations`, AND `time_predictor.num_iterations`, the override must touch all three.
 ---
 ## Pre-flight Checklist (before declaring code ready)
 ```
 □ All experiment types tested with minimal configs (not just the easiest one)
+□ ALL training phases tested end-to-end (not just Phase 1)
 □ Third-party library APIs tested with exact tensor shapes per experiment
+□ Shared state across phases verified (DataLoaders, iterators, caches)
+□ CLI flags override ALL relevant config values (not just some)
 □ Training loop profiled — no O(N) operations per step where O(1) suffices
 □ Memory estimated per experiment (pool size × data dim × 4 bytes)
 □ Checkpointing implemented for runs >10 minutes
 7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
 8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.
+9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
+10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.