rogermt commited on
Commit
d5802bb
Β·
verified Β·
1 Parent(s): 7921a70

Update SKILL.md with DataLoader batch size mismatch bug and shared state pitfall

Browse files
Files changed (1) hide show
  1. SKILL.md +49 -2
SKILL.md CHANGED
@@ -200,6 +200,29 @@ NSGF++ has 3 sequential training phases:
200
 
201
  **Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
202
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
  ---
204
 
205
  ## Phase 5: Testing Strategy
@@ -219,8 +242,8 @@ python main.py --experiment 2d --dataset 8gaussians --steps 5 --pool-batches 20
219
  ### Test image experiments separately with minimal configs
220
 
221
  ```bash
222
- # MNIST smoke test β€” 2 pool batches, 50 training iters
223
- python main.py --experiment mnist --pool-batches 2 --train-iters 50
224
 
225
  # If this crashes, fix before scaling up
226
  ```
@@ -229,6 +252,10 @@ python main.py --experiment mnist --pool-batches 2 --train-iters 50
229
 
230
  **Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
231
 
 
 
 
 
232
  ---
233
 
234
  ## Phase 6: Debugging GPU Runs
@@ -238,6 +265,7 @@ python main.py --experiment mnist --pool-batches 2 --train-iters 50
238
  | Error | Cause | Fix |
239
  |-------|-------|-----|
240
  | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
 
241
  | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
242
  | `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
243
  | `CUDA OOM` during training | Batch too large or model too big | Reduce batch β†’ increase grad accum |
@@ -288,13 +316,28 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
288
  - **Impact**: Could break silently with non-standard configs.
289
  - **Prevention**: Store config values as instance variables, don't infer from module counts.
290
 
 
 
 
 
 
 
 
 
 
 
 
 
291
  ---
292
 
293
  ## Pre-flight Checklist (before declaring code ready)
294
 
295
  ```
296
  β–‘ All experiment types tested with minimal configs (not just the easiest one)
 
297
  β–‘ Third-party library APIs tested with exact tensor shapes per experiment
 
 
298
  β–‘ Training loop profiled β€” no O(N) operations per step where O(1) suffices
299
  β–‘ Memory estimated per experiment (pool size Γ— data dim Γ— 4 bytes)
300
  β–‘ Checkpointing implemented for runs >10 minutes
@@ -324,3 +367,7 @@ If you're developing code that the user will run on their own GPU (Kaggle, Colab
324
  7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
325
 
326
  8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.
 
 
 
 
 
200
 
201
  **Key insight**: Each phase depends on the previous one being fully trained. Don't try to interleave them. The NSGF model must be in `eval()` mode when used as a sample generator in phases 2 and 3.
202
 
203
+ ### Shared state across training phases β€” the DataLoader trap
204
+
205
+ When a single `DatasetLoader` object is shared across multiple training phases, **lazy-initialized internal state** (like a cached DataLoader) will silently break subsequent phases.
206
+
207
+ **Mistake I made**: The `DatasetLoader.sample_target()` method lazily creates a PyTorch DataLoader on the first call, caching it with whatever batch size was requested. Phase 1 (pool building) calls `sample_target(256)` β†’ DataLoader created with `batch_size=256, drop_last=True`. Phase 2 (NSF training) calls `sample_target(128)` β†’ but the cached DataLoader still yields batches of 256 β†’ tensor shape mismatch β†’ crash:
208
+
209
+ ```
210
+ RuntimeError: The size of tensor a (128) must match the size of tensor b (256) at non-singleton dimension 0
211
+ ```
212
+
213
+ **The fix**: Track the batch size and recreate the DataLoader when it changes:
214
+
215
+ ```python
216
+ def sample_target(self, n, device="cpu"):
217
+ if not hasattr(self, "_loader") or self._batch_size != n:
218
+ self._batch_size = n
219
+ self._loader = get_image_dataloader(self.dataset_name, batch_size=n, train=True)
220
+ self._iter = iter(self._loader)
221
+ # ... sample from self._iter ...
222
+ ```
223
+
224
+ **General rule**: When sharing a data provider across multiple consumers with different batch sizes, NEVER cache a DataLoader with a fixed batch size. Either recreate it on batch size change, or provide raw dataset access and let each consumer create its own DataLoader.
225
+
226
  ---
227
 
228
  ## Phase 5: Testing Strategy
 
242
  ### Test image experiments separately with minimal configs
243
 
244
  ```bash
245
+ # MNIST smoke test β€” 2 pool batches, 5 training iters per phase
246
+ python main.py --experiment mnist --pool-batches 2 --train-iters 5
247
 
248
  # If this crashes, fix before scaling up
249
  ```
 
252
 
253
  **Rule**: Test EVERY experiment type, not just the simplest one. If you have `{2d, mnist, cifar10}` experiments, test all three with minimal configs before declaring the code ready.
254
 
255
+ ### Test all training phases, not just the first one
256
+
257
+ Even after fixing Phase 1, Phase 2 can still crash due to shared state (see DataLoader trap above). Run with `--train-iters 5 --pool-batches 2` to verify all 3 phases complete without errors. This takes <60 seconds on CPU for MNIST.
258
+
259
  ---
260
 
261
  ## Phase 6: Debugging GPU Runs
 
265
  | Error | Cause | Fix |
266
  |-------|-------|-----|
267
  | `ValueError: (N,D) or (B,N,D)` | Library expects flat tensors, got images | Flatten before library call |
268
+ | `RuntimeError: size of tensor a (X) must match size of tensor b (Y)` | Shared DataLoader with wrong batch size | Recreate DataLoader when batch size changes |
269
  | `RuntimeError: shape mismatch` in UNet | Skip connection count wrong | Count pushes and pops manually |
270
  | `CUDA OOM` during pool building | Pool too large for GPU | Build pool on CPU, sample to GPU |
271
  | `CUDA OOM` during training | Batch too large or model too big | Reduce batch β†’ increase grad accum |
 
316
  - **Impact**: Could break silently with non-standard configs.
317
  - **Prevention**: Store config values as instance variables, don't infer from module counts.
318
 
319
+ 6. **DataLoader batch size mismatch across phases** (CRITICAL)
320
+ - **What**: Shared `DatasetLoader` caches a DataLoader with batch_size=256 from Phase 1. Phase 2 requests batch_size=128 but gets 256 back β†’ tensor dimension mismatch crash.
321
+ - **Impact**: Phase 2 (NSF) crashes immediately even after Phase 1 completes successfully. The error message (`size of tensor a (128) must match size of tensor b (256)`) doesn't make the DataLoader caching obvious.
322
+ - **Root cause**: Lazy initialization pattern without invalidation. The `_image_loader` was created once and never checked for batch size changes.
323
+ - **Prevention**: When sharing stateful objects across consumers with different configs, either (a) track all cached parameters and invalidate on change, or (b) don't cache at all. For DataLoaders: recreate when batch_size changes.
324
+
325
+ 7. **CLI flag not overriding all training phases** (LOW)
326
+ - **What**: `--train-iters` flag overrode NSGF and NSF iterations but NOT the phase predictor iterations (40,000 default). Smoke tests would hang on Phase 3 even with `--train-iters 5`.
327
+ - **Impact**: Tests take much longer than expected. User thinks something is broken.
328
+ - **Root cause**: Forgot that 3-phase training means 3 iteration counts to override.
329
+ - **Prevention**: When adding a CLI override, grep the config for ALL fields it should affect. If a config has `nsgf_training.num_iterations`, `nsf_training.num_iterations`, AND `time_predictor.num_iterations`, the override must touch all three.
330
+
331
  ---
332
 
333
  ## Pre-flight Checklist (before declaring code ready)
334
 
335
  ```
336
  β–‘ All experiment types tested with minimal configs (not just the easiest one)
337
+ β–‘ ALL training phases tested end-to-end (not just Phase 1)
338
  β–‘ Third-party library APIs tested with exact tensor shapes per experiment
339
+ β–‘ Shared state across phases verified (DataLoaders, iterators, caches)
340
+ β–‘ CLI flags override ALL relevant config values (not just some)
341
  β–‘ Training loop profiled β€” no O(N) operations per step where O(1) suffices
342
  β–‘ Memory estimated per experiment (pool size Γ— data dim Γ— 4 bytes)
343
  β–‘ Checkpointing implemented for runs >10 minutes
 
367
  7. **Store training state on CPU, compute on GPU.** Trajectory pools, replay buffers, and other large data structures should live on CPU. Only the current minibatch goes to GPU.
368
 
369
  8. **Multi-phase training = multiple separate trainers.** Don't try to be clever with a single training loop that switches phases. Each phase is a distinct trainer with its own optimizer. The previous phase's model goes to `eval()`.
370
+
371
+ 9. **Shared objects across phases are landmines.** When a DataLoader, iterator, or cache is shared across training phases, any phase-specific parameter (batch size, number of workers, shuffle mode) can silently break later phases. Either don't share, or implement proper invalidation. Test by running all phases sequentially with different configs per phase.
372
+
373
+ 10. **CLI overrides must be exhaustive.** If your config has N copies of a parameter (one per training phase), your CLI override must touch all N. Grep the config file for the parameter name to find all instances.