Humanlearning commited on
Commit
7d32451
·
1 Parent(s): 632c145

feat: update README with GPU-utilization tuning instructions, enhance modal training script with run name parameter, and modify GRPO configuration for trace logging and vLLM settings

Browse files
README.md CHANGED
@@ -267,6 +267,46 @@ uv run --extra modal modal run scripts/modal_train_grpo.py \
267
  --difficulty 0
268
  ```
269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
  If running from a public repository and you do not want Modal to package the
271
  local workspace, use public source mode:
272
 
 
267
  --difficulty 0
268
  ```
269
 
270
+ For GPU-utilization tuning on the same single L4, start with a larger but still
271
+ bounded no-code trial:
272
+
273
+ ```bash
274
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
275
+ --max-steps 30 \
276
+ --dataset-size 64 \
277
+ --num-generations 8 \
278
+ --max-completion-length 256 \
279
+ --difficulty 0
280
+ ```
281
+
282
+ The launcher exposes GRPO throughput knobs for follow-up trials:
283
+
284
+ ```bash
285
+ # larger generation group, no vLLM
286
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
287
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
288
+ --max-completion-length 256 --trace-log-every 5
289
+
290
+ # vLLM colocate on the same L4
291
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
292
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
293
+ --max-completion-length 256 --use-vllm \
294
+ --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
295
+
296
+ # larger microbatch if the vLLM trial does not OOM
297
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
298
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
299
+ --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
300
+ --max-completion-length 256 --use-vllm \
301
+ --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
302
+ ```
303
+
304
+ `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
305
+ be divisible by `num_generations`; the launcher validates this before the GPU
306
+ container starts. Scalar Trackio metrics still log every reward callback, while
307
+ sample trace tables and Trace objects are throttled by `--trace-log-every`
308
+ (`1` restores every-callback logging, `0` disables trace artifacts).
309
+
310
  If running from a public repository and you do not want Modal to package the
311
  local workspace, use public source mode:
312
 
scripts/modal_train_grpo.py CHANGED
@@ -1477,6 +1477,7 @@ def main(
1477
  trace_log_every: int = 5,
1478
  seed_start: int = 0,
1479
  git_sha: str = "nogit",
 
1480
  source_mode: str = "local",
1481
  repo_url: str = PUBLIC_REPO_URL,
1482
  repo_branch: str = PUBLIC_REPO_BRANCH,
@@ -1565,7 +1566,7 @@ def main(
1565
 
1566
  model_slug = model_name.replace("/", "-")
1567
  local_stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
1568
- run_name = (
1569
  f"CyberSecurity_OWASP-{model_slug}-grpo-level{difficulty}-"
1570
  f"{local_stamp}-{git_sha[:8]}"
1571
  )
 
1477
  trace_log_every: int = 5,
1478
  seed_start: int = 0,
1479
  git_sha: str = "nogit",
1480
+ run_name: str = "",
1481
  source_mode: str = "local",
1482
  repo_url: str = PUBLIC_REPO_URL,
1483
  repo_branch: str = PUBLIC_REPO_BRANCH,
 
1566
 
1567
  model_slug = model_name.replace("/", "-")
1568
  local_stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
1569
+ run_name = run_name or (
1570
  f"CyberSecurity_OWASP-{model_slug}-grpo-level{difficulty}-"
1571
  f"{local_stamp}-{git_sha[:8]}"
1572
  )
training/configs/grpo_small.yaml CHANGED
@@ -6,6 +6,9 @@ episodes: 10
6
  num_generations: 6
7
  per_device_train_batch_size: 1
8
  gradient_accumulation_steps: 32
 
 
 
9
  learning_rate: 0.000005
10
  report_to: trackio
11
  trackio_space_id: Humanlearning/CyberSecurity_OWASP-trackio
 
6
  num_generations: 6
7
  per_device_train_batch_size: 1
8
  gradient_accumulation_steps: 32
9
+ use_vllm: false
10
+ vllm_gpu_memory_utilization: 0.2
11
+ trace_log_every: 5
12
  learning_rate: 0.000005
13
  report_to: trackio
14
  trackio_space_id: Humanlearning/CyberSecurity_OWASP-trackio