Humanlearning commited on
Commit
0b1f1bd
·
verified ·
1 Parent(s): 401c7b1

Sync README training commands

Browse files
Files changed (1) hide show
  1. README.md +524 -454
README.md CHANGED
@@ -1,323 +1,225 @@
1
- ---
2
- title: CyberSecurity_OWASP Environment Server
3
- emoji: 🛡️
4
- colorFrom: blue
5
- colorTo: gray
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- base_path: /web
10
- tags:
11
- - openenv
12
- - cybersecurity
13
- - owasp
14
- ---
15
-
16
- # CyberSecurity_OWASP
17
-
18
- [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
-
20
- `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
-
22
- ```text
23
- inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
24
- ```
25
-
26
- The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
27
-
28
- ## Diagrams
29
-
30
- [Architecture diagram](assets/architecture_diagram.svg) | [RL training flow diagram](assets/env_rl_training_flow_diagram.svg)
31
-
32
- ![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)
33
-
34
- ![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)
35
-
36
- Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and `assets/env_rl_training_flow_diagram.mmd`.
37
-
38
- ## Quick Start
39
-
40
- ```bash
41
- uv sync --extra dev
42
- uv run --extra dev pytest
43
- uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
44
- uv run server --port 8000
45
- ```
46
-
47
- Then connect with the OpenEnv client:
48
-
49
- ```python
50
- from CyberSecurity_OWASP import CyberSecurityOWASPAction, CyberSecurityOWASPEnv
51
-
52
- with CyberSecurityOWASPEnv(base_url="http://localhost:8000") as env:
53
- result = env.reset(seed=7)
54
- print(result.observation.task_brief)
55
- result = env.step(CyberSecurityOWASPAction(tool_name="list_routes"))
56
- print(result.observation.last_tool_result)
57
- ```
58
-
59
- ## Action Space
60
-
61
- The agent emits one JSON action at a time:
62
-
63
- ```json
64
- {"tool_name":"read_file","arguments":{"path":"app/routes/invoices.py"}}
65
- ```
66
-
67
- Supported tools:
68
-
69
- - `inspect_policy_graph`
70
- - `list_routes`
71
- - `read_openapi`
72
- - `read_file`
73
- - `search_code`
74
- - `send_local_request`
75
- - `compare_identities`
76
- - `submit_diagnosis`
77
- - `patch_file`
78
- - `run_visible_tests`
79
- - `submit_fix`
80
- - `noop`
81
-
82
- Tools are phase-gated:
83
-
84
- - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
85
- - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
86
- - `done`: stable terminal observation only.
87
-
88
- ## Reward
89
-
90
- Terminal reward uses stable components:
91
-
92
- ```python
93
- {
94
- "discovery": 0.0,
95
- "security": 0.0,
96
- "regression": 0.0,
97
- "public_routes": 0.0,
98
- "patch_quality": 0.0,
99
- "visible_tests": 0.0,
100
- "safety": 0.0,
101
- "anti_cheat": 0.0,
102
- "terminal_total": 0.0,
103
- "progressive": 0.0,
104
- "step_penalty": 0.0,
105
- "speed_bonus": 0.0,
106
- "token_penalty": 0.0,
107
- "behavior_penalty": 0.0,
108
- "train_total": 0.0,
109
- "total": 0.0,
110
- }
111
- ```
112
-
113
- The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
114
-
115
- Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
116
- Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
117
-
118
- ## Scenario Cache And Generation
119
-
120
- Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
121
-
122
- The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
123
-
124
- The cache bundle contract is:
125
-
126
- - `scenario.json`
127
- - `app_source/`
128
- - `policy_graph.json`
129
- - `visible_tests.py`
130
- - `hidden_tests.py`
131
- - `oracle_tests.py`
132
- - `expected_exploit_trace.json`
133
- - `reward_config.json`
134
- - `metadata.json`
135
-
136
- Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
137
-
138
- The MVP compiler currently generates:
139
-
140
- - invoices domain policy graph;
141
- - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
142
- - randomized users, tenants, invoices, and IDs;
143
- - generated app files under `app/`;
144
- - visible tests under `tests/test_visible.py`;
145
- - hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.
146
-
147
- Additional domains and bug families are scaffolded for extension.
148
-
149
- ## Runtime Components
150
-
151
- The OpenEnv runtime is split into small server modules:
152
-
153
- - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
154
- - `server/scenario_cache.py` writes and loads validated executable scenario bundles.
155
- - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
156
- - `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
157
- - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
158
- - `server/action_tools.py` dispatches typed tools through the sandbox.
159
- - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
160
- - `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
161
- - `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.
162
-
163
- The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.
164
-
165
- ## Testing
166
-
167
- ```bash
168
- uv run --extra dev pytest
169
- ```
170
-
171
- The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.
172
-
173
- ## Training Scaffold
174
-
175
- Training files are under `training/`:
176
-
177
- - `rollout.py`
178
- - `reward_funcs.py`
179
- - `train_grpo.py`
180
- - `eval_before_after.py`
181
- - `trackio_utils.py`
182
- - `configs/grpo_small.yaml`
183
-
184
- The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
185
-
186
  `training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
187
  Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
188
  `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
189
 
190
- Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
191
-
192
- ```bash
193
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
194
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
195
- ```
196
-
197
- If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
198
- The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
199
-
200
- ## Trackio Run Tracking
201
-
202
- Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
203
-
204
- ```bash
205
- export TRACKIO_SPACE_ID=<hf-user>/CyberSecurity_OWASP-trackio
206
- export TRACKIO_PROJECT=CyberSecurity_OWASP-grpo
207
- ```
208
-
209
- Use the tracked smoke wrapper instead of invoking pytest directly when producing run artifacts:
210
-
211
- ```bash
212
- bash scripts/smoke_test.sh
213
- uv run python scripts/track_pytest.py tests
214
- ```
215
-
216
- Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
217
-
218
- Training, baseline, and smoke runs also log the effective reward config at step
219
- 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
220
- see the actual values for each reward key, including stage-specific values,
221
- caps, thresholds, terminate flags, and descriptions. Scalar metrics under
222
- `reward_config/<key>/<field>` expose the same numeric values for plotting and
223
- filtering, for example `reward_config/policy_inspected/value` and
224
- `reward_config/shaping_weight/resolved`.
225
-
226
- Each run config includes `reward_config_id`, `reward_config_hash`,
227
- `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
228
- compare runs with the same scenario/model settings and different
229
- `reward_config_hash` values to see which reward weights produced each training
230
- curve.
231
 
232
- ## Modal Ephemeral Runs
 
 
233
 
234
- Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
235
-
236
- Install the optional local Modal client:
237
 
238
  ```bash
239
  uv sync --extra modal
 
 
240
  ```
241
 
242
- Run a temporary Modal app for a cheap environment/training smoke check:
243
-
244
- ```bash
245
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
246
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
247
- ```
248
-
249
- The app is ephemeral: Modal starts it for the command and stops it when the command exits. The remote result is written locally under `outputs/rollouts/` and the summary metrics are logged to Trackio.
250
-
251
- You can also validate the GRPO config construction remotely:
252
-
253
- ```bash
254
- uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode grpo-config
255
- ```
256
-
257
- The shell wrapper is equivalent:
258
-
259
- ```bash
260
- MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
261
- ```
262
-
263
- ## Synthetic SFT Before GRPO
264
-
265
- Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
266
- The SFT generator executes every teacher action in the real environment and
267
- keeps only trajectories that pass the deterministic reward verifier.
268
-
269
- Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
270
 
271
  ```bash
272
  uv run python scripts/generate_sft_dataset.py \
273
  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
274
  --target-model unsloth/gemma-4-E2B-it \
275
  --difficulty-levels 0,1,2,3 \
276
- --difficulty-buckets 4 \
277
  --episodes 75 \
278
  --validation-episodes 20 \
279
  --workers 8 \
280
  --out-dir outputs/sft
281
- ```
282
-
283
- `--episodes` is per difficulty level when `--difficulty-levels` is set, so
284
- `--episodes 75` across four levels gives 300 total train episodes. Expect
285
- roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
286
- contributes one row per action step. The script writes JSONL rows under
287
- `outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
288
- dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
289
- reward summaries and curriculum coverage.
290
 
291
- Verify reward metadata before any training run:
292
-
293
- ```bash
294
  uv run python scripts/generate_sft_dataset.py \
295
  --verify-only \
296
  --difficulty-levels 0,1,2,3 \
297
  --out-dir outputs/sft
298
  ```
299
 
300
- Push the verified dataset to Hugging Face Hub:
301
-
302
- ```bash
303
- uv run python scripts/generate_sft_dataset.py \
304
- --push-only \
305
- --difficulty-levels 0,1,2,3 \
306
- --out-dir outputs/sft \
307
- --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
308
- ```
309
-
310
- The canonical dataset repo name is
311
- `Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
312
- reward verification fails or `HF_TOKEN` is missing.
313
-
314
- You can also generate and push in one command by adding `--push-to-hub` to the
315
- generation command.
316
-
317
- For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
318
- use the teacher path and still pass the verifier gate above.
319
-
320
- Launch SFT on Modal after reward verification passes:
321
 
322
  ```bash
323
  uv run --extra modal modal run --detach scripts/modal_train_sft.py \
@@ -332,184 +234,352 @@ uv run --extra modal modal run --detach scripts/modal_train_sft.py \
332
  --detach
333
  ```
334
 
335
- `scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
336
- upload and again inside Modal before loading the model. It refuses to start SFT
337
- unless all required curriculum difficulties are represented and the verifier
338
- reward metadata passes. The default SFT config trains the full dataset
339
- (`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
340
- `H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
341
- assistant-only loss for the Gemma 4 vision-language loader, so both remain
342
- disabled for this model. The script pre-tokenizes the small JSONL dataset
343
- serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
344
- around the Gemma/Unsloth config object. It also uses the base Transformers loss
345
- path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
346
- warm run for the 300-400 episode dataset should usually finish in about 20-60
347
- minutes; first image or model-cache builds can push that closer to 45-90
348
- minutes.
349
-
350
- Continue GRPO from the SFT LoRA:
351
-
352
- The GRPO launcher downloads the Hub adapter, attaches a matching trainable
353
- Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
354
- SFT handoff compatible with Gemma 4's Unsloth linear wrappers.
355
 
356
  ```bash
357
  uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
358
  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
359
- --max-steps 300 \
360
- --dataset-size 64 \
361
- --num-generations 8 \
362
- --difficulty 0 \
363
- --trace-log-every 10 \
364
- --detach
365
- ```
366
-
367
- ## Modal GRPO Training
368
-
369
- The persistent GPU training launcher packages this local repo into Modal, trains
370
- a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
371
- the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
372
- to Hugging Face Hub.
373
-
374
- Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
375
- run the import/config check:
376
-
377
- ```bash
378
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
379
- ```
380
-
381
- Run the default smoke GRPO job:
382
-
383
- ```bash
384
- uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
385
- uv run --extra modal modal run scripts/modal_train_grpo.py \
386
- --max-steps 10 \
387
- --dataset-size 16 \
388
- --num-generations 6 \
389
- --difficulty 0
390
- ```
391
-
392
- For GPU-utilization tuning on the same single L4, start with a larger but still
393
- bounded no-code trial:
394
-
395
- ```bash
396
- uv run --extra modal modal run scripts/modal_train_grpo.py \
397
- --max-steps 30 \
398
- --dataset-size 64 \
399
- --num-generations 8 \
400
- --max-completion-length 256 \
401
- --difficulty 0
402
- ```
403
-
404
- The launcher exposes GRPO throughput knobs for follow-up trials:
405
-
406
- ```bash
407
- # larger generation group, no vLLM
408
- uv run --extra modal modal run scripts/modal_train_grpo.py \
409
- --max-steps 30 --dataset-size 64 --num-generations 8 \
410
- --max-completion-length 256 --trace-log-every 5
411
-
412
- # vLLM colocate on the same L4
413
- uv run --extra modal modal run scripts/modal_train_grpo.py \
414
- --max-steps 30 --dataset-size 64 --num-generations 8 \
415
- --max-completion-length 256 --use-vllm \
416
- --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
417
-
418
- # larger microbatch if the vLLM trial does not OOM
419
- uv run --extra modal modal run scripts/modal_train_grpo.py \
420
- --max-steps 30 --dataset-size 64 --num-generations 8 \
421
- --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
422
- --max-completion-length 256 --use-vllm \
423
- --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
424
- ```
425
-
426
- `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
427
- be divisible by `num_generations`; the launcher validates this before the GPU
428
- container starts. Scalar Trackio metrics still log every reward callback, while
429
- sample trace tables and Trace objects are throttled by `--trace-log-every`
430
- (`1` restores every-callback logging, `0` disables trace artifacts).
431
-
432
- ### Parallel Modal GRPO Runs
433
-
434
- Parallel Modal GRPO runs are safe when each run has its own seed range, run
435
- name, and output target, while the shared cache volumes remain read-only.
436
- Before launching another job, check what is already active:
437
-
438
- ```bash
439
- uv run --extra modal modal app list
440
- uv run --extra modal modal app logs <app-id>
441
- ```
442
-
443
- Launch long-running parallel jobs with both Modal CLI detach and the launcher
444
- detach flag. The CLI-level `--detach` keeps the remote function alive after the
445
- local entrypoint exits; the launcher `--detach` prevents the parent Modal
446
- function from waiting on the GPU call.
447
-
448
- ```bash
449
- uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
450
  --max-steps 300 \
451
  --dataset-size 64 \
452
  --num-generations 8 \
453
  --max-completion-length 768 \
454
  --difficulty 0 \
455
  --trace-log-every 10 \
456
- --seed-start 10000 \
 
457
  --detach
458
  ```
459
 
460
- For multiple concurrent experiments:
461
-
462
- - Use a unique `--seed-start` range for every run, normally spaced by at least
463
- 10,000 seeds.
464
- - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
465
- scenarios during training.
466
- - Do not run `prepare-cache --cache-force` while training jobs are active.
467
- - Keep `--push-to-hub` disabled unless each run has a unique
468
- `--output-repo-id`.
469
- - Let the launcher generate unique timestamped Trackio run names, or set an
470
- explicit `RUN_NAME` only when it is globally unique.
471
- - Use the same Trackio Space/project for comparable metrics, but never reuse a
472
- run name.
473
- - Treat `CyberSecurity_OWASP-model-cache` and
474
- `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
475
- during training. Run outputs and checkpoints should stay under each run's
476
- unique output directory.
477
-
478
- If a Windows shell fails with a Unicode `charmap` encoding error during Modal
479
- startup, rerun with UTF-8 enabled for that command:
480
 
481
  ```powershell
482
- $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
483
  ```
484
 
485
- If running from a public repository and you do not want Modal to package the
486
- local workspace, use public source mode:
487
-
488
- ```bash
489
- uv run --extra modal modal run scripts/modal_train_grpo.py \
490
- --source-mode public \
491
- --repo-url https://github.com/humandotlearning/CyberSecurity_OWASP.git \
492
- --repo-branch master \
493
- --max-steps 10 \
494
- --dataset-size 16 \
495
- --num-generations 6 \
496
- --difficulty 0
497
- ```
498
-
499
- Defaults are derived from `HF_TOKEN`:
500
-
501
- - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
502
- - Trackio project: `CyberSecurity_OWASP-grpo`
503
- - Training model: `unsloth/gemma-4-E2B-it`
504
- - Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
505
-
506
- Override these with `--trackio-space-id`, `--trackio-project`, and
507
- `--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
508
-
509
- ## Docker / Spaces
510
 
511
  ```bash
512
- docker build -t CyberSecurity_OWASP:latest -f server/Dockerfile .
513
- docker run --rm -p 8000:8000 CyberSecurity_OWASP:latest
514
- openenv push --repo-id <username>/CyberSecurity_OWASP
515
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: CyberSecurity_OWASP Environment Server
3
+ emoji: 🛡️
4
+ colorFrom: blue
5
+ colorTo: gray
6
+ sdk: docker
7
+ pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
+ - cybersecurity
13
+ - owasp
14
+ ---
15
+
16
+ # CyberSecurity_OWASP
17
+
18
+ [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
+
20
+ `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
+
22
+ ```text
23
+ inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
24
+ ```
25
+
26
+ The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
27
+
28
+ ## Diagrams
29
+
30
+ [Architecture diagram](assets/architecture_diagram.svg) | [RL training flow diagram](assets/env_rl_training_flow_diagram.svg)
31
+
32
+ ![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)
33
+
34
+ ![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)
35
+
36
+ Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and `assets/env_rl_training_flow_diagram.mmd`.
37
+
38
+ ## Quick Start
39
+
40
+ ```bash
41
+ uv sync --extra dev
42
+ uv run --extra dev pytest
43
+ uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
44
+ uv run server --port 8000
45
+ ```
46
+
47
+ Then connect with the OpenEnv client:
48
+
49
+ ```python
50
+ from CyberSecurity_OWASP import CyberSecurityOWASPAction, CyberSecurityOWASPEnv
51
+
52
+ with CyberSecurityOWASPEnv(base_url="http://localhost:8000") as env:
53
+ result = env.reset(seed=7)
54
+ print(result.observation.task_brief)
55
+ result = env.step(CyberSecurityOWASPAction(tool_name="list_routes"))
56
+ print(result.observation.last_tool_result)
57
+ ```
58
+
59
+ ## Action Space
60
+
61
+ The agent emits one JSON action at a time:
62
+
63
+ ```json
64
+ {"tool_name":"read_file","arguments":{"path":"app/routes/invoices.py"}}
65
+ ```
66
+
67
+ Supported tools:
68
+
69
+ - `inspect_policy_graph`
70
+ - `list_routes`
71
+ - `read_openapi`
72
+ - `read_file`
73
+ - `search_code`
74
+ - `send_local_request`
75
+ - `compare_identities`
76
+ - `submit_diagnosis`
77
+ - `patch_file`
78
+ - `run_visible_tests`
79
+ - `submit_fix`
80
+ - `noop`
81
+
82
+ Tools are phase-gated:
83
+
84
+ - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
85
+ - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
86
+ - `done`: stable terminal observation only.
87
+
88
+ ## Reward
89
+
90
+ Terminal reward uses stable components:
91
+
92
+ ```python
93
+ {
94
+ "discovery": 0.0,
95
+ "security": 0.0,
96
+ "regression": 0.0,
97
+ "public_routes": 0.0,
98
+ "patch_quality": 0.0,
99
+ "visible_tests": 0.0,
100
+ "safety": 0.0,
101
+ "anti_cheat": 0.0,
102
+ "terminal_total": 0.0,
103
+ "progressive": 0.0,
104
+ "step_penalty": 0.0,
105
+ "speed_bonus": 0.0,
106
+ "token_penalty": 0.0,
107
+ "behavior_penalty": 0.0,
108
+ "train_total": 0.0,
109
+ "total": 0.0,
110
+ }
111
+ ```
112
+
113
+ The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
114
+
115
+ Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
116
+ Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
117
+
118
+ ## Scenario Cache And Generation
119
+
120
+ Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
121
+
122
+ The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
123
+
124
+ The cache bundle contract is:
125
+
126
+ - `scenario.json`
127
+ - `app_source/`
128
+ - `policy_graph.json`
129
+ - `visible_tests.py`
130
+ - `hidden_tests.py`
131
+ - `oracle_tests.py`
132
+ - `expected_exploit_trace.json`
133
+ - `reward_config.json`
134
+ - `metadata.json`
135
+
136
+ Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
137
+
138
+ The MVP compiler currently generates:
139
+
140
+ - invoices domain policy graph;
141
+ - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
142
+ - randomized users, tenants, invoices, and IDs;
143
+ - generated app files under `app/`;
144
+ - visible tests under `tests/test_visible.py`;
145
+ - hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.
146
+
147
+ Additional domains and bug families are scaffolded for extension.
148
+
149
+ ## Runtime Components
150
+
151
+ The OpenEnv runtime is split into small server modules:
152
+
153
+ - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
154
+ - `server/scenario_cache.py` writes and loads validated executable scenario bundles.
155
+ - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
156
+ - `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
157
+ - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
158
+ - `server/action_tools.py` dispatches typed tools through the sandbox.
159
+ - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
160
+ - `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
161
+ - `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.
162
+
163
+ The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.
164
+
165
+ ## Testing
166
+
167
+ ```bash
168
+ uv run --extra dev pytest
169
+ ```
170
+
171
+ The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.
172
+
173
+ ## Training Scaffold
174
+
175
+ Training files are under `training/`:
176
+
177
+ - `rollout.py`
178
+ - `reward_funcs.py`
179
+ - `train_grpo.py`
180
+ - `eval_before_after.py`
181
+ - `trackio_utils.py`
182
+ - `configs/grpo_small.yaml`
183
+
184
+ The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
185
+
186
  `training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
187
  Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
188
  `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
189
 
190
+ ### Run SFT And GRPO Training Scripts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
+ Training runs on Modal. Do not run the GRPO loop directly on the local machine;
193
+ use the launcher scripts so scenario cache preflight, Trackio logging, Modal
194
+ volumes, and Hub uploads stay consistent.
195
 
196
+ First install the Modal extra and prepare the scenario cache:
 
 
197
 
198
  ```bash
199
  uv sync --extra modal
200
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
201
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
202
  ```
203
 
204
+ Generate and verify SFT trajectories before supervised fine-tuning:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
 
206
  ```bash
207
  uv run python scripts/generate_sft_dataset.py \
208
  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
209
  --target-model unsloth/gemma-4-E2B-it \
210
  --difficulty-levels 0,1,2,3 \
 
211
  --episodes 75 \
212
  --validation-episodes 20 \
213
  --workers 8 \
214
  --out-dir outputs/sft
 
 
 
 
 
 
 
 
 
215
 
 
 
 
216
  uv run python scripts/generate_sft_dataset.py \
217
  --verify-only \
218
  --difficulty-levels 0,1,2,3 \
219
  --out-dir outputs/sft
220
  ```
221
 
222
+ Run SFT on Modal and push the warm-start LoRA:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
  ```bash
225
  uv run --extra modal modal run --detach scripts/modal_train_sft.py \
 
234
  --detach
235
  ```
236
 
237
+ Continue with GRPO from the SFT adapter:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
238
 
239
  ```bash
240
  uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
241
  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  --max-steps 300 \
243
  --dataset-size 64 \
244
  --num-generations 8 \
245
  --max-completion-length 768 \
246
  --difficulty 0 \
247
  --trace-log-every 10 \
248
+ --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
249
+ --trackio-project CyberSecurity_OWASP-grpo \
250
  --detach
251
  ```
252
 
253
+ For reward-rubric ablations, use the PowerShell launcher and configs under
254
+ `training/configs/reward_ablations/`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255
 
256
  ```powershell
257
+ .\scripts\launch_reward_ablations.ps1
258
  ```
259
 
260
+ Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  ```bash
263
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
264
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
265
+ ```
266
+
267
+ If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
268
+ The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
269
+
270
+ ## Trackio Run Tracking
271
+
272
+ Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
273
+
274
+ ```bash
275
+ export TRACKIO_SPACE_ID=<hf-user>/CyberSecurity_OWASP-trackio
276
+ export TRACKIO_PROJECT=CyberSecurity_OWASP-grpo
277
+ ```
278
+
279
+ Use the tracked smoke wrapper instead of invoking pytest directly when producing run artifacts:
280
+
281
+ ```bash
282
+ bash scripts/smoke_test.sh
283
+ uv run python scripts/track_pytest.py tests
284
+ ```
285
+
286
+ Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
287
+
288
+ Training, baseline, and smoke runs also log the effective reward config at step
289
+ 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
290
+ see the actual values for each reward key, including stage-specific values,
291
+ caps, thresholds, terminate flags, and descriptions. Scalar metrics under
292
+ `reward_config/<key>/<field>` expose the same numeric values for plotting and
293
+ filtering, for example `reward_config/policy_inspected/value` and
294
+ `reward_config/shaping_weight/resolved`.
295
+
296
+ Each run config includes `reward_config_id`, `reward_config_hash`,
297
+ `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
298
+ compare runs with the same scenario/model settings and different
299
+ `reward_config_hash` values to see which reward weights produced each training
300
+ curve.
301
+
302
+ ## Modal Ephemeral Runs
303
+
304
+ Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
305
+
306
+ Install the optional local Modal client:
307
+
308
+ ```bash
309
+ uv sync --extra modal
310
+ ```
311
+
312
+ Run a temporary Modal app for a cheap environment/training smoke check:
313
+
314
+ ```bash
315
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
316
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
317
+ ```
318
+
319
+ The app is ephemeral: Modal starts it for the command and stops it when the command exits. The remote result is written locally under `outputs/rollouts/` and the summary metrics are logged to Trackio.
320
+
321
+ You can also validate the GRPO config construction remotely:
322
+
323
+ ```bash
324
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode grpo-config
325
+ ```
326
+
327
+ The shell wrapper is equivalent:
328
+
329
+ ```bash
330
+ MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
331
+ ```
332
+
333
+ ## Synthetic SFT Before GRPO
334
+
335
+ Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
336
+ The SFT generator executes every teacher action in the real environment and
337
+ keeps only trajectories that pass the deterministic reward verifier.
338
+
339
+ Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
340
+
341
+ ```bash
342
+ uv run python scripts/generate_sft_dataset.py \
343
+ --teacher-model deepseek-ai/DeepSeek-V4-Pro \
344
+ --target-model unsloth/gemma-4-E2B-it \
345
+ --difficulty-levels 0,1,2,3 \
346
+ --difficulty-buckets 4 \
347
+ --episodes 75 \
348
+ --validation-episodes 20 \
349
+ --workers 8 \
350
+ --out-dir outputs/sft
351
+ ```
352
+
353
+ `--episodes` is per difficulty level when `--difficulty-levels` is set, so
354
+ `--episodes 75` across four levels gives 300 total train episodes. Expect
355
+ roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
356
+ contributes one row per action step. The script writes JSONL rows under
357
+ `outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
358
+ dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
359
+ reward summaries and curriculum coverage.
360
+
361
+ Verify reward metadata before any training run:
362
+
363
+ ```bash
364
+ uv run python scripts/generate_sft_dataset.py \
365
+ --verify-only \
366
+ --difficulty-levels 0,1,2,3 \
367
+ --out-dir outputs/sft
368
+ ```
369
+
370
+ Push the verified dataset to Hugging Face Hub:
371
+
372
+ ```bash
373
+ uv run python scripts/generate_sft_dataset.py \
374
+ --push-only \
375
+ --difficulty-levels 0,1,2,3 \
376
+ --out-dir outputs/sft \
377
+ --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
378
+ ```
379
+
380
+ The canonical dataset repo name is
381
+ `Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
382
+ reward verification fails or `HF_TOKEN` is missing.
383
+
384
+ You can also generate and push in one command by adding `--push-to-hub` to the
385
+ generation command.
386
+
387
+ For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
388
+ use the teacher path and still pass the verifier gate above.
389
+
390
+ Launch SFT on Modal after reward verification passes:
391
+
392
+ ```bash
393
+ uv run --extra modal modal run --detach scripts/modal_train_sft.py \
394
+ --local-train-path outputs/sft/train.jsonl \
395
+ --local-validation-path outputs/sft/validation.jsonl \
396
+ --local-manifest-path outputs/sft/manifest.json \
397
+ --required-difficulties 0,1,2,3 \
398
+ --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
399
+ --trackio-project CyberSecurity_OWASP-sft \
400
+ --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
401
+ --push-to-hub \
402
+ --detach
403
+ ```
404
+
405
+ `scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
406
+ upload and again inside Modal before loading the model. It refuses to start SFT
407
+ unless all required curriculum difficulties are represented and the verifier
408
+ reward metadata passes. The default SFT config trains the full dataset
409
+ (`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
410
+ `H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
411
+ assistant-only loss for the Gemma 4 vision-language loader, so both remain
412
+ disabled for this model. The script pre-tokenizes the small JSONL dataset
413
+ serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
414
+ around the Gemma/Unsloth config object. It also uses the base Transformers loss
415
+ path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
416
+ warm run for the 300-400 episode dataset should usually finish in about 20-60
417
+ minutes; first image or model-cache builds can push that closer to 45-90
418
+ minutes.
419
+
420
+ Continue GRPO from the SFT LoRA:
421
+
422
+ The GRPO launcher downloads the Hub adapter, attaches a matching trainable
423
+ Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
424
+ SFT handoff compatible with Gemma 4's Unsloth linear wrappers.
425
+
426
+ ```bash
427
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
428
+ --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
429
+ --max-steps 300 \
430
+ --dataset-size 64 \
431
+ --num-generations 8 \
432
+ --difficulty 0 \
433
+ --trace-log-every 10 \
434
+ --detach
435
+ ```
436
+
437
+ ## Modal GRPO Training
438
+
439
+ The persistent GPU training launcher packages this local repo into Modal, trains
440
+ a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
441
+ the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
442
+ to Hugging Face Hub.
443
+
444
+ Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
445
+ run the import/config check:
446
+
447
+ ```bash
448
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
449
+ ```
450
+
451
+ Run the default smoke GRPO job:
452
+
453
+ ```bash
454
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
455
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
456
+ --max-steps 10 \
457
+ --dataset-size 16 \
458
+ --num-generations 6 \
459
+ --difficulty 0
460
+ ```
461
+
462
+ For GPU-utilization tuning on the same single L4, start with a larger but still
463
+ bounded no-code trial:
464
+
465
+ ```bash
466
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
467
+ --max-steps 30 \
468
+ --dataset-size 64 \
469
+ --num-generations 8 \
470
+ --max-completion-length 256 \
471
+ --difficulty 0
472
+ ```
473
+
474
+ The launcher exposes GRPO throughput knobs for follow-up trials:
475
+
476
+ ```bash
477
+ # larger generation group, no vLLM
478
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
479
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
480
+ --max-completion-length 256 --trace-log-every 5
481
+
482
+ # vLLM colocate on the same L4
483
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
484
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
485
+ --max-completion-length 256 --use-vllm \
486
+ --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
487
+
488
+ # larger microbatch if the vLLM trial does not OOM
489
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
490
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
491
+ --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
492
+ --max-completion-length 256 --use-vllm \
493
+ --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
494
+ ```
495
+
496
+ `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
497
+ be divisible by `num_generations`; the launcher validates this before the GPU
498
+ container starts. Scalar Trackio metrics still log every reward callback, while
499
+ sample trace tables and Trace objects are throttled by `--trace-log-every`
500
+ (`1` restores every-callback logging, `0` disables trace artifacts).
501
+
502
+ ### Parallel Modal GRPO Runs
503
+
504
+ Parallel Modal GRPO runs are safe when each run has its own seed range, run
505
+ name, and output target, while the shared cache volumes remain read-only.
506
+ Before launching another job, check what is already active:
507
+
508
+ ```bash
509
+ uv run --extra modal modal app list
510
+ uv run --extra modal modal app logs <app-id>
511
+ ```
512
+
513
+ Launch long-running parallel jobs with both Modal CLI detach and the launcher
514
+ detach flag. The CLI-level `--detach` keeps the remote function alive after the
515
+ local entrypoint exits; the launcher `--detach` prevents the parent Modal
516
+ function from waiting on the GPU call.
517
+
518
+ ```bash
519
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
520
+ --max-steps 300 \
521
+ --dataset-size 64 \
522
+ --num-generations 8 \
523
+ --max-completion-length 768 \
524
+ --difficulty 0 \
525
+ --trace-log-every 10 \
526
+ --seed-start 10000 \
527
+ --detach
528
+ ```
529
+
530
+ For multiple concurrent experiments:
531
+
532
+ - Use a unique `--seed-start` range for every run, normally spaced by at least
533
+ 10,000 seeds.
534
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
535
+ scenarios during training.
536
+ - Do not run `prepare-cache --cache-force` while training jobs are active.
537
+ - Keep `--push-to-hub` disabled unless each run has a unique
538
+ `--output-repo-id`.
539
+ - Let the launcher generate unique timestamped Trackio run names, or set an
540
+ explicit `RUN_NAME` only when it is globally unique.
541
+ - Use the same Trackio Space/project for comparable metrics, but never reuse a
542
+ run name.
543
+ - Treat `CyberSecurity_OWASP-model-cache` and
544
+ `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
545
+ during training. Run outputs and checkpoints should stay under each run's
546
+ unique output directory.
547
+
548
+ If a Windows shell fails with a Unicode `charmap` encoding error during Modal
549
+ startup, rerun with UTF-8 enabled for that command:
550
+
551
+ ```powershell
552
+ $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
553
+ ```
554
+
555
+ If running from a public repository and you do not want Modal to package the
556
+ local workspace, use public source mode:
557
+
558
+ ```bash
559
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
560
+ --source-mode public \
561
+ --repo-url https://github.com/humandotlearning/CyberSecurity_OWASP.git \
562
+ --repo-branch master \
563
+ --max-steps 10 \
564
+ --dataset-size 16 \
565
+ --num-generations 6 \
566
+ --difficulty 0
567
+ ```
568
+
569
+ Defaults are derived from `HF_TOKEN`:
570
+
571
+ - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
572
+ - Trackio project: `CyberSecurity_OWASP-grpo`
573
+ - Training model: `unsloth/gemma-4-E2B-it`
574
+ - Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
575
+
576
+ Override these with `--trackio-space-id`, `--trackio-project`, and
577
+ `--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
578
+
579
+ ## Docker / Spaces
580
+
581
+ ```bash
582
+ docker build -t CyberSecurity_OWASP:latest -f server/Dockerfile .
583
+ docker run --rm -p 8000:8000 CyberSecurity_OWASP:latest
584
+ openenv push --repo-id <username>/CyberSecurity_OWASP
585
+ ```