File size: 23,098 Bytes
0b1f1bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3ee507
 
 
 
0b1f1bd
0e7f59c
0b1f1bd
 
 
287d681
0b1f1bd
3807ea3
 
 
0b1f1bd
 
287d681
 
0b1f1bd
60f97ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b1f1bd
60f97ab
 
 
 
 
 
 
 
 
 
 
 
 
 
0b1f1bd
e5fe6f5
60f97ab
 
 
0e7f59c
 
 
 
 
 
0b1f1bd
 
0e7f59c
 
 
0b1f1bd
 
0e7f59c
 
0b1f1bd
0e7f59c
 
0b1f1bd
287d681
3807ea3
0b1f1bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
---

title: CyberSecurity_OWASP Environment Server
emoji: 🛡️
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - cybersecurity
  - owasp
---


# CyberSecurity_OWASP



[Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)



`CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:

```text

inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior

```

The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.

## Diagrams

[Architecture diagram](assets/architecture_diagram.svg) | [RL training flow diagram](assets/env_rl_training_flow_diagram.svg)

![CyberSecurity_OWASP architecture](assets/architecture_diagram.svg)

![CyberSecurity_OWASP RL training flow](assets/env_rl_training_flow_diagram.svg)

Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and `assets/env_rl_training_flow_diagram.mmd`.

## Quick Start

```bash

uv sync --extra dev

uv run --extra dev pytest

uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3

uv run server --port 8000

```

Then connect with the OpenEnv client:

```python

from CyberSecurity_OWASP import CyberSecurityOWASPAction, CyberSecurityOWASPEnv



with CyberSecurityOWASPEnv(base_url="http://localhost:8000") as env:

    result = env.reset(seed=7)

    print(result.observation.task_brief)

    result = env.step(CyberSecurityOWASPAction(tool_name="list_routes"))

    print(result.observation.last_tool_result)

```

## Action Space

The agent emits one JSON action at a time:

```json

{"tool_name":"read_file","arguments":{"path":"app/routes/invoices.py"}}

```

Supported tools:

- `inspect_policy_graph`
- `list_routes`
- `read_openapi`
- `read_file`
- `search_code`
- `send_local_request`
- `compare_identities`
- `submit_diagnosis`
- `patch_file`
- `run_visible_tests`
- `submit_fix`
- `noop`

Tools are phase-gated:

- `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
- `patch`: read/search, patch editable app files, run visible tests, submit final fix.
- `done`: stable terminal observation only.

## Reward

Terminal reward uses stable components:

```python

{

    "discovery": 0.0,

    "security": 0.0,

    "regression": 0.0,

    "public_routes": 0.0,

    "patch_quality": 0.0,

    "visible_tests": 0.0,

    "safety": 0.0,

    "anti_cheat": 0.0,

    "terminal_total": 0.0,

    "progressive": 0.0,

    "step_penalty": 0.0,

    "speed_bonus": 0.0,

    "token_penalty": 0.0,

    "behavior_penalty": 0.0,

    "train_total": 0.0,

    "total": 0.0,

}

```

The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.

Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.

## Scenario Cache And Generation

Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.

The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.

The cache bundle contract is:

- `scenario.json`
- `app_source/`
- `policy_graph.json`
- `visible_tests.py`
- `hidden_tests.py`
- `oracle_tests.py`
- `expected_exploit_trace.json`
- `reward_config.json`
- `metadata.json`

Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.

The MVP compiler currently generates:

- invoices domain policy graph;
- bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
- randomized users, tenants, invoices, and IDs;
- generated app files under `app/`;
- visible tests under `tests/test_visible.py`;
- hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.

Additional domains and bug families are scaffolded for extension.

## Runtime Components

The OpenEnv runtime is split into small server modules:

- `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
- `server/scenario_cache.py` writes and loads validated executable scenario bundles.
- `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
- `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
- `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
- `server/action_tools.py` dispatches typed tools through the sandbox.
- `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
- `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
- `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.

The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.

## Testing

```bash

uv run --extra dev pytest

```

The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.

## Training Scaffold

Training files are under `training/`:

- `rollout.py`
- `reward_funcs.py`
- `train_grpo.py`
- `eval_before_after.py`
- `trackio_utils.py`
- `configs/grpo_small.yaml`

The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.

`training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
`scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.

### Run SFT And GRPO Training Scripts

Training runs on Modal. Do not run the GRPO loop directly on the local machine;
use the launcher scripts so scenario cache preflight, Trackio logging, Modal
volumes, and Hub uploads stay consistent.

First install the Modal extra and prepare the scenario cache:

```bash
uv sync --extra modal
uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
```

Generate and verify SFT trajectories before supervised fine-tuning:

```bash
uv run python scripts/generate_sft_dataset.py \
  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
  --target-model unsloth/gemma-4-E2B-it \
  --difficulty-levels 0,1,2,3 \
  --episodes 75 \
  --validation-episodes 20 \
  --workers 8 \
  --out-dir outputs/sft

uv run python scripts/generate_sft_dataset.py \
  --verify-only \
  --difficulty-levels 0,1,2,3 \
  --out-dir outputs/sft
```

Run SFT on Modal and push the warm-start LoRA:

```bash
uv run --extra modal modal run --detach scripts/modal_train_sft.py \
  --local-train-path outputs/sft/train.jsonl \
  --local-validation-path outputs/sft/validation.jsonl \
  --local-manifest-path outputs/sft/manifest.json \
  --required-difficulties 0,1,2,3 \
  --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
  --trackio-project CyberSecurity_OWASP-sft \
  --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
  --push-to-hub \
  --detach
```

Continue with GRPO from the SFT adapter:

```bash
uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
  --max-steps 300 \
  --dataset-size 64 \
  --num-generations 8 \
  --max-completion-length 768 \
  --difficulty 0 \
  --trace-log-every 10 \
  --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
  --trackio-project CyberSecurity_OWASP-grpo \
  --detach
```

For reward-rubric ablations, use the PowerShell launcher and configs under
`training/configs/reward_ablations/`:

```powershell
.\scripts\launch_reward_ablations.ps1
```

Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:

```bash
uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache

```

If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.

## Trackio Run Tracking

Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.

```bash

export TRACKIO_SPACE_ID=<hf-user>/CyberSecurity_OWASP-trackio

export TRACKIO_PROJECT=CyberSecurity_OWASP-grpo

```

Use the tracked smoke wrapper instead of invoking pytest directly when producing run artifacts:

```bash

bash scripts/smoke_test.sh

uv run python scripts/track_pytest.py tests

```

Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.



Training, baseline, and smoke runs also log the effective reward config at step

0. In Trackio, open **Media & Tables** and select the `reward_config` table to
see the actual values for each reward key, including stage-specific values,
caps, thresholds, terminate flags, and descriptions. Scalar metrics under
`reward_config/<key>/<field>` expose the same numeric values for plotting and
filtering, for example `reward_config/policy_inspected/value` and
`reward_config/shaping_weight/resolved`.

Each run config includes `reward_config_id`, `reward_config_hash`,
`reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
compare runs with the same scenario/model settings and different
`reward_config_hash` values to see which reward weights produced each training
curve.

## Modal Ephemeral Runs

Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.

Install the optional local Modal client:

```bash

uv sync --extra modal

```

Run a temporary Modal app for a cheap environment/training smoke check:

```bash

uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache

uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4

```

The app is ephemeral: Modal starts it for the command and stops it when the command exits. The remote result is written locally under `outputs/rollouts/` and the summary metrics are logged to Trackio.

You can also validate the GRPO config construction remotely:

```bash

uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode grpo-config

```

The shell wrapper is equivalent:

```bash

MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh

```

## Synthetic SFT Before GRPO

Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
The SFT generator executes every teacher action in the real environment and
keeps only trajectories that pass the deterministic reward verifier.

Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:

```bash

uv run python scripts/generate_sft_dataset.py \

  --teacher-model deepseek-ai/DeepSeek-V4-Pro \

  --target-model unsloth/gemma-4-E2B-it \

  --difficulty-levels 0,1,2,3 \

  --difficulty-buckets 4 \

  --episodes 75 \

  --validation-episodes 20 \

  --workers 8 \

  --out-dir outputs/sft

```

`--episodes` is per difficulty level when `--difficulty-levels` is set, so
`--episodes 75` across four levels gives 300 total train episodes. Expect
roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
contributes one row per action step. The script writes JSONL rows under
`outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
reward summaries and curriculum coverage.

Verify reward metadata before any training run:

```bash

uv run python scripts/generate_sft_dataset.py \

  --verify-only \

  --difficulty-levels 0,1,2,3 \

  --out-dir outputs/sft

```

Push the verified dataset to Hugging Face Hub:

```bash

uv run python scripts/generate_sft_dataset.py \

  --push-only \

  --difficulty-levels 0,1,2,3 \

  --out-dir outputs/sft \

  --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset

```

The canonical dataset repo name is
`Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
reward verification fails or `HF_TOKEN` is missing.

You can also generate and push in one command by adding `--push-to-hub` to the
generation command.

For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
use the teacher path and still pass the verifier gate above.

Launch SFT on Modal after reward verification passes:

```bash

uv run --extra modal modal run --detach scripts/modal_train_sft.py \

  --local-train-path outputs/sft/train.jsonl \

  --local-validation-path outputs/sft/validation.jsonl \

  --local-manifest-path outputs/sft/manifest.json \

  --required-difficulties 0,1,2,3 \

  --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \

  --trackio-project CyberSecurity_OWASP-sft \

  --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \

  --push-to-hub \

  --detach

```

`scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
upload and again inside Modal before loading the model. It refuses to start SFT
unless all required curriculum difficulties are represented and the verifier
reward metadata passes. The default SFT config trains the full dataset
(`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
`H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
assistant-only loss for the Gemma 4 vision-language loader, so both remain
disabled for this model. The script pre-tokenizes the small JSONL dataset
serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
around the Gemma/Unsloth config object. It also uses the base Transformers loss
path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
warm run for the 300-400 episode dataset should usually finish in about 20-60
minutes; first image or model-cache builds can push that closer to 45-90
minutes.

Continue GRPO from the SFT LoRA:

The GRPO launcher downloads the Hub adapter, attaches a matching trainable
Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
SFT handoff compatible with Gemma 4's Unsloth linear wrappers.

```bash

uv run --extra modal modal run --detach scripts/modal_train_grpo.py \

  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \

  --max-steps 300 \

  --dataset-size 64 \

  --num-generations 8 \

  --difficulty 0 \

  --trace-log-every 10 \

  --detach

```

## Modal GRPO Training

The persistent GPU training launcher packages this local repo into Modal, trains
a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
to Hugging Face Hub.

Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
run the import/config check:

```bash

uv run --extra modal modal run scripts/modal_train_grpo.py --mode config

```

Run the default smoke GRPO job:

```bash

uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --max-steps 10 \

  --dataset-size 16 \

  --num-generations 6 \

  --difficulty 0

```

For GPU-utilization tuning on the same single L4, start with a larger but still
bounded no-code trial:

```bash

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --max-steps 30 \

  --dataset-size 64 \

  --num-generations 8 \

  --max-completion-length 256 \

  --difficulty 0

```

The launcher exposes GRPO throughput knobs for follow-up trials:

```bash

# larger generation group, no vLLM

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --max-steps 30 --dataset-size 64 --num-generations 8 \

  --max-completion-length 256 --trace-log-every 5



# vLLM colocate on the same L4

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --max-steps 30 --dataset-size 64 --num-generations 8 \

  --max-completion-length 256 --use-vllm \

  --vllm-gpu-memory-utilization 0.35 --trace-log-every 5



# larger microbatch if the vLLM trial does not OOM

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --max-steps 30 --dataset-size 64 --num-generations 8 \

  --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \

  --max-completion-length 256 --use-vllm \

  --vllm-gpu-memory-utilization 0.45 --trace-log-every 5

```

`per_device_train_batch_size * gradient_accumulation_steps * world_size` must
be divisible by `num_generations`; the launcher validates this before the GPU
container starts. Scalar Trackio metrics still log every reward callback, while
sample trace tables and Trace objects are throttled by `--trace-log-every`
(`1` restores every-callback logging, `0` disables trace artifacts).

### Parallel Modal GRPO Runs

Parallel Modal GRPO runs are safe when each run has its own seed range, run
name, and output target, while the shared cache volumes remain read-only.
Before launching another job, check what is already active:

```bash

uv run --extra modal modal app list

uv run --extra modal modal app logs <app-id>

```

Launch long-running parallel jobs with both Modal CLI detach and the launcher
detach flag. The CLI-level `--detach` keeps the remote function alive after the
local entrypoint exits; the launcher `--detach` prevents the parent Modal
function from waiting on the GPU call.

```bash

uv run --extra modal modal run --detach scripts/modal_train_grpo.py \

  --max-steps 300 \

  --dataset-size 64 \

  --num-generations 8 \

  --max-completion-length 768 \

  --difficulty 0 \

  --trace-log-every 10 \

  --seed-start 10000 \

  --detach

```

For multiple concurrent experiments:

- Use a unique `--seed-start` range for every run, normally spaced by at least
  10,000 seeds.
- Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
  scenarios during training.
- Do not run `prepare-cache --cache-force` while training jobs are active.
- Keep `--push-to-hub` disabled unless each run has a unique
  `--output-repo-id`.
- Let the launcher generate unique timestamped Trackio run names, or set an
  explicit `RUN_NAME` only when it is globally unique.
- Use the same Trackio Space/project for comparable metrics, but never reuse a
  run name.
- Treat `CyberSecurity_OWASP-model-cache` and
  `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
  during training. Run outputs and checkpoints should stay under each run's
  unique output directory.

If a Windows shell fails with a Unicode `charmap` encoding error during Modal
startup, rerun with UTF-8 enabled for that command:

```powershell

$env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach

```

If running from a public repository and you do not want Modal to package the
local workspace, use public source mode:

```bash

uv run --extra modal modal run scripts/modal_train_grpo.py \

  --source-mode public \

  --repo-url https://github.com/humandotlearning/CyberSecurity_OWASP.git \

  --repo-branch master \

  --max-steps 10 \

  --dataset-size 16 \

  --num-generations 6 \

  --difficulty 0

```

Defaults are derived from `HF_TOKEN`:

- Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
- Trackio project: `CyberSecurity_OWASP-grpo`
- Training model: `unsloth/gemma-4-E2B-it`
- Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`

Override these with `--trackio-space-id`, `--trackio-project`, and
`--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.

## Docker / Spaces

```bash

docker build -t CyberSecurity_OWASP:latest -f server/Dockerfile .

docker run --rm -p 8000:8000 CyberSecurity_OWASP:latest

openenv push --repo-id <username>/CyberSecurity_OWASP

```