Humanlearning commited on
Commit
28685f3
·
1 Parent(s): 6abc8c5

feat: add cybersecurity-owasp-trainer skill with reference notes and update AGENTS.md documentation

Browse files
.agents/skills/cybersecurity-owasp-trainer/SKILL.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: cybersecurity-owasp-trainer
3
+ description: Train, debug, evaluate, and document CyberSecurity_OWASP model runs with OpenEnv, TRL/GRPO, optional Unsloth/QLoRA, Trackio, rollout artifacts, baseline-vs-trained evaluation, and reward-hacking safeguards. Use when working on training scripts, launch commands, rollouts, Trackio metrics, model saving, or hackathon demo evidence for this repo.
4
+ ---
5
+
6
+ # CyberSecurity_OWASP Trainer
7
+
8
+ ## Overview
9
+
10
+ Use this skill to run or modify the CyberSecurity_OWASP training and evaluation loop without weakening the verifier, reward integrity, or hackathon evidence trail. Treat the environment and reward engine as the product; training only starts after those are stable.
11
+
12
+ ## References
13
+
14
+ - Load `references/hackathon-training-notes.md` when checking hackathon expectations, demo evidence, reward-hacking safeguards, or scaling order.
15
+ - Load `references/trl-unsloth-openenv-notes.md` before changing TRL, OpenEnv training integration, Unsloth/QLoRA settings, vLLM settings, or model saving.
16
+ - Use the repo's existing `openenv-cli` skill for OpenEnv CLI command details.
17
+ - Use the repo's existing `hugging-face-trackio` skill for Trackio API, dashboard, alert, or metric retrieval details.
18
+
19
+ ## Preflight Gate
20
+
21
+ Do not start real training until all checks below are true:
22
+
23
+ - `reset`, `step`, `state`, typed actions, observations, and terminal states work deterministically.
24
+ - Verifier and reward tests cover exploit blocking, regression preservation, public routes, visible tests, app boot, anti-cheat, invalid actions, and no repeated reward inflation.
25
+ - Hidden tests and protected files cannot be read or patched through environment actions.
26
+ - `send_local_request` is restricted to the generated local app.
27
+ - A local server or Docker server can run, and at least one manual episode completes.
28
+ - Scripted random, bad, and oracle policies run without crashing; oracle gets high reward on easy seeds.
29
+ - At least 10 validation rollouts complete and sampled rollout artifacts look behaviorally plausible.
30
+ - Trackio run config is set and can log a smoke metric locally or to the canonical Space.
31
+
32
+ If any gate fails, fix the environment, verifier, reward engine, or rollout parser before touching trainer scale.
33
+
34
+ ## Repo Training Path
35
+
36
+ Prefer the existing repo modules:
37
+
38
+ - `training/rollout.py`: full OpenEnv episode loop, action JSON parsing, reward trace, rollout artifact fields.
39
+ - `training/reward_funcs.py`: component reward functions exposed to TRL/GRPO.
40
+ - `training/train_grpo.py`: `GRPOConfig`, model defaults, Trackio reporting, vLLM settings.
41
+ - `training/eval_before_after.py`: baseline-vs-trained and held-out summary metrics.
42
+ - `training/trackio_utils.py`: run naming, canonical metric names, Trackio init/log/finalize helpers.
43
+
44
+ Default environment values:
45
+
46
+ ```powershell
47
+ $env:MODEL_NAME = "Qwen/Qwen3-1.7B"
48
+ $env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
49
+ $env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
50
+ $env:DIFFICULTY = "0"
51
+ ```
52
+
53
+ Use level-0 debug runs before scaling. Do not increase batch size, prompt count, scenario diversity, or difficulty until sampled artifacts show real discover-then-patch behavior rather than formatting compliance only.
54
+
55
+ ## Training Workflow
56
+
57
+ 1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
58
+ 2. Run a tiny smoke path that constructs `GRPOConfig` without starting expensive training.
59
+ 3. Run a frozen-model or dummy-policy rollout and inspect the action trace, observations, terminal reason, and reward breakdown.
60
+ 4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
61
+ 5. Start a very small GRPO run only after the above passes. Watch completions and rollout artifacts during the run, not just aggregate reward.
62
+ 6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
63
+ 7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
64
+
65
+ ## Reward And Monitoring
66
+
67
+ Track at least these behavior columns:
68
+
69
+ - Reward components: total, discovery, security, regression, public routes, patch quality, visible tests, safety, anti-cheat.
70
+ - Rates: success, exploit-block, regression preservation, public-route preservation, anti-cheat pass, invalid action, timeout, safety violation, reward-hacking suspected.
71
+ - Efficiency: episode length mean/p95, rollouts per second, tokens per second, loss, learning rate, KL, grad norm.
72
+ - Environment timing: reset, step, verifier, reward, scenario compile, error rate, difficulty, seed.
73
+
74
+ Stop or roll back if reward rises while sampled traces show deny-all patches, hardcoded users/resources/tenants, fixture/test tampering, repeated invalid actions, public routes being locked, or visible-test-only optimization.
75
+
76
+ ## TRL, OpenEnv, And Unsloth Guidance
77
+
78
+ - Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
79
+ - Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
80
+ - Use vLLM colocate for small local runs when memory allows; use server mode only when a separate inference GPU/server is available.
81
+ - For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
82
+ - Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
83
+ - Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.
84
+ - Save LoRA adapters or use Unsloth-supported merged save paths. Do not naively upcast a 4-bit model and merge adapters manually.
85
+
86
+ ## Demo Evidence
87
+
88
+ - Report baseline vs trained success and reward improvements.
89
+ - Include held-out split results, exploit-block rate, regression-preservation rate, public-route preservation rate, and anti-cheat pass rate.
90
+ - Show representative rollout traces before and after training.
91
+ - Explain how hidden verifier checks, anti-cheat checks, randomized scenarios, and held-out combinations reduce reward hacking and overfitting.
.agents/skills/cybersecurity-owasp-trainer/agents/openai.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ interface:
2
+ display_name: "CyberSecurity OWASP Trainer"
3
+ short_description: "Train and evaluate CyberSecurity_OWASP"
4
+ default_prompt: "Use $cybersecurity-owasp-trainer to run a safe GRPO training/evaluation pass for CyberSecurity_OWASP."
.agents/skills/cybersecurity-owasp-trainer/references/hackathon-training-notes.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hackathon Training Notes
2
+
3
+ Sources:
4
+
5
+ - `D:/delete_later/[External] Meta OpenEnv Hackathon Participant Help Guide.docx`
6
+ - `D:/delete_later/Hackathon FAQs (participants).docx`
7
+ - `D:/delete_later/OpenEnv_Hackathon_Resources.docx`
8
+
9
+ ## Core Build Order
10
+
11
+ - Build the environment before the trainer. Define observation, action space, episode end conditions, reward, abuse limits, and deterministic replay first.
12
+ - Treat the verifier and reward engine as the task specification. Prefer executable checks over subjective judgments.
13
+ - Use OpenEnv for `reset`, `step`, `state`, typed action/observation models, server deployment, and trainer integration.
14
+ - Use TRL for GRPO or related post-training, and use Unsloth when memory or rollout speed is the bottleneck.
15
+ - Deploy or run the environment early through local Python, Docker, or Hugging Face Spaces so packaging and client-server issues surface before training.
16
+
17
+ ## Training Readiness
18
+
19
+ Do not start meaningful training until:
20
+
21
+ - `reset`, `step`, rewards, timeouts, and logs work locally.
22
+ - The verifier has been adversarially tested.
23
+ - At least a few easy tasks produce non-zero reward.
24
+ - Random, bad, and oracle policies provide expected behavior.
25
+ - Sample rollouts can be inspected and do not reveal reward-hacking shortcuts.
26
+
27
+ Use this scale-up order:
28
+
29
+ 1. One manual episode.
30
+ 2. Scripted policies.
31
+ 3. Ten validation rollouts.
32
+ 4. Tiny frozen-model or debug GRPO run.
33
+ 5. Larger rollout count.
34
+ 6. Full training run.
35
+
36
+ ## Reward Engineering
37
+
38
+ - Use multiple independent reward components instead of one scalar only.
39
+ - Reward true outcomes first: exploit blocked, legitimate flows preserved, public routes preserved, app boots, and visible tests pass.
40
+ - Penalize shortcuts: protected-file edits, hidden-test access, hardcoded identities/resources, deny-all patches, external network attempts, and environment abuse.
41
+ - Keep explanation quality auxiliary only; do not let an LLM judge dominate the primary reward.
42
+ - Watch for rising reward without better behavior. That usually means the reward was hacked or the verifier is too weak.
43
+
44
+ ## Curriculum
45
+
46
+ - Start with short-horizon, easy, high-signal tasks where success probability is above zero.
47
+ - Increase difficulty only after the model gets reliable partial reward.
48
+ - If exploit blocking is poor, add easier security tasks.
49
+ - If regressions increase, add positive-flow and public-route traps.
50
+ - If validation reward plateaus, add unseen layouts, domains, and harder held-out combinations.
51
+
52
+ ## Monitoring And Demo Evidence
53
+
54
+ Track overall reward, component rewards, success indicators, timeouts, invalid actions, and sampled generated strategies. Inspect actual rollouts throughout training.
55
+
56
+ A strong hackathon demo shows:
57
+
58
+ - Baseline attempt and verifier output.
59
+ - Trained attempt and measurable improvement.
60
+ - Held-out domain/layout/bug results.
61
+ - Reward curves and component metrics.
62
+ - Anti-cheat evidence showing the model did not learn deny-all, hardcoding, or fixture tampering.
63
+
64
+ ## Common Mistakes
65
+
66
+ - Training before the environment and verifier are stable.
67
+ - Choosing a task with near-zero chance of reward.
68
+ - Using only one reward function.
69
+ - Monitoring average reward but not sampled behavior.
70
+ - Forgetting timeouts, sandboxing, or protected-file checks.
71
+ - Saving LoRA/QLoRA models through an unsafe merge path.
.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TRL, Unsloth, And OpenEnv Notes
2
+
3
+ Sources checked for this skill:
4
+
5
+ - TRL GRPO Trainer: https://huggingface.co/docs/trl/en/grpo_trainer
6
+ - TRL OpenEnv integration: https://huggingface.co/docs/trl/en/openenv
7
+ - Unsloth RL Guide: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide
8
+ - Unsloth Advanced RL Documentation: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation
9
+ - Unsloth vLLM deployment/saving guide: https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide
10
+
11
+ Recheck these pages before major dependency upgrades because TRL, OpenEnv integration, vLLM, and Unsloth RL APIs move quickly.
12
+
13
+ ## TRL GRPO
14
+
15
+ - GRPO is an online RL method that samples multiple completions per prompt, scores them with reward functions, and optimizes relative advantage within the group.
16
+ - `GRPOTrainer` accepts one or more reward functions. Custom reward functions receive prompts, completions, completion IDs, trainer state, and dataset columns through keyword arguments.
17
+ - Multiple reward functions are summed unless reward weights are configured. Use separate component functions for logging and diagnosis.
18
+ - TRL logs component reward means/stds, total reward, completion length, KL when enabled, entropy, clipping metrics, and token/step timing.
19
+ - vLLM is the main acceleration path for generation. Colocate mode shares the trainer process/GPU; server mode is better when inference has separate GPUs.
20
+
21
+ ## TRL With OpenEnv
22
+
23
+ - Use environment training when state carries across turns and observations depend on prior actions.
24
+ - Current TRL docs prefer `environment_factory` for automatic multi-turn tool loops. It exposes public methods as tools and uses an `environments` argument in reward functions.
25
+ - `rollout_func` is still appropriate when the repo needs a custom generation, parsing, artifact, or client loop. CyberSecurity_OWASP currently has this shape in `training/rollout.py`.
26
+ - If migrating from `rollout_func` to `environment_factory`, preserve typed action validation, phase gating, reward breakdowns, anti-cheat flags, and rollout artifact output.
27
+ - For concurrent training, match OpenEnv server session capacity to the generation batch. Create clients lazily in `reset` and close old sessions before reopening.
28
+
29
+ ## Unsloth RL Guidance
30
+
31
+ - Use Unsloth for memory-efficient LoRA/QLoRA GRPO when local hardware is constrained.
32
+ - Start from a capable instruct model or lightly format-tuned model. If success probability is effectively zero, RL will not bootstrap.
33
+ - Keep reward functions/verifiers simple and trustworthy first; add shaping only after sparse reward blocks learning.
34
+ - Unsloth recipes commonly use Qwen, Gemma, Llama, Phi, Mistral, and gpt-oss variants. For this repo, prefer the configured `Qwen/Qwen3-1.7B` or another small instruct/coder checkpoint for smoke runs.
35
+ - For Unsloth-specific GRPO recipes, use more than two generations per prompt when hardware allows. Keep the repo's small `num_generations=2` only as a low-cost smoke/debug default unless tests prove it is sufficient.
36
+ - Pin torch, CUDA, vLLM, TRL, and Unsloth versions for any serious run, then run a short smoke test before scaling.
37
+
38
+ ## Saving And Serving
39
+
40
+ - Save LoRA adapters directly when adapters are enough for evaluation or continued training.
41
+ - Use Unsloth-supported merged save methods for deployment formats, such as merged 16-bit for vLLM serving.
42
+ - Avoid manually upcasting a 4-bit model and merging LoRA weights outside the supported save path.
43
+ - After saving, immediately run post-training inference against a small held-out set to prove the artifact loads and still follows the JSON action protocol.
AGENTS.md CHANGED
@@ -793,7 +793,7 @@ import os
793
  from trl import GRPOConfig
794
 
795
  output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
796
- trackio_space_id = os.getenv("TRACKIO_SPACE_ID", output_dir)
797
 
798
  grpo_config = GRPOConfig(
799
  output_dir=output_dir,
@@ -825,6 +825,17 @@ Start with small debug runs before scaling.
825
 
826
  Trackio is mandatory for training and evaluation visibility.
827
 
 
 
 
 
 
 
 
 
 
 
 
828
  Run naming convention:
829
 
830
  ```text
 
793
  from trl import GRPOConfig
794
 
795
  output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
796
+ trackio_space_id = os.getenv("TRACKIO_SPACE_ID", "Humanlearning/CyberSecurity_OWASP-trackio")
797
 
798
  grpo_config = GRPOConfig(
799
  output_dir=output_dir,
 
825
 
826
  Trackio is mandatory for training and evaluation visibility.
827
 
828
+ Canonical Trackio Space:
829
+
830
+ ```text
831
+ https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP-trackio
832
+ ```
833
+
834
+ Use `TRACKIO_SPACE_ID=Humanlearning/CyberSecurity_OWASP-trackio` for training,
835
+ evaluation, and smoke runs. This is separate from the OpenEnv HF Space
836
+ `Humanlearning/CyberSecurity_OWASP`; do not send Trackio runs to the
837
+ environment Space.
838
+
839
  Run naming convention:
840
 
841
  ```text