Spaces:
Sleeping
Sleeping
Commit ·
28685f3
1
Parent(s): 6abc8c5
feat: add cybersecurity-owasp-trainer skill with reference notes and update AGENTS.md documentation
Browse files- .agents/skills/cybersecurity-owasp-trainer/SKILL.md +91 -0
- .agents/skills/cybersecurity-owasp-trainer/agents/openai.yaml +4 -0
- .agents/skills/cybersecurity-owasp-trainer/references/hackathon-training-notes.md +71 -0
- .agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md +43 -0
- AGENTS.md +12 -1
.agents/skills/cybersecurity-owasp-trainer/SKILL.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: cybersecurity-owasp-trainer
|
| 3 |
+
description: Train, debug, evaluate, and document CyberSecurity_OWASP model runs with OpenEnv, TRL/GRPO, optional Unsloth/QLoRA, Trackio, rollout artifacts, baseline-vs-trained evaluation, and reward-hacking safeguards. Use when working on training scripts, launch commands, rollouts, Trackio metrics, model saving, or hackathon demo evidence for this repo.
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# CyberSecurity_OWASP Trainer
|
| 7 |
+
|
| 8 |
+
## Overview
|
| 9 |
+
|
| 10 |
+
Use this skill to run or modify the CyberSecurity_OWASP training and evaluation loop without weakening the verifier, reward integrity, or hackathon evidence trail. Treat the environment and reward engine as the product; training only starts after those are stable.
|
| 11 |
+
|
| 12 |
+
## References
|
| 13 |
+
|
| 14 |
+
- Load `references/hackathon-training-notes.md` when checking hackathon expectations, demo evidence, reward-hacking safeguards, or scaling order.
|
| 15 |
+
- Load `references/trl-unsloth-openenv-notes.md` before changing TRL, OpenEnv training integration, Unsloth/QLoRA settings, vLLM settings, or model saving.
|
| 16 |
+
- Use the repo's existing `openenv-cli` skill for OpenEnv CLI command details.
|
| 17 |
+
- Use the repo's existing `hugging-face-trackio` skill for Trackio API, dashboard, alert, or metric retrieval details.
|
| 18 |
+
|
| 19 |
+
## Preflight Gate
|
| 20 |
+
|
| 21 |
+
Do not start real training until all checks below are true:
|
| 22 |
+
|
| 23 |
+
- `reset`, `step`, `state`, typed actions, observations, and terminal states work deterministically.
|
| 24 |
+
- Verifier and reward tests cover exploit blocking, regression preservation, public routes, visible tests, app boot, anti-cheat, invalid actions, and no repeated reward inflation.
|
| 25 |
+
- Hidden tests and protected files cannot be read or patched through environment actions.
|
| 26 |
+
- `send_local_request` is restricted to the generated local app.
|
| 27 |
+
- A local server or Docker server can run, and at least one manual episode completes.
|
| 28 |
+
- Scripted random, bad, and oracle policies run without crashing; oracle gets high reward on easy seeds.
|
| 29 |
+
- At least 10 validation rollouts complete and sampled rollout artifacts look behaviorally plausible.
|
| 30 |
+
- Trackio run config is set and can log a smoke metric locally or to the canonical Space.
|
| 31 |
+
|
| 32 |
+
If any gate fails, fix the environment, verifier, reward engine, or rollout parser before touching trainer scale.
|
| 33 |
+
|
| 34 |
+
## Repo Training Path
|
| 35 |
+
|
| 36 |
+
Prefer the existing repo modules:
|
| 37 |
+
|
| 38 |
+
- `training/rollout.py`: full OpenEnv episode loop, action JSON parsing, reward trace, rollout artifact fields.
|
| 39 |
+
- `training/reward_funcs.py`: component reward functions exposed to TRL/GRPO.
|
| 40 |
+
- `training/train_grpo.py`: `GRPOConfig`, model defaults, Trackio reporting, vLLM settings.
|
| 41 |
+
- `training/eval_before_after.py`: baseline-vs-trained and held-out summary metrics.
|
| 42 |
+
- `training/trackio_utils.py`: run naming, canonical metric names, Trackio init/log/finalize helpers.
|
| 43 |
+
|
| 44 |
+
Default environment values:
|
| 45 |
+
|
| 46 |
+
```powershell
|
| 47 |
+
$env:MODEL_NAME = "Qwen/Qwen3-1.7B"
|
| 48 |
+
$env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
|
| 49 |
+
$env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
|
| 50 |
+
$env:DIFFICULTY = "0"
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
Use level-0 debug runs before scaling. Do not increase batch size, prompt count, scenario diversity, or difficulty until sampled artifacts show real discover-then-patch behavior rather than formatting compliance only.
|
| 54 |
+
|
| 55 |
+
## Training Workflow
|
| 56 |
+
|
| 57 |
+
1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
|
| 58 |
+
2. Run a tiny smoke path that constructs `GRPOConfig` without starting expensive training.
|
| 59 |
+
3. Run a frozen-model or dummy-policy rollout and inspect the action trace, observations, terminal reason, and reward breakdown.
|
| 60 |
+
4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
|
| 61 |
+
5. Start a very small GRPO run only after the above passes. Watch completions and rollout artifacts during the run, not just aggregate reward.
|
| 62 |
+
6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
|
| 63 |
+
7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
|
| 64 |
+
|
| 65 |
+
## Reward And Monitoring
|
| 66 |
+
|
| 67 |
+
Track at least these behavior columns:
|
| 68 |
+
|
| 69 |
+
- Reward components: total, discovery, security, regression, public routes, patch quality, visible tests, safety, anti-cheat.
|
| 70 |
+
- Rates: success, exploit-block, regression preservation, public-route preservation, anti-cheat pass, invalid action, timeout, safety violation, reward-hacking suspected.
|
| 71 |
+
- Efficiency: episode length mean/p95, rollouts per second, tokens per second, loss, learning rate, KL, grad norm.
|
| 72 |
+
- Environment timing: reset, step, verifier, reward, scenario compile, error rate, difficulty, seed.
|
| 73 |
+
|
| 74 |
+
Stop or roll back if reward rises while sampled traces show deny-all patches, hardcoded users/resources/tenants, fixture/test tampering, repeated invalid actions, public routes being locked, or visible-test-only optimization.
|
| 75 |
+
|
| 76 |
+
## TRL, OpenEnv, And Unsloth Guidance
|
| 77 |
+
|
| 78 |
+
- Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
|
| 79 |
+
- Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
|
| 80 |
+
- Use vLLM colocate for small local runs when memory allows; use server mode only when a separate inference GPU/server is available.
|
| 81 |
+
- For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
|
| 82 |
+
- Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
|
| 83 |
+
- Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.
|
| 84 |
+
- Save LoRA adapters or use Unsloth-supported merged save paths. Do not naively upcast a 4-bit model and merge adapters manually.
|
| 85 |
+
|
| 86 |
+
## Demo Evidence
|
| 87 |
+
|
| 88 |
+
- Report baseline vs trained success and reward improvements.
|
| 89 |
+
- Include held-out split results, exploit-block rate, regression-preservation rate, public-route preservation rate, and anti-cheat pass rate.
|
| 90 |
+
- Show representative rollout traces before and after training.
|
| 91 |
+
- Explain how hidden verifier checks, anti-cheat checks, randomized scenarios, and held-out combinations reduce reward hacking and overfitting.
|
.agents/skills/cybersecurity-owasp-trainer/agents/openai.yaml
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
interface:
|
| 2 |
+
display_name: "CyberSecurity OWASP Trainer"
|
| 3 |
+
short_description: "Train and evaluate CyberSecurity_OWASP"
|
| 4 |
+
default_prompt: "Use $cybersecurity-owasp-trainer to run a safe GRPO training/evaluation pass for CyberSecurity_OWASP."
|
.agents/skills/cybersecurity-owasp-trainer/references/hackathon-training-notes.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hackathon Training Notes
|
| 2 |
+
|
| 3 |
+
Sources:
|
| 4 |
+
|
| 5 |
+
- `D:/delete_later/[External] Meta OpenEnv Hackathon Participant Help Guide.docx`
|
| 6 |
+
- `D:/delete_later/Hackathon FAQs (participants).docx`
|
| 7 |
+
- `D:/delete_later/OpenEnv_Hackathon_Resources.docx`
|
| 8 |
+
|
| 9 |
+
## Core Build Order
|
| 10 |
+
|
| 11 |
+
- Build the environment before the trainer. Define observation, action space, episode end conditions, reward, abuse limits, and deterministic replay first.
|
| 12 |
+
- Treat the verifier and reward engine as the task specification. Prefer executable checks over subjective judgments.
|
| 13 |
+
- Use OpenEnv for `reset`, `step`, `state`, typed action/observation models, server deployment, and trainer integration.
|
| 14 |
+
- Use TRL for GRPO or related post-training, and use Unsloth when memory or rollout speed is the bottleneck.
|
| 15 |
+
- Deploy or run the environment early through local Python, Docker, or Hugging Face Spaces so packaging and client-server issues surface before training.
|
| 16 |
+
|
| 17 |
+
## Training Readiness
|
| 18 |
+
|
| 19 |
+
Do not start meaningful training until:
|
| 20 |
+
|
| 21 |
+
- `reset`, `step`, rewards, timeouts, and logs work locally.
|
| 22 |
+
- The verifier has been adversarially tested.
|
| 23 |
+
- At least a few easy tasks produce non-zero reward.
|
| 24 |
+
- Random, bad, and oracle policies provide expected behavior.
|
| 25 |
+
- Sample rollouts can be inspected and do not reveal reward-hacking shortcuts.
|
| 26 |
+
|
| 27 |
+
Use this scale-up order:
|
| 28 |
+
|
| 29 |
+
1. One manual episode.
|
| 30 |
+
2. Scripted policies.
|
| 31 |
+
3. Ten validation rollouts.
|
| 32 |
+
4. Tiny frozen-model or debug GRPO run.
|
| 33 |
+
5. Larger rollout count.
|
| 34 |
+
6. Full training run.
|
| 35 |
+
|
| 36 |
+
## Reward Engineering
|
| 37 |
+
|
| 38 |
+
- Use multiple independent reward components instead of one scalar only.
|
| 39 |
+
- Reward true outcomes first: exploit blocked, legitimate flows preserved, public routes preserved, app boots, and visible tests pass.
|
| 40 |
+
- Penalize shortcuts: protected-file edits, hidden-test access, hardcoded identities/resources, deny-all patches, external network attempts, and environment abuse.
|
| 41 |
+
- Keep explanation quality auxiliary only; do not let an LLM judge dominate the primary reward.
|
| 42 |
+
- Watch for rising reward without better behavior. That usually means the reward was hacked or the verifier is too weak.
|
| 43 |
+
|
| 44 |
+
## Curriculum
|
| 45 |
+
|
| 46 |
+
- Start with short-horizon, easy, high-signal tasks where success probability is above zero.
|
| 47 |
+
- Increase difficulty only after the model gets reliable partial reward.
|
| 48 |
+
- If exploit blocking is poor, add easier security tasks.
|
| 49 |
+
- If regressions increase, add positive-flow and public-route traps.
|
| 50 |
+
- If validation reward plateaus, add unseen layouts, domains, and harder held-out combinations.
|
| 51 |
+
|
| 52 |
+
## Monitoring And Demo Evidence
|
| 53 |
+
|
| 54 |
+
Track overall reward, component rewards, success indicators, timeouts, invalid actions, and sampled generated strategies. Inspect actual rollouts throughout training.
|
| 55 |
+
|
| 56 |
+
A strong hackathon demo shows:
|
| 57 |
+
|
| 58 |
+
- Baseline attempt and verifier output.
|
| 59 |
+
- Trained attempt and measurable improvement.
|
| 60 |
+
- Held-out domain/layout/bug results.
|
| 61 |
+
- Reward curves and component metrics.
|
| 62 |
+
- Anti-cheat evidence showing the model did not learn deny-all, hardcoding, or fixture tampering.
|
| 63 |
+
|
| 64 |
+
## Common Mistakes
|
| 65 |
+
|
| 66 |
+
- Training before the environment and verifier are stable.
|
| 67 |
+
- Choosing a task with near-zero chance of reward.
|
| 68 |
+
- Using only one reward function.
|
| 69 |
+
- Monitoring average reward but not sampled behavior.
|
| 70 |
+
- Forgetting timeouts, sandboxing, or protected-file checks.
|
| 71 |
+
- Saving LoRA/QLoRA models through an unsafe merge path.
|
.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TRL, Unsloth, And OpenEnv Notes
|
| 2 |
+
|
| 3 |
+
Sources checked for this skill:
|
| 4 |
+
|
| 5 |
+
- TRL GRPO Trainer: https://huggingface.co/docs/trl/en/grpo_trainer
|
| 6 |
+
- TRL OpenEnv integration: https://huggingface.co/docs/trl/en/openenv
|
| 7 |
+
- Unsloth RL Guide: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide
|
| 8 |
+
- Unsloth Advanced RL Documentation: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation
|
| 9 |
+
- Unsloth vLLM deployment/saving guide: https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide
|
| 10 |
+
|
| 11 |
+
Recheck these pages before major dependency upgrades because TRL, OpenEnv integration, vLLM, and Unsloth RL APIs move quickly.
|
| 12 |
+
|
| 13 |
+
## TRL GRPO
|
| 14 |
+
|
| 15 |
+
- GRPO is an online RL method that samples multiple completions per prompt, scores them with reward functions, and optimizes relative advantage within the group.
|
| 16 |
+
- `GRPOTrainer` accepts one or more reward functions. Custom reward functions receive prompts, completions, completion IDs, trainer state, and dataset columns through keyword arguments.
|
| 17 |
+
- Multiple reward functions are summed unless reward weights are configured. Use separate component functions for logging and diagnosis.
|
| 18 |
+
- TRL logs component reward means/stds, total reward, completion length, KL when enabled, entropy, clipping metrics, and token/step timing.
|
| 19 |
+
- vLLM is the main acceleration path for generation. Colocate mode shares the trainer process/GPU; server mode is better when inference has separate GPUs.
|
| 20 |
+
|
| 21 |
+
## TRL With OpenEnv
|
| 22 |
+
|
| 23 |
+
- Use environment training when state carries across turns and observations depend on prior actions.
|
| 24 |
+
- Current TRL docs prefer `environment_factory` for automatic multi-turn tool loops. It exposes public methods as tools and uses an `environments` argument in reward functions.
|
| 25 |
+
- `rollout_func` is still appropriate when the repo needs a custom generation, parsing, artifact, or client loop. CyberSecurity_OWASP currently has this shape in `training/rollout.py`.
|
| 26 |
+
- If migrating from `rollout_func` to `environment_factory`, preserve typed action validation, phase gating, reward breakdowns, anti-cheat flags, and rollout artifact output.
|
| 27 |
+
- For concurrent training, match OpenEnv server session capacity to the generation batch. Create clients lazily in `reset` and close old sessions before reopening.
|
| 28 |
+
|
| 29 |
+
## Unsloth RL Guidance
|
| 30 |
+
|
| 31 |
+
- Use Unsloth for memory-efficient LoRA/QLoRA GRPO when local hardware is constrained.
|
| 32 |
+
- Start from a capable instruct model or lightly format-tuned model. If success probability is effectively zero, RL will not bootstrap.
|
| 33 |
+
- Keep reward functions/verifiers simple and trustworthy first; add shaping only after sparse reward blocks learning.
|
| 34 |
+
- Unsloth recipes commonly use Qwen, Gemma, Llama, Phi, Mistral, and gpt-oss variants. For this repo, prefer the configured `Qwen/Qwen3-1.7B` or another small instruct/coder checkpoint for smoke runs.
|
| 35 |
+
- For Unsloth-specific GRPO recipes, use more than two generations per prompt when hardware allows. Keep the repo's small `num_generations=2` only as a low-cost smoke/debug default unless tests prove it is sufficient.
|
| 36 |
+
- Pin torch, CUDA, vLLM, TRL, and Unsloth versions for any serious run, then run a short smoke test before scaling.
|
| 37 |
+
|
| 38 |
+
## Saving And Serving
|
| 39 |
+
|
| 40 |
+
- Save LoRA adapters directly when adapters are enough for evaluation or continued training.
|
| 41 |
+
- Use Unsloth-supported merged save methods for deployment formats, such as merged 16-bit for vLLM serving.
|
| 42 |
+
- Avoid manually upcasting a 4-bit model and merging LoRA weights outside the supported save path.
|
| 43 |
+
- After saving, immediately run post-training inference against a small held-out set to prove the artifact loads and still follows the JSON action protocol.
|
AGENTS.md
CHANGED
|
@@ -793,7 +793,7 @@ import os
|
|
| 793 |
from trl import GRPOConfig
|
| 794 |
|
| 795 |
output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
|
| 796 |
-
trackio_space_id = os.getenv("TRACKIO_SPACE_ID",
|
| 797 |
|
| 798 |
grpo_config = GRPOConfig(
|
| 799 |
output_dir=output_dir,
|
|
@@ -825,6 +825,17 @@ Start with small debug runs before scaling.
|
|
| 825 |
|
| 826 |
Trackio is mandatory for training and evaluation visibility.
|
| 827 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 828 |
Run naming convention:
|
| 829 |
|
| 830 |
```text
|
|
|
|
| 793 |
from trl import GRPOConfig
|
| 794 |
|
| 795 |
output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
|
| 796 |
+
trackio_space_id = os.getenv("TRACKIO_SPACE_ID", "Humanlearning/CyberSecurity_OWASP-trackio")
|
| 797 |
|
| 798 |
grpo_config = GRPOConfig(
|
| 799 |
output_dir=output_dir,
|
|
|
|
| 825 |
|
| 826 |
Trackio is mandatory for training and evaluation visibility.
|
| 827 |
|
| 828 |
+
Canonical Trackio Space:
|
| 829 |
+
|
| 830 |
+
```text
|
| 831 |
+
https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP-trackio
|
| 832 |
+
```
|
| 833 |
+
|
| 834 |
+
Use `TRACKIO_SPACE_ID=Humanlearning/CyberSecurity_OWASP-trackio` for training,
|
| 835 |
+
evaluation, and smoke runs. This is separate from the OpenEnv HF Space
|
| 836 |
+
`Humanlearning/CyberSecurity_OWASP`; do not send Trackio runs to the
|
| 837 |
+
environment Space.
|
| 838 |
+
|
| 839 |
Run naming convention:
|
| 840 |
|
| 841 |
```text
|