Spaces:

Humanlearning
/

Cyber_analyst-round1

Sleeping

App Files Files Community

Humanlearning commited on 13 days ago

Commit

28685f3

1 Parent(s): 6abc8c5

feat: add cybersecurity-owasp-trainer skill with reference notes and update AGENTS.md documentation

Browse files

Files changed (5) hide show

.agents/skills/cybersecurity-owasp-trainer/SKILL.md +91 -0
.agents/skills/cybersecurity-owasp-trainer/agents/openai.yaml +4 -0
.agents/skills/cybersecurity-owasp-trainer/references/hackathon-training-notes.md +71 -0
.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md +43 -0
AGENTS.md +12 -1

.agents/skills/cybersecurity-owasp-trainer/SKILL.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+name: cybersecurity-owasp-trainer
+description: Train, debug, evaluate, and document CyberSecurity_OWASP model runs with OpenEnv, TRL/GRPO, optional Unsloth/QLoRA, Trackio, rollout artifacts, baseline-vs-trained evaluation, and reward-hacking safeguards. Use when working on training scripts, launch commands, rollouts, Trackio metrics, model saving, or hackathon demo evidence for this repo.
+---
+# CyberSecurity_OWASP Trainer
+## Overview
+Use this skill to run or modify the CyberSecurity_OWASP training and evaluation loop without weakening the verifier, reward integrity, or hackathon evidence trail. Treat the environment and reward engine as the product; training only starts after those are stable.
+## References
+- Load `references/hackathon-training-notes.md` when checking hackathon expectations, demo evidence, reward-hacking safeguards, or scaling order.
+- Load `references/trl-unsloth-openenv-notes.md` before changing TRL, OpenEnv training integration, Unsloth/QLoRA settings, vLLM settings, or model saving.
+- Use the repo's existing `openenv-cli` skill for OpenEnv CLI command details.
+- Use the repo's existing `hugging-face-trackio` skill for Trackio API, dashboard, alert, or metric retrieval details.
+## Preflight Gate
+Do not start real training until all checks below are true:
+- `reset`, `step`, `state`, typed actions, observations, and terminal states work deterministically.
+- Verifier and reward tests cover exploit blocking, regression preservation, public routes, visible tests, app boot, anti-cheat, invalid actions, and no repeated reward inflation.
+- Hidden tests and protected files cannot be read or patched through environment actions.
+- `send_local_request` is restricted to the generated local app.
+- A local server or Docker server can run, and at least one manual episode completes.
+- Scripted random, bad, and oracle policies run without crashing; oracle gets high reward on easy seeds.
+- At least 10 validation rollouts complete and sampled rollout artifacts look behaviorally plausible.
+- Trackio run config is set and can log a smoke metric locally or to the canonical Space.
+If any gate fails, fix the environment, verifier, reward engine, or rollout parser before touching trainer scale.
+## Repo Training Path
+Prefer the existing repo modules:
+- `training/rollout.py`: full OpenEnv episode loop, action JSON parsing, reward trace, rollout artifact fields.
+- `training/reward_funcs.py`: component reward functions exposed to TRL/GRPO.
+- `training/train_grpo.py`: `GRPOConfig`, model defaults, Trackio reporting, vLLM settings.
+- `training/eval_before_after.py`: baseline-vs-trained and held-out summary metrics.
+- `training/trackio_utils.py`: run naming, canonical metric names, Trackio init/log/finalize helpers.
+Default environment values:
+```powershell
+$env:MODEL_NAME = "Qwen/Qwen3-1.7B"
+$env:TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
+$env:TRACKIO_PROJECT = "CyberSecurity_OWASP"
+$env:DIFFICULTY = "0"
+```
+Use level-0 debug runs before scaling. Do not increase batch size, prompt count, scenario diversity, or difficulty until sampled artifacts show real discover-then-patch behavior rather than formatting compliance only.
+## Training Workflow
+1. Validate the environment first: run the targeted tests that cover models, reset/step/state, rewards, anti-cheat, seed reproducibility, invalid actions, and rollouts.
+2. Run a tiny smoke path that constructs `GRPOConfig` without starting expensive training.
+3. Run a frozen-model or dummy-policy rollout and inspect the action trace, observations, terminal reason, and reward breakdown.
+4. Confirm Trackio receives component metrics and the run name follows `CyberSecurity_OWASP-<model>-<algo>-level<difficulty>-<YYYYMMDD-HHMM>-<git_sha>`.
+5. Start a very small GRPO run only after the above passes. Watch completions and rollout artifacts during the run, not just aggregate reward.
+6. Evaluate baseline, trained, and held-out splits with `training/eval_before_after.py` and save summaries under `outputs/evals/`.
+7. Save sampled rollouts under `outputs/rollouts/` for baseline, mid-training, trained, and held-out evidence.
+## Reward And Monitoring
+Track at least these behavior columns:
+- Reward components: total, discovery, security, regression, public routes, patch quality, visible tests, safety, anti-cheat.
+- Rates: success, exploit-block, regression preservation, public-route preservation, anti-cheat pass, invalid action, timeout, safety violation, reward-hacking suspected.
+- Efficiency: episode length mean/p95, rollouts per second, tokens per second, loss, learning rate, KL, grad norm.
+- Environment timing: reset, step, verifier, reward, scenario compile, error rate, difficulty, seed.
+Stop or roll back if reward rises while sampled traces show deny-all patches, hardcoded users/resources/tenants, fixture/test tampering, repeated invalid actions, public routes being locked, or visible-test-only optimization.
+## TRL, OpenEnv, And Unsloth Guidance
+- Use TRL GRPO for verifier-driven rewards. Keep multiple independent reward functions for logging and diagnosis.
+- Keep the existing custom rollout path unless deliberately migrating to TRL's `environment_factory`. If migrating, preserve typed actions, observations, reward component logging, anti-cheat flags, and rollout artifacts.
+- Use vLLM colocate for small local runs when memory allows; use server mode only when a separate inference GPU/server is available.
+- For OpenEnv server training concurrency, ensure the server supports enough concurrent sessions for the generation batch.
+- Use Unsloth with LoRA or QLoRA for memory efficiency when the training machine supports it. Start from an instruct-capable checkpoint and verify the model has non-zero success probability before RL.
+- Pin and smoke-test TRL, Unsloth, vLLM, CUDA, and torch versions before longer runs.
+- Save LoRA adapters or use Unsloth-supported merged save paths. Do not naively upcast a 4-bit model and merge adapters manually.
+## Demo Evidence
+- Report baseline vs trained success and reward improvements.
+- Include held-out split results, exploit-block rate, regression-preservation rate, public-route preservation rate, and anti-cheat pass rate.
+- Show representative rollout traces before and after training.
+- Explain how hidden verifier checks, anti-cheat checks, randomized scenarios, and held-out combinations reduce reward hacking and overfitting.

.agents/skills/cybersecurity-owasp-trainer/agents/openai.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+interface:
+  display_name: "CyberSecurity OWASP Trainer"
+  short_description: "Train and evaluate CyberSecurity_OWASP"
+  default_prompt: "Use $cybersecurity-owasp-trainer to run a safe GRPO training/evaluation pass for CyberSecurity_OWASP."

.agents/skills/cybersecurity-owasp-trainer/references/hackathon-training-notes.md ADDED Viewed

	@@ -0,0 +1,71 @@

+# Hackathon Training Notes
+Sources:
+- `D:/delete_later/[External] Meta OpenEnv Hackathon Participant Help Guide.docx`
+- `D:/delete_later/Hackathon FAQs (participants).docx`
+- `D:/delete_later/OpenEnv_Hackathon_Resources.docx`
+## Core Build Order
+- Build the environment before the trainer. Define observation, action space, episode end conditions, reward, abuse limits, and deterministic replay first.
+- Treat the verifier and reward engine as the task specification. Prefer executable checks over subjective judgments.
+- Use OpenEnv for `reset`, `step`, `state`, typed action/observation models, server deployment, and trainer integration.
+- Use TRL for GRPO or related post-training, and use Unsloth when memory or rollout speed is the bottleneck.
+- Deploy or run the environment early through local Python, Docker, or Hugging Face Spaces so packaging and client-server issues surface before training.
+## Training Readiness
+Do not start meaningful training until:
+- `reset`, `step`, rewards, timeouts, and logs work locally.
+- The verifier has been adversarially tested.
+- At least a few easy tasks produce non-zero reward.
+- Random, bad, and oracle policies provide expected behavior.
+- Sample rollouts can be inspected and do not reveal reward-hacking shortcuts.
+Use this scale-up order:
+1. One manual episode.
+2. Scripted policies.
+3. Ten validation rollouts.
+4. Tiny frozen-model or debug GRPO run.
+5. Larger rollout count.
+6. Full training run.
+## Reward Engineering
+- Use multiple independent reward components instead of one scalar only.
+- Reward true outcomes first: exploit blocked, legitimate flows preserved, public routes preserved, app boots, and visible tests pass.
+- Penalize shortcuts: protected-file edits, hidden-test access, hardcoded identities/resources, deny-all patches, external network attempts, and environment abuse.
+- Keep explanation quality auxiliary only; do not let an LLM judge dominate the primary reward.
+- Watch for rising reward without better behavior. That usually means the reward was hacked or the verifier is too weak.
+## Curriculum
+- Start with short-horizon, easy, high-signal tasks where success probability is above zero.
+- Increase difficulty only after the model gets reliable partial reward.
+- If exploit blocking is poor, add easier security tasks.
+- If regressions increase, add positive-flow and public-route traps.
+- If validation reward plateaus, add unseen layouts, domains, and harder held-out combinations.
+## Monitoring And Demo Evidence
+Track overall reward, component rewards, success indicators, timeouts, invalid actions, and sampled generated strategies. Inspect actual rollouts throughout training.
+A strong hackathon demo shows:
+- Baseline attempt and verifier output.
+- Trained attempt and measurable improvement.
+- Held-out domain/layout/bug results.
+- Reward curves and component metrics.
+- Anti-cheat evidence showing the model did not learn deny-all, hardcoding, or fixture tampering.
+## Common Mistakes
+- Training before the environment and verifier are stable.
+- Choosing a task with near-zero chance of reward.
+- Using only one reward function.
+- Monitoring average reward but not sampled behavior.
+- Forgetting timeouts, sandboxing, or protected-file checks.
+- Saving LoRA/QLoRA models through an unsafe merge path.

.agents/skills/cybersecurity-owasp-trainer/references/trl-unsloth-openenv-notes.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# TRL, Unsloth, And OpenEnv Notes
+Sources checked for this skill:
+- TRL GRPO Trainer: https://huggingface.co/docs/trl/en/grpo_trainer
+- TRL OpenEnv integration: https://huggingface.co/docs/trl/en/openenv
+- Unsloth RL Guide: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide
+- Unsloth Advanced RL Documentation: https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation
+- Unsloth vLLM deployment/saving guide: https://unsloth.ai/docs/basics/inference-and-deployment/vllm-guide
+Recheck these pages before major dependency upgrades because TRL, OpenEnv integration, vLLM, and Unsloth RL APIs move quickly.
+## TRL GRPO
+- GRPO is an online RL method that samples multiple completions per prompt, scores them with reward functions, and optimizes relative advantage within the group.
+- `GRPOTrainer` accepts one or more reward functions. Custom reward functions receive prompts, completions, completion IDs, trainer state, and dataset columns through keyword arguments.
+- Multiple reward functions are summed unless reward weights are configured. Use separate component functions for logging and diagnosis.
+- TRL logs component reward means/stds, total reward, completion length, KL when enabled, entropy, clipping metrics, and token/step timing.
+- vLLM is the main acceleration path for generation. Colocate mode shares the trainer process/GPU; server mode is better when inference has separate GPUs.
+## TRL With OpenEnv
+- Use environment training when state carries across turns and observations depend on prior actions.
+- Current TRL docs prefer `environment_factory` for automatic multi-turn tool loops. It exposes public methods as tools and uses an `environments` argument in reward functions.
+- `rollout_func` is still appropriate when the repo needs a custom generation, parsing, artifact, or client loop. CyberSecurity_OWASP currently has this shape in `training/rollout.py`.
+- If migrating from `rollout_func` to `environment_factory`, preserve typed action validation, phase gating, reward breakdowns, anti-cheat flags, and rollout artifact output.
+- For concurrent training, match OpenEnv server session capacity to the generation batch. Create clients lazily in `reset` and close old sessions before reopening.
+## Unsloth RL Guidance
+- Use Unsloth for memory-efficient LoRA/QLoRA GRPO when local hardware is constrained.
+- Start from a capable instruct model or lightly format-tuned model. If success probability is effectively zero, RL will not bootstrap.
+- Keep reward functions/verifiers simple and trustworthy first; add shaping only after sparse reward blocks learning.
+- Unsloth recipes commonly use Qwen, Gemma, Llama, Phi, Mistral, and gpt-oss variants. For this repo, prefer the configured `Qwen/Qwen3-1.7B` or another small instruct/coder checkpoint for smoke runs.
+- For Unsloth-specific GRPO recipes, use more than two generations per prompt when hardware allows. Keep the repo's small `num_generations=2` only as a low-cost smoke/debug default unless tests prove it is sufficient.
+- Pin torch, CUDA, vLLM, TRL, and Unsloth versions for any serious run, then run a short smoke test before scaling.
+## Saving And Serving
+- Save LoRA adapters directly when adapters are enough for evaluation or continued training.
+- Use Unsloth-supported merged save methods for deployment formats, such as merged 16-bit for vLLM serving.
+- Avoid manually upcasting a 4-bit model and merging LoRA weights outside the supported save path.
+- After saving, immediately run post-training inference against a small held-out set to prove the artifact loads and still follows the JSON action protocol.

AGENTS.md CHANGED Viewed

@@ -793,7 +793,7 @@ import os
 from trl import GRPOConfig
 output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
-trackio_space_id = os.getenv("TRACKIO_SPACE_ID", output_dir)
 grpo_config = GRPOConfig(
     output_dir=output_dir,
@@ -825,6 +825,17 @@ Start with small debug runs before scaling.
 Trackio is mandatory for training and evaluation visibility.
 Run naming convention:
 ```text

 from trl import GRPOConfig
 output_dir = os.getenv("OUTPUT_DIR", "CyberSecurity_OWASP-qwen3-1.7b-grpo")
+trackio_space_id = os.getenv("TRACKIO_SPACE_ID", "Humanlearning/CyberSecurity_OWASP-trackio")
 grpo_config = GRPOConfig(
     output_dir=output_dir,
 Trackio is mandatory for training and evaluation visibility.
+Canonical Trackio Space:
+```text
+https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP-trackio
+```
+Use `TRACKIO_SPACE_ID=Humanlearning/CyberSecurity_OWASP-trackio` for training,
+evaluation, and smoke runs. This is separate from the OpenEnv HF Space
+`Humanlearning/CyberSecurity_OWASP`; do not send Trackio runs to the
+environment Space.
 Run naming convention:
 ```text