Spaces:

Humanlearning
/

Cyber_analyst-round1

Sleeping

App Files Files Community

Humanlearning commited on 12 days ago

Commit

60f97ab

1 Parent(s): f7b8ac6

feat: expand README with synthetic SFT dataset generation instructions, enhance dataset verification and pushing to Hugging Face Hub, and improve modal training scripts with default configurations for curriculum and GPU fallback

Browse files

Files changed (5) hide show

README.md +94 -0
scripts/generate_sft_dataset.py +593 -36
scripts/modal_train_sft.py +266 -20
tests/test_modal_scenario_cache_static.py +37 -0
tests/test_sft_dataset_generation.py +48 -2

README.md CHANGED Viewed

@@ -256,6 +256,100 @@ The shell wrapper is equivalent:
 MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
 ```
 ## Modal GRPO Training
 The persistent GPU training launcher packages this local repo into Modal, trains

 MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
 ```
+## Synthetic SFT Before GRPO
+Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
+The SFT generator executes every teacher action in the real environment and
+keeps only trajectories that pass the deterministic reward verifier.
+Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
+```bash
+uv run python scripts/generate_sft_dataset.py \
+  --teacher-model deepseek-ai/DeepSeek-V4-Pro \
+  --target-model unsloth/gemma-4-E2B-it \
+  --difficulty-levels 0,1,2,3 \
+  --difficulty-buckets 4 \
+  --episodes 75 \
+  --validation-episodes 20 \
+  --workers 8 \
+  --out-dir outputs/sft
+```
+`--episodes` is per difficulty level when `--difficulty-levels` is set, so
+`--episodes 75` across four levels gives 300 total train episodes. Expect
+roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
+contributes one row per action step. The script writes JSONL rows under
+`outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
+dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
+reward summaries and curriculum coverage.
+Verify reward metadata before any training run:
+```bash
+uv run python scripts/generate_sft_dataset.py \
+  --verify-only \
+  --difficulty-levels 0,1,2,3 \
+  --out-dir outputs/sft
+```
+Push the verified dataset to Hugging Face Hub:
+```bash
+uv run python scripts/generate_sft_dataset.py \
+  --push-only \
+  --difficulty-levels 0,1,2,3 \
+  --out-dir outputs/sft \
+  --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
+```
+The canonical dataset repo name is
+`Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
+reward verification fails or `HF_TOKEN` is missing.
+You can also generate and push in one command by adding `--push-to-hub` to the
+generation command.
+For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
+use the teacher path and still pass the verifier gate above.
+Launch SFT on Modal after reward verification passes:
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_sft.py \
+  --local-train-path outputs/sft/train.jsonl \
+  --local-validation-path outputs/sft/validation.jsonl \
+  --local-manifest-path outputs/sft/manifest.json \
+  --required-difficulties 0,1,2,3 \
+  --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
+  --trackio-project CyberSecurity_OWASP-sft \
+  --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
+  --push-to-hub \
+  --detach
+```
+`scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
+upload and again inside Modal before loading the model. It refuses to start SFT
+unless all required curriculum difficulties are represented and the verifier
+reward metadata passes. The default SFT config trains one full epoch
+(`--max-steps -1`) with packed assistant-only loss, bf16/tf32, LoRA rank 32,
+and Modal GPU fallback `H200 -> H100 -> A100-80GB -> L40S`. A warm run for the
+300-episode dataset should usually finish in about 15-45 minutes; first image
+or model-cache builds can push that closer to 35-75 minutes.
+Continue GRPO from the SFT LoRA:
+```bash
+uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
+  --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
+  --max-steps 300 \
+  --dataset-size 64 \
+  --num-generations 8 \
+  --difficulty 0 \
+  --trace-log-every 10 \
+  --detach
+```
 ## Modal GRPO Training
 The persistent GPU training launcher packages this local repo into Modal, trains

scripts/generate_sft_dataset.py CHANGED Viewed

@@ -14,6 +14,8 @@ import json
 import os
 import statistics
 import subprocess
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Iterable
@@ -72,6 +74,14 @@ class DatasetConfig:
     temperature: float = 0.2
     top_p: float = 0.95
     dry_run_oracle: bool = False
 class HuggingFaceTeacher:
@@ -579,6 +589,174 @@ def write_jsonl(path: Path, rows: Iterable[dict[str, Any]]) -> None:
             handle.write(json.dumps(row, sort_keys=True, default=str) + "\n")
 def _write_trajectory(out_dir: Path, trajectory: dict[str, Any]) -> Path:
     traj_dir = out_dir / "trajectories"
     traj_dir.mkdir(parents=True, exist_ok=True)
@@ -622,80 +800,365 @@ def _reward_summary(values: list[float]) -> dict[str, float]:
     }
 def generate_dataset(config: DatasetConfig) -> dict[str, Any]:
     config.out_dir.mkdir(parents=True, exist_ok=True)
-    teacher = None
     if not config.dry_run_oracle:
-        token = os.getenv("HF_TOKEN")
-        if not token:
             raise RuntimeError("HF_TOKEN is required unless --dry-run-oracle is set")
-        teacher = HuggingFaceTeacher(
-            model=config.teacher_model,
-            token=token,
-            max_tokens=config.max_tokens,
-            temperature=config.temperature,
-            top_p=config.top_p,
-        )
     split_jobs = [(config.split, config.episodes, config.seed_start)]
     if config.validation_episodes:
-        split_jobs.append(("validation", config.validation_episodes, config.seed_start + config.episodes))
     rows_by_split: dict[str, list[dict[str, Any]]] = {"train": [], "validation": []}
     attempts: list[dict[str, Any]] = []
     rewards: list[float] = []
     accepted = 0
-    attempted = 0
-    for split, episodes, seed_start in split_jobs:
-        for offset in range(int(episodes)):
-            seed = int(seed_start) + offset
-            attempted += 1
-            result = run_episode(
                 seed=seed,
                 split=split,
-                difficulty=config.difficulty,
                 config=config,
-                teacher=teacher,
-            )
-            attempts.append(
-                {
-                    "seed": seed,
-                    "split": split,
-                    "accepted": bool(result["accepted"]),
-                    "reason": result.get("reason", ""),
-                    "trajectory_path": str(_write_trajectory(config.out_dir, result["trajectory"])),
-                }
             )
-            if result["accepted"]:
-                accepted += 1
-                rows = list(result["rows"])
-                rows_by_split.setdefault(split, []).extend(rows)
-                rewards.append(float(result["trajectory"].get("terminal_total", 0.0)))
     for split_name in ("train", "validation", config.split):
         write_jsonl(config.out_dir / f"{split_name}.jsonl", rows_by_split.get(split_name, []))
     manifest = {
         "teacher_model": config.teacher_model,
         "target_model": config.target_model,
         "split": config.split,
         "difficulty": config.difficulty,
         "seed_start": config.seed_start,
         "episodes_attempted": attempted,
         "episodes_accepted": accepted,
         "acceptance_rate": accepted / attempted if attempted else 0.0,
         "rows_by_split": {key: len(value) for key, value in sorted(rows_by_split.items())},
         "reward_summary": _reward_summary(rewards),
         "git_sha": _git_sha(),
         "verifier_version": "verifier_v1",
         "dry_run_oracle": config.dry_run_oracle,
         "attempts": attempts,
     }
     manifest_path = config.out_dir / "manifest.json"
     manifest_path.write_text(
         json.dumps(manifest, indent=2, sort_keys=True, default=str),
         encoding="utf-8",
     )
     return manifest
@@ -705,6 +1168,21 @@ def build_arg_parser() -> argparse.ArgumentParser:
     parser.add_argument("--target-model", default=DEFAULT_TARGET_MODEL)
     parser.add_argument("--split", default="train", choices=["train", "validation", "hidden_eval"])
     parser.add_argument("--difficulty", type=int, default=0)
     parser.add_argument("--seed-start", type=int, default=0)
     parser.add_argument("--episodes", type=int, default=100)
     parser.add_argument("--validation-episodes", type=int, default=0)
@@ -714,6 +1192,48 @@ def build_arg_parser() -> argparse.ArgumentParser:
     parser.add_argument("--max-tokens", type=int, default=768)
     parser.add_argument("--temperature", type=float, default=0.2)
     parser.add_argument("--top-p", type=float, default=0.95)
     parser.add_argument(
         "--dry-run-oracle",
         action="store_true",
@@ -728,6 +1248,8 @@ def config_from_args(args: argparse.Namespace) -> DatasetConfig:
         target_model=args.target_model,
         split=args.split,
         difficulty=args.difficulty,
         seed_start=args.seed_start,
         episodes=args.episodes,
         validation_episodes=args.validation_episodes,
@@ -738,15 +1260,50 @@ def config_from_args(args: argparse.Namespace) -> DatasetConfig:
         temperature=args.temperature,
         top_p=args.top_p,
         dry_run_oracle=args.dry_run_oracle,
     )
 def main(argv: list[str] | None = None) -> int:
     parser = build_arg_parser()
     args = parser.parse_args(argv)
-    manifest = generate_dataset(config_from_args(args))
-    print(json.dumps(manifest, indent=2, sort_keys=True))
-    return 0
 if __name__ == "__main__":

 import os
 import statistics
 import subprocess
+import threading
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Any, Iterable
     temperature: float = 0.2
     top_p: float = 0.95
     dry_run_oracle: bool = False
+    workers: int = 0
+    min_terminal_reward: float = 12.0
+    difficulty_levels: tuple[int, ...] = ()
+    difficulty_buckets: int = 0
+    push_to_hub: bool = False
+    dataset_repo_id: str = "Humanlearning/CyberSecurity_OWASP-sft-dataset"
+    hub_private: bool = False
+    progress: bool = False
 class HuggingFaceTeacher:
             handle.write(json.dumps(row, sort_keys=True, default=str) + "\n")
+def write_dataset_card(out_dir: Path, manifest: dict[str, Any], dataset_repo_id: str) -> Path:
+    card_path = out_dir / "README.md"
+    difficulty_levels = manifest.get("difficulty_levels", [])
+    reward_verification = manifest.get("reward_verification", {})
+    card = f"""---
+license: apache-2.0
+task_categories:
+- text-generation
+language:
+- en
+tags:
+- cybersecurity
+- owasp
+- openenv
+- tool-use
+- sft
+pretty_name: CyberSecurity_OWASP SFT Dataset
+---
+# CyberSecurity_OWASP SFT Dataset
+This dataset contains verifier-gated supervised fine-tuning examples for the
+`CyberSecurity_OWASP` OpenEnv environment. Each row teaches one step of the
+defensive local AppSec workflow: inspect policy/code, reproduce a local
+authorization failure, submit a policy-tied diagnosis, patch the generated app,
+run visible tests, and submit the fix.
+Every kept trajectory is executed against the real local environment and must
+pass the deterministic reward verifier before rows are written.
+## Intended Use
+- Target SFT model: `{manifest.get("target_model", "")}`
+- Teacher model: `{manifest.get("teacher_model", "")}`
+- Dataset repo: `{dataset_repo_id}`
+- Format: chat JSONL with `messages` and verifier metadata
+- Dry-run oracle: `{manifest.get("dry_run_oracle", False)}`
+## Curriculum Coverage
+- Difficulty levels: `{difficulty_levels}`
+- Episodes attempted: `{manifest.get("episodes_attempted", 0)}`
+- Episodes accepted: `{manifest.get("episodes_accepted", 0)}`
+- Acceptance rate: `{manifest.get("acceptance_rate", 0.0):.4f}`
+- Rows by split: `{json.dumps(manifest.get("rows_by_split", {}), sort_keys=True)}`
+- Rows by difficulty: `{json.dumps(manifest.get("rows_by_difficulty", {}), sort_keys=True)}`
+## Reward Verification
+- Passed: `{reward_verification.get("passed", False)}`
+- Checked rows: `{reward_verification.get("checked_rows", 0)}`
+- Minimum terminal reward: `{reward_verification.get("min_terminal_reward", 0.0)}`
+- Reward summary: `{json.dumps(reward_verification.get("reward_summary", {}), sort_keys=True)}`
+Rows are rejected if the episode fails hidden security/regression/public-route
+checks, triggers anti-cheat flags, lacks a positive patch-quality reward, or
+falls below the configured terminal reward threshold.
+## Schema
+Each JSONL row has:
+```json
+{{
+  "messages": [
+    {{"role": "system", "content": "..."}},
+    {{"role": "user", "content": "..."}},
+    {{"role": "assistant", "content": "{{\\"tool_name\\":\\"...\\",\\"arguments\\":{{...}}}}"}}
+  ],
+  "metadata": {{
+    "target_model": "...",
+    "teacher_model": "...",
+    "seed": 0,
+    "split": "train",
+    "difficulty": 0,
+    "step": 1,
+    "tool_name": "inspect_policy_graph",
+    "final_success": true,
+    "terminal_total": 12.5,
+    "anti_cheat_flags": []
+  }}
+}}
+```
+"""
+    card_path.write_text(card, encoding="utf-8")
+    return card_path
+def push_dataset_to_hub(out_dir: Path, *, repo_id: str, private: bool) -> dict[str, Any]:
+    token = os.getenv("HF_TOKEN")
+    if not token:
+        raise RuntimeError("HF_TOKEN is required for --push-to-hub")
+    try:
+        from huggingface_hub import HfApi
+    except ImportError as exc:  # pragma: no cover
+        raise RuntimeError("huggingface_hub is required for --push-to-hub") from exc
+    api = HfApi(token=token)
+    api.create_repo(repo_id=repo_id, repo_type="dataset", private=private, exist_ok=True)
+    commit_info = api.upload_folder(
+        repo_id=repo_id,
+        repo_type="dataset",
+        folder_path=str(out_dir),
+        path_in_repo=".",
+        commit_message="Upload verified CyberSecurity_OWASP SFT dataset",
+        delete_patterns=[
+            "README.md",
+            "manifest.json",
+            "train.jsonl",
+            "validation.jsonl",
+            "hidden_eval.jsonl",
+            "trajectories/**",
+        ],
+    )
+    return {
+        "repo_id": repo_id,
+        "private": bool(private),
+        "url": f"https://huggingface.co/datasets/{repo_id}",
+        "commit_url": getattr(commit_info, "commit_url", ""),
+    }
+def push_existing_dataset(
+    out_dir: Path,
+    *,
+    repo_id: str,
+    private: bool,
+    min_terminal_reward: float,
+    required_difficulties: tuple[int, ...],
+) -> dict[str, Any]:
+    verification = verify_sft_dataset_rewards(
+        out_dir,
+        min_terminal_reward=min_terminal_reward,
+        require_train_rows=True,
+        required_difficulties=required_difficulties,
+    )
+    if not verification["passed"]:
+        raise RuntimeError(f"Reward verification failed; refusing Hub push: {verification}")
+    manifest_path = out_dir / "manifest.json"
+    if manifest_path.exists():
+        manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
+    else:
+        manifest = {
+            "teacher_model": DEFAULT_TEACHER_MODEL,
+            "target_model": DEFAULT_TARGET_MODEL,
+            "difficulty_levels": [int(level) for level in required_difficulties],
+            "rows_by_split": verification.get("rows_by_split", {}),
+        }
+    manifest["reward_verification"] = verification
+    manifest["hub"] = {
+        "repo_id": repo_id,
+        "private": bool(private),
+        "url": f"https://huggingface.co/datasets/{repo_id}",
+    }
+    write_dataset_card(out_dir, manifest, repo_id)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, sort_keys=True, default=str),
+        encoding="utf-8",
+    )
+    hub_result = push_dataset_to_hub(out_dir, repo_id=repo_id, private=private)
+    manifest["hub"].update(hub_result)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, sort_keys=True, default=str),
+        encoding="utf-8",
+    )
+    return {"reward_verification": verification, "hub": manifest["hub"]}
 def _write_trajectory(out_dir: Path, trajectory: dict[str, Any]) -> Path:
     traj_dir = out_dir / "trajectories"
     traj_dir.mkdir(parents=True, exist_ok=True)
     }
+def _parse_int_csv(value: str) -> tuple[int, ...]:
+    if not value.strip():
+        return ()
+    levels = []
+    for item in value.split(","):
+        stripped = item.strip()
+        if not stripped:
+            continue
+        levels.append(int(stripped))
+    return tuple(dict.fromkeys(levels))
+def _difficulty_levels(config: DatasetConfig) -> tuple[int, ...]:
+    if config.difficulty_levels:
+        return tuple(int(level) for level in config.difficulty_levels)
+    return (int(config.difficulty),)
+def _configure_difficulty_buckets(config: DatasetConfig, levels: tuple[int, ...]) -> int:
+    requested = max(levels) + 1 if levels else int(config.difficulty) + 1
+    configured = max(int(config.difficulty_buckets or 0), requested, 1)
+    existing = os.getenv("CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS")
+    if existing:
+        configured = max(configured, int(existing))
+    os.environ["CYBERSECURITY_OWASP_DIFFICULTY_BUCKETS"] = str(configured)
+    return configured
+def _read_jsonl(path: Path) -> list[dict[str, Any]]:
+    if not path.exists():
+        return []
+    rows: list[dict[str, Any]] = []
+    for line_number, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
+        if not line.strip():
+            continue
+        try:
+            item = json.loads(line)
+        except json.JSONDecodeError as exc:
+            raise ValueError(f"{path}:{line_number}: invalid JSONL row: {exc}") from exc
+        if not isinstance(item, dict):
+            raise ValueError(f"{path}:{line_number}: row must be a JSON object")
+        rows.append(item)
+    return rows
+def _verify_sft_row_reward(
+    row: dict[str, Any],
+    *,
+    min_terminal_reward: float,
+    path: Path,
+    line_number: int,
+) -> tuple[bool, str, float]:
+    messages = row.get("messages")
+    if not isinstance(messages, list) or len(messages) < 3:
+        return False, f"{path}:{line_number}: messages must include system/user/assistant", 0.0
+    if messages[-1].get("role") != "assistant":
+        return False, f"{path}:{line_number}: final message must be assistant", 0.0
+    try:
+        CyberSecurityOWASPAction(**json.loads(str(messages[-1].get("content", ""))))
+    except Exception as exc:
+        return False, f"{path}:{line_number}: assistant content is not a valid action: {exc}", 0.0
+    metadata = row.get("metadata")
+    if not isinstance(metadata, dict):
+        return False, f"{path}:{line_number}: missing metadata object", 0.0
+    if metadata.get("final_success") is not True:
+        return False, f"{path}:{line_number}: final_success is not true", 0.0
+    flags = metadata.get("anti_cheat_flags") or []
+    if flags:
+        return False, f"{path}:{line_number}: anti-cheat flags present: {flags}", 0.0
+    reward = float(metadata.get("terminal_total", 0.0) or 0.0)
+    if reward < min_terminal_reward:
+        return (
+            False,
+            f"{path}:{line_number}: terminal_total {reward:.3f} below required {min_terminal_reward:.3f}",
+            reward,
+        )
+    breakdown = metadata.get("final_reward_breakdown") or {}
+    if not isinstance(breakdown, dict):
+        return False, f"{path}:{line_number}: missing final_reward_breakdown", reward
+    required_positive = ("security", "regression", "public_routes", "patch_quality", "visible_tests")
+    missing = [key for key in required_positive if float(breakdown.get(key, 0.0) or 0.0) <= 0.0]
+    if missing:
+        return False, f"{path}:{line_number}: non-positive reward components: {', '.join(missing)}", reward
+    return True, "", reward
+def verify_sft_dataset_rewards(
+    out_dir: Path,
+    *,
+    min_terminal_reward: float = 12.0,
+    require_train_rows: bool = True,
+    required_difficulties: tuple[int, ...] = (),
+) -> dict[str, Any]:
+    """Verify generated SFT rows carry successful verifier-backed rewards."""
+    checked_rows = 0
+    failed_rows: list[str] = []
+    rewards: list[float] = []
+    rows_by_split: dict[str, int] = {}
+    rows_by_difficulty: dict[str, int] = {}
+    for split_name in ("train", "validation", "hidden_eval"):
+        path = out_dir / f"{split_name}.jsonl"
+        rows = _read_jsonl(path)
+        if not rows and split_name != "train":
+            continue
+        rows_by_split[split_name] = len(rows)
+        for index, row in enumerate(rows, start=1):
+            ok, error, reward = _verify_sft_row_reward(
+                row,
+                min_terminal_reward=min_terminal_reward,
+                path=path,
+                line_number=index,
+            )
+            checked_rows += 1
+            if reward:
+                rewards.append(reward)
+            if not ok:
+                failed_rows.append(error)
+            metadata = row.get("metadata") if isinstance(row, dict) else {}
+            if isinstance(metadata, dict) and "difficulty" in metadata:
+                difficulty_key = str(int(metadata.get("difficulty", 0)))
+                rows_by_difficulty[difficulty_key] = rows_by_difficulty.get(difficulty_key, 0) + 1
+    passed = not failed_rows and (checked_rows > 0 or not require_train_rows)
+    if require_train_rows and rows_by_split.get("train", 0) <= 0:
+        passed = False
+        failed_rows.append(f"{out_dir / 'train.jsonl'}: no train rows found")
+    missing_difficulties = [
+        int(level)
+        for level in required_difficulties
+        if rows_by_difficulty.get(str(int(level)), 0) <= 0
+    ]
+    if missing_difficulties:
+        passed = False
+        failed_rows.append(f"missing required curriculum difficulty rows: {missing_difficulties}")
+    return {
+        "passed": passed,
+        "checked_rows": checked_rows,
+        "failed_rows": failed_rows[:50],
+        "failure_count": len(failed_rows),
+        "rows_by_split": rows_by_split,
+        "rows_by_difficulty": rows_by_difficulty,
+        "required_difficulties": [int(level) for level in required_difficulties],
+        "missing_difficulties": missing_difficulties,
+        "min_terminal_reward": float(min_terminal_reward),
+        "reward_summary": _reward_summary(rewards),
+    }
+def _resolved_worker_count(config: DatasetConfig, job_count: int) -> int:
+    if job_count <= 1:
+        return 1
+    if int(config.workers) > 0:
+        return max(1, min(int(config.workers), job_count))
+    cpu_count = os.cpu_count() or 4
+    return max(1, min(8, cpu_count, job_count))
 def generate_dataset(config: DatasetConfig) -> dict[str, Any]:
     config.out_dir.mkdir(parents=True, exist_ok=True)
+    teacher_local = threading.local()
+    teacher_token = None
     if not config.dry_run_oracle:
+        teacher_token = os.getenv("HF_TOKEN")
+        if not teacher_token:
             raise RuntimeError("HF_TOKEN is required unless --dry-run-oracle is set")
+    def teacher_for_thread() -> HuggingFaceTeacher | None:
+        if config.dry_run_oracle:
+            return None
+        teacher = getattr(teacher_local, "teacher", None)
+        if teacher is None:
+            teacher = HuggingFaceTeacher(
+                model=config.teacher_model,
+                token=str(teacher_token),
+                max_tokens=config.max_tokens,
+                temperature=config.temperature,
+                top_p=config.top_p,
+            )
+            teacher_local.teacher = teacher
+        return teacher
+    difficulty_levels = _difficulty_levels(config)
+    difficulty_bucket_count = _configure_difficulty_buckets(config, difficulty_levels)
+    validation_seed_start = config.seed_start + int(config.episodes) * len(difficulty_levels)
     split_jobs = [(config.split, config.episodes, config.seed_start)]
     if config.validation_episodes:
+        split_jobs.append(("validation", config.validation_episodes, validation_seed_start))
+    episode_jobs = [
+        {
+            "order": job_order,
+            "split": split,
+            "difficulty": int(difficulty),
+            "seed": int(seed_start) + difficulty_index * int(episodes) + offset,
+        }
+        for job_order, (split, episodes, seed_start) in enumerate(split_jobs)
+        for difficulty_index, difficulty in enumerate(difficulty_levels)
+        for offset in range(int(episodes))
+    ]
     rows_by_split: dict[str, list[dict[str, Any]]] = {"train": [], "validation": []}
     attempts: list[dict[str, Any]] = []
     rewards: list[float] = []
     accepted = 0
+    attempted = len(episode_jobs)
+    workers = _resolved_worker_count(config, attempted)
+    def run_job(job: dict[str, Any]) -> dict[str, Any]:
+        seed = int(job["seed"])
+        split = str(job["split"])
+        difficulty = int(job["difficulty"])
+        return {
+            "order": int(job["order"]),
+            **run_episode(
                 seed=seed,
                 split=split,
+                difficulty=difficulty,
                 config=config,
+                teacher=teacher_for_thread(),
+            ),
+        }
+    results: list[dict[str, Any]] = []
+    with ThreadPoolExecutor(max_workers=workers, thread_name_prefix="sft-episode") as executor:
+        futures = [executor.submit(run_job, job) for job in episode_jobs]
+        for future in as_completed(futures):
+            result = future.result()
+            results.append(result)
+            if config.progress:
+                print(
+                    json.dumps(
+                        {
+                            "event": "episode_done",
+                            "accepted": bool(result.get("accepted")),
+                            "split": result.get("split"),
+                            "difficulty": result.get("difficulty"),
+                            "seed": result.get("seed"),
+                            "reason": result.get("reason", ""),
+                        },
+                        sort_keys=True,
+                    ),
+                    flush=True,
+                )
+    for result in sorted(
+        results,
+        key=lambda item: (
+            str(item.get("split", "")),
+            int(item.get("difficulty", 0)),
+            int(item.get("seed", 0)),
+        ),
+    ):
+        seed = int(result["seed"])
+        split = str(result["split"])
+        difficulty = int(result["difficulty"])
+        attempts.append(
+            {
+                "seed": seed,
+                "split": split,
+                "difficulty": difficulty,
+                "accepted": bool(result["accepted"]),
+                "reason": result.get("reason", ""),
+                "trajectory_path": str(_write_trajectory(config.out_dir, result["trajectory"])),
+            }
+        )
+        if result["accepted"]:
+            accepted += 1
+            rows = list(result["rows"])
+            rows_by_split.setdefault(split, []).extend(rows)
+            rewards.append(float(result["trajectory"].get("terminal_total", 0.0)))
+    for split_rows in rows_by_split.values():
+        split_rows.sort(
+            key=lambda row: (
+                int((row.get("metadata") or {}).get("difficulty", 0)),
+                int((row.get("metadata") or {}).get("seed", 0)),
+                int((row.get("metadata") or {}).get("step", 0)),
             )
+        )
     for split_name in ("train", "validation", config.split):
         write_jsonl(config.out_dir / f"{split_name}.jsonl", rows_by_split.get(split_name, []))
+    reward_verification = verify_sft_dataset_rewards(
+        config.out_dir,
+        min_terminal_reward=config.min_terminal_reward,
+        require_train_rows=config.split == "train",
+        required_difficulties=difficulty_levels if len(difficulty_levels) > 1 else (),
+    )
+    accepted_by_difficulty: dict[str, int] = {}
+    attempted_by_difficulty: dict[str, int] = {}
+    reward_by_difficulty: dict[str, list[float]] = {}
+    row_count_by_difficulty: dict[str, int] = {}
+    for result in results:
+        difficulty_key = str(int(result.get("difficulty", 0)))
+        attempted_by_difficulty[difficulty_key] = attempted_by_difficulty.get(difficulty_key, 0) + 1
+        if result.get("accepted"):
+            accepted_by_difficulty[difficulty_key] = accepted_by_difficulty.get(difficulty_key, 0) + 1
+            reward_by_difficulty.setdefault(difficulty_key, []).append(
+                float((result.get("trajectory") or {}).get("terminal_total", 0.0))
+            )
+    for split_rows in rows_by_split.values():
+        for row in split_rows:
+            difficulty_key = str(int((row.get("metadata") or {}).get("difficulty", 0)))
+            row_count_by_difficulty[difficulty_key] = row_count_by_difficulty.get(difficulty_key, 0) + 1
     manifest = {
         "teacher_model": config.teacher_model,
         "target_model": config.target_model,
         "split": config.split,
         "difficulty": config.difficulty,
+        "difficulty_levels": [int(level) for level in difficulty_levels],
+        "difficulty_bucket_count": int(difficulty_bucket_count),
+        "episodes_per_difficulty": config.episodes,
+        "validation_episodes_per_difficulty": config.validation_episodes,
         "seed_start": config.seed_start,
         "episodes_attempted": attempted,
         "episodes_accepted": accepted,
         "acceptance_rate": accepted / attempted if attempted else 0.0,
+        "attempted_by_difficulty": attempted_by_difficulty,
+        "accepted_by_difficulty": accepted_by_difficulty,
+        "rows_by_difficulty": row_count_by_difficulty,
+        "reward_summary_by_difficulty": {
+            key: _reward_summary(value) for key, value in sorted(reward_by_difficulty.items())
+        },
+        "workers": workers,
         "rows_by_split": {key: len(value) for key, value in sorted(rows_by_split.items())},
         "reward_summary": _reward_summary(rewards),
+        "reward_verification": reward_verification,
         "git_sha": _git_sha(),
         "verifier_version": "verifier_v1",
         "dry_run_oracle": config.dry_run_oracle,
         "attempts": attempts,
     }
+    if config.push_to_hub:
+        if not reward_verification["passed"]:
+            raise RuntimeError("Reward verification failed; refusing to push dataset to Hub.")
+        manifest["hub"] = {
+            "repo_id": config.dataset_repo_id,
+            "private": bool(config.hub_private),
+            "url": f"https://huggingface.co/datasets/{config.dataset_repo_id}",
+        }
+    write_dataset_card(config.out_dir, manifest, config.dataset_repo_id)
     manifest_path = config.out_dir / "manifest.json"
     manifest_path.write_text(
         json.dumps(manifest, indent=2, sort_keys=True, default=str),
         encoding="utf-8",
     )
+    if config.push_to_hub:
+        hub_result = push_dataset_to_hub(
+            config.out_dir,
+            repo_id=config.dataset_repo_id,
+            private=config.hub_private,
+        )
+        manifest["hub"].update(hub_result)
+        manifest_path.write_text(
+            json.dumps(manifest, indent=2, sort_keys=True, default=str),
+            encoding="utf-8",
+        )
     return manifest
     parser.add_argument("--target-model", default=DEFAULT_TARGET_MODEL)
     parser.add_argument("--split", default="train", choices=["train", "validation", "hidden_eval"])
     parser.add_argument("--difficulty", type=int, default=0)
+    parser.add_argument(
+        "--difficulty-levels",
+        default="",
+        help="Comma-separated curriculum levels to include, for example 0,1,2,3. "
+        "When set, --episodes is per difficulty level.",
+    )
+    parser.add_argument(
+        "--difficulty-buckets",
+        type=int,
+        default=0,
+        help=(
+            "Number of curriculum difficulty buckets to expose to the environment. "
+            "Defaults to max(--difficulty-levels)+1."
+        ),
+    )
     parser.add_argument("--seed-start", type=int, default=0)
     parser.add_argument("--episodes", type=int, default=100)
     parser.add_argument("--validation-episodes", type=int, default=0)
     parser.add_argument("--max-tokens", type=int, default=768)
     parser.add_argument("--temperature", type=float, default=0.2)
     parser.add_argument("--top-p", type=float, default=0.95)
+    parser.add_argument(
+        "--workers",
+        type=int,
+        default=0,
+        help="Parallel episode workers. 0 auto-selects up to 8 workers.",
+    )
+    parser.add_argument(
+        "--min-terminal-reward",
+        type=float,
+        default=12.0,
+        help="Minimum verifier-backed terminal reward required for SFT rows.",
+    )
+    parser.add_argument(
+        "--verify-only",
+        action="store_true",
+        help="Only verify an existing out-dir dataset reward metadata.",
+    )
+    parser.add_argument(
+        "--push-to-hub",
+        action="store_true",
+        help="Upload the verified dataset folder to a Hugging Face dataset repo.",
+    )
+    parser.add_argument(
+        "--progress",
+        action="store_true",
+        help="Print one JSON progress event for each completed episode job.",
+    )
+    parser.add_argument(
+        "--push-only",
+        action="store_true",
+        help="Verify and upload an existing out-dir dataset without regenerating rows.",
+    )
+    parser.add_argument(
+        "--dataset-repo-id",
+        default="Humanlearning/CyberSecurity_OWASP-sft-dataset",
+        help="Hugging Face dataset repo id used with --push-to-hub.",
+    )
+    parser.add_argument(
+        "--hub-private",
+        action="store_true",
+        help="Create/upload the Hugging Face dataset repo as private.",
+    )
     parser.add_argument(
         "--dry-run-oracle",
         action="store_true",
         target_model=args.target_model,
         split=args.split,
         difficulty=args.difficulty,
+        difficulty_levels=_parse_int_csv(args.difficulty_levels),
+        difficulty_buckets=args.difficulty_buckets,
         seed_start=args.seed_start,
         episodes=args.episodes,
         validation_episodes=args.validation_episodes,
         temperature=args.temperature,
         top_p=args.top_p,
         dry_run_oracle=args.dry_run_oracle,
+        workers=args.workers,
+        min_terminal_reward=args.min_terminal_reward,
+        push_to_hub=args.push_to_hub,
+        dataset_repo_id=args.dataset_repo_id,
+        hub_private=args.hub_private,
+        progress=args.progress,
     )
 def main(argv: list[str] | None = None) -> int:
     parser = build_arg_parser()
     args = parser.parse_args(argv)
+    try:
+        if args.verify_only:
+            verification = verify_sft_dataset_rewards(
+                args.out_dir,
+                min_terminal_reward=args.min_terminal_reward,
+                require_train_rows=args.split == "train",
+                required_difficulties=_parse_int_csv(args.difficulty_levels),
+            )
+            print(json.dumps({"reward_verification": verification}, indent=2, sort_keys=True))
+            return 0 if verification["passed"] else 2
+        if args.push_only:
+            result = push_existing_dataset(
+                args.out_dir,
+                repo_id=args.dataset_repo_id,
+                private=args.hub_private,
+                min_terminal_reward=args.min_terminal_reward,
+                required_difficulties=_parse_int_csv(args.difficulty_levels),
+            )
+            print(json.dumps(result, indent=2, sort_keys=True))
+            return 0
+        manifest = generate_dataset(config_from_args(args))
+        print(json.dumps(manifest, indent=2, sort_keys=True))
+        return 0 if manifest.get("reward_verification", {}).get("passed") else 2
+    except (RuntimeError, ValueError) as exc:
+        print(
+            json.dumps(
+                {"error": str(exc), "error_type": exc.__class__.__name__},
+                indent=2,
+                sort_keys=True,
+            )
+        )
+        return 2
 if __name__ == "__main__":

scripts/modal_train_sft.py CHANGED Viewed

@@ -33,6 +33,15 @@ TRITON_CACHE_DIR = CACHE_DIR / "triton"
 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
 DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
@@ -50,6 +59,170 @@ def _model_repo_slug(model_name: str) -> str:
     return model_name.replace("/", "-").replace("_", "-").replace(".", "-").lower()
 def _configure_modal_cache_env() -> dict[str, str]:
     values = {
         "HF_HOME": str(HF_HOME_DIR),
@@ -171,7 +344,7 @@ def upload_sft_jsonl(relative_path: str, content: str) -> str:
 @app.function(
     image=training_image,
-    gpu="L4",
     timeout=12 * 60 * 60,
     volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
@@ -179,24 +352,29 @@ def upload_sft_jsonl(relative_path: str, content: str) -> str:
 def train_cybersecurity_owasp_sft(
     train_jsonl: str = "/runs/sft/train.jsonl",
     validation_jsonl: str = "/runs/sft/validation.jsonl",
-    output_repo_id: str = "",
     model_name: str = DEFAULT_GEMMA_MODEL,
     run_name: str = "",
     max_seq_length: int = 4096,
-    max_steps: int = 100,
     num_train_epochs: float = 1.0,
-    per_device_train_batch_size: int = 1,
-    gradient_accumulation_steps: int = 16,
     learning_rate: float = 2e-5,
     lora_rank: int = 32,
-    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
-    trackio_project: str = "CyberSecurity_OWASP-sft",
     push_to_hub: bool = False,
 ) -> dict[str, Any]:
     import inspect
     from datasets import load_dataset
-    from huggingface_hub import snapshot_download, whoami
     from trl import SFTConfig, SFTTrainer
     from trl.chat_template_utils import add_response_schema
     from unsloth import FastVisionModel
@@ -207,10 +385,9 @@ def train_cybersecurity_owasp_sft(
     if not hf_token:
         raise RuntimeError(f"HF_TOKEN is missing from the Modal secret {SECRET_NAME}.")
-    user = whoami(token=hf_token)["name"]
-    output_repo_id = output_repo_id or (
-        f"{user}/CyberSecurity_OWASP-{_model_repo_slug(model_name)}-sft-lora"
-    )
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
     run_name = run_name or f"CyberSecurity_OWASP-{_model_repo_slug(model_name)}-sft-{stamp}"
     output_dir = RUNS_DIR / run_name
@@ -222,6 +399,21 @@ def train_cybersecurity_owasp_sft(
     has_validation = validation_path.exists() and validation_path.stat().st_size > 0
     if has_validation:
         data_files["validation"] = validation_jsonl
     dataset = load_dataset("json", data_files=data_files)
     print(f"SFT run name: {run_name}")
@@ -232,6 +424,11 @@ def train_cybersecurity_owasp_sft(
     print(f"Output repo: https://huggingface.co/{output_repo_id}")
     print(f"Trackio Space: https://huggingface.co/spaces/{trackio_space_id}")
     print(f"HF_HUB_CACHE: {cache_env['HF_HUB_CACHE']}")
     try:
         snapshot_download(repo_id=model_name, cache_dir=str(HF_HUB_CACHE_DIR), token=hf_token)
@@ -280,19 +477,25 @@ def train_cybersecurity_owasp_sft(
         "per_device_train_batch_size": per_device_train_batch_size,
         "gradient_accumulation_steps": gradient_accumulation_steps,
         "learning_rate": learning_rate,
         "logging_steps": 1,
-        "save_steps": max(10, max_steps),
         "report_to": "trackio",
         "project": trackio_project,
         "trackio_space_id": trackio_space_id,
         "run_name": run_name,
         "assistant_only_loss": True,
-        "packing": False,
         "gradient_checkpointing": True,
         "gradient_checkpointing_kwargs": {"use_reentrant": False},
         "push_to_hub": push_to_hub,
         "hub_model_id": output_repo_id,
         "hub_private_repo": True,
     }
     sft_parameters = set(inspect.signature(SFTConfig).parameters)
     skipped = sorted(set(sft_values) - sft_parameters)
@@ -335,6 +538,11 @@ def train_cybersecurity_owasp_sft(
         "output_repo_id": output_repo_id,
         "train_jsonl": train_jsonl,
         "validation_jsonl": validation_jsonl if has_validation else "",
         "max_steps": max_steps,
         "push_to_hub": push_to_hub,
         "trackio_space_id": trackio_space_id,
@@ -365,20 +573,26 @@ def main(
     mode: str = "train",
     local_train_path: str = "outputs/sft/train.jsonl",
     local_validation_path: str = "outputs/sft/validation.jsonl",
     train_jsonl: str = "/runs/sft/train.jsonl",
     validation_jsonl: str = "/runs/sft/validation.jsonl",
-    output_repo_id: str = "",
     model_name: str = DEFAULT_GEMMA_MODEL,
     run_name: str = "",
     max_seq_length: int = 4096,
-    max_steps: int = 100,
     num_train_epochs: float = 1.0,
-    per_device_train_batch_size: int = 1,
-    gradient_accumulation_steps: int = 16,
     learning_rate: float = 2e-5,
     lora_rank: int = 32,
-    trackio_space_id: str = "Humanlearning/CyberSecurity_OWASP-trackio",
-    trackio_project: str = "CyberSecurity_OWASP-sft",
     source_mode: str = "local",
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
@@ -392,6 +606,22 @@ def main(
     local_train = pathlib.Path(local_train_path)
     local_validation = pathlib.Path(local_validation_path)
     if local_train.exists():
         uploaded = upload_sft_jsonl.remote(
             "sft/train.jsonl",
@@ -406,6 +636,13 @@ def main(
         )
         print(f"Uploaded validation JSONL: {uploaded_validation}")
         validation_jsonl = uploaded_validation
     if mode == "upload":
         return
@@ -416,6 +653,7 @@ def main(
     kwargs = dict(
         train_jsonl=train_jsonl,
         validation_jsonl=validation_jsonl,
         output_repo_id=output_repo_id,
         model_name=model_name,
         run_name=run_name,
@@ -428,11 +666,19 @@ def main(
         lora_rank=lora_rank,
         trackio_space_id=trackio_space_id,
         trackio_project=trackio_project,
         push_to_hub=push_to_hub,
     )
     print(f"SFT run name: {run_name}")
     print(f"Train JSONL: {train_jsonl}")
     print(f"Validation JSONL: {validation_jsonl}")
     print(f"Hub push enabled: {push_to_hub}")
     if detach:
         call = train_cybersecurity_owasp_sft.spawn(**kwargs)

 REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
 PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
 DEFAULT_GEMMA_MODEL = "unsloth/gemma-4-E2B-it"
+SFT_GPU_FALLBACK = ["H200", "H100", "A100-80GB", "L40S"]
+DEFAULT_CURRICULUM_LEVELS = "0,1,2,3"
+DEFAULT_TOTAL_TRAIN_EPISODES = 300
+DEFAULT_EPISODES_PER_LEVEL = 75
+DEFAULT_TRACKIO_SPACE_ID = "Humanlearning/CyberSecurity_OWASP-trackio"
+DEFAULT_TRACKIO_PROJECT = "CyberSecurity_OWASP-sft"
+DEFAULT_SFT_OUTPUT_REPO_ID = (
+    "Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora"
+)
 PUBLIC_REPO_URL = "https://github.com/humandotlearning/CyberSecurity_OWASP.git"
 PUBLIC_REPO_BRANCH = "master"
     return model_name.replace("/", "-").replace("_", "-").replace(".", "-").lower()
+def _parse_int_csv(value: str) -> set[int]:
+    if not value.strip():
+        return set()
+    return {int(item.strip()) for item in value.split(",") if item.strip()}
+SFT_ALLOWED_TOOLS = {
+    "inspect_policy_graph",
+    "list_routes",
+    "read_openapi",
+    "read_file",
+    "search_code",
+    "send_local_request",
+    "compare_identities",
+    "submit_diagnosis",
+    "patch_file",
+    "run_visible_tests",
+    "submit_fix",
+    "noop",
+}
+def _read_jsonl(path: pathlib.Path) -> list[dict[str, Any]]:
+    if not path.exists():
+        return []
+    rows: list[dict[str, Any]] = []
+    for line_number, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
+        if not line.strip():
+            continue
+        try:
+            row = json.loads(line)
+        except json.JSONDecodeError as exc:
+            raise ValueError(f"{path}:{line_number}: invalid JSONL: {exc}") from exc
+        if not isinstance(row, dict):
+            raise ValueError(f"{path}:{line_number}: row must be a JSON object")
+        rows.append(row)
+    return rows
+def _verify_sft_rows(
+    path: pathlib.Path,
+    *,
+    min_terminal_reward: float,
+) -> tuple[list[str], list[float], int, set[int]]:
+    rows = _read_jsonl(path)
+    failures: list[str] = []
+    rewards: list[float] = []
+    difficulties: set[int] = set()
+    for index, row in enumerate(rows, start=1):
+        messages = row.get("messages")
+        if not isinstance(messages, list) or len(messages) < 3:
+            failures.append(f"{path}:{index}: messages must include system/user/assistant")
+            continue
+        assistant = messages[-1]
+        if assistant.get("role") != "assistant":
+            failures.append(f"{path}:{index}: final message must be assistant")
+            continue
+        try:
+            action = json.loads(str(assistant.get("content", "")))
+        except json.JSONDecodeError as exc:
+            failures.append(f"{path}:{index}: assistant content is not JSON: {exc}")
+            continue
+        if not isinstance(action, dict) or action.get("tool_name") not in SFT_ALLOWED_TOOLS:
+            failures.append(f"{path}:{index}: assistant content is not a valid tool action")
+            continue
+        metadata = row.get("metadata")
+        if not isinstance(metadata, dict):
+            failures.append(f"{path}:{index}: missing metadata")
+            continue
+        if metadata.get("final_success") is not True:
+            failures.append(f"{path}:{index}: final_success is not true")
+            continue
+        if metadata.get("anti_cheat_flags") or []:
+            failures.append(f"{path}:{index}: anti-cheat flags present")
+            continue
+        if "difficulty" in metadata:
+            difficulties.add(int(metadata.get("difficulty", 0)))
+        terminal_reward = float(metadata.get("terminal_total", 0.0) or 0.0)
+        rewards.append(terminal_reward)
+        if terminal_reward < min_terminal_reward:
+            failures.append(
+                f"{path}:{index}: terminal_total {terminal_reward:.3f} below {min_terminal_reward:.3f}"
+            )
+            continue
+        breakdown = metadata.get("final_reward_breakdown") or {}
+        if not isinstance(breakdown, dict):
+            failures.append(f"{path}:{index}: missing final_reward_breakdown")
+            continue
+        for key in ("security", "regression", "public_routes", "patch_quality", "visible_tests"):
+            if float(breakdown.get(key, 0.0) or 0.0) <= 0.0:
+                failures.append(f"{path}:{index}: reward component {key} is not positive")
+                break
+    return failures, rewards, len(rows), difficulties
+def verify_sft_inputs(
+    *,
+    train_jsonl: str,
+    validation_jsonl: str = "",
+    manifest_path: str = "",
+    required_difficulties: str = "",
+    min_terminal_reward: float = 12.0,
+    min_train_rows: int = 1,
+) -> dict[str, Any]:
+    train_path = pathlib.Path(train_jsonl)
+    validation_path = pathlib.Path(validation_jsonl) if validation_jsonl else pathlib.Path("")
+    failures, rewards, train_rows, difficulties = _verify_sft_rows(
+        train_path,
+        min_terminal_reward=min_terminal_reward,
+    )
+    validation_rows = 0
+    if validation_jsonl and validation_path.exists() and validation_path.stat().st_size > 0:
+        validation_failures, validation_rewards, validation_rows, validation_difficulties = _verify_sft_rows(
+            validation_path,
+            min_terminal_reward=min_terminal_reward,
+        )
+        failures.extend(validation_failures)
+        rewards.extend(validation_rewards)
+        difficulties.update(validation_difficulties)
+    if train_rows < min_train_rows:
+        failures.append(f"{train_path}: expected at least {min_train_rows} train rows, found {train_rows}")
+    manifest_verification: dict[str, Any] = {}
+    manifest = pathlib.Path(manifest_path) if manifest_path else pathlib.Path("")
+    if manifest_path and manifest.exists():
+        try:
+            manifest_data = json.loads(manifest.read_text(encoding="utf-8"))
+            manifest_verification = dict(manifest_data.get("reward_verification") or {})
+            manifest_difficulties = {
+                int(item) for item in manifest_data.get("difficulty_levels", []) or []
+            }
+        except Exception as exc:
+            failures.append(f"{manifest}: could not read manifest reward verification: {exc}")
+            manifest_difficulties = set()
+        if manifest_verification and manifest_verification.get("passed") is not True:
+            failures.append(f"{manifest}: manifest reward_verification did not pass")
+    else:
+        manifest_difficulties = set()
+    required = _parse_int_csv(required_difficulties) or manifest_difficulties
+    missing_difficulties = sorted(level for level in required if level not in difficulties)
+    if missing_difficulties:
+        failures.append(f"missing required curriculum difficulty rows: {missing_difficulties}")
+    reward_summary = {
+        "min": min(rewards) if rewards else 0.0,
+        "max": max(rewards) if rewards else 0.0,
+        "mean": (sum(rewards) / len(rewards)) if rewards else 0.0,
+    }
+    return {
+        "passed": not failures,
+        "failure_count": len(failures),
+        "failures": failures[:50],
+        "train_rows": train_rows,
+        "validation_rows": validation_rows,
+        "difficulties": sorted(difficulties),
+        "required_difficulties": sorted(required),
+        "missing_difficulties": missing_difficulties,
+        "min_terminal_reward": float(min_terminal_reward),
+        "reward_summary": reward_summary,
+        "manifest_reward_verification": manifest_verification,
+    }
 def _configure_modal_cache_env() -> dict[str, str]:
     values = {
         "HF_HOME": str(HF_HOME_DIR),
 @app.function(
     image=training_image,
+    gpu=SFT_GPU_FALLBACK,
     timeout=12 * 60 * 60,
     volumes={RUNS_DIR: volume, CACHE_DIR: cache_volume},
     secrets=secrets,
 def train_cybersecurity_owasp_sft(
     train_jsonl: str = "/runs/sft/train.jsonl",
     validation_jsonl: str = "/runs/sft/validation.jsonl",
+    manifest_path: str = "/runs/sft/manifest.json",
+    output_repo_id: str = DEFAULT_SFT_OUTPUT_REPO_ID,
     model_name: str = DEFAULT_GEMMA_MODEL,
     run_name: str = "",
     max_seq_length: int = 4096,
+    max_steps: int = -1,
     num_train_epochs: float = 1.0,
+    per_device_train_batch_size: int = 4,
+    gradient_accumulation_steps: int = 4,
     learning_rate: float = 2e-5,
     lora_rank: int = 32,
+    trackio_space_id: str = DEFAULT_TRACKIO_SPACE_ID,
+    trackio_project: str = DEFAULT_TRACKIO_PROJECT,
+    require_reward_verification: bool = True,
+    required_difficulties: str = DEFAULT_CURRICULUM_LEVELS,
+    min_terminal_reward: float = 12.0,
+    min_train_rows: int = 1,
     push_to_hub: bool = False,
 ) -> dict[str, Any]:
     import inspect
     from datasets import load_dataset
+    from huggingface_hub import snapshot_download
     from trl import SFTConfig, SFTTrainer
     from trl.chat_template_utils import add_response_schema
     from unsloth import FastVisionModel
     if not hf_token:
         raise RuntimeError(f"HF_TOKEN is missing from the Modal secret {SECRET_NAME}.")
+    output_repo_id = output_repo_id or DEFAULT_SFT_OUTPUT_REPO_ID
+    os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
+    os.environ["TRACKIO_PROJECT"] = trackio_project
     stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
     run_name = run_name or f"CyberSecurity_OWASP-{_model_repo_slug(model_name)}-sft-{stamp}"
     output_dir = RUNS_DIR / run_name
     has_validation = validation_path.exists() and validation_path.stat().st_size > 0
     if has_validation:
         data_files["validation"] = validation_jsonl
+    reward_preflight = verify_sft_inputs(
+        train_jsonl=train_jsonl,
+        validation_jsonl=validation_jsonl if has_validation else "",
+        manifest_path=manifest_path,
+        required_difficulties=required_difficulties,
+        min_terminal_reward=min_terminal_reward,
+        min_train_rows=min_train_rows,
+    )
+    print(f"SFT reward preflight: {json.dumps(reward_preflight, sort_keys=True)}")
+    if require_reward_verification and not reward_preflight["passed"]:
+        raise RuntimeError(
+            "SFT reward verification failed; refusing to start model training. "
+            f"Failures: {reward_preflight['failures']}"
+        )
     dataset = load_dataset("json", data_files=data_files)
     print(f"SFT run name: {run_name}")
     print(f"Output repo: https://huggingface.co/{output_repo_id}")
     print(f"Trackio Space: https://huggingface.co/spaces/{trackio_space_id}")
     print(f"HF_HUB_CACHE: {cache_env['HF_HUB_CACHE']}")
+    print(
+        "SFT target: "
+        f"{DEFAULT_TOTAL_TRAIN_EPISODES} total train episodes, "
+        f"{DEFAULT_EPISODES_PER_LEVEL} per level across {DEFAULT_CURRICULUM_LEVELS}"
+    )
     try:
         snapshot_download(repo_id=model_name, cache_dir=str(HF_HUB_CACHE_DIR), token=hf_token)
         "per_device_train_batch_size": per_device_train_batch_size,
         "gradient_accumulation_steps": gradient_accumulation_steps,
         "learning_rate": learning_rate,
+        "optim": "adamw_8bit",
         "logging_steps": 1,
+        "logging_first_step": True,
+        "save_steps": max(10, max_steps) if max_steps > 0 else 100,
         "report_to": "trackio",
         "project": trackio_project,
         "trackio_space_id": trackio_space_id,
         "run_name": run_name,
         "assistant_only_loss": True,
+        "packing": True,
+        "packing_strategy": "bfd",
+        "bf16": True,
+        "tf32": True,
         "gradient_checkpointing": True,
         "gradient_checkpointing_kwargs": {"use_reentrant": False},
         "push_to_hub": push_to_hub,
         "hub_model_id": output_repo_id,
         "hub_private_repo": True,
+        "hub_strategy": "every_save",
     }
     sft_parameters = set(inspect.signature(SFTConfig).parameters)
     skipped = sorted(set(sft_values) - sft_parameters)
         "output_repo_id": output_repo_id,
         "train_jsonl": train_jsonl,
         "validation_jsonl": validation_jsonl if has_validation else "",
+        "manifest_path": manifest_path,
+        "reward_preflight": reward_preflight,
+        "required_difficulties": required_difficulties,
+        "default_total_train_episodes": DEFAULT_TOTAL_TRAIN_EPISODES,
+        "default_episodes_per_level": DEFAULT_EPISODES_PER_LEVEL,
         "max_steps": max_steps,
         "push_to_hub": push_to_hub,
         "trackio_space_id": trackio_space_id,
     mode: str = "train",
     local_train_path: str = "outputs/sft/train.jsonl",
     local_validation_path: str = "outputs/sft/validation.jsonl",
+    local_manifest_path: str = "outputs/sft/manifest.json",
     train_jsonl: str = "/runs/sft/train.jsonl",
     validation_jsonl: str = "/runs/sft/validation.jsonl",
+    manifest_path: str = "/runs/sft/manifest.json",
+    output_repo_id: str = DEFAULT_SFT_OUTPUT_REPO_ID,
     model_name: str = DEFAULT_GEMMA_MODEL,
     run_name: str = "",
     max_seq_length: int = 4096,
+    max_steps: int = -1,
     num_train_epochs: float = 1.0,
+    per_device_train_batch_size: int = 4,
+    gradient_accumulation_steps: int = 4,
     learning_rate: float = 2e-5,
     lora_rank: int = 32,
+    trackio_space_id: str = DEFAULT_TRACKIO_SPACE_ID,
+    trackio_project: str = DEFAULT_TRACKIO_PROJECT,
+    require_reward_verification: bool = True,
+    required_difficulties: str = DEFAULT_CURRICULUM_LEVELS,
+    min_terminal_reward: float = 12.0,
+    min_train_rows: int = 1,
     source_mode: str = "local",
     repo_url: str = PUBLIC_REPO_URL,
     repo_branch: str = PUBLIC_REPO_BRANCH,
     local_train = pathlib.Path(local_train_path)
     local_validation = pathlib.Path(local_validation_path)
+    local_manifest = pathlib.Path(local_manifest_path)
+    if require_reward_verification and local_train.exists():
+        local_reward_preflight = verify_sft_inputs(
+            train_jsonl=str(local_train),
+            validation_jsonl=str(local_validation) if local_validation.exists() else "",
+            manifest_path=str(local_manifest) if local_manifest.exists() else "",
+            required_difficulties=required_difficulties,
+            min_terminal_reward=min_terminal_reward,
+            min_train_rows=min_train_rows,
+        )
+        print(f"Local SFT reward preflight: {json.dumps(local_reward_preflight, sort_keys=True)}")
+        if not local_reward_preflight["passed"]:
+            raise RuntimeError(
+                "Local SFT reward verification failed; refusing to upload/train. "
+                f"Failures: {local_reward_preflight['failures']}"
+            )
     if local_train.exists():
         uploaded = upload_sft_jsonl.remote(
             "sft/train.jsonl",
         )
         print(f"Uploaded validation JSONL: {uploaded_validation}")
         validation_jsonl = uploaded_validation
+    if local_manifest.exists():
+        uploaded_manifest = upload_sft_jsonl.remote(
+            "sft/manifest.json",
+            local_manifest.read_text(encoding="utf-8"),
+        )
+        print(f"Uploaded manifest: {uploaded_manifest}")
+        manifest_path = uploaded_manifest
     if mode == "upload":
         return
     kwargs = dict(
         train_jsonl=train_jsonl,
         validation_jsonl=validation_jsonl,
+        manifest_path=manifest_path,
         output_repo_id=output_repo_id,
         model_name=model_name,
         run_name=run_name,
         lora_rank=lora_rank,
         trackio_space_id=trackio_space_id,
         trackio_project=trackio_project,
+        require_reward_verification=require_reward_verification,
+        required_difficulties=required_difficulties,
+        min_terminal_reward=min_terminal_reward,
+        min_train_rows=min_train_rows,
         push_to_hub=push_to_hub,
     )
     print(f"SFT run name: {run_name}")
     print(f"Train JSONL: {train_jsonl}")
     print(f"Validation JSONL: {validation_jsonl}")
+    print(f"Manifest: {manifest_path}")
+    print(f"Reward verification required: {require_reward_verification}")
+    if required_difficulties:
+        print(f"Required curriculum difficulties: {required_difficulties}")
     print(f"Hub push enabled: {push_to_hub}")
     if detach:
         call = train_cybersecurity_owasp_sft.spawn(**kwargs)

tests/test_modal_scenario_cache_static.py CHANGED Viewed

@@ -37,3 +37,40 @@ def test_modal_training_is_pinned_to_gemma4_e2b():
     assert "from unsloth import FastVisionModel" in source
     assert "Qwen" not in source
     assert "FastLanguageModel" not in source

     assert "from unsloth import FastVisionModel" in source
     assert "Qwen" not in source
     assert "FastLanguageModel" not in source
+def test_modal_sft_defaults_match_300_episode_fast_handoff_plan():
+    source = (ROOT / "scripts" / "modal_train_sft.py").read_text(encoding="utf-8")
+    assert 'SFT_GPU_FALLBACK = ["H200", "H100", "A100-80GB", "L40S"]' in source
+    assert "gpu=SFT_GPU_FALLBACK" in source
+    assert "DEFAULT_TOTAL_TRAIN_EPISODES = 300" in source
+    assert "DEFAULT_EPISODES_PER_LEVEL = 75" in source
+    assert 'DEFAULT_CURRICULUM_LEVELS = "0,1,2,3"' in source
+    assert (
+        'DEFAULT_SFT_OUTPUT_REPO_ID = (\n'
+        '    "Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora"'
+    ) in source
+    assert "output_repo_id = output_repo_id or DEFAULT_SFT_OUTPUT_REPO_ID" in source
+    assert source.count("max_steps: int = -1") >= 2
+    assert source.count("per_device_train_batch_size: int = 4") >= 2
+    assert source.count("gradient_accumulation_steps: int = 4") >= 2
+    assert '"assistant_only_loss": True' in source
+    assert '"packing": True' in source
+    assert '"packing_strategy": "bfd"' in source
+    assert '"bf16": True' in source
+    assert '"tf32": True' in source
+    assert '"hub_strategy": "every_save"' in source
+    assert 'trackio_space_id: str = DEFAULT_TRACKIO_SPACE_ID' in source
+    assert 'trackio_project: str = DEFAULT_TRACKIO_PROJECT' in source
+    assert 'os.environ["TRACKIO_SPACE_ID"] = trackio_space_id' in source
+    assert 'os.environ["TRACKIO_PROJECT"] = trackio_project' in source
+def test_modal_grpo_loads_sft_adapter_from_hub_as_trainable_lora():
+    source = (ROOT / "scripts" / "modal_train_grpo.py").read_text(encoding="utf-8")
+    assert "initial_adapter_repo_id" in source
+    assert "Downloading initial SFT adapter" in source
+    assert "snapshot_download(" in source
+    assert "PeftModel.from_pretrained(model, adapter_source, is_trainable=True)" in source

tests/test_sft_dataset_generation.py CHANGED Viewed

@@ -90,11 +90,20 @@ def test_dry_run_oracle_creates_chat_jsonl_without_network():
             validation_episodes=1,
             out_dir=out_dir,
             dry_run_oracle=True,
         )
     )
-    assert manifest["episodes_attempted"] == 3
-    assert manifest["episodes_accepted"] == 3
     assert (out_dir / "train.jsonl").exists()
     assert (out_dir / "validation.jsonl").exists()
     train_rows = [
@@ -110,6 +119,43 @@ def test_dry_run_oracle_creates_chat_jsonl_without_network():
     assert train_rows
     assert validation_rows
     assert all(row["messages"][-1]["role"] == "assistant" for row in train_rows)
 def test_saved_oracle_trajectory_replays_to_success():

             validation_episodes=1,
             out_dir=out_dir,
             dry_run_oracle=True,
+            workers=2,
+            difficulty_levels=(0, 1),
         )
     )
+    assert manifest["difficulty_levels"] == [0, 1]
+    assert manifest["difficulty_bucket_count"] >= 2
+    assert manifest["episodes_attempted"] == 6
+    assert manifest["episodes_accepted"] == 6
+    assert manifest["workers"] == 2
+    assert manifest["reward_verification"]["passed"] is True
+    assert manifest["reward_verification"]["missing_difficulties"] == []
+    assert manifest["rows_by_difficulty"]["0"] > 0
+    assert manifest["rows_by_difficulty"]["1"] > 0
     assert (out_dir / "train.jsonl").exists()
     assert (out_dir / "validation.jsonl").exists()
     train_rows = [
     assert train_rows
     assert validation_rows
     assert all(row["messages"][-1]["role"] == "assistant" for row in train_rows)
+    reward_check = generate_sft_dataset.verify_sft_dataset_rewards(
+        out_dir,
+        required_difficulties=(0, 1),
+    )
+    assert reward_check["passed"] is True
+    assert (out_dir / "README.md").exists()
+def test_reward_verification_rejects_low_reward_rows():
+    out_dir = _isolated_out_dir("bad_reward")
+    out_dir.mkdir(parents=True, exist_ok=True)
+    action = CyberSecurityOWASPAction(tool_name="inspect_policy_graph", arguments={})
+    row = {
+        "messages": [
+            {"role": "system", "content": "system"},
+            {"role": "user", "content": "user"},
+            {"role": "assistant", "content": json.dumps(action.model_dump())},
+        ],
+        "metadata": {
+            "final_success": True,
+            "terminal_total": 1.0,
+            "anti_cheat_flags": [],
+            "final_reward_breakdown": {
+                "security": 5.0,
+                "regression": 3.0,
+                "public_routes": 1.0,
+                "patch_quality": 2.0,
+                "visible_tests": 1.0,
+            },
+        },
+    }
+    (out_dir / "train.jsonl").write_text(json.dumps(row) + "\n", encoding="utf-8")
+    reward_check = generate_sft_dataset.verify_sft_dataset_rewards(out_dir)
+    assert reward_check["passed"] is False
+    assert reward_check["failure_count"] == 1
 def test_saved_oracle_trajectory_replays_to_success():