polyguard-openenv-workbench / polyguard-rl /docs /submission_artifacts.md
TheJackBright's picture
Deploy GitHub root master to Space
c296d62

Submission Artifact Index

This page points reviewers to the shared environment, training scripts, and training logs/results. It is intentionally path-based so the artifacts can be found from a fresh clone without relying on local outputs/ or checkpoints/ folders.

Environment And Runtime

Core OpenEnv/runtime files:

  • openenv.yaml - OpenEnv package entrypoint and deployment metadata.
  • server/app.py - ASGI/FastAPI bridge used by OpenEnv validation and Space deployment.
  • app/env/env_core.py - canonical PolyGuardEnv reset/step/state implementation.
  • app/env/fastapi_app.py - HTTP API, catalog, reset, step, and candidate-step routes.
  • app/env/reward_router.py - verifier-backed reward routing.
  • app/env/reward_scaling.py - reward clamping/rounding to [0.001, 0.999].
  • app/env/anti_cheat.py - anti-hacking and invalid-action checks.
  • app/env/catalog.py - task preset and sub-environment catalog.

Dependency and container files:

  • pyproject.toml and uv.lock - local Python environment lock.
  • requirements.txt - local/runtime pip dependency export.
  • requirements-space.txt - Hugging Face Space dependency export.
  • .env.example - non-secret environment variable template.
  • Dockerfile - local/container runtime.
  • Dockerfile.space - product HF Space runtime.
  • app/hf_space/Dockerfile - HF training/evidence Space runtime.
  • configs/sft.yaml and configs/grpo.yaml - train-loop defaults.
  • configs/rewards.yaml, configs/curriculum.yaml, and configs/env_*.yaml - environment/reward/curriculum configuration.

Secrets are not committed. Hugging Face access is supplied through HF_TOKEN as an environment variable or notebook/Space secret.

Training Scripts And Notebooks

End-to-end runner notebooks:

  • PolyGuard_SFT_GRPO_One_Run_Runner.ipynb - one-run data build, SFT, GRPO, artifact pull, inference validation, chart generation, and Space deployment.
  • notebooks/09_training_loop.ipynb - modular walkthrough of the same loop.

Dataset and corpus scripts:

  • scripts/bootstrap_data.py
  • scripts/build_training_corpus.py
  • scripts/generate_sft_data.py

SFT/GRPO training scripts:

  • scripts/train_sft_trl.py - TRL SFT baseline.
  • scripts/train_grpo_trl.py - TRL GRPO with environment-backed reward.
  • scripts/train_grpo_policy.py
  • scripts/train_grpo_planner.py
  • scripts/train_grpo_supervisor.py
  • scripts/train_grpo_dosing.py
  • app/training/sft_trl.py
  • app/training/grpo_trl.py
  • app/training/openenv_wrapper.py
  • app/training/reward_functions.py
  • app/training/callbacks.py
  • app/training/checkpointing.py

Hugging Face training/evidence scripts:

  • scripts/deploy_training_space.py - creates/runs the GPU training Space.
  • app/hf_space/training_runner.py - Space-side training orchestrator.
  • scripts/monitor_training_space_status.py - Space status/log monitor.
  • scripts/pull_training_artifacts.py - artifact puller from the HF model repo.
  • scripts/deploy_evidence_space.py and app/hf_space/evidence_runner.py - evaluation-only evidence Space.
  • scripts/generate_hf_training_report.py - training/sweep chart generation.
  • scripts/generate_submission_evidence.py - evidence bundle generation without retraining.
  • scripts/deploy_final_artifact_space.py - packages final public evidence/model artifacts into the final HF Space.

Post-training and inference scripts:

  • scripts/merge_adapters_safe.py
  • scripts/test_inference_postsave.py
  • scripts/benchmark_inference.py
  • scripts/activate_sweep_model.py
  • scripts/install_hf_active_bundle.py

Training Logs And Result Evidence

Final curated evidence:

  • docs/results/final_submission_evidence/README.md - final evidence overview.
  • docs/results/final_submission_evidence/manifest.json - artifact availability and final HF Space manifest.
  • docs/results/final_submission_evidence/reports/submission_summary.json - final three-model summary.
  • docs/results/final_submission_evidence/reports/grpo_trl_run.json - Qwen 3B GRPO training run report.
  • docs/results/final_submission_evidence/reports/postsave_inference_grpo.json - post-save GRPO inference check.
  • docs/results/final_submission_evidence/reports/grpo_ablation_report.json - GRPO/policy ablation report.
  • docs/results/final_submission_evidence/reports/basic_llm_vs_polyguard_report.json - baseline LLM-style policy vs full PolyGuard pipeline.
  • docs/results/final_submission_evidence/reports/action_traces.jsonl - matched action traces with verifier output.
  • docs/results/final_submission_evidence/charts/curated/README.md - visually reviewed chart index.

Per-model sweep histories:

  • docs/results/sweeps/qwen-qwen2-5-0-5b-instruct/sft_history.json
  • docs/results/sweeps/qwen-qwen2-5-0-5b-instruct/sft_trl_run.json
  • docs/results/sweeps/qwen-qwen2-5-1-5b-instruct/sft_history.json
  • docs/results/sweeps/qwen-qwen2-5-1-5b-instruct/sft_trl_run.json
  • docs/results/sweeps/qwen-qwen2-5-3b-instruct/sft_history.json
  • docs/results/sweeps/qwen-qwen2-5-3b-instruct/sft_trl_run.json
  • docs/results/sweeps/qwen-qwen2-5-3b-instruct/grpo_history.json
  • docs/results/sweeps/qwen-qwen2-5-3b-instruct/grpo_trl_run.json

Three-model submission evidence:

  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-0-5b-instruct/sft_history.json
  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-1-5b-instruct/sft_history.json
  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-3b-instruct/sft_history.json
  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-3b-instruct/grpo_history.json
  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/runs/qwen-qwen2-5-3b-instruct/grpo_reward_components.jsonl
  • docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/remote_stage_records.json

Completed-run status snapshots:

  • docs/results/qwen_completed_runs/reports/remote_status/live_hf_status_snapshot.json
  • docs/results/qwen_completed_runs/reports/remote_status/qwen_0_5b_completed_commands.json
  • docs/results/qwen_completed_runs/reports/remote_status/qwen_1_5b_completed_commands.json
  • docs/results/qwen_completed_runs/reports/remote_status/qwen_0_5b_1_5b_remote_stage_durations.json
  • docs/results/submission_evidence/qwen_3b_continuation/training_space_runtime_status.json

Legacy/local smoke logs are retained under docs/results/active_model/, docs/results/grpo_training_cycle/, and submission_bundle/ for auditability.

Model Artifacts

The public final artifact/evidence Space is:

The tracked local manifest is:

  • docs/results/final_submission_evidence/manifest.json

At packaging time, Qwen 3B had SFT and GRPO adapter directories plus checkpoint metadata in the final Space. Qwen 0.5B and 1.5B have reports/histories in this repo, but their adapter directories were not present in the checked artifact mirrors and are labeled reports_only_or_partial.

The final artifact Space and this checked-in evidence mirror are the public review paths. Authenticated downloads, when needed by maintainers, are operational details rather than part of the public submission narrative.

Reproduction Paths

Local smoke path: build the small corpus, run a short SFT pass, run a short GRPO pass, validate post-save inference, and generate local reports.

Full HF Space path: use the one-run notebook or training Space runner when you control the required Hugging Face credentials and hardware. The public evidence for review is the final curated bundle, not private training commands.