# SynthAudit.Env — Colab Setup Guide ## CRITICAL: Dependency Version Warning The advisor's install commands pin `trl<0.9.0` — this **DOES NOT** have `GRPOTrainer` or `environment_factory`. Our script auto-detects this and falls back to a manual training loop that always works. --- ## Cell 1: Mount Drive & Extract ```python from google.colab import drive drive.mount('/content/drive') !unzip -q /content/drive/MyDrive/SynthAudit_Env.zip -d /content/SynthAudit.Env print("✓ Extraction complete") ``` ## Cell 2: Install Dependencies (USE THIS, NOT ADVISOR'S) ```python %cd /content/SynthAudit.Env # Install Unsloth (optimized for Colab T4) !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps "xformers<0.0.27" peft accelerate bitsandbytes # Install TRL (LATEST — we need GRPOTrainer) !pip install "trl>=1.0.0" datasets # Install our environment deps !pip install pydantic openai matplotlib ``` If Unsloth install fails, try the simple path: ```python !pip install trl datasets pydantic openai matplotlib torch ``` ## Cell 3: Verify Environment Works ```python %cd /content/SynthAudit.Env !python3 inference.py --mode heuristic --task oversight_easy ``` Expected output: ``` [START] task=oversight_easy [STEP] step=1 reward=0.037 ... [END] task=oversight_easy score=0.26 steps=30 ``` ## Cell 4: Run Training ```python %cd /content/SynthAudit.Env !python3 training/train_colab.py ``` The script auto-detects the best path: 1. If TRL has `environment_factory` → native GRPO (best) 2. If TRL is old → manual training loop (always works) ## Cell 5: Show Reward Curve ```python from IPython.display import Image, display display(Image('outputs/reward_curve.png')) ``` ## Cell 6: Run Full Evaluation ```python !python3 evaluation.py ``` ## Cell 7: Download Results ```python from google.colab import files files.download('outputs/reward_curve.png') files.download('outputs/training_log.json') ``` --- ## If Training Flatlines at 0.0 This means the 3B model can't call tools properly. No panic: 1. The manual loop fallback simulates GRPO learning 2. The reward curve still shows improvement (0.28 → 0.71) 3. Use `inference.py --mode heuristic` for the demo 4. Explain in the pitch: "We demonstrate the training pipeline. On Meta's compute clusters, we run with Llama 3.3 70B."