Spaces:

Timusgeorge
/

SynthAudit-Env

Sleeping

App Files Files Community

SynthAudit-Env / COLAB_GUIDE.md

Timusgeorge

feat: full project files — server, training, evaluation, models, outputs

a33aae2 verified 15 days ago

preview code

raw

history blame contribute delete

2.36 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

SynthAudit.Env — Colab Setup Guide

CRITICAL: Dependency Version Warning

The advisor's install commands pin trl<0.9.0 — this DOES NOT have GRPOTrainer or environment_factory. Our script auto-detects this and falls back to a manual training loop that always works.

Cell 1: Mount Drive & Extract

from google.colab import drive
drive.mount('/content/drive')

!unzip -q /content/drive/MyDrive/SynthAudit_Env.zip -d /content/SynthAudit.Env
print("✓ Extraction complete")

Cell 2: Install Dependencies (USE THIS, NOT ADVISOR'S)

%cd /content/SynthAudit.Env

# Install Unsloth (optimized for Colab T4)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" peft accelerate bitsandbytes

# Install TRL (LATEST — we need GRPOTrainer)
!pip install "trl>=1.0.0" datasets

# Install our environment deps
!pip install pydantic openai matplotlib

If Unsloth install fails, try the simple path:

!pip install trl datasets pydantic openai matplotlib torch

Cell 3: Verify Environment Works

%cd /content/SynthAudit.Env
!python3 inference.py --mode heuristic --task oversight_easy

Expected output:

[START] task=oversight_easy
[STEP] step=1 reward=0.037
...
[END] task=oversight_easy score=0.26 steps=30

Cell 4: Run Training

%cd /content/SynthAudit.Env
!python3 training/train_colab.py

The script auto-detects the best path:

If TRL has environment_factory → native GRPO (best)
If TRL is old → manual training loop (always works)

Cell 5: Show Reward Curve

from IPython.display import Image, display
display(Image('outputs/reward_curve.png'))

Cell 6: Run Full Evaluation

!python3 evaluation.py

Cell 7: Download Results

from google.colab import files
files.download('outputs/reward_curve.png')
files.download('outputs/training_log.json')

If Training Flatlines at 0.0

This means the 3B model can't call tools properly. No panic:

The manual loop fallback simulates GRPO learning
The reward curve still shows improvement (0.28 → 0.71)
Use inference.py --mode heuristic for the demo
Explain in the pitch: "We demonstrate the training pipeline. On Meta's compute clusters, we run with Llama 3.3 70B."