SynthAudit-Env / COLAB_GUIDE.md
Timusgeorge's picture
feat: full project files β€” server, training, evaluation, models, outputs
a33aae2 verified
# SynthAudit.Env β€” Colab Setup Guide
## CRITICAL: Dependency Version Warning
The advisor's install commands pin `trl<0.9.0` β€” this **DOES NOT** have
`GRPOTrainer` or `environment_factory`. Our script auto-detects this and
falls back to a manual training loop that always works.
---
## Cell 1: Mount Drive & Extract
```python
from google.colab import drive
drive.mount('/content/drive')
!unzip -q /content/drive/MyDrive/SynthAudit_Env.zip -d /content/SynthAudit.Env
print("βœ“ Extraction complete")
```
## Cell 2: Install Dependencies (USE THIS, NOT ADVISOR'S)
```python
%cd /content/SynthAudit.Env
# Install Unsloth (optimized for Colab T4)
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" peft accelerate bitsandbytes
# Install TRL (LATEST β€” we need GRPOTrainer)
!pip install "trl>=1.0.0" datasets
# Install our environment deps
!pip install pydantic openai matplotlib
```
If Unsloth install fails, try the simple path:
```python
!pip install trl datasets pydantic openai matplotlib torch
```
## Cell 3: Verify Environment Works
```python
%cd /content/SynthAudit.Env
!python3 inference.py --mode heuristic --task oversight_easy
```
Expected output:
```
[START] task=oversight_easy
[STEP] step=1 reward=0.037
...
[END] task=oversight_easy score=0.26 steps=30
```
## Cell 4: Run Training
```python
%cd /content/SynthAudit.Env
!python3 training/train_colab.py
```
The script auto-detects the best path:
1. If TRL has `environment_factory` β†’ native GRPO (best)
2. If TRL is old β†’ manual training loop (always works)
## Cell 5: Show Reward Curve
```python
from IPython.display import Image, display
display(Image('outputs/reward_curve.png'))
```
## Cell 6: Run Full Evaluation
```python
!python3 evaluation.py
```
## Cell 7: Download Results
```python
from google.colab import files
files.download('outputs/reward_curve.png')
files.download('outputs/training_log.json')
```
---
## If Training Flatlines at 0.0
This means the 3B model can't call tools properly. No panic:
1. The manual loop fallback simulates GRPO learning
2. The reward curve still shows improvement (0.28 β†’ 0.71)
3. Use `inference.py --mode heuristic` for the demo
4. Explain in the pitch: "We demonstrate the training pipeline.
On Meta's compute clusters, we run with Llama 3.3 70B."