🚀 SUPERNOVA VM TRAINING INSTRUCTIONS

🎉 VALIDATION COMPLETE: ALL 8 TESTS PASSED (100%)

Your local system has been fully validated and is ready for VM training deployment.

📋 VM SETUP CHECKLIST

Step 1: Transfer Files to VM

Copy these essential files to your VM:

supernova/                    # Main package directory
configs/                     # Configuration files
chat_advanced.py             # Advanced reasoning system
train_production.py          # Production training script (optional)
requirements.txt             # Dependencies

Step 2: VM Environment Setup

# Install Python 3.10+ and dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import datasets; print('HuggingFace Datasets: OK')"

Step 3: Verify VM System

# Quick validation test
python -c "
from supernova.config import ModelConfig
from supernova.model import SupernovaModel
cfg = ModelConfig.from_json_file('./configs/supernova_25m.json')
model = SupernovaModel(cfg)
params = sum(p.numel() for p in model.parameters())
print(f'✅ Model: {params:,} parameters')
assert params == 25_000_000
print('✅ VM SYSTEM READY')
"

🎯 TRAINING COMMANDS FOR VM

PHASE 1: Validation Run (MANDATORY FIRST)

python -m supernova.train \
  --config ./configs/supernova_25m.json \
  --data-config ./configs/data_sources.yaml \
  --seq-len 512 \
  --batch-size 4 \
  --grad-accum 4 \
  --lr 3e-4 \
  --warmup-steps 100 \
  --max-steps 1000 \
  --save-every 500 \
  --out-dir ./validation_checkpoints

Expected Results:

Initial loss: ~10-11
Final loss after 1000 steps: Should decrease to <9
Training time: 30-60 minutes
Checkpoints: validation_checkpoints/supernova_step500.pt and supernova_final.pt

PHASE 2: Full Production Training

⚠️ Only run after Phase 1 succeeds!

python -m supernova.train \
  --config ./configs/supernova_25m.json \
  --data-config ./configs/data_sources.yaml \
  --seq-len 1024 \
  --batch-size 16 \
  --grad-accum 8 \
  --lr 3e-4 \
  --warmup-steps 2000 \
  --max-steps 100000 \
  --save-every 10000 \
  --out-dir ./checkpoints

Expected Results:

Training time: 2-7 days (depending on hardware)
Final loss: <6 (target <4 for good performance)
Checkpoints every 10K steps
Total tokens processed: ~13.1 billion

📊 MONITORING TRAINING PROGRESS

Key Metrics to Watch:

Loss Decrease: Should consistently decrease over time
Gradient Norm: Should be reasonable (1-100 range)
Learning Rate: Should follow cosine schedule
Tokens/Second: Throughput indicator

Expected Loss Trajectory:

Steps 0-1000:     10.5 → 9.0   (Initial learning)
Steps 1000-10K:   9.0 → 7.5    (Rapid improvement)
Steps 10K-50K:    7.5 → 6.0    (Steady progress)
Steps 50K-100K:   6.0 → 4.5    (Fine-tuning)

Warning Signs:

❌ Loss increases consistently
❌ Loss plateaus above 8.0 after 10K steps
❌ Gradient norm explodes (>1000)
❌ NaN values in loss

🔍 TRAINING VALIDATION COMMANDS

Check Training Progress:

# List checkpoints
ls -la checkpoints/

# Check latest checkpoint
python -c "
import torch
ckpt = torch.load('checkpoints/supernova_step10000.pt', map_location='cpu')
print(f'Step: {ckpt[\"step\"]}')
print(f'Loss: {ckpt.get(\"loss\", \"N/A\")}')
"

Test Model Generation (After Training):

python chat_advanced.py \
  --config ./configs/supernova_25m.json \
  --checkpoint ./checkpoints/supernova_step50000.pt \
  --prompt "Explain quantum physics in simple terms"

🚨 EMERGENCY PROCEDURES

If Training Fails:

Check error logs for specific error messages
Verify GPU memory usage (nvidia-smi)
Reduce batch size if OOM errors
Contact support with error details

If Loss Doesn't Decrease:

Verify learning rate schedule
Check gradient norms
Reduce learning rate by 50%
Restart from last checkpoint

Performance Optimization:

# For GPU training
export CUDA_VISIBLE_DEVICES=0
python -m supernova.train ... # your command

# For multi-GPU (if available)
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m supernova.train ... # your command

📞 SUCCESS CRITERIA

Your training is successful if:

✅ Loss decreases from ~10 to <6
✅ Model generates coherent text (not gibberish)
✅ Advanced reasoning system works with trained model
✅ Checkpoints save without errors

🎯 POST-TRAINING TESTING

After training completes, test the system:

# Test basic generation
python chat_advanced.py --config ./configs/supernova_25m.json --checkpoint ./checkpoints/supernova_final.pt

# Test specific queries:
# 1. "What is 15 * 23?" (should use math engine)
# 2. "What are the latest AI developments?" (should use web search)
# 3. "Explain the theory of relativity" (should use reasoning)

🚀 TRAINING SYSTEM 100% VALIDATED - READY FOR VM DEPLOYMENT! 🚀