π SUPERNOVA VM TRAINING INSTRUCTIONS
π VALIDATION COMPLETE: ALL 8 TESTS PASSED (100%)
Your local system has been fully validated and is ready for VM training deployment.
π VM SETUP CHECKLIST
Step 1: Transfer Files to VM
Copy these essential files to your VM:
supernova/ # Main package directory
configs/ # Configuration files
chat_advanced.py # Advanced reasoning system
train_production.py # Production training script (optional)
requirements.txt # Dependencies
Step 2: VM Environment Setup
# Install Python 3.10+ and dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import datasets; print('HuggingFace Datasets: OK')"
Step 3: Verify VM System
# Quick validation test
python -c "
from supernova.config import ModelConfig
from supernova.model import SupernovaModel
cfg = ModelConfig.from_json_file('./configs/supernova_25m.json')
model = SupernovaModel(cfg)
params = sum(p.numel() for p in model.parameters())
print(f'β
Model: {params:,} parameters')
assert params == 25_000_000
print('β
VM SYSTEM READY')
"
π― TRAINING COMMANDS FOR VM
PHASE 1: Validation Run (MANDATORY FIRST)
python -m supernova.train \
--config ./configs/supernova_25m.json \
--data-config ./configs/data_sources.yaml \
--seq-len 512 \
--batch-size 4 \
--grad-accum 4 \
--lr 3e-4 \
--warmup-steps 100 \
--max-steps 1000 \
--save-every 500 \
--out-dir ./validation_checkpoints
Expected Results:
- Initial loss: ~10-11
- Final loss after 1000 steps: Should decrease to <9
- Training time: 30-60 minutes
- Checkpoints:
validation_checkpoints/supernova_step500.ptandsupernova_final.pt
PHASE 2: Full Production Training
β οΈ Only run after Phase 1 succeeds!
python -m supernova.train \
--config ./configs/supernova_25m.json \
--data-config ./configs/data_sources.yaml \
--seq-len 1024 \
--batch-size 16 \
--grad-accum 8 \
--lr 3e-4 \
--warmup-steps 2000 \
--max-steps 100000 \
--save-every 10000 \
--out-dir ./checkpoints
Expected Results:
- Training time: 2-7 days (depending on hardware)
- Final loss: <6 (target <4 for good performance)
- Checkpoints every 10K steps
- Total tokens processed: ~13.1 billion
π MONITORING TRAINING PROGRESS
Key Metrics to Watch:
- Loss Decrease: Should consistently decrease over time
- Gradient Norm: Should be reasonable (1-100 range)
- Learning Rate: Should follow cosine schedule
- Tokens/Second: Throughput indicator
Expected Loss Trajectory:
Steps 0-1000: 10.5 β 9.0 (Initial learning)
Steps 1000-10K: 9.0 β 7.5 (Rapid improvement)
Steps 10K-50K: 7.5 β 6.0 (Steady progress)
Steps 50K-100K: 6.0 β 4.5 (Fine-tuning)
Warning Signs:
- β Loss increases consistently
- β Loss plateaus above 8.0 after 10K steps
- β Gradient norm explodes (>1000)
- β NaN values in loss
π TRAINING VALIDATION COMMANDS
Check Training Progress:
# List checkpoints
ls -la checkpoints/
# Check latest checkpoint
python -c "
import torch
ckpt = torch.load('checkpoints/supernova_step10000.pt', map_location='cpu')
print(f'Step: {ckpt[\"step\"]}')
print(f'Loss: {ckpt.get(\"loss\", \"N/A\")}')
"
Test Model Generation (After Training):
python chat_advanced.py \
--config ./configs/supernova_25m.json \
--checkpoint ./checkpoints/supernova_step50000.pt \
--prompt "Explain quantum physics in simple terms"
π¨ EMERGENCY PROCEDURES
If Training Fails:
- Check error logs for specific error messages
- Verify GPU memory usage (nvidia-smi)
- Reduce batch size if OOM errors
- Contact support with error details
If Loss Doesn't Decrease:
- Verify learning rate schedule
- Check gradient norms
- Reduce learning rate by 50%
- Restart from last checkpoint
Performance Optimization:
# For GPU training
export CUDA_VISIBLE_DEVICES=0
python -m supernova.train ... # your command
# For multi-GPU (if available)
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m supernova.train ... # your command
π SUCCESS CRITERIA
Your training is successful if:
- β Loss decreases from ~10 to <6
- β Model generates coherent text (not gibberish)
- β Advanced reasoning system works with trained model
- β Checkpoints save without errors
π― POST-TRAINING TESTING
After training completes, test the system:
# Test basic generation
python chat_advanced.py --config ./configs/supernova_25m.json --checkpoint ./checkpoints/supernova_final.pt
# Test specific queries:
# 1. "What is 15 * 23?" (should use math engine)
# 2. "What are the latest AI developments?" (should use web search)
# 3. "Explain the theory of relativity" (should use reasoning)
π TRAINING SYSTEM 100% VALIDATED - READY FOR VM DEPLOYMENT! π