# Aglimate System Weight Analysis & CPU Optimization Guide ## Current System Weight ### Model Sizes (Approximate) 1. **Qwen1.5-1.8B** (~1.8B parameters) ✅ **OPTIMIZED** - **Size**: ~3.6-7 GB (FP32) / ~3.6 GB (FP16) / ~1.8 GB (INT8 quantized) - **RAM Usage**: 4-8 GB at runtime - **Status**: ✅ **CPU-OPTIMIZED** - Much lighter than 4B model 2. **NLLB Translation Model** (drrobot9/nllb-ig-yo-ha-finetuned) - **Size**: ~600M-1.3B parameters (~2-5 GB) - **RAM Usage**: 4-10 GB - **Status**: ⚠️ Heavy but manageable 3. **SentenceTransformer Embedding** (paraphrase-multilingual-MiniLM-L12-v2) - **Size**: ~420 MB - **RAM Usage**: ~1-2 GB - **Status**: ✅ Acceptable 4. **FastText Language ID** - **Size**: ~130 MB - **RAM Usage**: ~200 MB - **Status**: ✅ Lightweight 5. **Intent Classifier** (joblib) - **Size**: ~10-50 MB - **RAM Usage**: ~100 MB - **Status**: ✅ Lightweight ### Total Estimated Weight - **Disk Space**: ~10-15 GB (models + dependencies) ✅ **REDUCED** - **RAM at Startup**: ~500 MB (lazy loading) / ~4-8 GB (when loaded) - **CPU Load**: Moderate (1.8B model much faster on CPU than 4B) ### Dependencies Weight - `torch` (full): ~1.5 GB - `transformers`: ~500 MB - `sentence-transformers`: ~200 MB - Other deps: ~500 MB - **Total**: ~2.7 GB --- ## Why this matters for Aglimate Keeping the Aglimate backend lean is essential so that smallholder farmers can access climate-resilient advice on affordable CPU-only infrastructure, without requiring expensive GPUs or large-cloud deployments. ## Critical Issues for CPU Deployment ### 1. **Eager Model Loading** ✅ FIXED ~~All models load at import time in `crew_pipeline.py`:~~ - ✅ **FIXED**: Models now load lazily on-demand - ✅ Qwen 1.8B loads only when `/ask` endpoint is called - ✅ Translation model loads only when needed - ✅ Startup time reduced to <5 seconds - ✅ Initial RAM usage ~500 MB ### 2. **Wrong PyTorch Version** - Using `torch` instead of `torch-cpu` (saves ~500 MB) - `torch.float16` on CPU is inefficient (should use float32 or quantized) ### 3. **No Quantization** - Models run in FP32/FP16 (full precision) - INT8 quantization could reduce size by 4x and speed by 2-3x ### 4. **No Lazy Loading** - Models should load on-demand, not at startup - Only load when endpoint is called ### 5. **Device Map Issues** - `device_map="auto"` may try GPU even on CPU - Should explicitly set CPU device --- ## Optimization Recommendations ### Priority 1: Lazy Loading (CRITICAL) Move model loading from import time to function calls. ### Priority 2: Use CPU-Optimized PyTorch Replace `torch` with `torch-cpu` in requirements. ### Priority 3: Model Quantization Use INT8 quantized models for CPU inference. ### Priority 4: Smaller Models ✅ COMPLETED ✅ **DONE**: Switched to Qwen 1.5-1.8B (much lighter for CPU) - ✅ Replaced Qwen 4B with Qwen 1.8B - ✅ Reduced model size by ~55% (from 4B to 1.8B parameters) - ✅ Reduced RAM usage by ~75% (from 16-32GB to 4-8GB) ### Priority 5: Optimize Dockerfile Remove model pre-downloading (let HuggingFace Spaces handle it). --- ## Best Practices for Hugging Face CPU Spaces 1. **Memory Limits**: HF Spaces CPU has ~16-32 GB RAM 2. **Startup Time**: Keep under 60 seconds 3. **Cold Start**: Models should load lazily 4. **Disk Space**: Limited to ~50 GB 5. **Concurrency**: Single worker recommended for CPU