Aglimate System Weight Analysis & CPU Optimization Guide
Current System Weight
Model Sizes (Approximate)
Qwen1.5-1.8B (~1.8B parameters) β OPTIMIZED
- Size: ~3.6-7 GB (FP32) / ~3.6 GB (FP16) / ~1.8 GB (INT8 quantized)
- RAM Usage: 4-8 GB at runtime
- Status: β CPU-OPTIMIZED - Much lighter than 4B model
NLLB Translation Model (drrobot9/nllb-ig-yo-ha-finetuned)
- Size:
600M-1.3B parameters (2-5 GB) - RAM Usage: 4-10 GB
- Status: β οΈ Heavy but manageable
- Size:
SentenceTransformer Embedding (paraphrase-multilingual-MiniLM-L12-v2)
- Size: ~420 MB
- RAM Usage: ~1-2 GB
- Status: β Acceptable
FastText Language ID
- Size: ~130 MB
- RAM Usage: ~200 MB
- Status: β Lightweight
Intent Classifier (joblib)
- Size: ~10-50 MB
- RAM Usage: ~100 MB
- Status: β Lightweight
Total Estimated Weight
- Disk Space: ~10-15 GB (models + dependencies) β REDUCED
- RAM at Startup: ~500 MB (lazy loading) / ~4-8 GB (when loaded)
- CPU Load: Moderate (1.8B model much faster on CPU than 4B)
Dependencies Weight
torch(full): ~1.5 GBtransformers: ~500 MBsentence-transformers: ~200 MB- Other deps: ~500 MB
- Total: ~2.7 GB
Why this matters for Aglimate
Keeping the Aglimate backend lean is essential so that smallholder farmers can access climate-resilient advice on affordable CPU-only infrastructure, without requiring expensive GPUs or large-cloud deployments.
Critical Issues for CPU Deployment
1. Eager Model Loading β FIXED
All models load at import time in crew_pipeline.py:
- β FIXED: Models now load lazily on-demand
- β
Qwen 1.8B loads only when
/askendpoint is called - β Translation model loads only when needed
- β Startup time reduced to <5 seconds
- β Initial RAM usage ~500 MB
2. Wrong PyTorch Version
- Using
torchinstead oftorch-cpu(saves ~500 MB) torch.float16on CPU is inefficient (should use float32 or quantized)
3. No Quantization
- Models run in FP32/FP16 (full precision)
- INT8 quantization could reduce size by 4x and speed by 2-3x
4. No Lazy Loading
- Models should load on-demand, not at startup
- Only load when endpoint is called
5. Device Map Issues
device_map="auto"may try GPU even on CPU- Should explicitly set CPU device
Optimization Recommendations
Priority 1: Lazy Loading (CRITICAL)
Move model loading from import time to function calls.
Priority 2: Use CPU-Optimized PyTorch
Replace torch with torch-cpu in requirements.
Priority 3: Model Quantization
Use INT8 quantized models for CPU inference.
Priority 4: Smaller Models β COMPLETED
β DONE: Switched to Qwen 1.5-1.8B (much lighter for CPU)
- β Replaced Qwen 4B with Qwen 1.8B
- β Reduced model size by ~55% (from 4B to 1.8B parameters)
- β Reduced RAM usage by ~75% (from 16-32GB to 4-8GB)
Priority 5: Optimize Dockerfile
Remove model pre-downloading (let HuggingFace Spaces handle it).
Best Practices for Hugging Face CPU Spaces
- Memory Limits: HF Spaces CPU has ~16-32 GB RAM
- Startup Time: Keep under 60 seconds
- Cold Start: Models should load lazily
- Disk Space: Limited to ~50 GB
- Concurrency: Single worker recommended for CPU