# Aglimate System Weight Analysis & CPU Optimization Guide

## Current System Weight

### Model Sizes (Approximate)
1. **Qwen1.5-1.8B** (~1.8B parameters) ✅ **OPTIMIZED**
   - **Size**: ~3.6-7 GB (FP32) / ~3.6 GB (FP16) / ~1.8 GB (INT8 quantized)
   - **RAM Usage**: 4-8 GB at runtime
   - **Status**: ✅ **CPU-OPTIMIZED** - Much lighter than 4B model

2. **NLLB Translation Model** (drrobot9/nllb-ig-yo-ha-finetuned)
   - **Size**: ~600M-1.3B parameters (~2-5 GB)
   - **RAM Usage**: 4-10 GB
   - **Status**: ⚠️ Heavy but manageable

3. **SentenceTransformer Embedding** (paraphrase-multilingual-MiniLM-L12-v2)
   - **Size**: ~420 MB
   - **RAM Usage**: ~1-2 GB
   - **Status**: ✅ Acceptable

4. **FastText Language ID**
   - **Size**: ~130 MB
   - **RAM Usage**: ~200 MB
   - **Status**: ✅ Lightweight

5. **Intent Classifier** (joblib)
   - **Size**: ~10-50 MB
   - **RAM Usage**: ~100 MB
   - **Status**: ✅ Lightweight

### Total Estimated Weight
- **Disk Space**: ~10-15 GB (models + dependencies) ✅ **REDUCED**
- **RAM at Startup**: ~500 MB (lazy loading) / ~4-8 GB (when loaded)
- **CPU Load**: Moderate (1.8B model much faster on CPU than 4B)

### Dependencies Weight
- `torch` (full): ~1.5 GB
- `transformers`: ~500 MB
- `sentence-transformers`: ~200 MB
- Other deps: ~500 MB
- **Total**: ~2.7 GB

---

## Why this matters for Aglimate

Keeping the Aglimate backend lean is essential so that smallholder farmers can access climate-resilient advice on affordable CPU-only infrastructure, without requiring expensive GPUs or large-cloud deployments.

## Critical Issues for CPU Deployment

### 1. **Eager Model Loading** ✅ FIXED
~~All models load at import time in `crew_pipeline.py`:~~
- ✅ **FIXED**: Models now load lazily on-demand
- ✅ Qwen 1.8B loads only when `/ask` endpoint is called
- ✅ Translation model loads only when needed
- ✅ Startup time reduced to <5 seconds
- ✅ Initial RAM usage ~500 MB

### 2. **Wrong PyTorch Version**
- Using `torch` instead of `torch-cpu` (saves ~500 MB)
- `torch.float16` on CPU is inefficient (should use float32 or quantized)

### 3. **No Quantization**
- Models run in FP32/FP16 (full precision)
- INT8 quantization could reduce size by 4x and speed by 2-3x

### 4. **No Lazy Loading**
- Models should load on-demand, not at startup
- Only load when endpoint is called

### 5. **Device Map Issues**
- `device_map="auto"` may try GPU even on CPU
- Should explicitly set CPU device

---

## Optimization Recommendations

### Priority 1: Lazy Loading (CRITICAL)
Move model loading from import time to function calls.

### Priority 2: Use CPU-Optimized PyTorch
Replace `torch` with `torch-cpu` in requirements.

### Priority 3: Model Quantization
Use INT8 quantized models for CPU inference.

### Priority 4: Smaller Models ✅ COMPLETED
✅ **DONE**: Switched to Qwen 1.5-1.8B (much lighter for CPU)
- ✅ Replaced Qwen 4B with Qwen 1.8B
- ✅ Reduced model size by ~55% (from 4B to 1.8B parameters)
- ✅ Reduced RAM usage by ~75% (from 16-32GB to 4-8GB)

### Priority 5: Optimize Dockerfile
Remove model pre-downloading (let HuggingFace Spaces handle it).

---

## Best Practices for Hugging Face CPU Spaces

1. **Memory Limits**: HF Spaces CPU has ~16-32 GB RAM
2. **Startup Time**: Keep under 60 seconds
3. **Cold Start**: Models should load lazily
4. **Disk Space**: Limited to ~50 GB
5. **Concurrency**: Single worker recommended for CPU