Spaces:

nexusbert
/

Aglimate

Sleeping

App Files Files Community

Aglimate / SYSTEM_WEIGHT_ANALYSIS.md

nexusbert

Initial Aglimate backend

7e5ed44 about 2 months ago

preview code

raw

history blame contribute delete

3.42 kB

Aglimate System Weight Analysis & CPU Optimization Guide

Current System Weight

Model Sizes (Approximate)

Qwen1.5-1.8B (~1.8B parameters) ✅ OPTIMIZED
- Size: ~3.6-7 GB (FP32) / ~3.6 GB (FP16) / ~1.8 GB (INT8 quantized)
- RAM Usage: 4-8 GB at runtime
- Status: ✅ CPU-OPTIMIZED - Much lighter than 4B model
NLLB Translation Model (drrobot9/nllb-ig-yo-ha-finetuned)
- Size: ~~600M-1.3B parameters (~~2-5 GB)
- RAM Usage: 4-10 GB
- Status: ⚠️ Heavy but manageable
SentenceTransformer Embedding (paraphrase-multilingual-MiniLM-L12-v2)
- Size: ~420 MB
- RAM Usage: ~1-2 GB
- Status: ✅ Acceptable
FastText Language ID
- Size: ~130 MB
- RAM Usage: ~200 MB
- Status: ✅ Lightweight
Intent Classifier (joblib)
- Size: ~10-50 MB
- RAM Usage: ~100 MB
- Status: ✅ Lightweight

Total Estimated Weight

Disk Space: ~10-15 GB (models + dependencies) ✅ REDUCED
RAM at Startup: ~500 MB (lazy loading) / ~4-8 GB (when loaded)
CPU Load: Moderate (1.8B model much faster on CPU than 4B)

Dependencies Weight

torch (full): ~1.5 GB
transformers: ~500 MB
sentence-transformers: ~200 MB
Other deps: ~500 MB
Total: ~2.7 GB

Why this matters for Aglimate

Keeping the Aglimate backend lean is essential so that smallholder farmers can access climate-resilient advice on affordable CPU-only infrastructure, without requiring expensive GPUs or large-cloud deployments.

Critical Issues for CPU Deployment

1. Eager Model Loading ✅ FIXED

~~All models load at import time in crew_pipeline.py:~~

✅ FIXED: Models now load lazily on-demand
✅ Qwen 1.8B loads only when /ask endpoint is called
✅ Translation model loads only when needed
✅ Startup time reduced to <5 seconds
✅ Initial RAM usage ~500 MB

2. Wrong PyTorch Version

Using torch instead of torch-cpu (saves ~500 MB)
torch.float16 on CPU is inefficient (should use float32 or quantized)

3. No Quantization

Models run in FP32/FP16 (full precision)
INT8 quantization could reduce size by 4x and speed by 2-3x

4. No Lazy Loading

Models should load on-demand, not at startup
Only load when endpoint is called

5. Device Map Issues

device_map="auto" may try GPU even on CPU
Should explicitly set CPU device

Optimization Recommendations

Priority 1: Lazy Loading (CRITICAL)

Move model loading from import time to function calls.

Priority 2: Use CPU-Optimized PyTorch

Replace torch with torch-cpu in requirements.

Priority 3: Model Quantization

Use INT8 quantized models for CPU inference.

Priority 4: Smaller Models ✅ COMPLETED

✅ DONE: Switched to Qwen 1.5-1.8B (much lighter for CPU)

✅ Replaced Qwen 4B with Qwen 1.8B
✅ Reduced model size by ~55% (from 4B to 1.8B parameters)
✅ Reduced RAM usage by ~75% (from 16-32GB to 4-8GB)

Priority 5: Optimize Dockerfile

Remove model pre-downloading (let HuggingFace Spaces handle it).

Best Practices for Hugging Face CPU Spaces

Memory Limits: HF Spaces CPU has ~16-32 GB RAM
Startup Time: Keep under 60 seconds
Cold Start: Models should load lazily
Disk Space: Limited to ~50 GB
Concurrency: Single worker recommended for CPU