Aglimate / SYSTEM_WEIGHT_ANALYSIS.md
nexusbert's picture
Initial Aglimate backend
7e5ed44

Aglimate System Weight Analysis & CPU Optimization Guide

Current System Weight

Model Sizes (Approximate)

  1. Qwen1.5-1.8B (~1.8B parameters) βœ… OPTIMIZED

    • Size: ~3.6-7 GB (FP32) / ~3.6 GB (FP16) / ~1.8 GB (INT8 quantized)
    • RAM Usage: 4-8 GB at runtime
    • Status: βœ… CPU-OPTIMIZED - Much lighter than 4B model
  2. NLLB Translation Model (drrobot9/nllb-ig-yo-ha-finetuned)

    • Size: 600M-1.3B parameters (2-5 GB)
    • RAM Usage: 4-10 GB
    • Status: ⚠️ Heavy but manageable
  3. SentenceTransformer Embedding (paraphrase-multilingual-MiniLM-L12-v2)

    • Size: ~420 MB
    • RAM Usage: ~1-2 GB
    • Status: βœ… Acceptable
  4. FastText Language ID

    • Size: ~130 MB
    • RAM Usage: ~200 MB
    • Status: βœ… Lightweight
  5. Intent Classifier (joblib)

    • Size: ~10-50 MB
    • RAM Usage: ~100 MB
    • Status: βœ… Lightweight

Total Estimated Weight

  • Disk Space: ~10-15 GB (models + dependencies) βœ… REDUCED
  • RAM at Startup: ~500 MB (lazy loading) / ~4-8 GB (when loaded)
  • CPU Load: Moderate (1.8B model much faster on CPU than 4B)

Dependencies Weight

  • torch (full): ~1.5 GB
  • transformers: ~500 MB
  • sentence-transformers: ~200 MB
  • Other deps: ~500 MB
  • Total: ~2.7 GB

Why this matters for Aglimate

Keeping the Aglimate backend lean is essential so that smallholder farmers can access climate-resilient advice on affordable CPU-only infrastructure, without requiring expensive GPUs or large-cloud deployments.

Critical Issues for CPU Deployment

1. Eager Model Loading βœ… FIXED

All models load at import time in crew_pipeline.py:

  • βœ… FIXED: Models now load lazily on-demand
  • βœ… Qwen 1.8B loads only when /ask endpoint is called
  • βœ… Translation model loads only when needed
  • βœ… Startup time reduced to <5 seconds
  • βœ… Initial RAM usage ~500 MB

2. Wrong PyTorch Version

  • Using torch instead of torch-cpu (saves ~500 MB)
  • torch.float16 on CPU is inefficient (should use float32 or quantized)

3. No Quantization

  • Models run in FP32/FP16 (full precision)
  • INT8 quantization could reduce size by 4x and speed by 2-3x

4. No Lazy Loading

  • Models should load on-demand, not at startup
  • Only load when endpoint is called

5. Device Map Issues

  • device_map="auto" may try GPU even on CPU
  • Should explicitly set CPU device

Optimization Recommendations

Priority 1: Lazy Loading (CRITICAL)

Move model loading from import time to function calls.

Priority 2: Use CPU-Optimized PyTorch

Replace torch with torch-cpu in requirements.

Priority 3: Model Quantization

Use INT8 quantized models for CPU inference.

Priority 4: Smaller Models βœ… COMPLETED

βœ… DONE: Switched to Qwen 1.5-1.8B (much lighter for CPU)

  • βœ… Replaced Qwen 4B with Qwen 1.8B
  • βœ… Reduced model size by ~55% (from 4B to 1.8B parameters)
  • βœ… Reduced RAM usage by ~75% (from 16-32GB to 4-8GB)

Priority 5: Optimize Dockerfile

Remove model pre-downloading (let HuggingFace Spaces handle it).


Best Practices for Hugging Face CPU Spaces

  1. Memory Limits: HF Spaces CPU has ~16-32 GB RAM
  2. Startup Time: Keep under 60 seconds
  3. Cold Start: Models should load lazily
  4. Disk Space: Limited to ~50 GB
  5. Concurrency: Single worker recommended for CPU