raazkumar
/

ml-intern-local-fork

Model card Files Files and versions

xet

Community

raazkumar commited on 2 days ago

Commit

30f7cdb

verified ·

1 Parent(s): 42855cf

Upload production/M2_PRO_MAX_GUIDE.md

Browse files

Files changed (1) hide show

production/M2_PRO_MAX_GUIDE.md +199 -201

production/M2_PRO_MAX_GUIDE.md CHANGED Viewed

@@ -1,92 +1,99 @@
-# M2 Pro Max 96GB — Setup Guide
-Your machine is powerful enough to run **70B models locally via MLX** while using **NIM (cloud) as primary** and **Cloudflare Workers AI as automatic fallback**.
-## Architecture for Your Setup
 ```
 ┌──────────────────────────────────────────────────────────────┐
 │                   MacBook M2 Pro Max 96GB                     │
 │                                                              │
-│  ┌──────────────────┐      ┌──────────────────────────┐    │
-│  │   MLX Server     │      │   Docker (API + Infra)   │    │
-│  │   (Metal GPU)    │      │   ─────────────────────  │    │
-│  │   ─────────────  │      │   • FastAPI server       │    │
-│  │   Port :8000     │◄─────│   • Redis cache          │    │
-│  │   70B models     │      │   • Postgres DB          │    │
-│  │   48GB RAM use   │      │   • Nginx LB             │    │
-│  └──────────────────┘      └──────────────────────────┘    │
-│         ▲                                                    │
 │         │                                                    │
-│  ┌──────┴──────┐    ┌─────────────────┐                     │
-│  │  NIM Cloud   │───►│ Cloudflare AI   │                     │
-│  │  (Primary)   │    │ (Auto Fallback)│                     │
-│  └──────────────┘    └─────────────────┘                     │
 │                                                              │
 └──────────────────────────────────────────────────────────────┘
 ```
-## Quick Start
-### 1. Install Prerequisites
-```bash
-# Install uv (fast Python package manager)
-curl -LsSf https://astral.sh/uv/install.sh | sh
-# Install Homebrew packages
-brew install redis postgresql docker
-# Start services
-brew services start redis
-brew services start postgresql
-# Install Docker Desktop for Mac (if not already)
-# https://www.docker.com/products/docker-desktop
-```
-### 2. Install MLX Server (Native on macOS)
 ```bash
-# Create a dedicated venv for MLX
-mkdir ~/mlx-server && cd ~/mlx-server
 uv venv --python 3.11
 source .venv/bin/activate
-# Install MLX LM
-uv pip install mlx-lm
-# Download a 70B model (takes ~40GB, fits in 96GB)
-# Option A: llama-3.1-70B (best quality)
-python -c "
-from mlx_lm import load
-load('mlx-community/Meta-Llama-3.1-70B-Instruct-4bit')
-"
-# Option B: Mistral-7B (faster, less memory)
 python -c "
-from mlx_lm import load
-load('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
 "
 ```
 ### 3. Start MLX Server
 ```bash
-# Terminal 1: Start MLX server (uses Metal GPU automatically)
-mlx_lm.server \
-  --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit \
   --host 0.0.0.0 \
   --port 8000
-# OR for 7B (faster, less RAM):
-# mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --host 0.0.0.0 --port 8000
-# Verify it's running
-curl http://localhost:8000/v1/models
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'
 ```
 ### 4. Configure API Server
@@ -95,55 +102,75 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 # In the ml-intern repo
 cd production
-# Create minimal .env (ONLY these 4-5 lines needed)
-cat > .env << 'EOF'
-# REQUIRED — Cloudflare fallback
-CLOUDFLARE_API_KEY=your_cloudflare_api_key_here
-CLOUDFLARE_ACCOUNT_ID=your_account_id_here
-# OPTIONAL — NIM primary (if you have API key)
-NVIDIA_API_KEY=your_nvidia_api_key_here
-# Point MLX to your local server
 MLX_API_BASE=http://host.docker.internal:8000/v1
-EOF
 ```
 ### 5. Start the Stack
 ```bash
-# Start Redis, Postgres, API, Workers, Nginx
 docker-compose -f docker-compose.m2.yml up -d
-# Verify everything
 curl http://localhost/health | jq
 ```
-### 6. Test the Full Pipeline
 ```bash
-# Test 1: NIM primary (if API key set)
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"nim/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from NIM"}]}'
-# Test 2: Cloudflare fallback
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from Cloudflare"}]}'
-# Test 3: MLX local (bypasses fallback)
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"mlx/llama-3.1-70b","messages":[{"role":"user","content":"Hello from MLX"}]}'
-# Test 4: Check which provider is active
-curl http://localhost/v1/fallback/status | jq
 ```
 ---
-## Fallback Behavior
 ```
 Request comes in
@@ -151,185 +178,156 @@ Request comes in
      ▼
 ┌──────────────┐
 │  Check NIM    │◄── Circuit breaker CLOSED?
-│  (Primary)    │    Yes → Send to NIM
 └──────────────┘    No  → Fall through
      │
-     ▼ (NIM down or rate limited)
 ┌──────────────┐
-│ Check Cloudflare│◄── Circuit breaker CLOSED?
-│ (Fallback)     │    Yes → Send to Cloudflare
 └──────────────┘    No  → Fall through
      │
-     ▼ (Both down)
 ┌──────────────┐
-│ Check MLX     │◄── Enabled + Circuit CLOSED?
-│ (Local)      │    Yes → Send to MLX
 └──────────────┘    No  → Return 503
 ```
-**You can force a provider** with `provider_override`:
 ```bash
-# Always use MLX regardless of NIM status
 curl -X POST http://localhost/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "mlx/llama-3.1-70b",
-    "messages": [{"role":"user","content":"Hello"}],
-    "provider_override": "mlx"
-  }'
 ```
 ---
-## Running MLX + API Together (All Native, No Docker)
-If you prefer running everything natively without Docker:
-### Terminal 1: MLX Server
 ```bash
-cd ~/mlx-server
-source .venv/bin/activate
-mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
 ```
-### Terminal 2: Redis
-```bash
-redis-server
-```
-### Terminal 3: PostgreSQL
-```bash
-# If not using Docker Postgres
-initdb /usr/local/var/postgres
-pg_ctl -D /usr/local/var/postgres start
-# Or just use Docker for Postgres only:
-docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=ml_intern postgres:16-alpine
-```
-### Terminal 4: API Server
-```bash
-cd production
-uv sync
-# Set only required env vars
-export REDIS_URL=redis://localhost:6379
-export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
-export MLX_ENABLED=true
-export MLX_API_BASE=http://localhost:8000/v1
-export CLOUDFLARE_API_KEY=your_key
-export CLOUDFLARE_ACCOUNT_ID=your_account_id
-export NVIDIA_API_KEY=your_nvidia_key
-export FALLBACK_ENABLED=true
-export FALLBACK_PRIMARY=nim
-export FALLBACK_SECONDARY=cloudflare
-# Run with uv
-uv run python -m production_server
-```
-### Terminal 5: Worker
 ```bash
-cd production
-uv run python -m worker
 ```
----
-## Performance Tips for M2 Pro Max
-| Setting | 7B Model | 70B Model |
-|---------|---------|-----------|
-| RAM Usage | ~8GB | ~48GB |
-| Tokens/sec | ~40 tok/s | ~8 tok/s |
-| Startup Time | 2s | 20s |
-| Best For | Fast Q&A | Complex reasoning |
-### Use Multiple MLX Models
-Your 96GB can hold **both** 7B and 70B simultaneously:
-```bash
-# Terminal A: 70B for complex tasks
-mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
-# Terminal B: 7B for quick tasks
-mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --port 8001
-```
-Then in `.env`:
-```env
-MLX_API_BASE=http://host.docker.internal:8000/v1  # 70B default
-```
-And send quick requests to port 8001 directly.
 ---
-## Minimal Configuration (Just 5 Lines)
 ```bash
-# .env — ONLY these are required
 CLOUDFLARE_API_KEY=sk-your-key
 CLOUDFLARE_ACCOUNT_ID=your-account-id
-NVIDIA_API_KEY=nvapi-your-key     # optional, enables NIM primary
-MLX_API_BASE=http://host.docker.internal:8000/v1  # optional, enables local fallback
 ```
-Everything else uses sensible defaults:
-- `FALLBACK_ENABLED=true` (default)
-- `FALLBACK_PRIMARY=nim` (default)
-- `FALLBACK_SECONDARY=cloudflare` (default)
-- `DEFAULT_RPM_LIMIT=40` (NIM free tier)
-- `CACHE_TTL_SECONDS=300` (5 min cache)
 ---
-## What If NIM is Down?
 ```bash
-# Simulate NIM failure — circuit breaker will open after 3 failures
-for i in {1..3}; do
-  curl -X POST http://localhost/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -d '{"model":"nim/llama-3.1-8b","messages":[{"role":"user","content":"test"}]}'
-done
-# Now check status
 curl http://localhost/v1/fallback/status | jq
-# → "nim": "open", "cloudflare": "closed", "active_provider": "cloudflare"
-# Next request automatically goes to Cloudflare
-curl -X POST http://localhost/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"This goes to Cloudflare"}]}'
-# After 60 seconds (default recovery), NIM circuit half-opens
-curl http://localhost/v1/fallback/status | jq
-# → "nim": "half-open"
 ```
 ---
-## Monitoring Your Setup
 ```bash
-# Check which provider handled each request
-curl http://localhost/v1/fallback/status | jq
-# Grafana: http://localhost:3000 (admin/admin)
-#   - Dashboard: "ml-intern Production"
-#   - Panels: fallback count, provider latency, cache hit rate
-# Prometheus: http://localhost:9090
-#   Query: ml_intern_fallback_total
-#   Query: ml_intern_circuit_breaker_state
-# Redis cache stats
-redis-cli info stats | grep keyspace
-# Postgres request log
-psql postgresql://ml_intern:ml_intern@localhost/ml_intern \
-  -c "SELECT provider, COUNT(*) FROM requests GROUP BY provider;"
-```

+# M2 Pro Max 96GB — Gemma 4 Setup Guide
+Your machine is powerful enough to run **Gemma 4 31B-BF16 locally via MLX** — the best open alternative to Claude Opus. This guide sets up:
+- **Primary**: NIM (cloud GPU)
+- **Secondary**: Cloudflare Workers AI
+- **Tertiary**: Google Gemini
+- **Local**: Gemma 4 via MLX on Metal GPU
+## Architecture
 ```
 ┌──────────────────────────────────────────────────────────────┐
 │                   MacBook M2 Pro Max 96GB                     │
 │                                                              │
+│  ┌──────────────┐    ┌──────────────────────────────┐    │
+│  │  MLX Server  │    │  Docker (uv + API + Infra)   │    │
+│  │  (Metal GPU) │    │  ─────────────────────────   │    │
+│  │  Port 8000   │◄───│  • FastAPI                   │    │
+│  │  Gemma 4 31B │    │  • Redis                     │    │
+│  │  ~65GB RAM   │    │  • Postgres                  │    │
+│  └──────────────┘    │  • Nginx                     │    │
+│         ▲             └──────────────────────────────┘    │
 │         │                                                    │
+│  ┌──────┴──────┐    ┌──────────────────┐                   │
+│  │  NIM Cloud  │───►│ Cloudflare AI    │                   │
+│  │  (Primary)  │    │ (Secondary)       │                   │
+│  └──────────────┘    └────────┬─────────┘                   │
+│                                │                             │
+│  ┌────────────────────────────┴─────────┐                   │
+│  │       Google Gemini (Tertiary)         │                   │
+│  │   Best for coding + reasoning tasks    │                   │
+│  └───────────────────────────────────────┘                   │
 │                                                              │
 └──────────────────────────────────────────────────────────────┘
 ```
+## Gemma 4 Model Recommendations
+With 96GB unified memory, you have options:
+| Model | RAM | Quality | Speed | Best For |
+|-------|-----|---------|-------|----------|
+| **gemma-4-31b-bf16** | ~65GB | ⭐⭐⭐⭐⭐ Highest | ~6 tok/s | Deep reasoning, code, complex tasks |
+| **gemma-4-26b-a4b-it-bf16** | ~55GB | ⭐⭐⭐⭐⭐ Excellent | ~7 tok/s | General purpose, multimodal |
+| **gemma-4-26b-a4b-it-8bit** | ~36GB | ⭐⭐⭐⭐ Great | ~12 tok/s | Fast inference with good quality |
+| **gemma-4-e4b-it** | ~12GB | ⭐⭐⭐ Good | ~25 tok/s | Quick Q&A, simple tasks |
+**Recommendation**: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context.
+---
+## Quick Start (5 Minutes)
+### 1. Install uv + MLX Server
 ```bash
+# Install uv
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Create MLX environment
+mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4
 uv venv --python 3.11
 source .venv/bin/activate
+# Install mlx-vlm (supports Gemma 4)
+uv pip install mlx-vlm
+```
+### 2. Download Gemma 4 Model
+```bash
+# Download 31B BF16 (best quality, ~58GB on disk)
+# This fits in 96GB with room for context
 python -c "
+from mlx_vlm.utils import load
+model, processor = load('mlx-community/gemma-4-31b-bf16')
+print('Gemma 4 31B loaded successfully')
 "
+# Or the 26B variant (slightly smaller, still excellent)
+# python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')"
 ```
 ### 3. Start MLX Server
 ```bash
+# Terminal 1: Start Gemma 4 MLX server
+# Uses Metal GPU automatically — no config needed
+python -m mlx_vlm.server \
+  --model mlx-community/gemma-4-31b-bf16 \
   --host 0.0.0.0 \
   --port 8000
+# Verify
+ curl http://localhost:8000/v1/models
 ```
 ### 4. Configure API Server
 # In the ml-intern repo
 cd production
+# Copy minimal env
+cp .env.minimal .env
+```
+Edit `.env` — **just add your API keys** (only 3-4 lines):
+```env
+# Cloudflare (required — always works)
+CLOUDFLARE_API_KEY=sk-your-cloudflare-key
+CLOUDFLARE_ACCOUNT_ID=your-account-id
+# NIM (optional — faster, free tier)
+NVIDIA_API_KEY=nvapi-your-nvidia-key
+# Gemini (optional — great coding/reasoning)
+GEMINI_API_KEY=your-gemini-key
+# Enable Gemma 4 local
+MLX_ENABLED=true
 MLX_API_BASE=http://host.docker.internal:8000/v1
 ```
 ### 5. Start the Stack
 ```bash
+# Terminal 2: Launch API + infrastructure
 docker-compose -f docker-compose.m2.yml up -d
+# Verify
 curl http://localhost/health | jq
+curl http://localhost/v1/models | jq
+curl http://localhost/v1/fallback/status | jq
 ```
+### 6. Test Everything
 ```bash
+# Test 1: Fallback status — see active provider
+curl http://localhost/v1/fallback/status | jq
+# Test 2: Chat via active provider (auto-fallback)
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{
+    "model": "cloudflare/@cf/google/gemma-4-26b-a4b-it",
+    "messages": [{"role":"user","content":"Explain quantum computing"}]
+  }'
+# Test 3: Force Gemma 4 local via MLX
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{
+    "model": "mlx/gemma-4-31b-bf16",
+    "messages": [{"role":"user","content":"Write a Python web scraper"}],
+    "provider_override": "mlx"
+  }'
+# Test 4: Gemini for coding
 curl -X POST http://localhost/v1/chat/completions \
   -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemini/gemini-2.5-pro-preview",
+    "messages": [{"role":"user","content":"Debug this code: def foo(): pass"}]
+  }'
 ```
 ---
+## Fallback Chain (Automatic)
 ```
 Request comes in
      ▼
 ┌──────────────┐
 │  Check NIM    │◄── Circuit breaker CLOSED?
+│  (Primary)    │    Yes → Send to NIM (fastest cloud)
 └──────────────┘    No  → Fall through
      │
+     ▼
+┌──────────────┐
+│Check Cloudflare│◄── Circuit breaker CLOSED?
+│ (Secondary)    │    Yes → Send to Cloudflare
+└──────────────┘    No  → Fall through
+     │
+     ▼
 ┌──────────────┐
+│ Check Gemini   │◄── Circuit breaker CLOSED?
+│ (Tertiary)     │    Yes → Send to Gemini (great for code)
 └──────────────┘    No  → Fall through
      │
+     ▼
 ┌──────────────┐
+│ Check MLX      │◄── Enabled + Gemma 4 loaded?
+│ (Local Gemma)│    Yes → Send to local Gemma 4
 └──────────────┘    No  → Return 503
 ```
+Force any provider:
 ```bash
 curl -X POST http://localhost/v1/chat/completions \
+  -d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}'
 ```
 ---
+## Advanced: Gemma 4 with Speculative Decoding (2x Speed)
+Use a small drafter model to predict tokens ahead, verified by the full model:
 ```bash
+# Terminal 1: Gemma 4 with MTP drafter (2x faster!)
+python -m mlx_vlm.server \
+  --model mlx-community/gemma-4-31b-bf16 \
+  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
+  --draft-kind mtp \
+  --host 0.0.0.0 \
+  --port 8000
 ```
+> **Note**: Temperature must be 0 for byte-identical output with MTP.
+---
+## Multi-Model Setup (Run Multiple Locally)
+Your 96GB can hold **both** Gemma 4 31B + a fast 7B model:
 ```bash
+# Terminal 1: Gemma 4 31B for deep reasoning
+python -m mlx_vlm.server \
+  --model mlx-community/gemma-4-31b-bf16 \
+  --host 0.0.0.0 --port 8000
+# Terminal 2: Gemma 4 E4B for quick tasks
+python -m mlx_vlm.server \
+  --model mlx-community/gemma-4-e4b-it \
+  --host 0.0.0.0 --port 8001
 ```
+Then send quick tasks to port 8001 directly, complex ones to port 8000.
+---
+## Provider Selection Guide
+| Task Type | Recommended Provider | Model |
+|-----------|---------------------|-------|
+| **General reasoning** | MLX local | `gemma-4-31b-bf16` |
+| **Coding/debugging** | Gemini | `gemini-2.5-pro-preview` |
+| **Fast Q&A** | Cloudflare | `@cf/google/gemma-4-26b-a4b-it` |
+| **High throughput** | NIM | `llama-3.1-405b` |
+| **Multimodal (image+text)** | MLX local | `gemma-4-26b-a4b-it` |
 ---
+## Minimal Configuration (Just 3 Lines)
 ```bash
+# .env — the bare minimum
 CLOUDFLARE_API_KEY=sk-your-key
 CLOUDFLARE_ACCOUNT_ID=your-account-id
+MLX_ENABLED=true
 ```
+Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup.
 ---
+## Monitoring
 ```bash
+# Check active provider and fallback status
 curl http://localhost/v1/fallback/status | jq
+# View all available models
+curl http://localhost/v1/models | jq '.data[] | {id, owned_by}'
+# Grafana: http://localhost:3000 (admin/admin)
+#   - Dashboard: "ml-intern Production"
+#   - Panels: provider latency, fallback count, cache hit rate
+# Prometheus queries:
+curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total'
+curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state'
 ```
 ---
+## Troubleshooting
+### Gemma 4 download is slow
 ```bash
+# Use huggingface-cli for resumable download
+uv pip install huggingface-hub
+huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b
+```
+### MLX server says "out of memory"
+```bash
+# Try the smaller 26B model instead
+python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000
+# Or the tiny E4B:
+python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000
+```
+### Docker can't reach MLX on host
+```bash
+# On macOS, host.docker.internal works
+# On Linux, use your machine's IP:
+MLX_API_BASE=http://192.168.1.5:8000/v1
+```
+---
+## Gemma 4 vs Claude Opus
+| Capability | Gemma 4 31B-BF16 (MLX) | Claude Opus 4 |
+|-----------|------------------------|---------------|
+| **Context window** | 128K tokens | 200K tokens |
+| **Multimodal** | ✅ Image + Text | ✅ Image + Text |
+| **Code quality** | ⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
+| **Reasoning** | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
+| **Speed** | ~6 tok/s (M2 Max) | Cloud-based |
+| **Cost** | $0 (local) | ~$15/1M input tokens |
+| **Privacy** | ✅ 100% local | ❌ Cloud |
+| **Offline** | ✅ Works offline | ❌ Requires internet |
+**Verdict**: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.