raazkumar
/

ml-intern-local-fork

Model card Files Files and versions

xet

Community

raazkumar commited on 2 days ago

Commit

2bc8470

verified ·

1 Parent(s): 3f39a2b

Upload production/M2_PRO_MAX_GUIDE.md

Browse files

Files changed (1) hide show

production/M2_PRO_MAX_GUIDE.md +335 -0

production/M2_PRO_MAX_GUIDE.md ADDED Viewed

	@@ -0,0 +1,335 @@

+# M2 Pro Max 96GB — Setup Guide
+Your machine is powerful enough to run **70B models locally via MLX** while using **NIM (cloud) as primary** and **Cloudflare Workers AI as automatic fallback**.
+## Architecture for Your Setup
+```
+┌──────────────────────────────────────────────────────────────┐
+│                   MacBook M2 Pro Max 96GB                     │
+│                                                              │
+│  ┌──────────────────┐      ┌──────────────────────────┐    │
+│  │   MLX Server     │      │   Docker (API + Infra)   │    │
+│  │   (Metal GPU)    │      │   ─────────────────────  │    │
+│  │   ─────────────  │      │   • FastAPI server       │    │
+│  │   Port :8000     │◄─────│   • Redis cache          │    │
+│  │   70B models     │      │   • Postgres DB          │    │
+│  │   48GB RAM use   │      │   • Nginx LB             │    │
+│  └──────────────────┘      └──────────────────────────┘    │
+│         ▲                                                    │
+│         │                                                    │
+│  ┌──────┴──────┐    ┌─────────────────┐                     │
+│  │  NIM Cloud   │───►│ Cloudflare AI   │                     │
+│  │  (Primary)   │    │ (Auto Fallback)│                     │
+│  └──────────────┘    └─────────────────┘                     │
+│                                                              │
+└──────────────────────────────────────────────────────────────┘
+```
+## Quick Start
+### 1. Install Prerequisites
+```bash
+# Install uv (fast Python package manager)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Install Homebrew packages
+brew install redis postgresql docker
+# Start services
+brew services start redis
+brew services start postgresql
+# Install Docker Desktop for Mac (if not already)
+# https://www.docker.com/products/docker-desktop
+```
+### 2. Install MLX Server (Native on macOS)
+```bash
+# Create a dedicated venv for MLX
+mkdir ~/mlx-server && cd ~/mlx-server
+uv venv --python 3.11
+source .venv/bin/activate
+# Install MLX LM
+uv pip install mlx-lm
+# Download a 70B model (takes ~40GB, fits in 96GB)
+# Option A: llama-3.1-70B (best quality)
+python -c "
+from mlx_lm import load
+load('mlx-community/Meta-Llama-3.1-70B-Instruct-4bit')
+"
+# Option B: Mistral-7B (faster, less memory)
+python -c "
+from mlx_lm import load
+load('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
+"
+```
+### 3. Start MLX Server
+```bash
+# Terminal 1: Start MLX server (uses Metal GPU automatically)
+mlx_lm.server \
+  --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit \
+  --host 0.0.0.0 \
+  --port 8000
+# OR for 7B (faster, less RAM):
+# mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --host 0.0.0.0 --port 8000
+# Verify it's running
+curl http://localhost:8000/v1/models
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'
+```
+### 4. Configure API Server
+```bash
+# In the ml-intern repo
+cd production
+# Create minimal .env (ONLY these 4-5 lines needed)
+cat > .env << 'EOF'
+# REQUIRED — Cloudflare fallback
+CLOUDFLARE_API_KEY=your_cloudflare_api_key_here
+CLOUDFLARE_ACCOUNT_ID=your_account_id_here
+# OPTIONAL — NIM primary (if you have API key)
+NVIDIA_API_KEY=your_nvidia_api_key_here
+# Point MLX to your local server
+MLX_API_BASE=http://host.docker.internal:8000/v1
+EOF
+```
+### 5. Start the Stack
+```bash
+# Start Redis, Postgres, API, Workers, Nginx
+docker-compose -f docker-compose.m2.yml up -d
+# Verify everything
+curl http://localhost/health | jq
+```
+### 6. Test the Full Pipeline
+```bash
+# Test 1: NIM primary (if API key set)
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"nim/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from NIM"}]}'
+# Test 2: Cloudflare fallback
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from Cloudflare"}]}'
+# Test 3: MLX local (bypasses fallback)
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"mlx/llama-3.1-70b","messages":[{"role":"user","content":"Hello from MLX"}]}'
+# Test 4: Check which provider is active
+curl http://localhost/v1/fallback/status | jq
+```
+---
+## Fallback Behavior
+```
+Request comes in
+     │
+     ▼
+┌──────────────┐
+│  Check NIM    │◄── Circuit breaker CLOSED?
+│  (Primary)    │    Yes → Send to NIM
+└──────────────┘    No  → Fall through
+     │
+     ▼ (NIM down or rate limited)
+┌──────────────┐
+│ Check Cloudflare│◄── Circuit breaker CLOSED?
+│ (Fallback)     │    Yes → Send to Cloudflare
+└──────────────┘    No  → Fall through
+     │
+     ▼ (Both down)
+┌──────────────┐
+│ Check MLX     │◄── Enabled + Circuit CLOSED?
+│ (Local)      │    Yes → Send to MLX
+└──────────────┘    No  → Return 503
+```
+**You can force a provider** with `provider_override`:
+```bash
+# Always use MLX regardless of NIM status
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "mlx/llama-3.1-70b",
+    "messages": [{"role":"user","content":"Hello"}],
+    "provider_override": "mlx"
+  }'
+```
+---
+## Running MLX + API Together (All Native, No Docker)
+If you prefer running everything natively without Docker:
+### Terminal 1: MLX Server
+```bash
+cd ~/mlx-server
+source .venv/bin/activate
+mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
+```
+### Terminal 2: Redis
+```bash
+redis-server
+```
+### Terminal 3: PostgreSQL
+```bash
+# If not using Docker Postgres
+initdb /usr/local/var/postgres
+pg_ctl -D /usr/local/var/postgres start
+# Or just use Docker for Postgres only:
+docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=ml_intern postgres:16-alpine
+```
+### Terminal 4: API Server
+```bash
+cd production
+uv sync
+# Set only required env vars
+export REDIS_URL=redis://localhost:6379
+export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
+export MLX_ENABLED=true
+export MLX_API_BASE=http://localhost:8000/v1
+export CLOUDFLARE_API_KEY=your_key
+export CLOUDFLARE_ACCOUNT_ID=your_account_id
+export NVIDIA_API_KEY=your_nvidia_key
+export FALLBACK_ENABLED=true
+export FALLBACK_PRIMARY=nim
+export FALLBACK_SECONDARY=cloudflare
+# Run with uv
+uv run python -m production_server
+```
+### Terminal 5: Worker
+```bash
+cd production
+uv run python -m worker
+```
+---
+## Performance Tips for M2 Pro Max
+| Setting | 7B Model | 70B Model |
+|---------|---------|-----------|
+| RAM Usage | ~8GB | ~48GB |
+| Tokens/sec | ~40 tok/s | ~8 tok/s |
+| Startup Time | 2s | 20s |
+| Best For | Fast Q&A | Complex reasoning |
+### Use Multiple MLX Models
+Your 96GB can hold **both** 7B and 70B simultaneously:
+```bash
+# Terminal A: 70B for complex tasks
+mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
+# Terminal B: 7B for quick tasks
+mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --port 8001
+```
+Then in `.env`:
+```env
+MLX_API_BASE=http://host.docker.internal:8000/v1  # 70B default
+```
+And send quick requests to port 8001 directly.
+---
+## Minimal Configuration (Just 5 Lines)
+```bash
+# .env — ONLY these are required
+CLOUDFLARE_API_KEY=sk-your-key
+CLOUDFLARE_ACCOUNT_ID=your-account-id
+NVIDIA_API_KEY=nvapi-your-key     # optional, enables NIM primary
+MLX_API_BASE=http://host.docker.internal:8000/v1  # optional, enables local fallback
+```
+Everything else uses sensible defaults:
+- `FALLBACK_ENABLED=true` (default)
+- `FALLBACK_PRIMARY=nim` (default)
+- `FALLBACK_SECONDARY=cloudflare` (default)
+- `DEFAULT_RPM_LIMIT=40` (NIM free tier)
+- `CACHE_TTL_SECONDS=300` (5 min cache)
+---
+## What If NIM is Down?
+```bash
+# Simulate NIM failure — circuit breaker will open after 3 failures
+for i in {1..3}; do
+  curl -X POST http://localhost/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{"model":"nim/llama-3.1-8b","messages":[{"role":"user","content":"test"}]}'
+done
+# Now check status
+curl http://localhost/v1/fallback/status | jq
+# → "nim": "open", "cloudflare": "closed", "active_provider": "cloudflare"
+# Next request automatically goes to Cloudflare
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"This goes to Cloudflare"}]}'
+# After 60 seconds (default recovery), NIM circuit half-opens
+curl http://localhost/v1/fallback/status | jq
+# → "nim": "half-open"
+```
+---
+## Monitoring Your Setup
+```bash
+# Check which provider handled each request
+curl http://localhost/v1/fallback/status | jq
+# Grafana: http://localhost:3000 (admin/admin)
+#   - Dashboard: "ml-intern Production"
+#   - Panels: fallback count, provider latency, cache hit rate
+# Prometheus: http://localhost:9090
+#   Query: ml_intern_fallback_total
+#   Query: ml_intern_circuit_breaker_state
+# Redis cache stats
+redis-cli info stats | grep keyspace
+# Postgres request log
+psql postgresql://ml_intern:ml_intern@localhost/ml_intern \
+  -c "SELECT provider, COUNT(*) FROM requests GROUP BY provider;"
+```