ml-intern-local-fork / production /M2_PRO_MAX_GUIDE.md
raazkumar's picture
Upload production/M2_PRO_MAX_GUIDE.md
30f7cdb verified
# M2 Pro Max 96GB β€” Gemma 4 Setup Guide
Your machine is powerful enough to run **Gemma 4 31B-BF16 locally via MLX** β€” the best open alternative to Claude Opus. This guide sets up:
- **Primary**: NIM (cloud GPU)
- **Secondary**: Cloudflare Workers AI
- **Tertiary**: Google Gemini
- **Local**: Gemma 4 via MLX on Metal GPU
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MacBook M2 Pro Max 96GB β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ MLX Server β”‚ β”‚ Docker (uv + API + Infra) β”‚ β”‚
β”‚ β”‚ (Metal GPU) β”‚ β”‚ ───────────────────────── β”‚ β”‚
β”‚ β”‚ Port 8000 │◄───│ β€’ FastAPI β”‚ β”‚
β”‚ β”‚ Gemma 4 31B β”‚ β”‚ β€’ Redis β”‚ β”‚
β”‚ β”‚ ~65GB RAM β”‚ β”‚ β€’ Postgres β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β€’ Nginx β”‚ β”‚
β”‚ β–² β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ NIM Cloud │───►│ Cloudflare AI β”‚ β”‚
β”‚ β”‚ (Primary) β”‚ β”‚ (Secondary) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Google Gemini (Tertiary) β”‚ β”‚
β”‚ β”‚ Best for coding + reasoning tasks β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Gemma 4 Model Recommendations
With 96GB unified memory, you have options:
| Model | RAM | Quality | Speed | Best For |
|-------|-----|---------|-------|----------|
| **gemma-4-31b-bf16** | ~65GB | ⭐⭐⭐⭐⭐ Highest | ~6 tok/s | Deep reasoning, code, complex tasks |
| **gemma-4-26b-a4b-it-bf16** | ~55GB | ⭐⭐⭐⭐⭐ Excellent | ~7 tok/s | General purpose, multimodal |
| **gemma-4-26b-a4b-it-8bit** | ~36GB | ⭐⭐⭐⭐ Great | ~12 tok/s | Fast inference with good quality |
| **gemma-4-e4b-it** | ~12GB | ⭐⭐⭐ Good | ~25 tok/s | Quick Q&A, simple tasks |
**Recommendation**: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context.
---
## Quick Start (5 Minutes)
### 1. Install uv + MLX Server
```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create MLX environment
mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4
uv venv --python 3.11
source .venv/bin/activate
# Install mlx-vlm (supports Gemma 4)
uv pip install mlx-vlm
```
### 2. Download Gemma 4 Model
```bash
# Download 31B BF16 (best quality, ~58GB on disk)
# This fits in 96GB with room for context
python -c "
from mlx_vlm.utils import load
model, processor = load('mlx-community/gemma-4-31b-bf16')
print('Gemma 4 31B loaded successfully')
"
# Or the 26B variant (slightly smaller, still excellent)
# python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')"
```
### 3. Start MLX Server
```bash
# Terminal 1: Start Gemma 4 MLX server
# Uses Metal GPU automatically β€” no config needed
python -m mlx_vlm.server \
--model mlx-community/gemma-4-31b-bf16 \
--host 0.0.0.0 \
--port 8000
# Verify
curl http://localhost:8000/v1/models
```
### 4. Configure API Server
```bash
# In the ml-intern repo
cd production
# Copy minimal env
cp .env.minimal .env
```
Edit `.env` β€” **just add your API keys** (only 3-4 lines):
```env
# Cloudflare (required β€” always works)
CLOUDFLARE_API_KEY=sk-your-cloudflare-key
CLOUDFLARE_ACCOUNT_ID=your-account-id
# NIM (optional β€” faster, free tier)
NVIDIA_API_KEY=nvapi-your-nvidia-key
# Gemini (optional β€” great coding/reasoning)
GEMINI_API_KEY=your-gemini-key
# Enable Gemma 4 local
MLX_ENABLED=true
MLX_API_BASE=http://host.docker.internal:8000/v1
```
### 5. Start the Stack
```bash
# Terminal 2: Launch API + infrastructure
docker-compose -f docker-compose.m2.yml up -d
# Verify
curl http://localhost/health | jq
curl http://localhost/v1/models | jq
curl http://localhost/v1/fallback/status | jq
```
### 6. Test Everything
```bash
# Test 1: Fallback status β€” see active provider
curl http://localhost/v1/fallback/status | jq
# Test 2: Chat via active provider (auto-fallback)
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "cloudflare/@cf/google/gemma-4-26b-a4b-it",
"messages": [{"role":"user","content":"Explain quantum computing"}]
}'
# Test 3: Force Gemma 4 local via MLX
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx/gemma-4-31b-bf16",
"messages": [{"role":"user","content":"Write a Python web scraper"}],
"provider_override": "mlx"
}'
# Test 4: Gemini for coding
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/gemini-2.5-pro-preview",
"messages": [{"role":"user","content":"Debug this code: def foo(): pass"}]
}'
```
---
## Fallback Chain (Automatic)
```
Request comes in
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check NIM │◄── Circuit breaker CLOSED?
β”‚ (Primary) β”‚ Yes β†’ Send to NIM (fastest cloud)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Check Cloudflare│◄── Circuit breaker CLOSED?
β”‚ (Secondary) β”‚ Yes β†’ Send to Cloudflare
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check Gemini │◄── Circuit breaker CLOSED?
β”‚ (Tertiary) β”‚ Yes β†’ Send to Gemini (great for code)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check MLX │◄── Enabled + Gemma 4 loaded?
β”‚ (Local Gemma)β”‚ Yes β†’ Send to local Gemma 4
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Return 503
```
Force any provider:
```bash
curl -X POST http://localhost/v1/chat/completions \
-d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}'
```
---
## Advanced: Gemma 4 with Speculative Decoding (2x Speed)
Use a small drafter model to predict tokens ahead, verified by the full model:
```bash
# Terminal 1: Gemma 4 with MTP drafter (2x faster!)
python -m mlx_vlm.server \
--model mlx-community/gemma-4-31b-bf16 \
--draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
--draft-kind mtp \
--host 0.0.0.0 \
--port 8000
```
> **Note**: Temperature must be 0 for byte-identical output with MTP.
---
## Multi-Model Setup (Run Multiple Locally)
Your 96GB can hold **both** Gemma 4 31B + a fast 7B model:
```bash
# Terminal 1: Gemma 4 31B for deep reasoning
python -m mlx_vlm.server \
--model mlx-community/gemma-4-31b-bf16 \
--host 0.0.0.0 --port 8000
# Terminal 2: Gemma 4 E4B for quick tasks
python -m mlx_vlm.server \
--model mlx-community/gemma-4-e4b-it \
--host 0.0.0.0 --port 8001
```
Then send quick tasks to port 8001 directly, complex ones to port 8000.
---
## Provider Selection Guide
| Task Type | Recommended Provider | Model |
|-----------|---------------------|-------|
| **General reasoning** | MLX local | `gemma-4-31b-bf16` |
| **Coding/debugging** | Gemini | `gemini-2.5-pro-preview` |
| **Fast Q&A** | Cloudflare | `@cf/google/gemma-4-26b-a4b-it` |
| **High throughput** | NIM | `llama-3.1-405b` |
| **Multimodal (image+text)** | MLX local | `gemma-4-26b-a4b-it` |
---
## Minimal Configuration (Just 3 Lines)
```bash
# .env β€” the bare minimum
CLOUDFLARE_API_KEY=sk-your-key
CLOUDFLARE_ACCOUNT_ID=your-account-id
MLX_ENABLED=true
```
Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup.
---
## Monitoring
```bash
# Check active provider and fallback status
curl http://localhost/v1/fallback/status | jq
# View all available models
curl http://localhost/v1/models | jq '.data[] | {id, owned_by}'
# Grafana: http://localhost:3000 (admin/admin)
# - Dashboard: "ml-intern Production"
# - Panels: provider latency, fallback count, cache hit rate
# Prometheus queries:
curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total'
curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state'
```
---
## Troubleshooting
### Gemma 4 download is slow
```bash
# Use huggingface-cli for resumable download
uv pip install huggingface-hub
huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b
```
### MLX server says "out of memory"
```bash
# Try the smaller 26B model instead
python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000
# Or the tiny E4B:
python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000
```
### Docker can't reach MLX on host
```bash
# On macOS, host.docker.internal works
# On Linux, use your machine's IP:
MLX_API_BASE=http://192.168.1.5:8000/v1
```
---
## Gemma 4 vs Claude Opus
| Capability | Gemma 4 31B-BF16 (MLX) | Claude Opus 4 |
|-----------|------------------------|---------------|
| **Context window** | 128K tokens | 200K tokens |
| **Multimodal** | βœ… Image + Text | βœ… Image + Text |
| **Code quality** | ⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
| **Reasoning** | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
| **Speed** | ~6 tok/s (M2 Max) | Cloud-based |
| **Cost** | $0 (local) | ~$15/1M input tokens |
| **Privacy** | βœ… 100% local | ❌ Cloud |
| **Offline** | βœ… Works offline | ❌ Requires internet |
**Verdict**: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.