ml-intern-local-fork / production /M2_PRO_MAX_GUIDE.md

Upload production/M2_PRO_MAX_GUIDE.md

30f7cdb verified 2 days ago

11.2 kB

	# M2 Pro Max 96GB — Gemma 4 Setup Guide

	Your machine is powerful enough to run Gemma 4 31B-BF16 locally via MLX — the best open alternative to Claude Opus. This guide sets up:

	- Primary: NIM (cloud GPU)
	- Secondary: Cloudflare Workers AI
	- Tertiary: Google Gemini
	- Local: Gemma 4 via MLX on Metal GPU

	## Architecture

	```
	┌──────────────────────────────────────────────────────────────┐
	│ MacBook M2 Pro Max 96GB │
	│ │
	│ ┌──────────────┐ ┌──────────────────────────────┐ │
	│ │ MLX Server │ │ Docker (uv + API + Infra) │ │
	│ │ (Metal GPU) │ │ ───────────────────────── │ │
	│ │ Port 8000 │◄───│ • FastAPI │ │
	│ │ Gemma 4 31B │ │ • Redis │ │
	│ │ ~65GB RAM │ │ • Postgres │ │
	│ └──────────────┘ │ • Nginx │ │
	│ ▲ └──────────────────────────────┘ │
	│ │ │
	│ ┌──────┴──────┐ ┌──────────────────┐ │
	│ │ NIM Cloud │───►│ Cloudflare AI │ │
	│ │ (Primary) │ │ (Secondary) │ │
	│ └──────────────┘ └────────┬─────────┘ │
	│ │ │
	│ ┌────────────────────────────┴─────────┐ │
	│ │ Google Gemini (Tertiary) │ │
	│ │ Best for coding + reasoning tasks │ │
	│ └───────────────────────────────────────┘ │
	│ │
	└──────────────────────────────────────────────────────────────┘
	```

	## Gemma 4 Model Recommendations

	With 96GB unified memory, you have options:

	\| Model \| RAM \| Quality \| Speed \| Best For \|
	\|-------\|-----\|---------\|-------\|----------\|
	\| gemma-4-31b-bf16 \| ~65GB \| ⭐⭐⭐⭐⭐ Highest \| ~6 tok/s \| Deep reasoning, code, complex tasks \|
	\| gemma-4-26b-a4b-it-bf16 \| ~55GB \| ⭐⭐⭐⭐⭐ Excellent \| ~7 tok/s \| General purpose, multimodal \|
	\| gemma-4-26b-a4b-it-8bit \| ~36GB \| ⭐⭐⭐⭐ Great \| ~12 tok/s \| Fast inference with good quality \|
	\| gemma-4-e4b-it \| ~12GB \| ⭐⭐⭐ Good \| ~25 tok/s \| Quick Q&A, simple tasks \|

	Recommendation: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context.

	---

	## Quick Start (5 Minutes)

	### 1. Install uv + MLX Server

	```bash
	# Install uv
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Create MLX environment
	mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4
	uv venv --python 3.11
	source .venv/bin/activate

	# Install mlx-vlm (supports Gemma 4)
	uv pip install mlx-vlm
	```

	### 2. Download Gemma 4 Model

	```bash
	# Download 31B BF16 (best quality, ~58GB on disk)
	# This fits in 96GB with room for context
	python -c "
	from mlx_vlm.utils import load
	model, processor = load('mlx-community/gemma-4-31b-bf16')
	print('Gemma 4 31B loaded successfully')
	"

	# Or the 26B variant (slightly smaller, still excellent)
	# python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')"
	```

	### 3. Start MLX Server

	```bash
	# Terminal 1: Start Gemma 4 MLX server
	# Uses Metal GPU automatically — no config needed
	python -m mlx_vlm.server \
	--model mlx-community/gemma-4-31b-bf16 \
	--host 0.0.0.0 \
	--port 8000

	# Verify
	curl http://localhost:8000/v1/models
	```

	### 4. Configure API Server

	```bash
	# In the ml-intern repo
	cd production

	# Copy minimal env
	cp .env.minimal .env
	```

	Edit `.env` — just add your API keys (only 3-4 lines):

	```env
	# Cloudflare (required — always works)
	CLOUDFLARE_API_KEY=sk-your-cloudflare-key
	CLOUDFLARE_ACCOUNT_ID=your-account-id

	# NIM (optional — faster, free tier)
	NVIDIA_API_KEY=nvapi-your-nvidia-key

	# Gemini (optional — great coding/reasoning)
	GEMINI_API_KEY=your-gemini-key

	# Enable Gemma 4 local
	MLX_ENABLED=true
	MLX_API_BASE=http://host.docker.internal:8000/v1
	```

	### 5. Start the Stack

	```bash
	# Terminal 2: Launch API + infrastructure
	docker-compose -f docker-compose.m2.yml up -d

	# Verify
	curl http://localhost/health \| jq
	curl http://localhost/v1/models \| jq
	curl http://localhost/v1/fallback/status \| jq
	```

	### 6. Test Everything

	```bash
	# Test 1: Fallback status — see active provider
	curl http://localhost/v1/fallback/status \| jq

	# Test 2: Chat via active provider (auto-fallback)
	curl -X POST http://localhost/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "cloudflare/@cf/google/gemma-4-26b-a4b-it",
	"messages": [{"role":"user","content":"Explain quantum computing"}]
	}'

	# Test 3: Force Gemma 4 local via MLX
	curl -X POST http://localhost/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "mlx/gemma-4-31b-bf16",
	"messages": [{"role":"user","content":"Write a Python web scraper"}],
	"provider_override": "mlx"
	}'

	# Test 4: Gemini for coding
	curl -X POST http://localhost/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "gemini/gemini-2.5-pro-preview",
	"messages": [{"role":"user","content":"Debug this code: def foo(): pass"}]
	}'
	```

	---

	## Fallback Chain (Automatic)

	```
	Request comes in
	│
	▼
	┌──────────────┐
	│ Check NIM │◄── Circuit breaker CLOSED?
	│ (Primary) │ Yes → Send to NIM (fastest cloud)
	└──────────────┘ No → Fall through
	│
	▼
	┌──────────────┐
	│Check Cloudflare│◄── Circuit breaker CLOSED?
	│ (Secondary) │ Yes → Send to Cloudflare
	└──────────────┘ No → Fall through
	│
	▼
	┌──────────────┐
	│ Check Gemini │◄── Circuit breaker CLOSED?
	│ (Tertiary) │ Yes → Send to Gemini (great for code)
	└──────────────┘ No → Fall through
	│
	▼
	┌──────────────┐
	│ Check MLX │◄── Enabled + Gemma 4 loaded?
	│ (Local Gemma)│ Yes → Send to local Gemma 4
	└──────────────┘ No → Return 503
	```

	Force any provider:
	```bash
	curl -X POST http://localhost/v1/chat/completions \
	-d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}'
	```

	---

	## Advanced: Gemma 4 with Speculative Decoding (2x Speed)

	Use a small drafter model to predict tokens ahead, verified by the full model:

	```bash
	# Terminal 1: Gemma 4 with MTP drafter (2x faster!)
	python -m mlx_vlm.server \
	--model mlx-community/gemma-4-31b-bf16 \
	--draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
	--draft-kind mtp \
	--host 0.0.0.0 \
	--port 8000
	```

	> Note: Temperature must be 0 for byte-identical output with MTP.

	---

	## Multi-Model Setup (Run Multiple Locally)

	Your 96GB can hold both Gemma 4 31B + a fast 7B model:

	```bash
	# Terminal 1: Gemma 4 31B for deep reasoning
	python -m mlx_vlm.server \
	--model mlx-community/gemma-4-31b-bf16 \
	--host 0.0.0.0 --port 8000

	# Terminal 2: Gemma 4 E4B for quick tasks
	python -m mlx_vlm.server \
	--model mlx-community/gemma-4-e4b-it \
	--host 0.0.0.0 --port 8001
	```

	Then send quick tasks to port 8001 directly, complex ones to port 8000.

	---

	## Provider Selection Guide

	\| Task Type \| Recommended Provider \| Model \|
	\|-----------\|---------------------\|-------\|
	\| General reasoning \| MLX local \| `gemma-4-31b-bf16` \|
	\| Coding/debugging \| Gemini \| `gemini-2.5-pro-preview` \|
	\| Fast Q&A \| Cloudflare \| `@cf/google/gemma-4-26b-a4b-it` \|
	\| High throughput \| NIM \| `llama-3.1-405b` \|
	\| Multimodal (image+text) \| MLX local \| `gemma-4-26b-a4b-it` \|

	---

	## Minimal Configuration (Just 3 Lines)

	```bash
	# .env — the bare minimum
	CLOUDFLARE_API_KEY=sk-your-key
	CLOUDFLARE_ACCOUNT_ID=your-account-id
	MLX_ENABLED=true
	```

	Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup.

	---

	## Monitoring

	```bash
	# Check active provider and fallback status
	curl http://localhost/v1/fallback/status \| jq

	# View all available models
	curl http://localhost/v1/models \| jq '.data[] \| {id, owned_by}'

	# Grafana: http://localhost:3000 (admin/admin)
	# - Dashboard: "ml-intern Production"
	# - Panels: provider latency, fallback count, cache hit rate

	# Prometheus queries:
	curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total'
	curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state'
	```

	---

	## Troubleshooting

	### Gemma 4 download is slow
	```bash
	# Use huggingface-cli for resumable download
	uv pip install huggingface-hub
	huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b
	```

	### MLX server says "out of memory"
	```bash
	# Try the smaller 26B model instead
	python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000
	# Or the tiny E4B:
	python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000
	```

	### Docker can't reach MLX on host
	```bash
	# On macOS, host.docker.internal works
	# On Linux, use your machine's IP:
	MLX_API_BASE=http://192.168.1.5:8000/v1
	```

	---

	## Gemma 4 vs Claude Opus

	\| Capability \| Gemma 4 31B-BF16 (MLX) \| Claude Opus 4 \|
	\|-----------\|------------------------\|---------------\|
	\| Context window \| 128K tokens \| 200K tokens \|
	\| Multimodal \| ✅ Image + Text \| ✅ Image + Text \|
	\| Code quality \| ⭐⭐⭐⭐ Excellent \| ⭐⭐⭐⭐⭐ Best \|
	\| Reasoning \| ⭐⭐⭐⭐⭐ Excellent \| ⭐⭐⭐⭐⭐ Best \|
	\| Speed \| ~6 tok/s (M2 Max) \| Cloud-based \|
	\| Cost \| $0 (local) \| ~$15/1M input tokens \|
	\| Privacy \| ✅ 100% local \| ❌ Cloud \|
	\| Offline \| ✅ Works offline \| ❌ Requires internet \|

	Verdict: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.