ml-intern-local-fork / production /M2_PRO_MAX_GUIDE.md
raazkumar's picture
Upload production/M2_PRO_MAX_GUIDE.md
30f7cdb verified

M2 Pro Max 96GB β€” Gemma 4 Setup Guide

Your machine is powerful enough to run Gemma 4 31B-BF16 locally via MLX β€” the best open alternative to Claude Opus. This guide sets up:

  • Primary: NIM (cloud GPU)
  • Secondary: Cloudflare Workers AI
  • Tertiary: Google Gemini
  • Local: Gemma 4 via MLX on Metal GPU

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   MacBook M2 Pro Max 96GB                     β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  MLX Server  β”‚    β”‚  Docker (uv + API + Infra)   β”‚    β”‚
β”‚  β”‚  (Metal GPU) β”‚    β”‚  ─────────────────────────   β”‚    β”‚
β”‚  β”‚  Port 8000   │◄───│  β€’ FastAPI                   β”‚    β”‚
β”‚  β”‚  Gemma 4 31B β”‚    β”‚  β€’ Redis                     β”‚    β”‚
β”‚  β”‚  ~65GB RAM   β”‚    β”‚  β€’ Postgres                  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β€’ Nginx                     β”‚    β”‚
β”‚         β–²             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚         β”‚                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚  NIM Cloud  │───►│ Cloudflare AI    β”‚                   β”‚
β”‚  β”‚  (Primary)  β”‚    β”‚ (Secondary)       β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                β”‚                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚       Google Gemini (Tertiary)         β”‚                   β”‚
β”‚  β”‚   Best for coding + reasoning tasks    β”‚                   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Gemma 4 Model Recommendations

With 96GB unified memory, you have options:

Model RAM Quality Speed Best For
gemma-4-31b-bf16 ~65GB ⭐⭐⭐⭐⭐ Highest ~6 tok/s Deep reasoning, code, complex tasks
gemma-4-26b-a4b-it-bf16 ~55GB ⭐⭐⭐⭐⭐ Excellent ~7 tok/s General purpose, multimodal
gemma-4-26b-a4b-it-8bit ~36GB ⭐⭐⭐⭐ Great ~12 tok/s Fast inference with good quality
gemma-4-e4b-it ~12GB ⭐⭐⭐ Good ~25 tok/s Quick Q&A, simple tasks

Recommendation: Start with gemma-4-31b-bf16. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context.


Quick Start (5 Minutes)

1. Install uv + MLX Server

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create MLX environment
mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4
uv venv --python 3.11
source .venv/bin/activate

# Install mlx-vlm (supports Gemma 4)
uv pip install mlx-vlm

2. Download Gemma 4 Model

# Download 31B BF16 (best quality, ~58GB on disk)
# This fits in 96GB with room for context
python -c "
from mlx_vlm.utils import load
model, processor = load('mlx-community/gemma-4-31b-bf16')
print('Gemma 4 31B loaded successfully')
"

# Or the 26B variant (slightly smaller, still excellent)
# python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')"

3. Start MLX Server

# Terminal 1: Start Gemma 4 MLX server
# Uses Metal GPU automatically β€” no config needed
python -m mlx_vlm.server \
  --model mlx-community/gemma-4-31b-bf16 \
  --host 0.0.0.0 \
  --port 8000

# Verify
 curl http://localhost:8000/v1/models

4. Configure API Server

# In the ml-intern repo
cd production

# Copy minimal env
cp .env.minimal .env

Edit .env β€” just add your API keys (only 3-4 lines):

# Cloudflare (required β€” always works)
CLOUDFLARE_API_KEY=sk-your-cloudflare-key
CLOUDFLARE_ACCOUNT_ID=your-account-id

# NIM (optional β€” faster, free tier)
NVIDIA_API_KEY=nvapi-your-nvidia-key

# Gemini (optional β€” great coding/reasoning)
GEMINI_API_KEY=your-gemini-key

# Enable Gemma 4 local
MLX_ENABLED=true
MLX_API_BASE=http://host.docker.internal:8000/v1

5. Start the Stack

# Terminal 2: Launch API + infrastructure
docker-compose -f docker-compose.m2.yml up -d

# Verify
curl http://localhost/health | jq
curl http://localhost/v1/models | jq
curl http://localhost/v1/fallback/status | jq

6. Test Everything

# Test 1: Fallback status β€” see active provider
curl http://localhost/v1/fallback/status | jq

# Test 2: Chat via active provider (auto-fallback)
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cloudflare/@cf/google/gemma-4-26b-a4b-it",
    "messages": [{"role":"user","content":"Explain quantum computing"}]
  }'

# Test 3: Force Gemma 4 local via MLX
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx/gemma-4-31b-bf16",
    "messages": [{"role":"user","content":"Write a Python web scraper"}],
    "provider_override": "mlx"
  }'

# Test 4: Gemini for coding
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini/gemini-2.5-pro-preview",
    "messages": [{"role":"user","content":"Debug this code: def foo(): pass"}]
  }'

Fallback Chain (Automatic)

Request comes in
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Check NIM    │◄── Circuit breaker CLOSED?
β”‚  (Primary)    β”‚    Yes β†’ Send to NIM (fastest cloud)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    No  β†’ Fall through
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Check Cloudflare│◄── Circuit breaker CLOSED?
β”‚ (Secondary)    β”‚    Yes β†’ Send to Cloudflare
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    No  β†’ Fall through
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check Gemini   │◄── Circuit breaker CLOSED?
β”‚ (Tertiary)     β”‚    Yes β†’ Send to Gemini (great for code)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    No  β†’ Fall through
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check MLX      │◄── Enabled + Gemma 4 loaded?
β”‚ (Local Gemma)β”‚    Yes β†’ Send to local Gemma 4
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    No  β†’ Return 503

Force any provider:

curl -X POST http://localhost/v1/chat/completions \
  -d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}'

Advanced: Gemma 4 with Speculative Decoding (2x Speed)

Use a small drafter model to predict tokens ahead, verified by the full model:

# Terminal 1: Gemma 4 with MTP drafter (2x faster!)
python -m mlx_vlm.server \
  --model mlx-community/gemma-4-31b-bf16 \
  --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
  --draft-kind mtp \
  --host 0.0.0.0 \
  --port 8000

Note: Temperature must be 0 for byte-identical output with MTP.


Multi-Model Setup (Run Multiple Locally)

Your 96GB can hold both Gemma 4 31B + a fast 7B model:

# Terminal 1: Gemma 4 31B for deep reasoning
python -m mlx_vlm.server \
  --model mlx-community/gemma-4-31b-bf16 \
  --host 0.0.0.0 --port 8000

# Terminal 2: Gemma 4 E4B for quick tasks
python -m mlx_vlm.server \
  --model mlx-community/gemma-4-e4b-it \
  --host 0.0.0.0 --port 8001

Then send quick tasks to port 8001 directly, complex ones to port 8000.


Provider Selection Guide

Task Type Recommended Provider Model
General reasoning MLX local gemma-4-31b-bf16
Coding/debugging Gemini gemini-2.5-pro-preview
Fast Q&A Cloudflare @cf/google/gemma-4-26b-a4b-it
High throughput NIM llama-3.1-405b
Multimodal (image+text) MLX local gemma-4-26b-a4b-it

Minimal Configuration (Just 3 Lines)

# .env β€” the bare minimum
CLOUDFLARE_API_KEY=sk-your-key
CLOUDFLARE_ACCOUNT_ID=your-account-id
MLX_ENABLED=true

Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup.


Monitoring

# Check active provider and fallback status
curl http://localhost/v1/fallback/status | jq

# View all available models
curl http://localhost/v1/models | jq '.data[] | {id, owned_by}'

# Grafana: http://localhost:3000 (admin/admin)
#   - Dashboard: "ml-intern Production"
#   - Panels: provider latency, fallback count, cache hit rate

# Prometheus queries:
curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total'
curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state'

Troubleshooting

Gemma 4 download is slow

# Use huggingface-cli for resumable download
uv pip install huggingface-hub
huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b

MLX server says "out of memory"

# Try the smaller 26B model instead
python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000
# Or the tiny E4B:
python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000

Docker can't reach MLX on host

# On macOS, host.docker.internal works
# On Linux, use your machine's IP:
MLX_API_BASE=http://192.168.1.5:8000/v1

Gemma 4 vs Claude Opus

Capability Gemma 4 31B-BF16 (MLX) Claude Opus 4
Context window 128K tokens 200K tokens
Multimodal βœ… Image + Text βœ… Image + Text
Code quality ⭐⭐⭐⭐ Excellent ⭐⭐⭐⭐⭐ Best
Reasoning ⭐⭐⭐⭐⭐ Excellent ⭐⭐⭐⭐⭐ Best
Speed ~6 tok/s (M2 Max) Cloud-based
Cost $0 (local) ~$15/1M input tokens
Privacy βœ… 100% local ❌ Cloud
Offline βœ… Works offline ❌ Requires internet

Verdict: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.