# M2 Pro Max 96GB — Gemma 4 Setup Guide Your machine is powerful enough to run **Gemma 4 31B-BF16 locally via MLX** — the best open alternative to Claude Opus. This guide sets up: - **Primary**: NIM (cloud GPU) - **Secondary**: Cloudflare Workers AI - **Tertiary**: Google Gemini - **Local**: Gemma 4 via MLX on Metal GPU ## Architecture ``` ┌──────────────────────────────────────────────────────────────┐ │ MacBook M2 Pro Max 96GB │ │ │ │ ┌──────────────┐ ┌──────────────────────────────┐ │ │ │ MLX Server │ │ Docker (uv + API + Infra) │ │ │ │ (Metal GPU) │ │ ───────────────────────── │ │ │ │ Port 8000 │◄───│ • FastAPI │ │ │ │ Gemma 4 31B │ │ • Redis │ │ │ │ ~65GB RAM │ │ • Postgres │ │ │ └──────────────┘ │ • Nginx │ │ │ ▲ └──────────────────────────────┘ │ │ │ │ │ ┌──────┴──────┐ ┌──────────────────┐ │ │ │ NIM Cloud │───►│ Cloudflare AI │ │ │ │ (Primary) │ │ (Secondary) │ │ │ └──────────────┘ └────────┬─────────┘ │ │ │ │ │ ┌────────────────────────────┴─────────┐ │ │ │ Google Gemini (Tertiary) │ │ │ │ Best for coding + reasoning tasks │ │ │ └───────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────┘ ``` ## Gemma 4 Model Recommendations With 96GB unified memory, you have options: | Model | RAM | Quality | Speed | Best For | |-------|-----|---------|-------|----------| | **gemma-4-31b-bf16** | ~65GB | ⭐⭐⭐⭐⭐ Highest | ~6 tok/s | Deep reasoning, code, complex tasks | | **gemma-4-26b-a4b-it-bf16** | ~55GB | ⭐⭐⭐⭐⭐ Excellent | ~7 tok/s | General purpose, multimodal | | **gemma-4-26b-a4b-it-8bit** | ~36GB | ⭐⭐⭐⭐ Great | ~12 tok/s | Fast inference with good quality | | **gemma-4-e4b-it** | ~12GB | ⭐⭐⭐ Good | ~25 tok/s | Quick Q&A, simple tasks | **Recommendation**: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context. --- ## Quick Start (5 Minutes) ### 1. Install uv + MLX Server ```bash # Install uv curl -LsSf https://astral.sh/uv/install.sh | sh # Create MLX environment mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4 uv venv --python 3.11 source .venv/bin/activate # Install mlx-vlm (supports Gemma 4) uv pip install mlx-vlm ``` ### 2. Download Gemma 4 Model ```bash # Download 31B BF16 (best quality, ~58GB on disk) # This fits in 96GB with room for context python -c " from mlx_vlm.utils import load model, processor = load('mlx-community/gemma-4-31b-bf16') print('Gemma 4 31B loaded successfully') " # Or the 26B variant (slightly smaller, still excellent) # python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')" ``` ### 3. Start MLX Server ```bash # Terminal 1: Start Gemma 4 MLX server # Uses Metal GPU automatically — no config needed python -m mlx_vlm.server \ --model mlx-community/gemma-4-31b-bf16 \ --host 0.0.0.0 \ --port 8000 # Verify curl http://localhost:8000/v1/models ``` ### 4. Configure API Server ```bash # In the ml-intern repo cd production # Copy minimal env cp .env.minimal .env ``` Edit `.env` — **just add your API keys** (only 3-4 lines): ```env # Cloudflare (required — always works) CLOUDFLARE_API_KEY=sk-your-cloudflare-key CLOUDFLARE_ACCOUNT_ID=your-account-id # NIM (optional — faster, free tier) NVIDIA_API_KEY=nvapi-your-nvidia-key # Gemini (optional — great coding/reasoning) GEMINI_API_KEY=your-gemini-key # Enable Gemma 4 local MLX_ENABLED=true MLX_API_BASE=http://host.docker.internal:8000/v1 ``` ### 5. Start the Stack ```bash # Terminal 2: Launch API + infrastructure docker-compose -f docker-compose.m2.yml up -d # Verify curl http://localhost/health | jq curl http://localhost/v1/models | jq curl http://localhost/v1/fallback/status | jq ``` ### 6. Test Everything ```bash # Test 1: Fallback status — see active provider curl http://localhost/v1/fallback/status | jq # Test 2: Chat via active provider (auto-fallback) curl -X POST http://localhost/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "cloudflare/@cf/google/gemma-4-26b-a4b-it", "messages": [{"role":"user","content":"Explain quantum computing"}] }' # Test 3: Force Gemma 4 local via MLX curl -X POST http://localhost/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "mlx/gemma-4-31b-bf16", "messages": [{"role":"user","content":"Write a Python web scraper"}], "provider_override": "mlx" }' # Test 4: Gemini for coding curl -X POST http://localhost/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemini/gemini-2.5-pro-preview", "messages": [{"role":"user","content":"Debug this code: def foo(): pass"}] }' ``` --- ## Fallback Chain (Automatic) ``` Request comes in │ ▼ ┌──────────────┐ │ Check NIM │◄── Circuit breaker CLOSED? │ (Primary) │ Yes → Send to NIM (fastest cloud) └──────────────┘ No → Fall through │ ▼ ┌──────────────┐ │Check Cloudflare│◄── Circuit breaker CLOSED? │ (Secondary) │ Yes → Send to Cloudflare └──────────────┘ No → Fall through │ ▼ ┌──────────────┐ │ Check Gemini │◄── Circuit breaker CLOSED? │ (Tertiary) │ Yes → Send to Gemini (great for code) └──────────────┘ No → Fall through │ ▼ ┌──────────────┐ │ Check MLX │◄── Enabled + Gemma 4 loaded? │ (Local Gemma)│ Yes → Send to local Gemma 4 └──────────────┘ No → Return 503 ``` Force any provider: ```bash curl -X POST http://localhost/v1/chat/completions \ -d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}' ``` --- ## Advanced: Gemma 4 with Speculative Decoding (2x Speed) Use a small drafter model to predict tokens ahead, verified by the full model: ```bash # Terminal 1: Gemma 4 with MTP drafter (2x faster!) python -m mlx_vlm.server \ --model mlx-community/gemma-4-31b-bf16 \ --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \ --draft-kind mtp \ --host 0.0.0.0 \ --port 8000 ``` > **Note**: Temperature must be 0 for byte-identical output with MTP. --- ## Multi-Model Setup (Run Multiple Locally) Your 96GB can hold **both** Gemma 4 31B + a fast 7B model: ```bash # Terminal 1: Gemma 4 31B for deep reasoning python -m mlx_vlm.server \ --model mlx-community/gemma-4-31b-bf16 \ --host 0.0.0.0 --port 8000 # Terminal 2: Gemma 4 E4B for quick tasks python -m mlx_vlm.server \ --model mlx-community/gemma-4-e4b-it \ --host 0.0.0.0 --port 8001 ``` Then send quick tasks to port 8001 directly, complex ones to port 8000. --- ## Provider Selection Guide | Task Type | Recommended Provider | Model | |-----------|---------------------|-------| | **General reasoning** | MLX local | `gemma-4-31b-bf16` | | **Coding/debugging** | Gemini | `gemini-2.5-pro-preview` | | **Fast Q&A** | Cloudflare | `@cf/google/gemma-4-26b-a4b-it` | | **High throughput** | NIM | `llama-3.1-405b` | | **Multimodal (image+text)** | MLX local | `gemma-4-26b-a4b-it` | --- ## Minimal Configuration (Just 3 Lines) ```bash # .env — the bare minimum CLOUDFLARE_API_KEY=sk-your-key CLOUDFLARE_ACCOUNT_ID=your-account-id MLX_ENABLED=true ``` Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup. --- ## Monitoring ```bash # Check active provider and fallback status curl http://localhost/v1/fallback/status | jq # View all available models curl http://localhost/v1/models | jq '.data[] | {id, owned_by}' # Grafana: http://localhost:3000 (admin/admin) # - Dashboard: "ml-intern Production" # - Panels: provider latency, fallback count, cache hit rate # Prometheus queries: curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total' curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state' ``` --- ## Troubleshooting ### Gemma 4 download is slow ```bash # Use huggingface-cli for resumable download uv pip install huggingface-hub huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b ``` ### MLX server says "out of memory" ```bash # Try the smaller 26B model instead python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000 # Or the tiny E4B: python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000 ``` ### Docker can't reach MLX on host ```bash # On macOS, host.docker.internal works # On Linux, use your machine's IP: MLX_API_BASE=http://192.168.1.5:8000/v1 ``` --- ## Gemma 4 vs Claude Opus | Capability | Gemma 4 31B-BF16 (MLX) | Claude Opus 4 | |-----------|------------------------|---------------| | **Context window** | 128K tokens | 200K tokens | | **Multimodal** | ✅ Image + Text | ✅ Image + Text | | **Code quality** | ⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best | | **Reasoning** | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best | | **Speed** | ~6 tok/s (M2 Max) | Cloud-based | | **Cost** | $0 (local) | ~$15/1M input tokens | | **Privacy** | ✅ 100% local | ❌ Cloud | | **Offline** | ✅ Works offline | ❌ Requires internet | **Verdict**: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.