| # M2 Pro Max 96GB β Gemma 4 Setup Guide |
|
|
| Your machine is powerful enough to run **Gemma 4 31B-BF16 locally via MLX** β the best open alternative to Claude Opus. This guide sets up: |
|
|
| - **Primary**: NIM (cloud GPU) |
| - **Secondary**: Cloudflare Workers AI |
| - **Tertiary**: Google Gemini |
| - **Local**: Gemma 4 via MLX on Metal GPU |
|
|
| ## Architecture |
|
|
| ``` |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β MacBook M2 Pro Max 96GB β |
| β β |
| β ββββββββββββββββ ββββββββββββββββββββββββββββββββ β |
| β β MLX Server β β Docker (uv + API + Infra) β β |
| β β (Metal GPU) β β βββββββββββββββββββββββββ β β |
| β β Port 8000 ββββββ β’ FastAPI β β |
| β β Gemma 4 31B β β β’ Redis β β |
| β β ~65GB RAM β β β’ Postgres β β |
| β ββββββββββββββββ β β’ Nginx β β |
| β β² ββββββββββββββββββββββββββββββββ β |
| β β β |
| β ββββββββ΄βββββββ ββββββββββββββββββββ β |
| β β NIM Cloud βββββΊβ Cloudflare AI β β |
| β β (Primary) β β (Secondary) β β |
| β ββββββββββββββββ ββββββββββ¬ββββββββββ β |
| β β β |
| β ββββββββββββββββββββββββββββββ΄ββββββββββ β |
| β β Google Gemini (Tertiary) β β |
| β β Best for coding + reasoning tasks β β |
| β βββββββββββββββββββββββββββββββββββββββββ β |
| β β |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Gemma 4 Model Recommendations |
|
|
| With 96GB unified memory, you have options: |
|
|
| | Model | RAM | Quality | Speed | Best For | |
| |-------|-----|---------|-------|----------| |
| | **gemma-4-31b-bf16** | ~65GB | βββββ Highest | ~6 tok/s | Deep reasoning, code, complex tasks | |
| | **gemma-4-26b-a4b-it-bf16** | ~55GB | βββββ Excellent | ~7 tok/s | General purpose, multimodal | |
| | **gemma-4-26b-a4b-it-8bit** | ~36GB | ββββ Great | ~12 tok/s | Fast inference with good quality | |
| | **gemma-4-e4b-it** | ~12GB | βββ Good | ~25 tok/s | Quick Q&A, simple tasks | |
|
|
| **Recommendation**: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context. |
|
|
| --- |
|
|
| ## Quick Start (5 Minutes) |
|
|
| ### 1. Install uv + MLX Server |
|
|
| ```bash |
| # Install uv |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| |
| # Create MLX environment |
| mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4 |
| uv venv --python 3.11 |
| source .venv/bin/activate |
| |
| # Install mlx-vlm (supports Gemma 4) |
| uv pip install mlx-vlm |
| ``` |
|
|
| ### 2. Download Gemma 4 Model |
|
|
| ```bash |
| # Download 31B BF16 (best quality, ~58GB on disk) |
| # This fits in 96GB with room for context |
| python -c " |
| from mlx_vlm.utils import load |
| model, processor = load('mlx-community/gemma-4-31b-bf16') |
| print('Gemma 4 31B loaded successfully') |
| " |
| |
| # Or the 26B variant (slightly smaller, still excellent) |
| # python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')" |
| ``` |
|
|
| ### 3. Start MLX Server |
|
|
| ```bash |
| # Terminal 1: Start Gemma 4 MLX server |
| # Uses Metal GPU automatically β no config needed |
| python -m mlx_vlm.server \ |
| --model mlx-community/gemma-4-31b-bf16 \ |
| --host 0.0.0.0 \ |
| --port 8000 |
| |
| # Verify |
| curl http://localhost:8000/v1/models |
| ``` |
|
|
| ### 4. Configure API Server |
|
|
| ```bash |
| # In the ml-intern repo |
| cd production |
| |
| # Copy minimal env |
| cp .env.minimal .env |
| ``` |
|
|
| Edit `.env` β **just add your API keys** (only 3-4 lines): |
|
|
| ```env |
| # Cloudflare (required β always works) |
| CLOUDFLARE_API_KEY=sk-your-cloudflare-key |
| CLOUDFLARE_ACCOUNT_ID=your-account-id |
| |
| # NIM (optional β faster, free tier) |
| NVIDIA_API_KEY=nvapi-your-nvidia-key |
| |
| # Gemini (optional β great coding/reasoning) |
| GEMINI_API_KEY=your-gemini-key |
| |
| # Enable Gemma 4 local |
| MLX_ENABLED=true |
| MLX_API_BASE=http://host.docker.internal:8000/v1 |
| ``` |
|
|
| ### 5. Start the Stack |
|
|
| ```bash |
| # Terminal 2: Launch API + infrastructure |
| docker-compose -f docker-compose.m2.yml up -d |
| |
| # Verify |
| curl http://localhost/health | jq |
| curl http://localhost/v1/models | jq |
| curl http://localhost/v1/fallback/status | jq |
| ``` |
|
|
| ### 6. Test Everything |
|
|
| ```bash |
| # Test 1: Fallback status β see active provider |
| curl http://localhost/v1/fallback/status | jq |
| |
| # Test 2: Chat via active provider (auto-fallback) |
| curl -X POST http://localhost/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "cloudflare/@cf/google/gemma-4-26b-a4b-it", |
| "messages": [{"role":"user","content":"Explain quantum computing"}] |
| }' |
| |
| # Test 3: Force Gemma 4 local via MLX |
| curl -X POST http://localhost/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "mlx/gemma-4-31b-bf16", |
| "messages": [{"role":"user","content":"Write a Python web scraper"}], |
| "provider_override": "mlx" |
| }' |
| |
| # Test 4: Gemini for coding |
| curl -X POST http://localhost/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "gemini/gemini-2.5-pro-preview", |
| "messages": [{"role":"user","content":"Debug this code: def foo(): pass"}] |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Fallback Chain (Automatic) |
|
|
| ``` |
| Request comes in |
| β |
| βΌ |
| ββββββββββββββββ |
| β Check NIM ββββ Circuit breaker CLOSED? |
| β (Primary) β Yes β Send to NIM (fastest cloud) |
| ββββββββββββββββ No β Fall through |
| β |
| βΌ |
| ββββββββββββββββ |
| βCheck Cloudflareββββ Circuit breaker CLOSED? |
| β (Secondary) β Yes β Send to Cloudflare |
| ββββββββββββββββ No β Fall through |
| β |
| βΌ |
| ββββββββββββββββ |
| β Check Gemini ββββ Circuit breaker CLOSED? |
| β (Tertiary) β Yes β Send to Gemini (great for code) |
| ββββββββββββββββ No β Fall through |
| β |
| βΌ |
| ββββββββββββββββ |
| β Check MLX ββββ Enabled + Gemma 4 loaded? |
| β (Local Gemma)β Yes β Send to local Gemma 4 |
| ββββββββββββββββ No β Return 503 |
| ``` |
|
|
| Force any provider: |
| ```bash |
| curl -X POST http://localhost/v1/chat/completions \ |
| -d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}' |
| ``` |
|
|
| --- |
|
|
| ## Advanced: Gemma 4 with Speculative Decoding (2x Speed) |
|
|
| Use a small drafter model to predict tokens ahead, verified by the full model: |
|
|
| ```bash |
| # Terminal 1: Gemma 4 with MTP drafter (2x faster!) |
| python -m mlx_vlm.server \ |
| --model mlx-community/gemma-4-31b-bf16 \ |
| --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \ |
| --draft-kind mtp \ |
| --host 0.0.0.0 \ |
| --port 8000 |
| ``` |
|
|
| > **Note**: Temperature must be 0 for byte-identical output with MTP. |
|
|
| --- |
|
|
| ## Multi-Model Setup (Run Multiple Locally) |
|
|
| Your 96GB can hold **both** Gemma 4 31B + a fast 7B model: |
|
|
| ```bash |
| # Terminal 1: Gemma 4 31B for deep reasoning |
| python -m mlx_vlm.server \ |
| --model mlx-community/gemma-4-31b-bf16 \ |
| --host 0.0.0.0 --port 8000 |
| |
| # Terminal 2: Gemma 4 E4B for quick tasks |
| python -m mlx_vlm.server \ |
| --model mlx-community/gemma-4-e4b-it \ |
| --host 0.0.0.0 --port 8001 |
| ``` |
|
|
| Then send quick tasks to port 8001 directly, complex ones to port 8000. |
|
|
| --- |
|
|
| ## Provider Selection Guide |
|
|
| | Task Type | Recommended Provider | Model | |
| |-----------|---------------------|-------| |
| | **General reasoning** | MLX local | `gemma-4-31b-bf16` | |
| | **Coding/debugging** | Gemini | `gemini-2.5-pro-preview` | |
| | **Fast Q&A** | Cloudflare | `@cf/google/gemma-4-26b-a4b-it` | |
| | **High throughput** | NIM | `llama-3.1-405b` | |
| | **Multimodal (image+text)** | MLX local | `gemma-4-26b-a4b-it` | |
|
|
| --- |
|
|
| ## Minimal Configuration (Just 3 Lines) |
|
|
| ```bash |
| # .env β the bare minimum |
| CLOUDFLARE_API_KEY=sk-your-key |
| CLOUDFLARE_ACCOUNT_ID=your-account-id |
| MLX_ENABLED=true |
| ``` |
|
|
| Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup. |
|
|
| --- |
|
|
| ## Monitoring |
|
|
| ```bash |
| # Check active provider and fallback status |
| curl http://localhost/v1/fallback/status | jq |
| |
| # View all available models |
| curl http://localhost/v1/models | jq '.data[] | {id, owned_by}' |
| |
| # Grafana: http://localhost:3000 (admin/admin) |
| # - Dashboard: "ml-intern Production" |
| # - Panels: provider latency, fallback count, cache hit rate |
| |
| # Prometheus queries: |
| curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total' |
| curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state' |
| ``` |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### Gemma 4 download is slow |
| ```bash |
| # Use huggingface-cli for resumable download |
| uv pip install huggingface-hub |
| huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b |
| ``` |
|
|
| ### MLX server says "out of memory" |
| ```bash |
| # Try the smaller 26B model instead |
| python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000 |
| # Or the tiny E4B: |
| python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000 |
| ``` |
|
|
| ### Docker can't reach MLX on host |
| ```bash |
| # On macOS, host.docker.internal works |
| # On Linux, use your machine's IP: |
| MLX_API_BASE=http://192.168.1.5:8000/v1 |
| ``` |
|
|
| --- |
|
|
| ## Gemma 4 vs Claude Opus |
|
|
| | Capability | Gemma 4 31B-BF16 (MLX) | Claude Opus 4 | |
| |-----------|------------------------|---------------| |
| | **Context window** | 128K tokens | 200K tokens | |
| | **Multimodal** | β
Image + Text | β
Image + Text | |
| | **Code quality** | ββββ Excellent | βββββ Best | |
| | **Reasoning** | βββββ Excellent | βββββ Best | |
| | **Speed** | ~6 tok/s (M2 Max) | Cloud-based | |
| | **Cost** | $0 (local) | ~$15/1M input tokens | |
| | **Privacy** | β
100% local | β Cloud | |
| | **Offline** | β
Works offline | β Requires internet | |
|
|
| **Verdict**: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap. |
|
|