raazkumar commited on
Commit
30f7cdb
Β·
verified Β·
1 Parent(s): 42855cf

Upload production/M2_PRO_MAX_GUIDE.md

Browse files
Files changed (1) hide show
  1. production/M2_PRO_MAX_GUIDE.md +199 -201
production/M2_PRO_MAX_GUIDE.md CHANGED
@@ -1,92 +1,99 @@
1
- # M2 Pro Max 96GB β€” Setup Guide
2
 
3
- Your machine is powerful enough to run **70B models locally via MLX** while using **NIM (cloud) as primary** and **Cloudflare Workers AI as automatic fallback**.
4
 
5
- ## Architecture for Your Setup
 
 
 
 
 
6
 
7
  ```
8
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9
  β”‚ MacBook M2 Pro Max 96GB β”‚
10
  β”‚ β”‚
11
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
12
- β”‚ β”‚ MLX Server β”‚ β”‚ Docker (API + Infra) β”‚ β”‚
13
- β”‚ β”‚ (Metal GPU) β”‚ β”‚ ───────────────────── β”‚ β”‚
14
- β”‚ β”‚ ───────────── β”‚ β”‚ β€’ FastAPI server β”‚ β”‚
15
- β”‚ β”‚ Port :8000 │◄─────│ β€’ Redis cache β”‚ β”‚
16
- β”‚ β”‚ 70B models β”‚ β”‚ β€’ Postgres DB β”‚ β”‚
17
- β”‚ β”‚ 48GB RAM use β”‚ β”‚ β€’ Nginx LB β”‚ β”‚
18
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
19
- β”‚ β–² β”‚
20
  β”‚ β”‚ β”‚
21
- β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
22
- β”‚ β”‚ NIM Cloud │───►│ Cloudflare AI β”‚ β”‚
23
- β”‚ β”‚ (Primary) β”‚ β”‚ (Auto Fallback)β”‚ β”‚
24
- β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 
 
 
 
 
25
  β”‚ β”‚
26
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
27
  ```
28
 
29
- ## Quick Start
30
 
31
- ### 1. Install Prerequisites
32
 
33
- ```bash
34
- # Install uv (fast Python package manager)
35
- curl -LsSf https://astral.sh/uv/install.sh | sh
 
 
 
36
 
37
- # Install Homebrew packages
38
- brew install redis postgresql docker
39
 
40
- # Start services
41
- brew services start redis
42
- brew services start postgresql
43
 
44
- # Install Docker Desktop for Mac (if not already)
45
- # https://www.docker.com/products/docker-desktop
46
- ```
47
 
48
- ### 2. Install MLX Server (Native on macOS)
49
 
50
  ```bash
51
- # Create a dedicated venv for MLX
52
- mkdir ~/mlx-server && cd ~/mlx-server
 
 
 
53
  uv venv --python 3.11
54
  source .venv/bin/activate
55
 
56
- # Install MLX LM
57
- uv pip install mlx-lm
 
58
 
59
- # Download a 70B model (takes ~40GB, fits in 96GB)
60
- # Option A: llama-3.1-70B (best quality)
61
- python -c "
62
- from mlx_lm import load
63
- load('mlx-community/Meta-Llama-3.1-70B-Instruct-4bit')
64
- "
65
 
66
- # Option B: Mistral-7B (faster, less memory)
 
 
67
  python -c "
68
- from mlx_lm import load
69
- load('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
 
70
  "
 
 
 
71
  ```
72
 
73
  ### 3. Start MLX Server
74
 
75
  ```bash
76
- # Terminal 1: Start MLX server (uses Metal GPU automatically)
77
- mlx_lm.server \
78
- --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit \
 
79
  --host 0.0.0.0 \
80
  --port 8000
81
 
82
- # OR for 7B (faster, less RAM):
83
- # mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --host 0.0.0.0 --port 8000
84
-
85
- # Verify it's running
86
- curl http://localhost:8000/v1/models
87
- curl -X POST http://localhost:8000/v1/chat/completions \
88
- -H "Content-Type: application/json" \
89
- -d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'
90
  ```
91
 
92
  ### 4. Configure API Server
@@ -95,55 +102,75 @@ curl -X POST http://localhost:8000/v1/chat/completions \
95
  # In the ml-intern repo
96
  cd production
97
 
98
- # Create minimal .env (ONLY these 4-5 lines needed)
99
- cat > .env << 'EOF'
100
- # REQUIRED β€” Cloudflare fallback
101
- CLOUDFLARE_API_KEY=your_cloudflare_api_key_here
102
- CLOUDFLARE_ACCOUNT_ID=your_account_id_here
103
 
104
- # OPTIONAL β€” NIM primary (if you have API key)
105
- NVIDIA_API_KEY=your_nvidia_api_key_here
 
 
 
 
 
106
 
107
- # Point MLX to your local server
 
 
 
 
108
  MLX_API_BASE=http://host.docker.internal:8000/v1
109
- EOF
110
  ```
111
 
112
  ### 5. Start the Stack
113
 
114
  ```bash
115
- # Start Redis, Postgres, API, Workers, Nginx
116
  docker-compose -f docker-compose.m2.yml up -d
117
 
118
- # Verify everything
119
  curl http://localhost/health | jq
 
 
120
  ```
121
 
122
- ### 6. Test the Full Pipeline
123
 
124
  ```bash
125
- # Test 1: NIM primary (if API key set)
 
 
 
126
  curl -X POST http://localhost/v1/chat/completions \
127
  -H "Content-Type: application/json" \
128
- -d '{"model":"nim/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from NIM"}]}'
 
 
 
129
 
130
- # Test 2: Cloudflare fallback
131
  curl -X POST http://localhost/v1/chat/completions \
132
  -H "Content-Type: application/json" \
133
- -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from Cloudflare"}]}'
 
 
 
 
134
 
135
- # Test 3: MLX local (bypasses fallback)
136
  curl -X POST http://localhost/v1/chat/completions \
137
  -H "Content-Type: application/json" \
138
- -d '{"model":"mlx/llama-3.1-70b","messages":[{"role":"user","content":"Hello from MLX"}]}'
139
-
140
- # Test 4: Check which provider is active
141
- curl http://localhost/v1/fallback/status | jq
142
  ```
143
 
144
  ---
145
 
146
- ## Fallback Behavior
147
 
148
  ```
149
  Request comes in
@@ -151,185 +178,156 @@ Request comes in
151
  β–Ό
152
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
153
  β”‚ Check NIM │◄── Circuit breaker CLOSED?
154
- β”‚ (Primary) β”‚ Yes β†’ Send to NIM
155
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
156
  β”‚
157
- β–Ό (NIM down or rate limited)
 
 
 
 
 
 
158
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
159
- β”‚ Check Cloudflare│◄── Circuit breaker CLOSED?
160
- β”‚ (Fallback) β”‚ Yes β†’ Send to Cloudflare
161
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
162
  β”‚
163
- β–Ό (Both down)
164
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
165
- β”‚ Check MLX │◄── Enabled + Circuit CLOSED?
166
- β”‚ (Local) β”‚ Yes β†’ Send to MLX
167
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Return 503
168
  ```
169
 
170
- **You can force a provider** with `provider_override`:
171
-
172
  ```bash
173
- # Always use MLX regardless of NIM status
174
  curl -X POST http://localhost/v1/chat/completions \
175
- -H "Content-Type: application/json" \
176
- -d '{
177
- "model": "mlx/llama-3.1-70b",
178
- "messages": [{"role":"user","content":"Hello"}],
179
- "provider_override": "mlx"
180
- }'
181
  ```
182
 
183
  ---
184
 
185
- ## Running MLX + API Together (All Native, No Docker)
186
 
187
- If you prefer running everything natively without Docker:
188
 
189
- ### Terminal 1: MLX Server
190
  ```bash
191
- cd ~/mlx-server
192
- source .venv/bin/activate
193
- mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
 
 
 
 
194
  ```
195
 
196
- ### Terminal 2: Redis
197
- ```bash
198
- redis-server
199
- ```
200
 
201
- ### Terminal 3: PostgreSQL
202
- ```bash
203
- # If not using Docker Postgres
204
- initdb /usr/local/var/postgres
205
- pg_ctl -D /usr/local/var/postgres start
206
 
207
- # Or just use Docker for Postgres only:
208
- docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=ml_intern postgres:16-alpine
209
- ```
210
 
211
- ### Terminal 4: API Server
212
- ```bash
213
- cd production
214
- uv sync
215
-
216
- # Set only required env vars
217
- export REDIS_URL=redis://localhost:6379
218
- export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
219
- export MLX_ENABLED=true
220
- export MLX_API_BASE=http://localhost:8000/v1
221
- export CLOUDFLARE_API_KEY=your_key
222
- export CLOUDFLARE_ACCOUNT_ID=your_account_id
223
- export NVIDIA_API_KEY=your_nvidia_key
224
- export FALLBACK_ENABLED=true
225
- export FALLBACK_PRIMARY=nim
226
- export FALLBACK_SECONDARY=cloudflare
227
-
228
- # Run with uv
229
- uv run python -m production_server
230
- ```
231
 
232
- ### Terminal 5: Worker
233
  ```bash
234
- cd production
235
- uv run python -m worker
 
 
 
 
 
 
 
236
  ```
237
 
238
- ---
239
-
240
- ## Performance Tips for M2 Pro Max
241
-
242
- | Setting | 7B Model | 70B Model |
243
- |---------|---------|-----------|
244
- | RAM Usage | ~8GB | ~48GB |
245
- | Tokens/sec | ~40 tok/s | ~8 tok/s |
246
- | Startup Time | 2s | 20s |
247
- | Best For | Fast Q&A | Complex reasoning |
248
 
249
- ### Use Multiple MLX Models
250
-
251
- Your 96GB can hold **both** 7B and 70B simultaneously:
252
-
253
- ```bash
254
- # Terminal A: 70B for complex tasks
255
- mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
256
-
257
- # Terminal B: 7B for quick tasks
258
- mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --port 8001
259
- ```
260
 
261
- Then in `.env`:
262
- ```env
263
- MLX_API_BASE=http://host.docker.internal:8000/v1 # 70B default
264
- ```
265
 
266
- And send quick requests to port 8001 directly.
 
 
 
 
 
 
267
 
268
  ---
269
 
270
- ## Minimal Configuration (Just 5 Lines)
271
 
272
  ```bash
273
- # .env β€” ONLY these are required
274
  CLOUDFLARE_API_KEY=sk-your-key
275
  CLOUDFLARE_ACCOUNT_ID=your-account-id
276
- NVIDIA_API_KEY=nvapi-your-key # optional, enables NIM primary
277
- MLX_API_BASE=http://host.docker.internal:8000/v1 # optional, enables local fallback
278
  ```
279
 
280
- Everything else uses sensible defaults:
281
- - `FALLBACK_ENABLED=true` (default)
282
- - `FALLBACK_PRIMARY=nim` (default)
283
- - `FALLBACK_SECONDARY=cloudflare` (default)
284
- - `DEFAULT_RPM_LIMIT=40` (NIM free tier)
285
- - `CACHE_TTL_SECONDS=300` (5 min cache)
286
 
287
  ---
288
 
289
- ## What If NIM is Down?
290
 
291
  ```bash
292
- # Simulate NIM failure β€” circuit breaker will open after 3 failures
293
- for i in {1..3}; do
294
- curl -X POST http://localhost/v1/chat/completions \
295
- -H "Content-Type: application/json" \
296
- -d '{"model":"nim/llama-3.1-8b","messages":[{"role":"user","content":"test"}]}'
297
- done
298
-
299
- # Now check status
300
  curl http://localhost/v1/fallback/status | jq
301
- # β†’ "nim": "open", "cloudflare": "closed", "active_provider": "cloudflare"
302
 
303
- # Next request automatically goes to Cloudflare
304
- curl -X POST http://localhost/v1/chat/completions \
305
- -H "Content-Type: application/json" \
306
- -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"This goes to Cloudflare"}]}'
307
 
308
- # After 60 seconds (default recovery), NIM circuit half-opens
309
- curl http://localhost/v1/fallback/status | jq
310
- # β†’ "nim": "half-open"
 
 
 
 
311
  ```
312
 
313
  ---
314
 
315
- ## Monitoring Your Setup
316
 
 
317
  ```bash
318
- # Check which provider handled each request
319
- curl http://localhost/v1/fallback/status | jq
 
 
320
 
321
- # Grafana: http://localhost:3000 (admin/admin)
322
- # - Dashboard: "ml-intern Production"
323
- # - Panels: fallback count, provider latency, cache hit rate
 
 
 
 
 
 
 
 
 
 
 
324
 
325
- # Prometheus: http://localhost:9090
326
- # Query: ml_intern_fallback_total
327
- # Query: ml_intern_circuit_breaker_state
328
 
329
- # Redis cache stats
330
- redis-cli info stats | grep keyspace
331
 
332
- # Postgres request log
333
- psql postgresql://ml_intern:ml_intern@localhost/ml_intern \
334
- -c "SELECT provider, COUNT(*) FROM requests GROUP BY provider;"
335
- ```
 
 
 
 
 
 
 
 
 
1
+ # M2 Pro Max 96GB β€” Gemma 4 Setup Guide
2
 
3
+ Your machine is powerful enough to run **Gemma 4 31B-BF16 locally via MLX** β€” the best open alternative to Claude Opus. This guide sets up:
4
 
5
+ - **Primary**: NIM (cloud GPU)
6
+ - **Secondary**: Cloudflare Workers AI
7
+ - **Tertiary**: Google Gemini
8
+ - **Local**: Gemma 4 via MLX on Metal GPU
9
+
10
+ ## Architecture
11
 
12
  ```
13
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
14
  β”‚ MacBook M2 Pro Max 96GB β”‚
15
  β”‚ β”‚
16
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
17
+ β”‚ β”‚ MLX Server β”‚ β”‚ Docker (uv + API + Infra) β”‚ β”‚
18
+ β”‚ β”‚ (Metal GPU) β”‚ β”‚ ───────────────────────── β”‚ β”‚
19
+ β”‚ β”‚ Port 8000 │◄───│ β€’ FastAPI β”‚ β”‚
20
+ β”‚ β”‚ Gemma 4 31B β”‚ β”‚ β€’ Redis β”‚ β”‚
21
+ β”‚ β”‚ ~65GB RAM β”‚ β”‚ β€’ Postgres β”‚ β”‚
22
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β€’ Nginx β”‚ β”‚
23
+ β”‚ β–² β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 
24
  β”‚ β”‚ β”‚
25
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
26
+ β”‚ β”‚ NIM Cloud │───►│ Cloudflare AI β”‚ β”‚
27
+ β”‚ β”‚ (Primary) β”‚ β”‚ (Secondary) β”‚ β”‚
28
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
29
+ β”‚ β”‚ β”‚
30
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
31
+ β”‚ β”‚ Google Gemini (Tertiary) β”‚ β”‚
32
+ β”‚ β”‚ Best for coding + reasoning tasks β”‚ β”‚
33
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
34
  β”‚ β”‚
35
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
36
  ```
37
 
38
+ ## Gemma 4 Model Recommendations
39
 
40
+ With 96GB unified memory, you have options:
41
 
42
+ | Model | RAM | Quality | Speed | Best For |
43
+ |-------|-----|---------|-------|----------|
44
+ | **gemma-4-31b-bf16** | ~65GB | ⭐⭐⭐⭐⭐ Highest | ~6 tok/s | Deep reasoning, code, complex tasks |
45
+ | **gemma-4-26b-a4b-it-bf16** | ~55GB | ⭐⭐⭐⭐⭐ Excellent | ~7 tok/s | General purpose, multimodal |
46
+ | **gemma-4-26b-a4b-it-8bit** | ~36GB | ⭐⭐⭐⭐ Great | ~12 tok/s | Fast inference with good quality |
47
+ | **gemma-4-e4b-it** | ~12GB | ⭐⭐⭐ Good | ~25 tok/s | Quick Q&A, simple tasks |
48
 
49
+ **Recommendation**: Start with `gemma-4-31b-bf16`. It's the best open alternative to Claude Opus and still leaves ~30GB for system + context.
 
50
 
51
+ ---
 
 
52
 
53
+ ## Quick Start (5 Minutes)
 
 
54
 
55
+ ### 1. Install uv + MLX Server
56
 
57
  ```bash
58
+ # Install uv
59
+ curl -LsSf https://astral.sh/uv/install.sh | sh
60
+
61
+ # Create MLX environment
62
+ mkdir ~/mlx-gemma4 && cd ~/mlx-gemma4
63
  uv venv --python 3.11
64
  source .venv/bin/activate
65
 
66
+ # Install mlx-vlm (supports Gemma 4)
67
+ uv pip install mlx-vlm
68
+ ```
69
 
70
+ ### 2. Download Gemma 4 Model
 
 
 
 
 
71
 
72
+ ```bash
73
+ # Download 31B BF16 (best quality, ~58GB on disk)
74
+ # This fits in 96GB with room for context
75
  python -c "
76
+ from mlx_vlm.utils import load
77
+ model, processor = load('mlx-community/gemma-4-31b-bf16')
78
+ print('Gemma 4 31B loaded successfully')
79
  "
80
+
81
+ # Or the 26B variant (slightly smaller, still excellent)
82
+ # python -c "from mlx_vlm.utils import load; load('mlx-community/gemma-4-26b-a4b-it-bf16')"
83
  ```
84
 
85
  ### 3. Start MLX Server
86
 
87
  ```bash
88
+ # Terminal 1: Start Gemma 4 MLX server
89
+ # Uses Metal GPU automatically β€” no config needed
90
+ python -m mlx_vlm.server \
91
+ --model mlx-community/gemma-4-31b-bf16 \
92
  --host 0.0.0.0 \
93
  --port 8000
94
 
95
+ # Verify
96
+ curl http://localhost:8000/v1/models
 
 
 
 
 
 
97
  ```
98
 
99
  ### 4. Configure API Server
 
102
  # In the ml-intern repo
103
  cd production
104
 
105
+ # Copy minimal env
106
+ cp .env.minimal .env
107
+ ```
108
+
109
+ Edit `.env` β€” **just add your API keys** (only 3-4 lines):
110
 
111
+ ```env
112
+ # Cloudflare (required β€” always works)
113
+ CLOUDFLARE_API_KEY=sk-your-cloudflare-key
114
+ CLOUDFLARE_ACCOUNT_ID=your-account-id
115
+
116
+ # NIM (optional β€” faster, free tier)
117
+ NVIDIA_API_KEY=nvapi-your-nvidia-key
118
 
119
+ # Gemini (optional β€” great coding/reasoning)
120
+ GEMINI_API_KEY=your-gemini-key
121
+
122
+ # Enable Gemma 4 local
123
+ MLX_ENABLED=true
124
  MLX_API_BASE=http://host.docker.internal:8000/v1
 
125
  ```
126
 
127
  ### 5. Start the Stack
128
 
129
  ```bash
130
+ # Terminal 2: Launch API + infrastructure
131
  docker-compose -f docker-compose.m2.yml up -d
132
 
133
+ # Verify
134
  curl http://localhost/health | jq
135
+ curl http://localhost/v1/models | jq
136
+ curl http://localhost/v1/fallback/status | jq
137
  ```
138
 
139
+ ### 6. Test Everything
140
 
141
  ```bash
142
+ # Test 1: Fallback status β€” see active provider
143
+ curl http://localhost/v1/fallback/status | jq
144
+
145
+ # Test 2: Chat via active provider (auto-fallback)
146
  curl -X POST http://localhost/v1/chat/completions \
147
  -H "Content-Type: application/json" \
148
+ -d '{
149
+ "model": "cloudflare/@cf/google/gemma-4-26b-a4b-it",
150
+ "messages": [{"role":"user","content":"Explain quantum computing"}]
151
+ }'
152
 
153
+ # Test 3: Force Gemma 4 local via MLX
154
  curl -X POST http://localhost/v1/chat/completions \
155
  -H "Content-Type: application/json" \
156
+ -d '{
157
+ "model": "mlx/gemma-4-31b-bf16",
158
+ "messages": [{"role":"user","content":"Write a Python web scraper"}],
159
+ "provider_override": "mlx"
160
+ }'
161
 
162
+ # Test 4: Gemini for coding
163
  curl -X POST http://localhost/v1/chat/completions \
164
  -H "Content-Type: application/json" \
165
+ -d '{
166
+ "model": "gemini/gemini-2.5-pro-preview",
167
+ "messages": [{"role":"user","content":"Debug this code: def foo(): pass"}]
168
+ }'
169
  ```
170
 
171
  ---
172
 
173
+ ## Fallback Chain (Automatic)
174
 
175
  ```
176
  Request comes in
 
178
  β–Ό
179
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
180
  β”‚ Check NIM │◄── Circuit breaker CLOSED?
181
+ β”‚ (Primary) β”‚ Yes β†’ Send to NIM (fastest cloud)
182
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
183
  β”‚
184
+ β–Ό
185
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
186
+ β”‚Check Cloudflare│◄── Circuit breaker CLOSED?
187
+ β”‚ (Secondary) β”‚ Yes β†’ Send to Cloudflare
188
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
189
+ β”‚
190
+ β–Ό
191
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
192
+ β”‚ Check Gemini │◄── Circuit breaker CLOSED?
193
+ β”‚ (Tertiary) β”‚ Yes β†’ Send to Gemini (great for code)
194
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
195
  β”‚
196
+ β–Ό
197
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
198
+ β”‚ Check MLX │◄── Enabled + Gemma 4 loaded?
199
+ β”‚ (Local Gemma)β”‚ Yes β†’ Send to local Gemma 4
200
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Return 503
201
  ```
202
 
203
+ Force any provider:
 
204
  ```bash
 
205
  curl -X POST http://localhost/v1/chat/completions \
206
+ -d '{"model":"mlx/gemma-4-31b-bf16","messages":[...],"provider_override":"mlx"}'
 
 
 
 
 
207
  ```
208
 
209
  ---
210
 
211
+ ## Advanced: Gemma 4 with Speculative Decoding (2x Speed)
212
 
213
+ Use a small drafter model to predict tokens ahead, verified by the full model:
214
 
 
215
  ```bash
216
+ # Terminal 1: Gemma 4 with MTP drafter (2x faster!)
217
+ python -m mlx_vlm.server \
218
+ --model mlx-community/gemma-4-31b-bf16 \
219
+ --draft-model mlx-community/gemma-4-31B-it-assistant-bf16 \
220
+ --draft-kind mtp \
221
+ --host 0.0.0.0 \
222
+ --port 8000
223
  ```
224
 
225
+ > **Note**: Temperature must be 0 for byte-identical output with MTP.
 
 
 
226
 
227
+ ---
 
 
 
 
228
 
229
+ ## Multi-Model Setup (Run Multiple Locally)
 
 
230
 
231
+ Your 96GB can hold **both** Gemma 4 31B + a fast 7B model:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
 
 
233
  ```bash
234
+ # Terminal 1: Gemma 4 31B for deep reasoning
235
+ python -m mlx_vlm.server \
236
+ --model mlx-community/gemma-4-31b-bf16 \
237
+ --host 0.0.0.0 --port 8000
238
+
239
+ # Terminal 2: Gemma 4 E4B for quick tasks
240
+ python -m mlx_vlm.server \
241
+ --model mlx-community/gemma-4-e4b-it \
242
+ --host 0.0.0.0 --port 8001
243
  ```
244
 
245
+ Then send quick tasks to port 8001 directly, complex ones to port 8000.
 
 
 
 
 
 
 
 
 
246
 
247
+ ---
 
 
 
 
 
 
 
 
 
 
248
 
249
+ ## Provider Selection Guide
 
 
 
250
 
251
+ | Task Type | Recommended Provider | Model |
252
+ |-----------|---------------------|-------|
253
+ | **General reasoning** | MLX local | `gemma-4-31b-bf16` |
254
+ | **Coding/debugging** | Gemini | `gemini-2.5-pro-preview` |
255
+ | **Fast Q&A** | Cloudflare | `@cf/google/gemma-4-26b-a4b-it` |
256
+ | **High throughput** | NIM | `llama-3.1-405b` |
257
+ | **Multimodal (image+text)** | MLX local | `gemma-4-26b-a4b-it` |
258
 
259
  ---
260
 
261
+ ## Minimal Configuration (Just 3 Lines)
262
 
263
  ```bash
264
+ # .env β€” the bare minimum
265
  CLOUDFLARE_API_KEY=sk-your-key
266
  CLOUDFLARE_ACCOUNT_ID=your-account-id
267
+ MLX_ENABLED=true
 
268
  ```
269
 
270
+ Everything else auto-configures. Even without NIM or Gemini keys, Cloudflare + MLX gives you a robust setup.
 
 
 
 
 
271
 
272
  ---
273
 
274
+ ## Monitoring
275
 
276
  ```bash
277
+ # Check active provider and fallback status
 
 
 
 
 
 
 
278
  curl http://localhost/v1/fallback/status | jq
 
279
 
280
+ # View all available models
281
+ curl http://localhost/v1/models | jq '.data[] | {id, owned_by}'
 
 
282
 
283
+ # Grafana: http://localhost:3000 (admin/admin)
284
+ # - Dashboard: "ml-intern Production"
285
+ # - Panels: provider latency, fallback count, cache hit rate
286
+
287
+ # Prometheus queries:
288
+ curl 'http://localhost:9090/api/v1/query?query=ml_intern_fallback_total'
289
+ curl 'http://localhost:9090/api/v1/query?query=ml_intern_circuit_breaker_state'
290
  ```
291
 
292
  ---
293
 
294
+ ## Troubleshooting
295
 
296
+ ### Gemma 4 download is slow
297
  ```bash
298
+ # Use huggingface-cli for resumable download
299
+ uv pip install huggingface-hub
300
+ huggingface-cli download mlx-community/gemma-4-31b-bf16 --local-dir ~/models/gemma-4-31b
301
+ ```
302
 
303
+ ### MLX server says "out of memory"
304
+ ```bash
305
+ # Try the smaller 26B model instead
306
+ python -m mlx_vlm.server --model mlx-community/gemma-4-26b-a4b-it-8bit --port 8000
307
+ # Or the tiny E4B:
308
+ python -m mlx_vlm.server --model mlx-community/gemma-4-e4b-it --port 8000
309
+ ```
310
+
311
+ ### Docker can't reach MLX on host
312
+ ```bash
313
+ # On macOS, host.docker.internal works
314
+ # On Linux, use your machine's IP:
315
+ MLX_API_BASE=http://192.168.1.5:8000/v1
316
+ ```
317
 
318
+ ---
 
 
319
 
320
+ ## Gemma 4 vs Claude Opus
 
321
 
322
+ | Capability | Gemma 4 31B-BF16 (MLX) | Claude Opus 4 |
323
+ |-----------|------------------------|---------------|
324
+ | **Context window** | 128K tokens | 200K tokens |
325
+ | **Multimodal** | βœ… Image + Text | βœ… Image + Text |
326
+ | **Code quality** | ⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
327
+ | **Reasoning** | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐⭐⭐⭐ Best |
328
+ | **Speed** | ~6 tok/s (M2 Max) | Cloud-based |
329
+ | **Cost** | $0 (local) | ~$15/1M input tokens |
330
+ | **Privacy** | βœ… 100% local | ❌ Cloud |
331
+ | **Offline** | βœ… Works offline | ❌ Requires internet |
332
+
333
+ **Verdict**: For most tasks, Gemma 4 31B-BF16 on your M2 Pro Max is a genuine Claude Opus alternative. For edge cases where you need the absolute best, Gemini 2.5 Pro fills the gap.