raazkumar commited on
Commit
2bc8470
Β·
verified Β·
1 Parent(s): 3f39a2b

Upload production/M2_PRO_MAX_GUIDE.md

Browse files
Files changed (1) hide show
  1. production/M2_PRO_MAX_GUIDE.md +335 -0
production/M2_PRO_MAX_GUIDE.md ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # M2 Pro Max 96GB β€” Setup Guide
2
+
3
+ Your machine is powerful enough to run **70B models locally via MLX** while using **NIM (cloud) as primary** and **Cloudflare Workers AI as automatic fallback**.
4
+
5
+ ## Architecture for Your Setup
6
+
7
+ ```
8
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9
+ β”‚ MacBook M2 Pro Max 96GB β”‚
10
+ β”‚ β”‚
11
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
12
+ β”‚ β”‚ MLX Server β”‚ β”‚ Docker (API + Infra) β”‚ β”‚
13
+ β”‚ β”‚ (Metal GPU) β”‚ β”‚ ───────────────────── β”‚ β”‚
14
+ β”‚ β”‚ ───────────── β”‚ β”‚ β€’ FastAPI server β”‚ β”‚
15
+ β”‚ β”‚ Port :8000 │◄─────│ β€’ Redis cache β”‚ β”‚
16
+ β”‚ β”‚ 70B models β”‚ β”‚ β€’ Postgres DB β”‚ β”‚
17
+ β”‚ β”‚ 48GB RAM use β”‚ β”‚ β€’ Nginx LB β”‚ β”‚
18
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
19
+ β”‚ β–² β”‚
20
+ β”‚ β”‚ β”‚
21
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
22
+ β”‚ β”‚ NIM Cloud │───►│ Cloudflare AI β”‚ β”‚
23
+ β”‚ β”‚ (Primary) β”‚ β”‚ (Auto Fallback)β”‚ β”‚
24
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
25
+ β”‚ β”‚
26
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
27
+ ```
28
+
29
+ ## Quick Start
30
+
31
+ ### 1. Install Prerequisites
32
+
33
+ ```bash
34
+ # Install uv (fast Python package manager)
35
+ curl -LsSf https://astral.sh/uv/install.sh | sh
36
+
37
+ # Install Homebrew packages
38
+ brew install redis postgresql docker
39
+
40
+ # Start services
41
+ brew services start redis
42
+ brew services start postgresql
43
+
44
+ # Install Docker Desktop for Mac (if not already)
45
+ # https://www.docker.com/products/docker-desktop
46
+ ```
47
+
48
+ ### 2. Install MLX Server (Native on macOS)
49
+
50
+ ```bash
51
+ # Create a dedicated venv for MLX
52
+ mkdir ~/mlx-server && cd ~/mlx-server
53
+ uv venv --python 3.11
54
+ source .venv/bin/activate
55
+
56
+ # Install MLX LM
57
+ uv pip install mlx-lm
58
+
59
+ # Download a 70B model (takes ~40GB, fits in 96GB)
60
+ # Option A: llama-3.1-70B (best quality)
61
+ python -c "
62
+ from mlx_lm import load
63
+ load('mlx-community/Meta-Llama-3.1-70B-Instruct-4bit')
64
+ "
65
+
66
+ # Option B: Mistral-7B (faster, less memory)
67
+ python -c "
68
+ from mlx_lm import load
69
+ load('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
70
+ "
71
+ ```
72
+
73
+ ### 3. Start MLX Server
74
+
75
+ ```bash
76
+ # Terminal 1: Start MLX server (uses Metal GPU automatically)
77
+ mlx_lm.server \
78
+ --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit \
79
+ --host 0.0.0.0 \
80
+ --port 8000
81
+
82
+ # OR for 7B (faster, less RAM):
83
+ # mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --host 0.0.0.0 --port 8000
84
+
85
+ # Verify it's running
86
+ curl http://localhost:8000/v1/models
87
+ curl -X POST http://localhost:8000/v1/chat/completions \
88
+ -H "Content-Type: application/json" \
89
+ -d '{"model":"local","messages":[{"role":"user","content":"Hello"}]}'
90
+ ```
91
+
92
+ ### 4. Configure API Server
93
+
94
+ ```bash
95
+ # In the ml-intern repo
96
+ cd production
97
+
98
+ # Create minimal .env (ONLY these 4-5 lines needed)
99
+ cat > .env << 'EOF'
100
+ # REQUIRED β€” Cloudflare fallback
101
+ CLOUDFLARE_API_KEY=your_cloudflare_api_key_here
102
+ CLOUDFLARE_ACCOUNT_ID=your_account_id_here
103
+
104
+ # OPTIONAL β€” NIM primary (if you have API key)
105
+ NVIDIA_API_KEY=your_nvidia_api_key_here
106
+
107
+ # Point MLX to your local server
108
+ MLX_API_BASE=http://host.docker.internal:8000/v1
109
+ EOF
110
+ ```
111
+
112
+ ### 5. Start the Stack
113
+
114
+ ```bash
115
+ # Start Redis, Postgres, API, Workers, Nginx
116
+ docker-compose -f docker-compose.m2.yml up -d
117
+
118
+ # Verify everything
119
+ curl http://localhost/health | jq
120
+ ```
121
+
122
+ ### 6. Test the Full Pipeline
123
+
124
+ ```bash
125
+ # Test 1: NIM primary (if API key set)
126
+ curl -X POST http://localhost/v1/chat/completions \
127
+ -H "Content-Type: application/json" \
128
+ -d '{"model":"nim/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from NIM"}]}'
129
+
130
+ # Test 2: Cloudflare fallback
131
+ curl -X POST http://localhost/v1/chat/completions \
132
+ -H "Content-Type: application/json" \
133
+ -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"Hello from Cloudflare"}]}'
134
+
135
+ # Test 3: MLX local (bypasses fallback)
136
+ curl -X POST http://localhost/v1/chat/completions \
137
+ -H "Content-Type: application/json" \
138
+ -d '{"model":"mlx/llama-3.1-70b","messages":[{"role":"user","content":"Hello from MLX"}]}'
139
+
140
+ # Test 4: Check which provider is active
141
+ curl http://localhost/v1/fallback/status | jq
142
+ ```
143
+
144
+ ---
145
+
146
+ ## Fallback Behavior
147
+
148
+ ```
149
+ Request comes in
150
+ β”‚
151
+ β–Ό
152
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
153
+ β”‚ Check NIM │◄── Circuit breaker CLOSED?
154
+ β”‚ (Primary) β”‚ Yes β†’ Send to NIM
155
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
156
+ β”‚
157
+ β–Ό (NIM down or rate limited)
158
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
159
+ β”‚ Check Cloudflare│◄── Circuit breaker CLOSED?
160
+ β”‚ (Fallback) β”‚ Yes β†’ Send to Cloudflare
161
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Fall through
162
+ β”‚
163
+ β–Ό (Both down)
164
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
165
+ β”‚ Check MLX │◄── Enabled + Circuit CLOSED?
166
+ β”‚ (Local) β”‚ Yes β†’ Send to MLX
167
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ No β†’ Return 503
168
+ ```
169
+
170
+ **You can force a provider** with `provider_override`:
171
+
172
+ ```bash
173
+ # Always use MLX regardless of NIM status
174
+ curl -X POST http://localhost/v1/chat/completions \
175
+ -H "Content-Type: application/json" \
176
+ -d '{
177
+ "model": "mlx/llama-3.1-70b",
178
+ "messages": [{"role":"user","content":"Hello"}],
179
+ "provider_override": "mlx"
180
+ }'
181
+ ```
182
+
183
+ ---
184
+
185
+ ## Running MLX + API Together (All Native, No Docker)
186
+
187
+ If you prefer running everything natively without Docker:
188
+
189
+ ### Terminal 1: MLX Server
190
+ ```bash
191
+ cd ~/mlx-server
192
+ source .venv/bin/activate
193
+ mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
194
+ ```
195
+
196
+ ### Terminal 2: Redis
197
+ ```bash
198
+ redis-server
199
+ ```
200
+
201
+ ### Terminal 3: PostgreSQL
202
+ ```bash
203
+ # If not using Docker Postgres
204
+ initdb /usr/local/var/postgres
205
+ pg_ctl -D /usr/local/var/postgres start
206
+
207
+ # Or just use Docker for Postgres only:
208
+ docker run -d --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=ml_intern postgres:16-alpine
209
+ ```
210
+
211
+ ### Terminal 4: API Server
212
+ ```bash
213
+ cd production
214
+ uv sync
215
+
216
+ # Set only required env vars
217
+ export REDIS_URL=redis://localhost:6379
218
+ export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
219
+ export MLX_ENABLED=true
220
+ export MLX_API_BASE=http://localhost:8000/v1
221
+ export CLOUDFLARE_API_KEY=your_key
222
+ export CLOUDFLARE_ACCOUNT_ID=your_account_id
223
+ export NVIDIA_API_KEY=your_nvidia_key
224
+ export FALLBACK_ENABLED=true
225
+ export FALLBACK_PRIMARY=nim
226
+ export FALLBACK_SECONDARY=cloudflare
227
+
228
+ # Run with uv
229
+ uv run python -m production_server
230
+ ```
231
+
232
+ ### Terminal 5: Worker
233
+ ```bash
234
+ cd production
235
+ uv run python -m worker
236
+ ```
237
+
238
+ ---
239
+
240
+ ## Performance Tips for M2 Pro Max
241
+
242
+ | Setting | 7B Model | 70B Model |
243
+ |---------|---------|-----------|
244
+ | RAM Usage | ~8GB | ~48GB |
245
+ | Tokens/sec | ~40 tok/s | ~8 tok/s |
246
+ | Startup Time | 2s | 20s |
247
+ | Best For | Fast Q&A | Complex reasoning |
248
+
249
+ ### Use Multiple MLX Models
250
+
251
+ Your 96GB can hold **both** 7B and 70B simultaneously:
252
+
253
+ ```bash
254
+ # Terminal A: 70B for complex tasks
255
+ mlx_lm.server --model mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --port 8000
256
+
257
+ # Terminal B: 7B for quick tasks
258
+ mlx_lm.server --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --port 8001
259
+ ```
260
+
261
+ Then in `.env`:
262
+ ```env
263
+ MLX_API_BASE=http://host.docker.internal:8000/v1 # 70B default
264
+ ```
265
+
266
+ And send quick requests to port 8001 directly.
267
+
268
+ ---
269
+
270
+ ## Minimal Configuration (Just 5 Lines)
271
+
272
+ ```bash
273
+ # .env β€” ONLY these are required
274
+ CLOUDFLARE_API_KEY=sk-your-key
275
+ CLOUDFLARE_ACCOUNT_ID=your-account-id
276
+ NVIDIA_API_KEY=nvapi-your-key # optional, enables NIM primary
277
+ MLX_API_BASE=http://host.docker.internal:8000/v1 # optional, enables local fallback
278
+ ```
279
+
280
+ Everything else uses sensible defaults:
281
+ - `FALLBACK_ENABLED=true` (default)
282
+ - `FALLBACK_PRIMARY=nim` (default)
283
+ - `FALLBACK_SECONDARY=cloudflare` (default)
284
+ - `DEFAULT_RPM_LIMIT=40` (NIM free tier)
285
+ - `CACHE_TTL_SECONDS=300` (5 min cache)
286
+
287
+ ---
288
+
289
+ ## What If NIM is Down?
290
+
291
+ ```bash
292
+ # Simulate NIM failure β€” circuit breaker will open after 3 failures
293
+ for i in {1..3}; do
294
+ curl -X POST http://localhost/v1/chat/completions \
295
+ -H "Content-Type: application/json" \
296
+ -d '{"model":"nim/llama-3.1-8b","messages":[{"role":"user","content":"test"}]}'
297
+ done
298
+
299
+ # Now check status
300
+ curl http://localhost/v1/fallback/status | jq
301
+ # β†’ "nim": "open", "cloudflare": "closed", "active_provider": "cloudflare"
302
+
303
+ # Next request automatically goes to Cloudflare
304
+ curl -X POST http://localhost/v1/chat/completions \
305
+ -H "Content-Type: application/json" \
306
+ -d '{"model":"cloudflare/@cf/meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"This goes to Cloudflare"}]}'
307
+
308
+ # After 60 seconds (default recovery), NIM circuit half-opens
309
+ curl http://localhost/v1/fallback/status | jq
310
+ # β†’ "nim": "half-open"
311
+ ```
312
+
313
+ ---
314
+
315
+ ## Monitoring Your Setup
316
+
317
+ ```bash
318
+ # Check which provider handled each request
319
+ curl http://localhost/v1/fallback/status | jq
320
+
321
+ # Grafana: http://localhost:3000 (admin/admin)
322
+ # - Dashboard: "ml-intern Production"
323
+ # - Panels: fallback count, provider latency, cache hit rate
324
+
325
+ # Prometheus: http://localhost:9090
326
+ # Query: ml_intern_fallback_total
327
+ # Query: ml_intern_circuit_breaker_state
328
+
329
+ # Redis cache stats
330
+ redis-cli info stats | grep keyspace
331
+
332
+ # Postgres request log
333
+ psql postgresql://ml_intern:ml_intern@localhost/ml_intern \
334
+ -c "SELECT provider, COUNT(*) FROM requests GROUP BY provider;"
335
+ ```