raazkumar commited on
Commit
03cc10d
Β·
verified Β·
1 Parent(s): 3910229

Upload production/LOCAL_DEPLOYMENT.md

Browse files
Files changed (1) hide show
  1. production/LOCAL_DEPLOYMENT.md +401 -0
production/LOCAL_DEPLOYMENT.md ADDED
@@ -0,0 +1,401 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Local Deployment Guide β€” No Hugging Face Required
2
+
3
+ Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).
4
+
5
+ ## Prerequisites
6
+
7
+ - **Docker + Docker Compose** (recommended) OR **Python 3.11+**
8
+ - **8GB RAM minimum** (16GB+ recommended)
9
+ - **Local LLM backend** (pick one):
10
+ - [Ollama](https://ollama.com) β€” easiest
11
+ - [LM Studio](https://lmstudio.ai) β€” GUI, great for Mac/Windows
12
+ - [llama.cpp](https://github.com/ggerganov/llama.cpp) β€” most control
13
+ - [vLLM](https://github.com/vllm-project/vllm) β€” highest throughput
14
+ - [NVIDIA NIM](https://developer.nvidia.com/nim) β€” enterprise GPUs
15
+ - [MLX](https://github.com/ml-explore/mlx) β€” Apple Silicon optimized
16
+
17
+ ---
18
+
19
+ ## Option 1: Docker Compose (Fastest β€” 2 Minutes)
20
+
21
+ ### Step 1: Start a Local LLM Server
22
+
23
+ **Option A β€” Ollama (Recommended)**
24
+
25
+ ```bash
26
+ # Install Ollama (one-liner)
27
+ curl -fsSL https://ollama.com/install.sh | sh
28
+
29
+ # Pull a model
30
+ ollama pull llama3.1
31
+
32
+ # Start server (runs on :11434, OpenAI-compatible on :11434/v1)
33
+ ollama serve
34
+ ```
35
+
36
+ **Option B β€” LM Studio**
37
+
38
+ 1. Download LM Studio from https://lmstudio.ai
39
+ 2. Load any GGUF model
40
+ 3. Start **Local Inference Server** β†’ it runs on `http://localhost:1234/v1`
41
+
42
+ **Option C β€” llama.cpp Server**
43
+
44
+ ```bash
45
+ # Build
46
+ git clone https://github.com/ggerganov/llama.cpp
47
+ cd llama.cpp && make
48
+
49
+ # Download a GGUF model
50
+ wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
51
+
52
+ # Start server (OpenAI-compatible API on :8080/v1)
53
+ ./server -m llama-2-7b.Q4_K_M.gguf --port 8080
54
+ ```
55
+
56
+ ### Step 2: Clone & Configure
57
+
58
+ ```bash
59
+ git clone https://github.com/raazkumar/ml-intern-local-fork.git
60
+ cd ml-intern-local-fork/production
61
+
62
+ # Copy environment template
63
+ cp .env.example .env
64
+ ```
65
+
66
+ Edit `.env` β€” **only change these lines**:
67
+
68
+ ```env
69
+ # Point to your local LLM server
70
+ OLLAMA_API_BASE=http://host.docker.internal:11434/v1
71
+ # (or for LM Studio: http://host.docker.internal:1234/v1)
72
+ # (or for llama.cpp: http://host.docker.internal:8080/v1)
73
+
74
+ # No cloud API keys needed for local-only mode
75
+ # Leave these blank or comment them out:
76
+ # HF_TOKEN=
77
+ # ANTHROPIC_API_KEY=
78
+ # OPENAI_API_KEY=
79
+ # GROQ_API_KEY=
80
+ # NVIDIA_API_KEY=
81
+ ```
82
+
83
+ > **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box.
84
+
85
+ ### Step 3: Launch the Stack
86
+
87
+ ```bash
88
+ docker-compose up -d
89
+ ```
90
+
91
+ This starts:
92
+ - **API server** (FastAPI) on http://localhost:8000
93
+ - **Background workers** (cleanup, budget alerts)
94
+ - **Redis** (caching + rate limiting) on :6379
95
+ - **PostgreSQL** (audit log + sessions) on :5432
96
+ - **Nginx** (load balancer) on :80
97
+ - **Prometheus** (metrics) on :9090
98
+ - **Grafana** (dashboards) on :3000
99
+ - **Jaeger** (tracing) on :16686
100
+ - **pgAdmin** (DB GUI) on :5050
101
+
102
+ ### Step 4: Verify
103
+
104
+ ```bash
105
+ # Health check
106
+ curl http://localhost/health | jq
107
+
108
+ # List available models (includes your local ones)
109
+ curl http://localhost/v1/models | jq
110
+
111
+ # Chat with your local model
112
+ curl -X POST http://localhost/v1/chat/completions \
113
+ -H "Content-Type: application/json" \
114
+ -d '{
115
+ "model": "ollama/llama3.1",
116
+ "messages": [{"role":"user","content":"Hello from local deployment!"}],
117
+ "stream": false
118
+ }'
119
+ ```
120
+
121
+ ### Step 5: View Dashboards
122
+
123
+ | Service | URL | Default Login |
124
+ |---------|-----|---------------|
125
+ | API | http://localhost:8000 | β€” |
126
+ | Grafana | http://localhost:3000 | admin / admin |
127
+ | Prometheus | http://localhost:9090 | β€” |
128
+ | Jaeger UI | http://localhost:16686 | β€” |
129
+ | pgAdmin | http://localhost:5050 | admin@mlintern.local / admin |
130
+
131
+ ---
132
+
133
+ ## Option 2: Pure Python (No Docker)
134
+
135
+ For development or lightweight setups.
136
+
137
+ ### Step 1: Install Dependencies
138
+
139
+ ```bash
140
+ # Python 3.11+ required
141
+ python -m venv .venv
142
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
143
+
144
+ pip install -r production/requirements.prod.txt
145
+ ```
146
+
147
+ ### Step 2: Start PostgreSQL + Redis
148
+
149
+ You need these running locally. Options:
150
+
151
+ **A) System packages:**
152
+ ```bash
153
+ # Ubuntu/Debian
154
+ sudo apt install postgresql redis
155
+ sudo systemctl start postgresql redis
156
+
157
+ # macOS
158
+ brew install postgresql redis
159
+ brew services start postgresql redis
160
+ ```
161
+
162
+ **B) Docker (just the infra):**
163
+ ```bash
164
+ docker run -d --name redis -p 6379:6379 redis:7-alpine
165
+ docker run -d --name postgres \
166
+ -e POSTGRES_PASSWORD=ml_intern \
167
+ -e POSTGRES_DB=ml_intern \
168
+ -p 5432:5432 postgres:16-alpine
169
+ ```
170
+
171
+ ### Step 3: Initialize Database
172
+
173
+ ```bash
174
+ psql -U postgres -h localhost -d ml_intern -f production/init.sql
175
+ ```
176
+
177
+ ### Step 4: Configure Environment
178
+
179
+ ```bash
180
+ export REDIS_URL=redis://localhost:6379
181
+ export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
182
+ export PORT=8000
183
+ export WORKERS=1
184
+ export LOG_LEVEL=INFO
185
+
186
+ # Point to your local LLM
187
+ export OLLAMA_API_BASE=http://localhost:11434/v1
188
+ ```
189
+
190
+ ### Step 5: Start the Server
191
+
192
+ ```bash
193
+ cd production
194
+ python -m production_server
195
+ ```
196
+
197
+ Server runs on http://localhost:8000
198
+
199
+ ### Step 6: Start Worker (in another terminal)
200
+
201
+ ```bash
202
+ source .venv/bin/activate
203
+ cd production
204
+ python -m worker
205
+ ```
206
+
207
+ ---
208
+
209
+ ## Connecting Different Local Backends
210
+
211
+ | Backend | Start Command | API Base | Model Prefix | Example Model String |
212
+ |---------|--------------|----------|-------------|---------------------|
213
+ | **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` |
214
+ | **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` |
215
+ | **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` |
216
+ | **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` |
217
+ | **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` |
218
+ | **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` |
219
+ | **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` |
220
+ | **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` |
221
+
222
+ ### Override API Base (if not default port)
223
+
224
+ In `.env`:
225
+ ```env
226
+ OLLAMA_API_BASE=http://192.168.1.100:11434/v1
227
+ LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
228
+ VLLM_API_BASE=http://vllm-server.internal:8000/v1
229
+ ```
230
+
231
+ ---
232
+
233
+ ## Multi-Backend Setup (Recommended)
234
+
235
+ Run **multiple local backends** and let ml-intern round-robin or fail over:
236
+
237
+ ```bash
238
+ # Terminal 1: Ollama for fast models
239
+ ollama pull llama3.1
240
+ ollama serve
241
+
242
+ # Terminal 2: vLLM for high-throughput
243
+ python -m vllm.entrypoints.openai.api_server \
244
+ --model meta-llama/Llama-3.1-70B-Instruct \
245
+ --tensor-parallel-size 2 \
246
+ --port 8001
247
+ ```
248
+
249
+ In `.env`:
250
+ ```env
251
+ OLLAMA_API_BASE=http://localhost:11434/v1
252
+ VLLM_API_BASE=http://localhost:8001/v1
253
+ ```
254
+
255
+ Now you can use either:
256
+ ```bash
257
+ curl http://localhost/v1/chat/completions -d '{
258
+ "model": "ollama/llama3.1",
259
+ "messages": [{"role":"user","content":"Quick question"}]
260
+ }'
261
+
262
+ curl http://localhost/v1/chat/completions -d '{
263
+ "model": "vllm/llama-3.1-70b",
264
+ "messages": [{"role":"user","content":"Complex reasoning"}]
265
+ }'
266
+ ```
267
+
268
+ ---
269
+
270
+ ## CLI Mode (No Server)
271
+
272
+ If you want to use ml-intern as a CLI tool with local models (the original use case):
273
+
274
+ ```bash
275
+ # Install the agent CLI
276
+ pip install -e .
277
+
278
+ # Run with local model
279
+ ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"
280
+
281
+ # With local overrides
282
+ OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
283
+ --model ollama/llama3.1 \
284
+ --yolo \
285
+ "Create a FastAPI app with Redis caching"
286
+ ```
287
+
288
+ ---
289
+
290
+ ## Hardware Requirements by Backend
291
+
292
+ | Backend | Min GPU | Recommended GPU | RAM | Notes |
293
+ |---------|---------|----------------|-----|-------|
294
+ | Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use |
295
+ | Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps |
296
+ | LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration |
297
+ | vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput |
298
+ | vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required |
299
+ | llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only |
300
+ | MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU |
301
+ | NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support |
302
+
303
+ ---
304
+
305
+ ## Troubleshooting
306
+
307
+ ### "Connection refused" to local LLM
308
+
309
+ Docker containers can't reach `localhost` on the host. Use:
310
+ - **Mac/Windows**: `host.docker.internal` (already in default `.env`)
311
+ - **Linux**: Your machine's LAN IP, e.g., `192.168.1.5`
312
+ - **All platforms**: Put the LLM server in Docker Compose too
313
+
314
+ ### Ollama in Docker Compose
315
+
316
+ Add to `docker-compose.yml`:
317
+ ```yaml
318
+ ollama:
319
+ image: ollama/ollama
320
+ volumes:
321
+ - ollama:/root/.ollama
322
+ ports:
323
+ - "11434:11434"
324
+ ```
325
+ Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS).
326
+
327
+ ### "Rate limit exceeded" immediately
328
+
329
+ The default RPM is 40. For local models with no actual limit, increase it:
330
+ ```env
331
+ DEFAULT_RPM_LIMIT=1000
332
+ ```
333
+
334
+ ### PostgreSQL connection failed
335
+
336
+ ```bash
337
+ # Check if Postgres is running
338
+ docker ps | grep postgres
339
+
340
+ # Check logs
341
+ docker logs ml-intern-postgres-1
342
+
343
+ # Reset database
344
+ docker-compose down -v # WARNING: deletes all data
345
+ docker-compose up -d postgres
346
+ ```
347
+
348
+ ### Grafana shows "No data"
349
+
350
+ Prometheus needs time to scrape. Wait 30 seconds, or check:
351
+ ```bash
352
+ curl http://localhost:9090/api/v1/targets
353
+ ```
354
+
355
+ ### Slow first response
356
+
357
+ Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.
358
+
359
+ ---
360
+
361
+ ## File Structure (Local Copy)
362
+
363
+ ```
364
+ ml-intern/
365
+ β”œβ”€β”€ production/
366
+ β”‚ β”œβ”€β”€ docker-compose.yml # Full stack
367
+ β”‚ β”œβ”€β”€ Dockerfile.prod # API + worker image
368
+ β”‚ β”œβ”€β”€ production_server.py # FastAPI app
369
+ β”‚ β”œβ”€β”€ worker.py # Background tasks
370
+ β”‚ β”œβ”€β”€ init.sql # DB schema
371
+ β”‚ β”œβ”€β”€ nginx.conf # Load balancer config
372
+ β”‚ β”œβ”€β”€ prometheus.yml # Metrics collection
373
+ β”‚ β”œβ”€β”€ requirements.prod.txt # Python deps
374
+ β”‚ β”œβ”€β”€ .env.example # Configuration template
375
+ β”‚ β”œβ”€β”€ grafana/ # Dashboards
376
+ β”‚ β”œβ”€β”€ k8s/ # Kubernetes manifests
377
+ β”‚ β”œβ”€β”€ helm/ # Helm charts
378
+ β”‚ └── tests/ # Integration + load tests
379
+ └── agent/ # Original ml-intern agent code
380
+ ```
381
+
382
+ ---
383
+
384
+ ## Next Steps
385
+
386
+ 1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost`
387
+ 2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded
388
+ 3. **Monitor costs**: Even local models use electricity β€” Grafana tracks request volume
389
+ 4. **Scale horizontally**: `docker-compose up -d --scale api=4`
390
+
391
+ ---
392
+
393
+ ## No Internet Required
394
+
395
+ Once models are downloaded and Docker images are cached, the entire stack runs **offline**:
396
+ - Local LLM (Ollama, LM Studio, etc.) β€” no network
397
+ - Redis, PostgreSQL, Nginx β€” local containers
398
+ - Prometheus + Grafana β€” local containers
399
+ - The only outbound calls are to the LLM API on localhost
400
+
401
+ Perfect for air-gapped environments or private data processing.