ml-intern-local-fork / production /LOCAL_DEPLOYMENT.md

Upload production/LOCAL_DEPLOYMENT.md

03cc10d verified 5 days ago

11.3 kB

	# Local Deployment Guide — No Hugging Face Required

	Run the entire ml-intern production system locally on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).

	## Prerequisites

	- Docker + Docker Compose (recommended) OR Python 3.11+
	- 8GB RAM minimum (16GB+ recommended)
	- Local LLM backend (pick one):
	- [Ollama](https://ollama.com) — easiest
	- [LM Studio](https://lmstudio.ai) — GUI, great for Mac/Windows
	- [llama.cpp](https://github.com/ggerganov/llama.cpp) — most control
	- [vLLM](https://github.com/vllm-project/vllm) — highest throughput
	- [NVIDIA NIM](https://developer.nvidia.com/nim) — enterprise GPUs
	- [MLX](https://github.com/ml-explore/mlx) — Apple Silicon optimized

	---

	## Option 1: Docker Compose (Fastest — 2 Minutes)

	### Step 1: Start a Local LLM Server

	Option A — Ollama (Recommended)

	```bash
	# Install Ollama (one-liner)
	curl -fsSL https://ollama.com/install.sh \| sh

	# Pull a model
	ollama pull llama3.1

	# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
	ollama serve
	```

	Option B — LM Studio

	1. Download LM Studio from https://lmstudio.ai
	2. Load any GGUF model
	3. Start Local Inference Server → it runs on `http://localhost:1234/v1`

	Option C — llama.cpp Server

	```bash
	# Build
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp && make

	# Download a GGUF model
	wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

	# Start server (OpenAI-compatible API on :8080/v1)
	./server -m llama-2-7b.Q4_K_M.gguf --port 8080
	```

	### Step 2: Clone & Configure

	```bash
	git clone https://github.com/raazkumar/ml-intern-local-fork.git
	cd ml-intern-local-fork/production

	# Copy environment template
	cp .env.example .env
	```

	Edit `.env` — only change these lines:

	```env
	# Point to your local LLM server
	OLLAMA_API_BASE=http://host.docker.internal:11434/v1
	# (or for LM Studio: http://host.docker.internal:1234/v1)
	# (or for llama.cpp: http://host.docker.internal:8080/v1)

	# No cloud API keys needed for local-only mode
	# Leave these blank or comment them out:
	# HF_TOKEN=
	# ANTHROPIC_API_KEY=
	# OPENAI_API_KEY=
	# GROQ_API_KEY=
	# NVIDIA_API_KEY=
	```

	> Docker host networking note: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box.

	### Step 3: Launch the Stack

	```bash
	docker-compose up -d
	```

	This starts:
	- API server (FastAPI) on http://localhost:8000
	- Background workers (cleanup, budget alerts)
	- Redis (caching + rate limiting) on :6379
	- PostgreSQL (audit log + sessions) on :5432
	- Nginx (load balancer) on :80
	- Prometheus (metrics) on :9090
	- Grafana (dashboards) on :3000
	- Jaeger (tracing) on :16686
	- pgAdmin (DB GUI) on :5050

	### Step 4: Verify

	```bash
	# Health check
	curl http://localhost/health \| jq

	# List available models (includes your local ones)
	curl http://localhost/v1/models \| jq

	# Chat with your local model
	curl -X POST http://localhost/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "ollama/llama3.1",
	"messages": [{"role":"user","content":"Hello from local deployment!"}],
	"stream": false
	}'
	```

	### Step 5: View Dashboards

	\| Service \| URL \| Default Login \|
	\|---------\|-----\|---------------\|
	\| API \| http://localhost:8000 \| — \|
	\| Grafana \| http://localhost:3000 \| admin / admin \|
	\| Prometheus \| http://localhost:9090 \| — \|
	\| Jaeger UI \| http://localhost:16686 \| — \|
	\| pgAdmin \| http://localhost:5050 \| admin@mlintern.local / admin \|

	---

	## Option 2: Pure Python (No Docker)

	For development or lightweight setups.

	### Step 1: Install Dependencies

	```bash
	# Python 3.11+ required
	python -m venv .venv
	source .venv/bin/activate # Windows: .venv\Scripts\activate

	pip install -r production/requirements.prod.txt
	```

	### Step 2: Start PostgreSQL + Redis

	You need these running locally. Options:

	A) System packages:
	```bash
	# Ubuntu/Debian
	sudo apt install postgresql redis
	sudo systemctl start postgresql redis

	# macOS
	brew install postgresql redis
	brew services start postgresql redis
	```

	B) Docker (just the infra):
	```bash
	docker run -d --name redis -p 6379:6379 redis:7-alpine
	docker run -d --name postgres \
	-e POSTGRES_PASSWORD=ml_intern \
	-e POSTGRES_DB=ml_intern \
	-p 5432:5432 postgres:16-alpine
	```

	### Step 3: Initialize Database

	```bash
	psql -U postgres -h localhost -d ml_intern -f production/init.sql
	```

	### Step 4: Configure Environment

	```bash
	export REDIS_URL=redis://localhost:6379
	export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
	export PORT=8000
	export WORKERS=1
	export LOG_LEVEL=INFO

	# Point to your local LLM
	export OLLAMA_API_BASE=http://localhost:11434/v1
	```

	### Step 5: Start the Server

	```bash
	cd production
	python -m production_server
	```

	Server runs on http://localhost:8000

	### Step 6: Start Worker (in another terminal)

	```bash
	source .venv/bin/activate
	cd production
	python -m worker
	```

	---

	## Connecting Different Local Backends

	\| Backend \| Start Command \| API Base \| Model Prefix \| Example Model String \|
	\|---------\|--------------\|----------\|-------------\|---------------------\|
	\| Ollama \| `ollama serve` \| `http://localhost:11434/v1` \| `ollama/` \| `ollama/llama3.1` \|
	\| LM Studio \| Start server in GUI \| `http://localhost:1234/v1` \| `lmstudio/` \| `lmstudio/llama-3-8b` \|
	\| llama.cpp \| `./server -m model.gguf` \| `http://localhost:8080/v1` \| `llamacpp/` \| `llamacpp/llama-2-7b` \|
	\| vLLM \| `python -m vllm.entrypoints.openai.api_server` \| `http://localhost:8000/v1` \| `vllm/` \| `vllm/llama-3-8b` \|
	\| MLX \| `python -m mlx_lm.server` \| `http://localhost:8000/v1` \| `mlx/` \| `mlx/llama-3-8b` \|
	\| NVIDIA NIM \| `docker run nvcr.io/...` \| `http://localhost:8000/v1` \| `nim/` \| `nim/llama-3.1-8b` \|
	\| TGI \| `docker run ghcr.io/...tgi` \| `http://localhost:8080/v1` \| `tgi/` \| `tgi/llama-3-8b` \|
	\| Custom PyTorch \| Your own server \| `http://localhost:8000/v1` \| `local/` \| `local/my-model` \|

	### Override API Base (if not default port)

	In `.env`:
	```env
	OLLAMA_API_BASE=http://192.168.1.100:11434/v1
	LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
	VLLM_API_BASE=http://vllm-server.internal:8000/v1
	```

	---

	## Multi-Backend Setup (Recommended)

	Run multiple local backends and let ml-intern round-robin or fail over:

	```bash
	# Terminal 1: Ollama for fast models
	ollama pull llama3.1
	ollama serve

	# Terminal 2: vLLM for high-throughput
	python -m vllm.entrypoints.openai.api_server \
	--model meta-llama/Llama-3.1-70B-Instruct \
	--tensor-parallel-size 2 \
	--port 8001
	```

	In `.env`:
	```env
	OLLAMA_API_BASE=http://localhost:11434/v1
	VLLM_API_BASE=http://localhost:8001/v1
	```

	Now you can use either:
	```bash
	curl http://localhost/v1/chat/completions -d '{
	"model": "ollama/llama3.1",
	"messages": [{"role":"user","content":"Quick question"}]
	}'

	curl http://localhost/v1/chat/completions -d '{
	"model": "vllm/llama-3.1-70b",
	"messages": [{"role":"user","content":"Complex reasoning"}]
	}'
	```

	---

	## CLI Mode (No Server)

	If you want to use ml-intern as a CLI tool with local models (the original use case):

	```bash
	# Install the agent CLI
	pip install -e .

	# Run with local model
	ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"

	# With local overrides
	OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
	--model ollama/llama3.1 \
	--yolo \
	"Create a FastAPI app with Redis caching"
	```

	---

	## Hardware Requirements by Backend

	\| Backend \| Min GPU \| Recommended GPU \| RAM \| Notes \|
	\|---------\|---------\|----------------\|-----\|-------\|
	\| Ollama (7B) \| None (CPU) \| 8GB VRAM \| 16GB \| Best ease-of-use \|
	\| Ollama (70B) \| 48GB VRAM \| 80GB (A100) \| 128GB \| Q4 quantization helps \|
	\| LM Studio \| None (CPU) \| 8GB+ VRAM \| 16GB \| Great GUI for exploration \|
	\| vLLM (7B) \| 16GB VRAM \| 24GB (3090/A10G) \| 32GB \| Highest throughput \|
	\| vLLM (70B) \| 80GB VRAM \| 2x A100 \| 256GB \| tensor_parallel required \|
	\| llama.cpp \| None (CPU) \| Any \| 8GB \| Best for CPU-only \|
	\| MLX (Mac) \| Apple Silicon \| M3 Max 36GB \| 32GB \| Native Apple GPU \|
	\| NVIDIA NIM \| 24GB+ \| A100/H100 \| 64GB \| Enterprise support \|

	---

	## Troubleshooting

	### "Connection refused" to local LLM

	Docker containers can't reach `localhost` on the host. Use:
	- Mac/Windows: `host.docker.internal` (already in default `.env`)
	- Linux: Your machine's LAN IP, e.g., `192.168.1.5`
	- All platforms: Put the LLM server in Docker Compose too

	### Ollama in Docker Compose

	Add to `docker-compose.yml`:
	```yaml
	ollama:
	image: ollama/ollama
	volumes:
	- ollama:/root/.ollama
	ports:
	- "11434:11434"
	```
	Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS).

	### "Rate limit exceeded" immediately

	The default RPM is 40. For local models with no actual limit, increase it:
	```env
	DEFAULT_RPM_LIMIT=1000
	```

	### PostgreSQL connection failed

	```bash
	# Check if Postgres is running
	docker ps \| grep postgres

	# Check logs
	docker logs ml-intern-postgres-1

	# Reset database
	docker-compose down -v # WARNING: deletes all data
	docker-compose up -d postgres
	```

	### Grafana shows "No data"

	Prometheus needs time to scrape. Wait 30 seconds, or check:
	```bash
	curl http://localhost:9090/api/v1/targets
	```

	### Slow first response

	Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.

	---

	## File Structure (Local Copy)

	```
	ml-intern/
	├── production/
	│ ├── docker-compose.yml # Full stack
	│ ├── Dockerfile.prod # API + worker image
	│ ├── production_server.py # FastAPI app
	│ ├── worker.py # Background tasks
	│ ├── init.sql # DB schema
	│ ├── nginx.conf # Load balancer config
	│ ├── prometheus.yml # Metrics collection
	│ ├── requirements.prod.txt # Python deps
	│ ├── .env.example # Configuration template
	│ ├── grafana/ # Dashboards
	│ ├── k8s/ # Kubernetes manifests
	│ ├── helm/ # Helm charts
	│ └── tests/ # Integration + load tests
	└── agent/ # Original ml-intern agent code
	```

	---

	## Next Steps

	1. Load test your setup: `locust -f production/tests/load_test.py --host http://localhost`
	2. Add cloud fallback: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded
	3. Monitor costs: Even local models use electricity — Grafana tracks request volume
	4. Scale horizontally: `docker-compose up -d --scale api=4`

	---

	## No Internet Required

	Once models are downloaded and Docker images are cached, the entire stack runs offline:
	- Local LLM (Ollama, LM Studio, etc.) — no network
	- Redis, PostgreSQL, Nginx — local containers
	- Prometheus + Grafana — local containers
	- The only outbound calls are to the LLM API on localhost

	Perfect for air-gapped environments or private data processing.