ml-intern-local-fork / production /LOCAL_DEPLOYMENT.md
raazkumar's picture
Upload production/LOCAL_DEPLOYMENT.md
03cc10d verified
# Local Deployment Guide β€” No Hugging Face Required
Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).
## Prerequisites
- **Docker + Docker Compose** (recommended) OR **Python 3.11+**
- **8GB RAM minimum** (16GB+ recommended)
- **Local LLM backend** (pick one):
- [Ollama](https://ollama.com) β€” easiest
- [LM Studio](https://lmstudio.ai) β€” GUI, great for Mac/Windows
- [llama.cpp](https://github.com/ggerganov/llama.cpp) β€” most control
- [vLLM](https://github.com/vllm-project/vllm) β€” highest throughput
- [NVIDIA NIM](https://developer.nvidia.com/nim) β€” enterprise GPUs
- [MLX](https://github.com/ml-explore/mlx) β€” Apple Silicon optimized
---
## Option 1: Docker Compose (Fastest β€” 2 Minutes)
### Step 1: Start a Local LLM Server
**Option A β€” Ollama (Recommended)**
```bash
# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1
# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
ollama serve
```
**Option B β€” LM Studio**
1. Download LM Studio from https://lmstudio.ai
2. Load any GGUF model
3. Start **Local Inference Server** β†’ it runs on `http://localhost:1234/v1`
**Option C β€” llama.cpp Server**
```bash
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# Start server (OpenAI-compatible API on :8080/v1)
./server -m llama-2-7b.Q4_K_M.gguf --port 8080
```
### Step 2: Clone & Configure
```bash
git clone https://github.com/raazkumar/ml-intern-local-fork.git
cd ml-intern-local-fork/production
# Copy environment template
cp .env.example .env
```
Edit `.env` β€” **only change these lines**:
```env
# Point to your local LLM server
OLLAMA_API_BASE=http://host.docker.internal:11434/v1
# (or for LM Studio: http://host.docker.internal:1234/v1)
# (or for llama.cpp: http://host.docker.internal:8080/v1)
# No cloud API keys needed for local-only mode
# Leave these blank or comment them out:
# HF_TOKEN=
# ANTHROPIC_API_KEY=
# OPENAI_API_KEY=
# GROQ_API_KEY=
# NVIDIA_API_KEY=
```
> **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box.
### Step 3: Launch the Stack
```bash
docker-compose up -d
```
This starts:
- **API server** (FastAPI) on http://localhost:8000
- **Background workers** (cleanup, budget alerts)
- **Redis** (caching + rate limiting) on :6379
- **PostgreSQL** (audit log + sessions) on :5432
- **Nginx** (load balancer) on :80
- **Prometheus** (metrics) on :9090
- **Grafana** (dashboards) on :3000
- **Jaeger** (tracing) on :16686
- **pgAdmin** (DB GUI) on :5050
### Step 4: Verify
```bash
# Health check
curl http://localhost/health | jq
# List available models (includes your local ones)
curl http://localhost/v1/models | jq
# Chat with your local model
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.1",
"messages": [{"role":"user","content":"Hello from local deployment!"}],
"stream": false
}'
```
### Step 5: View Dashboards
| Service | URL | Default Login |
|---------|-----|---------------|
| API | http://localhost:8000 | β€” |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | β€” |
| Jaeger UI | http://localhost:16686 | β€” |
| pgAdmin | http://localhost:5050 | admin@mlintern.local / admin |
---
## Option 2: Pure Python (No Docker)
For development or lightweight setups.
### Step 1: Install Dependencies
```bash
# Python 3.11+ required
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r production/requirements.prod.txt
```
### Step 2: Start PostgreSQL + Redis
You need these running locally. Options:
**A) System packages:**
```bash
# Ubuntu/Debian
sudo apt install postgresql redis
sudo systemctl start postgresql redis
# macOS
brew install postgresql redis
brew services start postgresql redis
```
**B) Docker (just the infra):**
```bash
docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run -d --name postgres \
-e POSTGRES_PASSWORD=ml_intern \
-e POSTGRES_DB=ml_intern \
-p 5432:5432 postgres:16-alpine
```
### Step 3: Initialize Database
```bash
psql -U postgres -h localhost -d ml_intern -f production/init.sql
```
### Step 4: Configure Environment
```bash
export REDIS_URL=redis://localhost:6379
export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
export PORT=8000
export WORKERS=1
export LOG_LEVEL=INFO
# Point to your local LLM
export OLLAMA_API_BASE=http://localhost:11434/v1
```
### Step 5: Start the Server
```bash
cd production
python -m production_server
```
Server runs on http://localhost:8000
### Step 6: Start Worker (in another terminal)
```bash
source .venv/bin/activate
cd production
python -m worker
```
---
## Connecting Different Local Backends
| Backend | Start Command | API Base | Model Prefix | Example Model String |
|---------|--------------|----------|-------------|---------------------|
| **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` |
| **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` |
| **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` |
| **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` |
| **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` |
| **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` |
| **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` |
| **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` |
### Override API Base (if not default port)
In `.env`:
```env
OLLAMA_API_BASE=http://192.168.1.100:11434/v1
LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
VLLM_API_BASE=http://vllm-server.internal:8000/v1
```
---
## Multi-Backend Setup (Recommended)
Run **multiple local backends** and let ml-intern round-robin or fail over:
```bash
# Terminal 1: Ollama for fast models
ollama pull llama3.1
ollama serve
# Terminal 2: vLLM for high-throughput
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--port 8001
```
In `.env`:
```env
OLLAMA_API_BASE=http://localhost:11434/v1
VLLM_API_BASE=http://localhost:8001/v1
```
Now you can use either:
```bash
curl http://localhost/v1/chat/completions -d '{
"model": "ollama/llama3.1",
"messages": [{"role":"user","content":"Quick question"}]
}'
curl http://localhost/v1/chat/completions -d '{
"model": "vllm/llama-3.1-70b",
"messages": [{"role":"user","content":"Complex reasoning"}]
}'
```
---
## CLI Mode (No Server)
If you want to use ml-intern as a CLI tool with local models (the original use case):
```bash
# Install the agent CLI
pip install -e .
# Run with local model
ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"
# With local overrides
OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
--model ollama/llama3.1 \
--yolo \
"Create a FastAPI app with Redis caching"
```
---
## Hardware Requirements by Backend
| Backend | Min GPU | Recommended GPU | RAM | Notes |
|---------|---------|----------------|-----|-------|
| Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use |
| Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps |
| LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration |
| vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput |
| vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required |
| llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only |
| MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU |
| NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support |
---
## Troubleshooting
### "Connection refused" to local LLM
Docker containers can't reach `localhost` on the host. Use:
- **Mac/Windows**: `host.docker.internal` (already in default `.env`)
- **Linux**: Your machine's LAN IP, e.g., `192.168.1.5`
- **All platforms**: Put the LLM server in Docker Compose too
### Ollama in Docker Compose
Add to `docker-compose.yml`:
```yaml
ollama:
image: ollama/ollama
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
```
Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS).
### "Rate limit exceeded" immediately
The default RPM is 40. For local models with no actual limit, increase it:
```env
DEFAULT_RPM_LIMIT=1000
```
### PostgreSQL connection failed
```bash
# Check if Postgres is running
docker ps | grep postgres
# Check logs
docker logs ml-intern-postgres-1
# Reset database
docker-compose down -v # WARNING: deletes all data
docker-compose up -d postgres
```
### Grafana shows "No data"
Prometheus needs time to scrape. Wait 30 seconds, or check:
```bash
curl http://localhost:9090/api/v1/targets
```
### Slow first response
Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.
---
## File Structure (Local Copy)
```
ml-intern/
β”œβ”€β”€ production/
β”‚ β”œβ”€β”€ docker-compose.yml # Full stack
β”‚ β”œβ”€β”€ Dockerfile.prod # API + worker image
β”‚ β”œβ”€β”€ production_server.py # FastAPI app
β”‚ β”œβ”€β”€ worker.py # Background tasks
β”‚ β”œβ”€β”€ init.sql # DB schema
β”‚ β”œβ”€β”€ nginx.conf # Load balancer config
β”‚ β”œβ”€β”€ prometheus.yml # Metrics collection
β”‚ β”œβ”€β”€ requirements.prod.txt # Python deps
β”‚ β”œβ”€β”€ .env.example # Configuration template
β”‚ β”œβ”€β”€ grafana/ # Dashboards
β”‚ β”œβ”€β”€ k8s/ # Kubernetes manifests
β”‚ β”œβ”€β”€ helm/ # Helm charts
β”‚ └── tests/ # Integration + load tests
└── agent/ # Original ml-intern agent code
```
---
## Next Steps
1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost`
2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded
3. **Monitor costs**: Even local models use electricity β€” Grafana tracks request volume
4. **Scale horizontally**: `docker-compose up -d --scale api=4`
---
## No Internet Required
Once models are downloaded and Docker images are cached, the entire stack runs **offline**:
- Local LLM (Ollama, LM Studio, etc.) β€” no network
- Redis, PostgreSQL, Nginx β€” local containers
- Prometheus + Grafana β€” local containers
- The only outbound calls are to the LLM API on localhost
Perfect for air-gapped environments or private data processing.