ml-intern-local-fork / production /LOCAL_DEPLOYMENT.md

raazkumar

Upload production/LOCAL_DEPLOYMENT.md

03cc10d verified 4 days ago

preview code

raw

history blame contribute delete

11.3 kB

Local Deployment Guide — No Hugging Face Required

Run the entire ml-intern production system locally on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).

Prerequisites

Docker + Docker Compose (recommended) OR Python 3.11+
8GB RAM minimum (16GB+ recommended)
Local LLM backend (pick one):
- Ollama — easiest
- LM Studio — GUI, great for Mac/Windows
- llama.cpp — most control
- vLLM — highest throughput
- NVIDIA NIM — enterprise GPUs
- MLX — Apple Silicon optimized

Option 1: Docker Compose (Fastest — 2 Minutes)

Step 1: Start a Local LLM Server

Option A — Ollama (Recommended)

# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1

# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
ollama serve

Option B — LM Studio

Download LM Studio from https://lmstudio.ai
Load any GGUF model
Start Local Inference Server → it runs on http://localhost:1234/v1

Option C — llama.cpp Server

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Start server (OpenAI-compatible API on :8080/v1)
./server -m llama-2-7b.Q4_K_M.gguf --port 8080

Step 2: Clone & Configure

git clone https://github.com/raazkumar/ml-intern-local-fork.git
cd ml-intern-local-fork/production

# Copy environment template
cp .env.example .env

Edit .env — only change these lines:

# Point to your local LLM server
OLLAMA_API_BASE=http://host.docker.internal:11434/v1
# (or for LM Studio: http://host.docker.internal:1234/v1)
# (or for llama.cpp: http://host.docker.internal:8080/v1)

# No cloud API keys needed for local-only mode
# Leave these blank or comment them out:
# HF_TOKEN=
# ANTHROPIC_API_KEY=
# OPENAI_API_KEY=
# GROQ_API_KEY=
# NVIDIA_API_KEY=

Docker host networking note: On Linux, host.docker.internal may not work. Use your machine's LAN IP (e.g., 192.168.1.5) instead. On Mac/Windows, host.docker.internal works out of the box.

Step 3: Launch the Stack

docker-compose up -d

This starts:

API server (FastAPI) on http://localhost:8000
Background workers (cleanup, budget alerts)
Redis (caching + rate limiting) on :6379
PostgreSQL (audit log + sessions) on :5432
Nginx (load balancer) on :80
Prometheus (metrics) on :9090
Grafana (dashboards) on :3000
Jaeger (tracing) on :16686
pgAdmin (DB GUI) on :5050

Step 4: Verify

# Health check
curl http://localhost/health | jq

# List available models (includes your local ones)
curl http://localhost/v1/models | jq

# Chat with your local model
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1",
    "messages": [{"role":"user","content":"Hello from local deployment!"}],
    "stream": false
  }'

Step 5: View Dashboards

Service	URL	Default Login
API	http://localhost:8000	—
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	—
Jaeger UI	http://localhost:16686	—
pgAdmin	http://localhost:5050	admin@mlintern.local / admin

Option 2: Pure Python (No Docker)

For development or lightweight setups.

Step 1: Install Dependencies

# Python 3.11+ required
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r production/requirements.prod.txt

Step 2: Start PostgreSQL + Redis

You need these running locally. Options:

A) System packages:

# Ubuntu/Debian
sudo apt install postgresql redis
sudo systemctl start postgresql redis

# macOS
brew install postgresql redis
brew services start postgresql redis

B) Docker (just the infra):

docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run -d --name postgres \
  -e POSTGRES_PASSWORD=ml_intern \
  -e POSTGRES_DB=ml_intern \
  -p 5432:5432 postgres:16-alpine

Step 3: Initialize Database

psql -U postgres -h localhost -d ml_intern -f production/init.sql

Step 4: Configure Environment

export REDIS_URL=redis://localhost:6379
export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
export PORT=8000
export WORKERS=1
export LOG_LEVEL=INFO

# Point to your local LLM
export OLLAMA_API_BASE=http://localhost:11434/v1

Step 5: Start the Server

cd production
python -m production_server

Server runs on http://localhost:8000

Step 6: Start Worker (in another terminal)

source .venv/bin/activate
cd production
python -m worker

Connecting Different Local Backends

Backend	Start Command	API Base	Model Prefix	Example Model String
Ollama	`ollama serve`	`http://localhost:11434/v1`	`ollama/`	`ollama/llama3.1`
LM Studio	Start server in GUI	`http://localhost:1234/v1`	`lmstudio/`	`lmstudio/llama-3-8b`
llama.cpp	`./server -m model.gguf`	`http://localhost:8080/v1`	`llamacpp/`	`llamacpp/llama-2-7b`
vLLM	`python -m vllm.entrypoints.openai.api_server`	`http://localhost:8000/v1`	`vllm/`	`vllm/llama-3-8b`
MLX	`python -m mlx_lm.server`	`http://localhost:8000/v1`	`mlx/`	`mlx/llama-3-8b`
NVIDIA NIM	`docker run nvcr.io/...`	`http://localhost:8000/v1`	`nim/`	`nim/llama-3.1-8b`
TGI	`docker run ghcr.io/...tgi`	`http://localhost:8080/v1`	`tgi/`	`tgi/llama-3-8b`
Custom PyTorch	Your own server	`http://localhost:8000/v1`	`local/`	`local/my-model`

Override API Base (if not default port)

In .env:

OLLAMA_API_BASE=http://192.168.1.100:11434/v1
LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
VLLM_API_BASE=http://vllm-server.internal:8000/v1

Multi-Backend Setup (Recommended)

Run multiple local backends and let ml-intern round-robin or fail over:

# Terminal 1: Ollama for fast models
ollama pull llama3.1
ollama serve

# Terminal 2: vLLM for high-throughput
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8001

In .env:

OLLAMA_API_BASE=http://localhost:11434/v1
VLLM_API_BASE=http://localhost:8001/v1

Now you can use either:

curl http://localhost/v1/chat/completions -d '{
  "model": "ollama/llama3.1",
  "messages": [{"role":"user","content":"Quick question"}]
}'

curl http://localhost/v1/chat/completions -d '{
  "model": "vllm/llama-3.1-70b",
  "messages": [{"role":"user","content":"Complex reasoning"}]
}'

CLI Mode (No Server)

If you want to use ml-intern as a CLI tool with local models (the original use case):

# Install the agent CLI
pip install -e .

# Run with local model
ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"

# With local overrides
OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
  --model ollama/llama3.1 \
  --yolo \
  "Create a FastAPI app with Redis caching"

Hardware Requirements by Backend

Backend	Min GPU	Recommended GPU	RAM	Notes
Ollama (7B)	None (CPU)	8GB VRAM	16GB	Best ease-of-use
Ollama (70B)	48GB VRAM	80GB (A100)	128GB	Q4 quantization helps
LM Studio	None (CPU)	8GB+ VRAM	16GB	Great GUI for exploration
vLLM (7B)	16GB VRAM	24GB (3090/A10G)	32GB	Highest throughput
vLLM (70B)	80GB VRAM	2x A100	256GB	tensor_parallel required
llama.cpp	None (CPU)	Any	8GB	Best for CPU-only
MLX (Mac)	Apple Silicon	M3 Max 36GB	32GB	Native Apple GPU
NVIDIA NIM	24GB+	A100/H100	64GB	Enterprise support

Troubleshooting

"Connection refused" to local LLM

Docker containers can't reach localhost on the host. Use:

Mac/Windows: host.docker.internal (already in default .env)
Linux: Your machine's LAN IP, e.g., 192.168.1.5
All platforms: Put the LLM server in Docker Compose too

Ollama in Docker Compose

Add to docker-compose.yml:

  ollama:
    image: ollama/ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"

Then set OLLAMA_API_BASE=http://ollama:11434/v1 (internal Docker DNS).

"Rate limit exceeded" immediately

The default RPM is 40. For local models with no actual limit, increase it:

DEFAULT_RPM_LIMIT=1000

PostgreSQL connection failed

# Check if Postgres is running
docker ps | grep postgres

# Check logs
docker logs ml-intern-postgres-1

# Reset database
docker-compose down -v  # WARNING: deletes all data
docker-compose up -d postgres

Grafana shows "No data"

Prometheus needs time to scrape. Wait 30 seconds, or check:

curl http://localhost:9090/api/v1/targets

Slow first response

Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.

File Structure (Local Copy)

ml-intern/
├── production/
│   ├── docker-compose.yml      # Full stack
│   ├── Dockerfile.prod         # API + worker image
│   ├── production_server.py   # FastAPI app
│   ├── worker.py              # Background tasks
│   ├── init.sql               # DB schema
│   ├── nginx.conf             # Load balancer config
│   ├── prometheus.yml         # Metrics collection
│   ├── requirements.prod.txt  # Python deps
│   ├── .env.example           # Configuration template
│   ├── grafana/               # Dashboards
│   ├── k8s/                   # Kubernetes manifests
│   ├── helm/                  # Helm charts
│   └── tests/                 # Integration + load tests
└── agent/                     # Original ml-intern agent code

Next Steps

Load test your setup: locust -f production/tests/load_test.py --host http://localhost
Add cloud fallback: Set GROQ_API_KEY or OPENAI_API_KEY for when local model is overloaded
Monitor costs: Even local models use electricity — Grafana tracks request volume
Scale horizontally: docker-compose up -d --scale api=4

No Internet Required

Once models are downloaded and Docker images are cached, the entire stack runs offline:

Local LLM (Ollama, LM Studio, etc.) — no network
Redis, PostgreSQL, Nginx — local containers
Prometheus + Grafana — local containers
The only outbound calls are to the LLM API on localhost

Perfect for air-gapped environments or private data processing.