ml-intern-local-fork / production /LOCAL_DEPLOYMENT.md
raazkumar's picture
Upload production/LOCAL_DEPLOYMENT.md
03cc10d verified

Local Deployment Guide β€” No Hugging Face Required

Run the entire ml-intern production system locally on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).

Prerequisites

  • Docker + Docker Compose (recommended) OR Python 3.11+
  • 8GB RAM minimum (16GB+ recommended)
  • Local LLM backend (pick one):

Option 1: Docker Compose (Fastest β€” 2 Minutes)

Step 1: Start a Local LLM Server

Option A β€” Ollama (Recommended)

# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1

# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
ollama serve

Option B β€” LM Studio

  1. Download LM Studio from https://lmstudio.ai
  2. Load any GGUF model
  3. Start Local Inference Server β†’ it runs on http://localhost:1234/v1

Option C β€” llama.cpp Server

# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Start server (OpenAI-compatible API on :8080/v1)
./server -m llama-2-7b.Q4_K_M.gguf --port 8080

Step 2: Clone & Configure

git clone https://github.com/raazkumar/ml-intern-local-fork.git
cd ml-intern-local-fork/production

# Copy environment template
cp .env.example .env

Edit .env β€” only change these lines:

# Point to your local LLM server
OLLAMA_API_BASE=http://host.docker.internal:11434/v1
# (or for LM Studio: http://host.docker.internal:1234/v1)
# (or for llama.cpp: http://host.docker.internal:8080/v1)

# No cloud API keys needed for local-only mode
# Leave these blank or comment them out:
# HF_TOKEN=
# ANTHROPIC_API_KEY=
# OPENAI_API_KEY=
# GROQ_API_KEY=
# NVIDIA_API_KEY=

Docker host networking note: On Linux, host.docker.internal may not work. Use your machine's LAN IP (e.g., 192.168.1.5) instead. On Mac/Windows, host.docker.internal works out of the box.

Step 3: Launch the Stack

docker-compose up -d

This starts:

  • API server (FastAPI) on http://localhost:8000
  • Background workers (cleanup, budget alerts)
  • Redis (caching + rate limiting) on :6379
  • PostgreSQL (audit log + sessions) on :5432
  • Nginx (load balancer) on :80
  • Prometheus (metrics) on :9090
  • Grafana (dashboards) on :3000
  • Jaeger (tracing) on :16686
  • pgAdmin (DB GUI) on :5050

Step 4: Verify

# Health check
curl http://localhost/health | jq

# List available models (includes your local ones)
curl http://localhost/v1/models | jq

# Chat with your local model
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1",
    "messages": [{"role":"user","content":"Hello from local deployment!"}],
    "stream": false
  }'

Step 5: View Dashboards

Service URL Default Login
API http://localhost:8000 β€”
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090 β€”
Jaeger UI http://localhost:16686 β€”
pgAdmin http://localhost:5050 admin@mlintern.local / admin

Option 2: Pure Python (No Docker)

For development or lightweight setups.

Step 1: Install Dependencies

# Python 3.11+ required
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r production/requirements.prod.txt

Step 2: Start PostgreSQL + Redis

You need these running locally. Options:

A) System packages:

# Ubuntu/Debian
sudo apt install postgresql redis
sudo systemctl start postgresql redis

# macOS
brew install postgresql redis
brew services start postgresql redis

B) Docker (just the infra):

docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run -d --name postgres \
  -e POSTGRES_PASSWORD=ml_intern \
  -e POSTGRES_DB=ml_intern \
  -p 5432:5432 postgres:16-alpine

Step 3: Initialize Database

psql -U postgres -h localhost -d ml_intern -f production/init.sql

Step 4: Configure Environment

export REDIS_URL=redis://localhost:6379
export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
export PORT=8000
export WORKERS=1
export LOG_LEVEL=INFO

# Point to your local LLM
export OLLAMA_API_BASE=http://localhost:11434/v1

Step 5: Start the Server

cd production
python -m production_server

Server runs on http://localhost:8000

Step 6: Start Worker (in another terminal)

source .venv/bin/activate
cd production
python -m worker

Connecting Different Local Backends

Backend Start Command API Base Model Prefix Example Model String
Ollama ollama serve http://localhost:11434/v1 ollama/ ollama/llama3.1
LM Studio Start server in GUI http://localhost:1234/v1 lmstudio/ lmstudio/llama-3-8b
llama.cpp ./server -m model.gguf http://localhost:8080/v1 llamacpp/ llamacpp/llama-2-7b
vLLM python -m vllm.entrypoints.openai.api_server http://localhost:8000/v1 vllm/ vllm/llama-3-8b
MLX python -m mlx_lm.server http://localhost:8000/v1 mlx/ mlx/llama-3-8b
NVIDIA NIM docker run nvcr.io/... http://localhost:8000/v1 nim/ nim/llama-3.1-8b
TGI docker run ghcr.io/...tgi http://localhost:8080/v1 tgi/ tgi/llama-3-8b
Custom PyTorch Your own server http://localhost:8000/v1 local/ local/my-model

Override API Base (if not default port)

In .env:

OLLAMA_API_BASE=http://192.168.1.100:11434/v1
LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
VLLM_API_BASE=http://vllm-server.internal:8000/v1

Multi-Backend Setup (Recommended)

Run multiple local backends and let ml-intern round-robin or fail over:

# Terminal 1: Ollama for fast models
ollama pull llama3.1
ollama serve

# Terminal 2: vLLM for high-throughput
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8001

In .env:

OLLAMA_API_BASE=http://localhost:11434/v1
VLLM_API_BASE=http://localhost:8001/v1

Now you can use either:

curl http://localhost/v1/chat/completions -d '{
  "model": "ollama/llama3.1",
  "messages": [{"role":"user","content":"Quick question"}]
}'

curl http://localhost/v1/chat/completions -d '{
  "model": "vllm/llama-3.1-70b",
  "messages": [{"role":"user","content":"Complex reasoning"}]
}'

CLI Mode (No Server)

If you want to use ml-intern as a CLI tool with local models (the original use case):

# Install the agent CLI
pip install -e .

# Run with local model
ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"

# With local overrides
OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
  --model ollama/llama3.1 \
  --yolo \
  "Create a FastAPI app with Redis caching"

Hardware Requirements by Backend

Backend Min GPU Recommended GPU RAM Notes
Ollama (7B) None (CPU) 8GB VRAM 16GB Best ease-of-use
Ollama (70B) 48GB VRAM 80GB (A100) 128GB Q4 quantization helps
LM Studio None (CPU) 8GB+ VRAM 16GB Great GUI for exploration
vLLM (7B) 16GB VRAM 24GB (3090/A10G) 32GB Highest throughput
vLLM (70B) 80GB VRAM 2x A100 256GB tensor_parallel required
llama.cpp None (CPU) Any 8GB Best for CPU-only
MLX (Mac) Apple Silicon M3 Max 36GB 32GB Native Apple GPU
NVIDIA NIM 24GB+ A100/H100 64GB Enterprise support

Troubleshooting

"Connection refused" to local LLM

Docker containers can't reach localhost on the host. Use:

  • Mac/Windows: host.docker.internal (already in default .env)
  • Linux: Your machine's LAN IP, e.g., 192.168.1.5
  • All platforms: Put the LLM server in Docker Compose too

Ollama in Docker Compose

Add to docker-compose.yml:

  ollama:
    image: ollama/ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"

Then set OLLAMA_API_BASE=http://ollama:11434/v1 (internal Docker DNS).

"Rate limit exceeded" immediately

The default RPM is 40. For local models with no actual limit, increase it:

DEFAULT_RPM_LIMIT=1000

PostgreSQL connection failed

# Check if Postgres is running
docker ps | grep postgres

# Check logs
docker logs ml-intern-postgres-1

# Reset database
docker-compose down -v  # WARNING: deletes all data
docker-compose up -d postgres

Grafana shows "No data"

Prometheus needs time to scrape. Wait 30 seconds, or check:

curl http://localhost:9090/api/v1/targets

Slow first response

Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.


File Structure (Local Copy)

ml-intern/
β”œβ”€β”€ production/
β”‚   β”œβ”€β”€ docker-compose.yml      # Full stack
β”‚   β”œβ”€β”€ Dockerfile.prod         # API + worker image
β”‚   β”œβ”€β”€ production_server.py   # FastAPI app
β”‚   β”œβ”€β”€ worker.py              # Background tasks
β”‚   β”œβ”€β”€ init.sql               # DB schema
β”‚   β”œβ”€β”€ nginx.conf             # Load balancer config
β”‚   β”œβ”€β”€ prometheus.yml         # Metrics collection
β”‚   β”œβ”€β”€ requirements.prod.txt  # Python deps
β”‚   β”œβ”€β”€ .env.example           # Configuration template
β”‚   β”œβ”€β”€ grafana/               # Dashboards
β”‚   β”œβ”€β”€ k8s/                   # Kubernetes manifests
β”‚   β”œβ”€β”€ helm/                  # Helm charts
β”‚   └── tests/                 # Integration + load tests
└── agent/                     # Original ml-intern agent code

Next Steps

  1. Load test your setup: locust -f production/tests/load_test.py --host http://localhost
  2. Add cloud fallback: Set GROQ_API_KEY or OPENAI_API_KEY for when local model is overloaded
  3. Monitor costs: Even local models use electricity β€” Grafana tracks request volume
  4. Scale horizontally: docker-compose up -d --scale api=4

No Internet Required

Once models are downloaded and Docker images are cached, the entire stack runs offline:

  • Local LLM (Ollama, LM Studio, etc.) β€” no network
  • Redis, PostgreSQL, Nginx β€” local containers
  • Prometheus + Grafana β€” local containers
  • The only outbound calls are to the LLM API on localhost

Perfect for air-gapped environments or private data processing.