Local Deployment Guide β No Hugging Face Required
Run the entire ml-intern production system locally on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).
Prerequisites
- Docker + Docker Compose (recommended) OR Python 3.11+
- 8GB RAM minimum (16GB+ recommended)
- Local LLM backend (pick one):
Option 1: Docker Compose (Fastest β 2 Minutes)
Step 1: Start a Local LLM Server
Option A β Ollama (Recommended)
# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1
# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
ollama serve
Option B β LM Studio
- Download LM Studio from https://lmstudio.ai
- Load any GGUF model
- Start Local Inference Server β it runs on
http://localhost:1234/v1
Option C β llama.cpp Server
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# Start server (OpenAI-compatible API on :8080/v1)
./server -m llama-2-7b.Q4_K_M.gguf --port 8080
Step 2: Clone & Configure
git clone https://github.com/raazkumar/ml-intern-local-fork.git
cd ml-intern-local-fork/production
# Copy environment template
cp .env.example .env
Edit .env β only change these lines:
# Point to your local LLM server
OLLAMA_API_BASE=http://host.docker.internal:11434/v1
# (or for LM Studio: http://host.docker.internal:1234/v1)
# (or for llama.cpp: http://host.docker.internal:8080/v1)
# No cloud API keys needed for local-only mode
# Leave these blank or comment them out:
# HF_TOKEN=
# ANTHROPIC_API_KEY=
# OPENAI_API_KEY=
# GROQ_API_KEY=
# NVIDIA_API_KEY=
Docker host networking note: On Linux,
host.docker.internalmay not work. Use your machine's LAN IP (e.g.,192.168.1.5) instead. On Mac/Windows,host.docker.internalworks out of the box.
Step 3: Launch the Stack
docker-compose up -d
This starts:
- API server (FastAPI) on http://localhost:8000
- Background workers (cleanup, budget alerts)
- Redis (caching + rate limiting) on :6379
- PostgreSQL (audit log + sessions) on :5432
- Nginx (load balancer) on :80
- Prometheus (metrics) on :9090
- Grafana (dashboards) on :3000
- Jaeger (tracing) on :16686
- pgAdmin (DB GUI) on :5050
Step 4: Verify
# Health check
curl http://localhost/health | jq
# List available models (includes your local ones)
curl http://localhost/v1/models | jq
# Chat with your local model
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/llama3.1",
"messages": [{"role":"user","content":"Hello from local deployment!"}],
"stream": false
}'
Step 5: View Dashboards
| Service | URL | Default Login |
|---|---|---|
| API | http://localhost:8000 | β |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | β |
| Jaeger UI | http://localhost:16686 | β |
| pgAdmin | http://localhost:5050 | admin@mlintern.local / admin |
Option 2: Pure Python (No Docker)
For development or lightweight setups.
Step 1: Install Dependencies
# Python 3.11+ required
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r production/requirements.prod.txt
Step 2: Start PostgreSQL + Redis
You need these running locally. Options:
A) System packages:
# Ubuntu/Debian
sudo apt install postgresql redis
sudo systemctl start postgresql redis
# macOS
brew install postgresql redis
brew services start postgresql redis
B) Docker (just the infra):
docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run -d --name postgres \
-e POSTGRES_PASSWORD=ml_intern \
-e POSTGRES_DB=ml_intern \
-p 5432:5432 postgres:16-alpine
Step 3: Initialize Database
psql -U postgres -h localhost -d ml_intern -f production/init.sql
Step 4: Configure Environment
export REDIS_URL=redis://localhost:6379
export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
export PORT=8000
export WORKERS=1
export LOG_LEVEL=INFO
# Point to your local LLM
export OLLAMA_API_BASE=http://localhost:11434/v1
Step 5: Start the Server
cd production
python -m production_server
Server runs on http://localhost:8000
Step 6: Start Worker (in another terminal)
source .venv/bin/activate
cd production
python -m worker
Connecting Different Local Backends
| Backend | Start Command | API Base | Model Prefix | Example Model String |
|---|---|---|---|---|
| Ollama | ollama serve |
http://localhost:11434/v1 |
ollama/ |
ollama/llama3.1 |
| LM Studio | Start server in GUI | http://localhost:1234/v1 |
lmstudio/ |
lmstudio/llama-3-8b |
| llama.cpp | ./server -m model.gguf |
http://localhost:8080/v1 |
llamacpp/ |
llamacpp/llama-2-7b |
| vLLM | python -m vllm.entrypoints.openai.api_server |
http://localhost:8000/v1 |
vllm/ |
vllm/llama-3-8b |
| MLX | python -m mlx_lm.server |
http://localhost:8000/v1 |
mlx/ |
mlx/llama-3-8b |
| NVIDIA NIM | docker run nvcr.io/... |
http://localhost:8000/v1 |
nim/ |
nim/llama-3.1-8b |
| TGI | docker run ghcr.io/...tgi |
http://localhost:8080/v1 |
tgi/ |
tgi/llama-3-8b |
| Custom PyTorch | Your own server | http://localhost:8000/v1 |
local/ |
local/my-model |
Override API Base (if not default port)
In .env:
OLLAMA_API_BASE=http://192.168.1.100:11434/v1
LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
VLLM_API_BASE=http://vllm-server.internal:8000/v1
Multi-Backend Setup (Recommended)
Run multiple local backends and let ml-intern round-robin or fail over:
# Terminal 1: Ollama for fast models
ollama pull llama3.1
ollama serve
# Terminal 2: vLLM for high-throughput
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--port 8001
In .env:
OLLAMA_API_BASE=http://localhost:11434/v1
VLLM_API_BASE=http://localhost:8001/v1
Now you can use either:
curl http://localhost/v1/chat/completions -d '{
"model": "ollama/llama3.1",
"messages": [{"role":"user","content":"Quick question"}]
}'
curl http://localhost/v1/chat/completions -d '{
"model": "vllm/llama-3.1-70b",
"messages": [{"role":"user","content":"Complex reasoning"}]
}'
CLI Mode (No Server)
If you want to use ml-intern as a CLI tool with local models (the original use case):
# Install the agent CLI
pip install -e .
# Run with local model
ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"
# With local overrides
OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
--model ollama/llama3.1 \
--yolo \
"Create a FastAPI app with Redis caching"
Hardware Requirements by Backend
| Backend | Min GPU | Recommended GPU | RAM | Notes |
|---|---|---|---|---|
| Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use |
| Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps |
| LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration |
| vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput |
| vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required |
| llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only |
| MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU |
| NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support |
Troubleshooting
"Connection refused" to local LLM
Docker containers can't reach localhost on the host. Use:
- Mac/Windows:
host.docker.internal(already in default.env) - Linux: Your machine's LAN IP, e.g.,
192.168.1.5 - All platforms: Put the LLM server in Docker Compose too
Ollama in Docker Compose
Add to docker-compose.yml:
ollama:
image: ollama/ollama
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
Then set OLLAMA_API_BASE=http://ollama:11434/v1 (internal Docker DNS).
"Rate limit exceeded" immediately
The default RPM is 40. For local models with no actual limit, increase it:
DEFAULT_RPM_LIMIT=1000
PostgreSQL connection failed
# Check if Postgres is running
docker ps | grep postgres
# Check logs
docker logs ml-intern-postgres-1
# Reset database
docker-compose down -v # WARNING: deletes all data
docker-compose up -d postgres
Grafana shows "No data"
Prometheus needs time to scrape. Wait 30 seconds, or check:
curl http://localhost:9090/api/v1/targets
Slow first response
Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.
File Structure (Local Copy)
ml-intern/
βββ production/
β βββ docker-compose.yml # Full stack
β βββ Dockerfile.prod # API + worker image
β βββ production_server.py # FastAPI app
β βββ worker.py # Background tasks
β βββ init.sql # DB schema
β βββ nginx.conf # Load balancer config
β βββ prometheus.yml # Metrics collection
β βββ requirements.prod.txt # Python deps
β βββ .env.example # Configuration template
β βββ grafana/ # Dashboards
β βββ k8s/ # Kubernetes manifests
β βββ helm/ # Helm charts
β βββ tests/ # Integration + load tests
βββ agent/ # Original ml-intern agent code
Next Steps
- Load test your setup:
locust -f production/tests/load_test.py --host http://localhost - Add cloud fallback: Set
GROQ_API_KEYorOPENAI_API_KEYfor when local model is overloaded - Monitor costs: Even local models use electricity β Grafana tracks request volume
- Scale horizontally:
docker-compose up -d --scale api=4
No Internet Required
Once models are downloaded and Docker images are cached, the entire stack runs offline:
- Local LLM (Ollama, LM Studio, etc.) β no network
- Redis, PostgreSQL, Nginx β local containers
- Prometheus + Grafana β local containers
- The only outbound calls are to the LLM API on localhost
Perfect for air-gapped environments or private data processing.