# Local Deployment Guide — No Hugging Face Required Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them). ## Prerequisites - **Docker + Docker Compose** (recommended) OR **Python 3.11+** - **8GB RAM minimum** (16GB+ recommended) - **Local LLM backend** (pick one): - [Ollama](https://ollama.com) — easiest - [LM Studio](https://lmstudio.ai) — GUI, great for Mac/Windows - [llama.cpp](https://github.com/ggerganov/llama.cpp) — most control - [vLLM](https://github.com/vllm-project/vllm) — highest throughput - [NVIDIA NIM](https://developer.nvidia.com/nim) — enterprise GPUs - [MLX](https://github.com/ml-explore/mlx) — Apple Silicon optimized --- ## Option 1: Docker Compose (Fastest — 2 Minutes) ### Step 1: Start a Local LLM Server **Option A — Ollama (Recommended)** ```bash # Install Ollama (one-liner) curl -fsSL https://ollama.com/install.sh | sh # Pull a model ollama pull llama3.1 # Start server (runs on :11434, OpenAI-compatible on :11434/v1) ollama serve ``` **Option B — LM Studio** 1. Download LM Studio from https://lmstudio.ai 2. Load any GGUF model 3. Start **Local Inference Server** → it runs on `http://localhost:1234/v1` **Option C — llama.cpp Server** ```bash # Build git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Download a GGUF model wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf # Start server (OpenAI-compatible API on :8080/v1) ./server -m llama-2-7b.Q4_K_M.gguf --port 8080 ``` ### Step 2: Clone & Configure ```bash git clone https://github.com/raazkumar/ml-intern-local-fork.git cd ml-intern-local-fork/production # Copy environment template cp .env.example .env ``` Edit `.env` — **only change these lines**: ```env # Point to your local LLM server OLLAMA_API_BASE=http://host.docker.internal:11434/v1 # (or for LM Studio: http://host.docker.internal:1234/v1) # (or for llama.cpp: http://host.docker.internal:8080/v1) # No cloud API keys needed for local-only mode # Leave these blank or comment them out: # HF_TOKEN= # ANTHROPIC_API_KEY= # OPENAI_API_KEY= # GROQ_API_KEY= # NVIDIA_API_KEY= ``` > **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box. ### Step 3: Launch the Stack ```bash docker-compose up -d ``` This starts: - **API server** (FastAPI) on http://localhost:8000 - **Background workers** (cleanup, budget alerts) - **Redis** (caching + rate limiting) on :6379 - **PostgreSQL** (audit log + sessions) on :5432 - **Nginx** (load balancer) on :80 - **Prometheus** (metrics) on :9090 - **Grafana** (dashboards) on :3000 - **Jaeger** (tracing) on :16686 - **pgAdmin** (DB GUI) on :5050 ### Step 4: Verify ```bash # Health check curl http://localhost/health | jq # List available models (includes your local ones) curl http://localhost/v1/models | jq # Chat with your local model curl -X POST http://localhost/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ollama/llama3.1", "messages": [{"role":"user","content":"Hello from local deployment!"}], "stream": false }' ``` ### Step 5: View Dashboards | Service | URL | Default Login | |---------|-----|---------------| | API | http://localhost:8000 | — | | Grafana | http://localhost:3000 | admin / admin | | Prometheus | http://localhost:9090 | — | | Jaeger UI | http://localhost:16686 | — | | pgAdmin | http://localhost:5050 | admin@mlintern.local / admin | --- ## Option 2: Pure Python (No Docker) For development or lightweight setups. ### Step 1: Install Dependencies ```bash # Python 3.11+ required python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r production/requirements.prod.txt ``` ### Step 2: Start PostgreSQL + Redis You need these running locally. Options: **A) System packages:** ```bash # Ubuntu/Debian sudo apt install postgresql redis sudo systemctl start postgresql redis # macOS brew install postgresql redis brew services start postgresql redis ``` **B) Docker (just the infra):** ```bash docker run -d --name redis -p 6379:6379 redis:7-alpine docker run -d --name postgres \ -e POSTGRES_PASSWORD=ml_intern \ -e POSTGRES_DB=ml_intern \ -p 5432:5432 postgres:16-alpine ``` ### Step 3: Initialize Database ```bash psql -U postgres -h localhost -d ml_intern -f production/init.sql ``` ### Step 4: Configure Environment ```bash export REDIS_URL=redis://localhost:6379 export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern export PORT=8000 export WORKERS=1 export LOG_LEVEL=INFO # Point to your local LLM export OLLAMA_API_BASE=http://localhost:11434/v1 ``` ### Step 5: Start the Server ```bash cd production python -m production_server ``` Server runs on http://localhost:8000 ### Step 6: Start Worker (in another terminal) ```bash source .venv/bin/activate cd production python -m worker ``` --- ## Connecting Different Local Backends | Backend | Start Command | API Base | Model Prefix | Example Model String | |---------|--------------|----------|-------------|---------------------| | **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` | | **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` | | **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` | | **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` | | **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` | | **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` | | **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` | | **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` | ### Override API Base (if not default port) In `.env`: ```env OLLAMA_API_BASE=http://192.168.1.100:11434/v1 LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1 VLLM_API_BASE=http://vllm-server.internal:8000/v1 ``` --- ## Multi-Backend Setup (Recommended) Run **multiple local backends** and let ml-intern round-robin or fail over: ```bash # Terminal 1: Ollama for fast models ollama pull llama3.1 ollama serve # Terminal 2: vLLM for high-throughput python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --port 8001 ``` In `.env`: ```env OLLAMA_API_BASE=http://localhost:11434/v1 VLLM_API_BASE=http://localhost:8001/v1 ``` Now you can use either: ```bash curl http://localhost/v1/chat/completions -d '{ "model": "ollama/llama3.1", "messages": [{"role":"user","content":"Quick question"}] }' curl http://localhost/v1/chat/completions -d '{ "model": "vllm/llama-3.1-70b", "messages": [{"role":"user","content":"Complex reasoning"}] }' ``` --- ## CLI Mode (No Server) If you want to use ml-intern as a CLI tool with local models (the original use case): ```bash # Install the agent CLI pip install -e . # Run with local model ml-intern --model ollama/llama3.1 "Write a Python function to sort a list" # With local overrides OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \ --model ollama/llama3.1 \ --yolo \ "Create a FastAPI app with Redis caching" ``` --- ## Hardware Requirements by Backend | Backend | Min GPU | Recommended GPU | RAM | Notes | |---------|---------|----------------|-----|-------| | Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use | | Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps | | LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration | | vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput | | vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required | | llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only | | MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU | | NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support | --- ## Troubleshooting ### "Connection refused" to local LLM Docker containers can't reach `localhost` on the host. Use: - **Mac/Windows**: `host.docker.internal` (already in default `.env`) - **Linux**: Your machine's LAN IP, e.g., `192.168.1.5` - **All platforms**: Put the LLM server in Docker Compose too ### Ollama in Docker Compose Add to `docker-compose.yml`: ```yaml ollama: image: ollama/ollama volumes: - ollama:/root/.ollama ports: - "11434:11434" ``` Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS). ### "Rate limit exceeded" immediately The default RPM is 40. For local models with no actual limit, increase it: ```env DEFAULT_RPM_LIMIT=1000 ``` ### PostgreSQL connection failed ```bash # Check if Postgres is running docker ps | grep postgres # Check logs docker logs ml-intern-postgres-1 # Reset database docker-compose down -v # WARNING: deletes all data docker-compose up -d postgres ``` ### Grafana shows "No data" Prometheus needs time to scrape. Wait 30 seconds, or check: ```bash curl http://localhost:9090/api/v1/targets ``` ### Slow first response Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts. --- ## File Structure (Local Copy) ``` ml-intern/ ├── production/ │ ├── docker-compose.yml # Full stack │ ├── Dockerfile.prod # API + worker image │ ├── production_server.py # FastAPI app │ ├── worker.py # Background tasks │ ├── init.sql # DB schema │ ├── nginx.conf # Load balancer config │ ├── prometheus.yml # Metrics collection │ ├── requirements.prod.txt # Python deps │ ├── .env.example # Configuration template │ ├── grafana/ # Dashboards │ ├── k8s/ # Kubernetes manifests │ ├── helm/ # Helm charts │ └── tests/ # Integration + load tests └── agent/ # Original ml-intern agent code ``` --- ## Next Steps 1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost` 2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded 3. **Monitor costs**: Even local models use electricity — Grafana tracks request volume 4. **Scale horizontally**: `docker-compose up -d --scale api=4` --- ## No Internet Required Once models are downloaded and Docker images are cached, the entire stack runs **offline**: - Local LLM (Ollama, LM Studio, etc.) — no network - Redis, PostgreSQL, Nginx — local containers - Prometheus + Grafana — local containers - The only outbound calls are to the LLM API on localhost Perfect for air-gapped environments or private data processing.