| # Local Deployment Guide β No Hugging Face Required |
|
|
| Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them). |
|
|
| ## Prerequisites |
|
|
| - **Docker + Docker Compose** (recommended) OR **Python 3.11+** |
| - **8GB RAM minimum** (16GB+ recommended) |
| - **Local LLM backend** (pick one): |
| - [Ollama](https://ollama.com) β easiest |
| - [LM Studio](https://lmstudio.ai) β GUI, great for Mac/Windows |
| - [llama.cpp](https://github.com/ggerganov/llama.cpp) β most control |
| - [vLLM](https://github.com/vllm-project/vllm) β highest throughput |
| - [NVIDIA NIM](https://developer.nvidia.com/nim) β enterprise GPUs |
| - [MLX](https://github.com/ml-explore/mlx) β Apple Silicon optimized |
|
|
| --- |
|
|
| ## Option 1: Docker Compose (Fastest β 2 Minutes) |
|
|
| ### Step 1: Start a Local LLM Server |
|
|
| **Option A β Ollama (Recommended)** |
|
|
| ```bash |
| # Install Ollama (one-liner) |
| curl -fsSL https://ollama.com/install.sh | sh |
| |
| # Pull a model |
| ollama pull llama3.1 |
| |
| # Start server (runs on :11434, OpenAI-compatible on :11434/v1) |
| ollama serve |
| ``` |
|
|
| **Option B β LM Studio** |
|
|
| 1. Download LM Studio from https://lmstudio.ai |
| 2. Load any GGUF model |
| 3. Start **Local Inference Server** β it runs on `http://localhost:1234/v1` |
|
|
| **Option C β llama.cpp Server** |
|
|
| ```bash |
| # Build |
| git clone https://github.com/ggerganov/llama.cpp |
| cd llama.cpp && make |
| |
| # Download a GGUF model |
| wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf |
| |
| # Start server (OpenAI-compatible API on :8080/v1) |
| ./server -m llama-2-7b.Q4_K_M.gguf --port 8080 |
| ``` |
|
|
| ### Step 2: Clone & Configure |
|
|
| ```bash |
| git clone https://github.com/raazkumar/ml-intern-local-fork.git |
| cd ml-intern-local-fork/production |
| |
| # Copy environment template |
| cp .env.example .env |
| ``` |
|
|
| Edit `.env` β **only change these lines**: |
|
|
| ```env |
| # Point to your local LLM server |
| OLLAMA_API_BASE=http://host.docker.internal:11434/v1 |
| # (or for LM Studio: http://host.docker.internal:1234/v1) |
| # (or for llama.cpp: http://host.docker.internal:8080/v1) |
| |
| # No cloud API keys needed for local-only mode |
| # Leave these blank or comment them out: |
| # HF_TOKEN= |
| # ANTHROPIC_API_KEY= |
| # OPENAI_API_KEY= |
| # GROQ_API_KEY= |
| # NVIDIA_API_KEY= |
| ``` |
|
|
| > **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box. |
|
|
| ### Step 3: Launch the Stack |
|
|
| ```bash |
| docker-compose up -d |
| ``` |
|
|
| This starts: |
| - **API server** (FastAPI) on http://localhost:8000 |
| - **Background workers** (cleanup, budget alerts) |
| - **Redis** (caching + rate limiting) on :6379 |
| - **PostgreSQL** (audit log + sessions) on :5432 |
| - **Nginx** (load balancer) on :80 |
| - **Prometheus** (metrics) on :9090 |
| - **Grafana** (dashboards) on :3000 |
| - **Jaeger** (tracing) on :16686 |
| - **pgAdmin** (DB GUI) on :5050 |
|
|
| ### Step 4: Verify |
|
|
| ```bash |
| # Health check |
| curl http://localhost/health | jq |
| |
| # List available models (includes your local ones) |
| curl http://localhost/v1/models | jq |
| |
| # Chat with your local model |
| curl -X POST http://localhost/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "ollama/llama3.1", |
| "messages": [{"role":"user","content":"Hello from local deployment!"}], |
| "stream": false |
| }' |
| ``` |
|
|
| ### Step 5: View Dashboards |
|
|
| | Service | URL | Default Login | |
| |---------|-----|---------------| |
| | API | http://localhost:8000 | β | |
| | Grafana | http://localhost:3000 | admin / admin | |
| | Prometheus | http://localhost:9090 | β | |
| | Jaeger UI | http://localhost:16686 | β | |
| | pgAdmin | http://localhost:5050 | admin@mlintern.local / admin | |
|
|
| --- |
|
|
| ## Option 2: Pure Python (No Docker) |
|
|
| For development or lightweight setups. |
|
|
| ### Step 1: Install Dependencies |
|
|
| ```bash |
| # Python 3.11+ required |
| python -m venv .venv |
| source .venv/bin/activate # Windows: .venv\Scripts\activate |
| |
| pip install -r production/requirements.prod.txt |
| ``` |
|
|
| ### Step 2: Start PostgreSQL + Redis |
|
|
| You need these running locally. Options: |
|
|
| **A) System packages:** |
| ```bash |
| # Ubuntu/Debian |
| sudo apt install postgresql redis |
| sudo systemctl start postgresql redis |
| |
| # macOS |
| brew install postgresql redis |
| brew services start postgresql redis |
| ``` |
|
|
| **B) Docker (just the infra):** |
| ```bash |
| docker run -d --name redis -p 6379:6379 redis:7-alpine |
| docker run -d --name postgres \ |
| -e POSTGRES_PASSWORD=ml_intern \ |
| -e POSTGRES_DB=ml_intern \ |
| -p 5432:5432 postgres:16-alpine |
| ``` |
|
|
| ### Step 3: Initialize Database |
|
|
| ```bash |
| psql -U postgres -h localhost -d ml_intern -f production/init.sql |
| ``` |
|
|
| ### Step 4: Configure Environment |
|
|
| ```bash |
| export REDIS_URL=redis://localhost:6379 |
| export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern |
| export PORT=8000 |
| export WORKERS=1 |
| export LOG_LEVEL=INFO |
| |
| # Point to your local LLM |
| export OLLAMA_API_BASE=http://localhost:11434/v1 |
| ``` |
|
|
| ### Step 5: Start the Server |
|
|
| ```bash |
| cd production |
| python -m production_server |
| ``` |
|
|
| Server runs on http://localhost:8000 |
|
|
| ### Step 6: Start Worker (in another terminal) |
|
|
| ```bash |
| source .venv/bin/activate |
| cd production |
| python -m worker |
| ``` |
|
|
| --- |
|
|
| ## Connecting Different Local Backends |
|
|
| | Backend | Start Command | API Base | Model Prefix | Example Model String | |
| |---------|--------------|----------|-------------|---------------------| |
| | **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` | |
| | **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` | |
| | **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` | |
| | **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` | |
| | **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` | |
| | **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` | |
| | **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` | |
| | **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` | |
|
|
| ### Override API Base (if not default port) |
|
|
| In `.env`: |
| ```env |
| OLLAMA_API_BASE=http://192.168.1.100:11434/v1 |
| LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1 |
| VLLM_API_BASE=http://vllm-server.internal:8000/v1 |
| ``` |
|
|
| --- |
|
|
| ## Multi-Backend Setup (Recommended) |
|
|
| Run **multiple local backends** and let ml-intern round-robin or fail over: |
|
|
| ```bash |
| # Terminal 1: Ollama for fast models |
| ollama pull llama3.1 |
| ollama serve |
| |
| # Terminal 2: vLLM for high-throughput |
| python -m vllm.entrypoints.openai.api_server \ |
| --model meta-llama/Llama-3.1-70B-Instruct \ |
| --tensor-parallel-size 2 \ |
| --port 8001 |
| ``` |
|
|
| In `.env`: |
| ```env |
| OLLAMA_API_BASE=http://localhost:11434/v1 |
| VLLM_API_BASE=http://localhost:8001/v1 |
| ``` |
|
|
| Now you can use either: |
| ```bash |
| curl http://localhost/v1/chat/completions -d '{ |
| "model": "ollama/llama3.1", |
| "messages": [{"role":"user","content":"Quick question"}] |
| }' |
| |
| curl http://localhost/v1/chat/completions -d '{ |
| "model": "vllm/llama-3.1-70b", |
| "messages": [{"role":"user","content":"Complex reasoning"}] |
| }' |
| ``` |
|
|
| --- |
|
|
| ## CLI Mode (No Server) |
|
|
| If you want to use ml-intern as a CLI tool with local models (the original use case): |
|
|
| ```bash |
| # Install the agent CLI |
| pip install -e . |
| |
| # Run with local model |
| ml-intern --model ollama/llama3.1 "Write a Python function to sort a list" |
| |
| # With local overrides |
| OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \ |
| --model ollama/llama3.1 \ |
| --yolo \ |
| "Create a FastAPI app with Redis caching" |
| ``` |
|
|
| --- |
|
|
| ## Hardware Requirements by Backend |
|
|
| | Backend | Min GPU | Recommended GPU | RAM | Notes | |
| |---------|---------|----------------|-----|-------| |
| | Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use | |
| | Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps | |
| | LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration | |
| | vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput | |
| | vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required | |
| | llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only | |
| | MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU | |
| | NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support | |
| |
| --- |
| |
| ## Troubleshooting |
| |
| ### "Connection refused" to local LLM |
| |
| Docker containers can't reach `localhost` on the host. Use: |
| - **Mac/Windows**: `host.docker.internal` (already in default `.env`) |
| - **Linux**: Your machine's LAN IP, e.g., `192.168.1.5` |
| - **All platforms**: Put the LLM server in Docker Compose too |
| |
| ### Ollama in Docker Compose |
| |
| Add to `docker-compose.yml`: |
| ```yaml |
| ollama: |
| image: ollama/ollama |
| volumes: |
| - ollama:/root/.ollama |
| ports: |
| - "11434:11434" |
| ``` |
| Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS). |
| |
| ### "Rate limit exceeded" immediately |
| |
| The default RPM is 40. For local models with no actual limit, increase it: |
| ```env |
| DEFAULT_RPM_LIMIT=1000 |
| ``` |
| |
| ### PostgreSQL connection failed |
| |
| ```bash |
| # Check if Postgres is running |
| docker ps | grep postgres |
| |
| # Check logs |
| docker logs ml-intern-postgres-1 |
| |
| # Reset database |
| docker-compose down -v # WARNING: deletes all data |
| docker-compose up -d postgres |
| ``` |
| |
| ### Grafana shows "No data" |
| |
| Prometheus needs time to scrape. Wait 30 seconds, or check: |
| ```bash |
| curl http://localhost:9090/api/v1/targets |
| ``` |
| |
| ### Slow first response |
| |
| Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts. |
| |
| --- |
| |
| ## File Structure (Local Copy) |
| |
| ``` |
| ml-intern/ |
| βββ production/ |
| β βββ docker-compose.yml # Full stack |
| β βββ Dockerfile.prod # API + worker image |
| β βββ production_server.py # FastAPI app |
| β βββ worker.py # Background tasks |
| β βββ init.sql # DB schema |
| β βββ nginx.conf # Load balancer config |
| β βββ prometheus.yml # Metrics collection |
| β βββ requirements.prod.txt # Python deps |
| β βββ .env.example # Configuration template |
| β βββ grafana/ # Dashboards |
| β βββ k8s/ # Kubernetes manifests |
| β βββ helm/ # Helm charts |
| β βββ tests/ # Integration + load tests |
| βββ agent/ # Original ml-intern agent code |
| ``` |
| |
| --- |
| |
| ## Next Steps |
| |
| 1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost` |
| 2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded |
| 3. **Monitor costs**: Even local models use electricity β Grafana tracks request volume |
| 4. **Scale horizontally**: `docker-compose up -d --scale api=4` |
| |
| --- |
| |
| ## No Internet Required |
| |
| Once models are downloaded and Docker images are cached, the entire stack runs **offline**: |
| - Local LLM (Ollama, LM Studio, etc.) β no network |
| - Redis, PostgreSQL, Nginx β local containers |
| - Prometheus + Grafana β local containers |
| - The only outbound calls are to the LLM API on localhost |
| |
| Perfect for air-gapped environments or private data processing. |
| |