raazkumar
/

ml-intern-local-fork

Model card Files Files and versions

xet

Community

raazkumar commited on 2 days ago

Commit

03cc10d

verified ·

1 Parent(s): 3910229

Upload production/LOCAL_DEPLOYMENT.md

Browse files

Files changed (1) hide show

production/LOCAL_DEPLOYMENT.md +401 -0

production/LOCAL_DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,401 @@

+# Local Deployment Guide — No Hugging Face Required
+Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).
+## Prerequisites
+- **Docker + Docker Compose** (recommended) OR **Python 3.11+**
+- **8GB RAM minimum** (16GB+ recommended)
+- **Local LLM backend** (pick one):
+  - [Ollama](https://ollama.com) — easiest
+  - [LM Studio](https://lmstudio.ai) — GUI, great for Mac/Windows
+  - [llama.cpp](https://github.com/ggerganov/llama.cpp) — most control
+  - [vLLM](https://github.com/vllm-project/vllm) — highest throughput
+  - [NVIDIA NIM](https://developer.nvidia.com/nim) — enterprise GPUs
+  - [MLX](https://github.com/ml-explore/mlx) — Apple Silicon optimized
+---
+## Option 1: Docker Compose (Fastest — 2 Minutes)
+### Step 1: Start a Local LLM Server
+**Option A — Ollama (Recommended)**
+```bash
+# Install Ollama (one-liner)
+curl -fsSL https://ollama.com/install.sh | sh
+# Pull a model
+ollama pull llama3.1
+# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
+ollama serve
+```
+**Option B — LM Studio**
+1. Download LM Studio from https://lmstudio.ai
+2. Load any GGUF model
+3. Start **Local Inference Server** → it runs on `http://localhost:1234/v1`
+**Option C — llama.cpp Server**
+```bash
+# Build
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp && make
+# Download a GGUF model
+wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
+# Start server (OpenAI-compatible API on :8080/v1)
+./server -m llama-2-7b.Q4_K_M.gguf --port 8080
+```
+### Step 2: Clone & Configure
+```bash
+git clone https://github.com/raazkumar/ml-intern-local-fork.git
+cd ml-intern-local-fork/production
+# Copy environment template
+cp .env.example .env
+```
+Edit `.env` — **only change these lines**:
+```env
+# Point to your local LLM server
+OLLAMA_API_BASE=http://host.docker.internal:11434/v1
+# (or for LM Studio: http://host.docker.internal:1234/v1)
+# (or for llama.cpp: http://host.docker.internal:8080/v1)
+# No cloud API keys needed for local-only mode
+# Leave these blank or comment them out:
+# HF_TOKEN=
+# ANTHROPIC_API_KEY=
+# OPENAI_API_KEY=
+# GROQ_API_KEY=
+# NVIDIA_API_KEY=
+```
+> **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box.
+### Step 3: Launch the Stack
+```bash
+docker-compose up -d
+```
+This starts:
+- **API server** (FastAPI) on http://localhost:8000
+- **Background workers** (cleanup, budget alerts)
+- **Redis** (caching + rate limiting) on :6379
+- **PostgreSQL** (audit log + sessions) on :5432
+- **Nginx** (load balancer) on :80
+- **Prometheus** (metrics) on :9090
+- **Grafana** (dashboards) on :3000
+- **Jaeger** (tracing) on :16686
+- **pgAdmin** (DB GUI) on :5050
+### Step 4: Verify
+```bash
+# Health check
+curl http://localhost/health | jq
+# List available models (includes your local ones)
+curl http://localhost/v1/models | jq
+# Chat with your local model
+curl -X POST http://localhost/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "ollama/llama3.1",
+    "messages": [{"role":"user","content":"Hello from local deployment!"}],
+    "stream": false
+  }'
+```
+### Step 5: View Dashboards
+| Service | URL | Default Login |
+|---------|-----|---------------|
+| API | http://localhost:8000 | — |
+| Grafana | http://localhost:3000 | admin / admin |
+| Prometheus | http://localhost:9090 | — |
+| Jaeger UI | http://localhost:16686 | — |
+| pgAdmin | http://localhost:5050 | admin@mlintern.local / admin |
+---
+## Option 2: Pure Python (No Docker)
+For development or lightweight setups.
+### Step 1: Install Dependencies
+```bash
+# Python 3.11+ required
+python -m venv .venv
+source .venv/bin/activate  # Windows: .venv\Scripts\activate
+pip install -r production/requirements.prod.txt
+```
+### Step 2: Start PostgreSQL + Redis
+You need these running locally. Options:
+**A) System packages:**
+```bash
+# Ubuntu/Debian
+sudo apt install postgresql redis
+sudo systemctl start postgresql redis
+# macOS
+brew install postgresql redis
+brew services start postgresql redis
+```
+**B) Docker (just the infra):**
+```bash
+docker run -d --name redis -p 6379:6379 redis:7-alpine
+docker run -d --name postgres \
+  -e POSTGRES_PASSWORD=ml_intern \
+  -e POSTGRES_DB=ml_intern \
+  -p 5432:5432 postgres:16-alpine
+```
+### Step 3: Initialize Database
+```bash
+psql -U postgres -h localhost -d ml_intern -f production/init.sql
+```
+### Step 4: Configure Environment
+```bash
+export REDIS_URL=redis://localhost:6379
+export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
+export PORT=8000
+export WORKERS=1
+export LOG_LEVEL=INFO
+# Point to your local LLM
+export OLLAMA_API_BASE=http://localhost:11434/v1
+```
+### Step 5: Start the Server
+```bash
+cd production
+python -m production_server
+```
+Server runs on http://localhost:8000
+### Step 6: Start Worker (in another terminal)
+```bash
+source .venv/bin/activate
+cd production
+python -m worker
+```
+---
+## Connecting Different Local Backends
+| Backend | Start Command | API Base | Model Prefix | Example Model String |
+|---------|--------------|----------|-------------|---------------------|
+| **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` |
+| **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` |
+| **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` |
+| **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` |
+| **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` |
+| **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` |
+| **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` |
+| **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` |
+### Override API Base (if not default port)
+In `.env`:
+```env
+OLLAMA_API_BASE=http://192.168.1.100:11434/v1
+LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
+VLLM_API_BASE=http://vllm-server.internal:8000/v1
+```
+---
+## Multi-Backend Setup (Recommended)
+Run **multiple local backends** and let ml-intern round-robin or fail over:
+```bash
+# Terminal 1: Ollama for fast models
+ollama pull llama3.1
+ollama serve
+# Terminal 2: vLLM for high-throughput
+python -m vllm.entrypoints.openai.api_server \
+  --model meta-llama/Llama-3.1-70B-Instruct \
+  --tensor-parallel-size 2 \
+  --port 8001
+```
+In `.env`:
+```env
+OLLAMA_API_BASE=http://localhost:11434/v1
+VLLM_API_BASE=http://localhost:8001/v1
+```
+Now you can use either:
+```bash
+curl http://localhost/v1/chat/completions -d '{
+  "model": "ollama/llama3.1",
+  "messages": [{"role":"user","content":"Quick question"}]
+}'
+curl http://localhost/v1/chat/completions -d '{
+  "model": "vllm/llama-3.1-70b",
+  "messages": [{"role":"user","content":"Complex reasoning"}]
+}'
+```
+---
+## CLI Mode (No Server)
+If you want to use ml-intern as a CLI tool with local models (the original use case):
+```bash
+# Install the agent CLI
+pip install -e .
+# Run with local model
+ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"
+# With local overrides
+OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
+  --model ollama/llama3.1 \
+  --yolo \
+  "Create a FastAPI app with Redis caching"
+```
+---
+## Hardware Requirements by Backend
+| Backend | Min GPU | Recommended GPU | RAM | Notes |
+|---------|---------|----------------|-----|-------|
+| Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use |
+| Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps |
+| LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration |
+| vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput |
+| vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required |
+| llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only |
+| MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU |
+| NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support |
+---
+## Troubleshooting
+### "Connection refused" to local LLM
+Docker containers can't reach `localhost` on the host. Use:
+- **Mac/Windows**: `host.docker.internal` (already in default `.env`)
+- **Linux**: Your machine's LAN IP, e.g., `192.168.1.5`
+- **All platforms**: Put the LLM server in Docker Compose too
+### Ollama in Docker Compose
+Add to `docker-compose.yml`:
+```yaml
+  ollama:
+    image: ollama/ollama
+    volumes:
+      - ollama:/root/.ollama
+    ports:
+      - "11434:11434"
+```
+Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS).
+### "Rate limit exceeded" immediately
+The default RPM is 40. For local models with no actual limit, increase it:
+```env
+DEFAULT_RPM_LIMIT=1000
+```
+### PostgreSQL connection failed
+```bash
+# Check if Postgres is running
+docker ps | grep postgres
+# Check logs
+docker logs ml-intern-postgres-1
+# Reset database
+docker-compose down -v  # WARNING: deletes all data
+docker-compose up -d postgres
+```
+### Grafana shows "No data"
+Prometheus needs time to scrape. Wait 30 seconds, or check:
+```bash
+curl http://localhost:9090/api/v1/targets
+```
+### Slow first response
+Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.
+---
+## File Structure (Local Copy)
+```
+ml-intern/
+├── production/
+│   ├── docker-compose.yml      # Full stack
+│   ├── Dockerfile.prod         # API + worker image
+│   ├── production_server.py   # FastAPI app
+│   ├── worker.py              # Background tasks
+│   ├── init.sql               # DB schema
+│   ├── nginx.conf             # Load balancer config
+│   ├── prometheus.yml         # Metrics collection
+│   ├── requirements.prod.txt  # Python deps
+│   ├── .env.example           # Configuration template
+│   ├── grafana/               # Dashboards
+│   ├── k8s/                   # Kubernetes manifests
+│   ├── helm/                  # Helm charts
+│   └── tests/                 # Integration + load tests
+└── agent/                     # Original ml-intern agent code
+```
+---
+## Next Steps
+1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost`
+2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded
+3. **Monitor costs**: Even local models use electricity — Grafana tracks request volume
+4. **Scale horizontally**: `docker-compose up -d --scale api=4`
+---
+## No Internet Required
+Once models are downloaded and Docker images are cached, the entire stack runs **offline**:
+- Local LLM (Ollama, LM Studio, etc.) — no network
+- Redis, PostgreSQL, Nginx — local containers
+- Prometheus + Grafana — local containers
+- The only outbound calls are to the LLM API on localhost
+Perfect for air-gapped environments or private data processing.