# Local Deployment Guide — No Hugging Face Required

Run the entire ml-intern production system **locally** on your machine using Docker Compose or native Python. No HF account, no cloud APIs needed (though you can add them).

## Prerequisites

- **Docker + Docker Compose** (recommended) OR **Python 3.11+**
- **8GB RAM minimum** (16GB+ recommended)
- **Local LLM backend** (pick one):
  - [Ollama](https://ollama.com) — easiest
  - [LM Studio](https://lmstudio.ai) — GUI, great for Mac/Windows
  - [llama.cpp](https://github.com/ggerganov/llama.cpp) — most control
  - [vLLM](https://github.com/vllm-project/vllm) — highest throughput
  - [NVIDIA NIM](https://developer.nvidia.com/nim) — enterprise GPUs
  - [MLX](https://github.com/ml-explore/mlx) — Apple Silicon optimized

---

## Option 1: Docker Compose (Fastest — 2 Minutes)

### Step 1: Start a Local LLM Server

**Option A — Ollama (Recommended)**

```bash
# Install Ollama (one-liner)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1

# Start server (runs on :11434, OpenAI-compatible on :11434/v1)
ollama serve
```

**Option B — LM Studio**

1. Download LM Studio from https://lmstudio.ai
2. Load any GGUF model
3. Start **Local Inference Server** → it runs on `http://localhost:1234/v1`

**Option C — llama.cpp Server**

```bash
# Build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Start server (OpenAI-compatible API on :8080/v1)
./server -m llama-2-7b.Q4_K_M.gguf --port 8080
```

### Step 2: Clone & Configure

```bash
git clone https://github.com/raazkumar/ml-intern-local-fork.git
cd ml-intern-local-fork/production

# Copy environment template
cp .env.example .env
```

Edit `.env` — **only change these lines**:

```env
# Point to your local LLM server
OLLAMA_API_BASE=http://host.docker.internal:11434/v1
# (or for LM Studio: http://host.docker.internal:1234/v1)
# (or for llama.cpp: http://host.docker.internal:8080/v1)

# No cloud API keys needed for local-only mode
# Leave these blank or comment them out:
# HF_TOKEN=
# ANTHROPIC_API_KEY=
# OPENAI_API_KEY=
# GROQ_API_KEY=
# NVIDIA_API_KEY=
```

> **Docker host networking note**: On Linux, `host.docker.internal` may not work. Use your machine's LAN IP (e.g., `192.168.1.5`) instead. On Mac/Windows, `host.docker.internal` works out of the box.

### Step 3: Launch the Stack

```bash
docker-compose up -d
```

This starts:
- **API server** (FastAPI) on http://localhost:8000
- **Background workers** (cleanup, budget alerts)
- **Redis** (caching + rate limiting) on :6379
- **PostgreSQL** (audit log + sessions) on :5432
- **Nginx** (load balancer) on :80
- **Prometheus** (metrics) on :9090
- **Grafana** (dashboards) on :3000
- **Jaeger** (tracing) on :16686
- **pgAdmin** (DB GUI) on :5050

### Step 4: Verify

```bash
# Health check
curl http://localhost/health | jq

# List available models (includes your local ones)
curl http://localhost/v1/models | jq

# Chat with your local model
curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/llama3.1",
    "messages": [{"role":"user","content":"Hello from local deployment!"}],
    "stream": false
  }'
```

### Step 5: View Dashboards

| Service | URL | Default Login |
|---------|-----|---------------|
| API | http://localhost:8000 | — |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | — |
| Jaeger UI | http://localhost:16686 | — |
| pgAdmin | http://localhost:5050 | admin@mlintern.local / admin |

---

## Option 2: Pure Python (No Docker)

For development or lightweight setups.

### Step 1: Install Dependencies

```bash
# Python 3.11+ required
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

pip install -r production/requirements.prod.txt
```

### Step 2: Start PostgreSQL + Redis

You need these running locally. Options:

**A) System packages:**
```bash
# Ubuntu/Debian
sudo apt install postgresql redis
sudo systemctl start postgresql redis

# macOS
brew install postgresql redis
brew services start postgresql redis
```

**B) Docker (just the infra):**
```bash
docker run -d --name redis -p 6379:6379 redis:7-alpine
docker run -d --name postgres \
  -e POSTGRES_PASSWORD=ml_intern \
  -e POSTGRES_DB=ml_intern \
  -p 5432:5432 postgres:16-alpine
```

### Step 3: Initialize Database

```bash
psql -U postgres -h localhost -d ml_intern -f production/init.sql
```

### Step 4: Configure Environment

```bash
export REDIS_URL=redis://localhost:6379
export DATABASE_URL=postgresql://postgres:ml_intern@localhost:5432/ml_intern
export PORT=8000
export WORKERS=1
export LOG_LEVEL=INFO

# Point to your local LLM
export OLLAMA_API_BASE=http://localhost:11434/v1
```

### Step 5: Start the Server

```bash
cd production
python -m production_server
```

Server runs on http://localhost:8000

### Step 6: Start Worker (in another terminal)

```bash
source .venv/bin/activate
cd production
python -m worker
```

---

## Connecting Different Local Backends

| Backend | Start Command | API Base | Model Prefix | Example Model String |
|---------|--------------|----------|-------------|---------------------|
| **Ollama** | `ollama serve` | `http://localhost:11434/v1` | `ollama/` | `ollama/llama3.1` |
| **LM Studio** | Start server in GUI | `http://localhost:1234/v1` | `lmstudio/` | `lmstudio/llama-3-8b` |
| **llama.cpp** | `./server -m model.gguf` | `http://localhost:8080/v1` | `llamacpp/` | `llamacpp/llama-2-7b` |
| **vLLM** | `python -m vllm.entrypoints.openai.api_server` | `http://localhost:8000/v1` | `vllm/` | `vllm/llama-3-8b` |
| **MLX** | `python -m mlx_lm.server` | `http://localhost:8000/v1` | `mlx/` | `mlx/llama-3-8b` |
| **NVIDIA NIM** | `docker run nvcr.io/...` | `http://localhost:8000/v1` | `nim/` | `nim/llama-3.1-8b` |
| **TGI** | `docker run ghcr.io/...tgi` | `http://localhost:8080/v1` | `tgi/` | `tgi/llama-3-8b` |
| **Custom PyTorch** | Your own server | `http://localhost:8000/v1` | `local/` | `local/my-model` |

### Override API Base (if not default port)

In `.env`:
```env
OLLAMA_API_BASE=http://192.168.1.100:11434/v1
LMSTUDIO_API_BASE=http://lmstudio.local:1234/v1
VLLM_API_BASE=http://vllm-server.internal:8000/v1
```

---

## Multi-Backend Setup (Recommended)

Run **multiple local backends** and let ml-intern round-robin or fail over:

```bash
# Terminal 1: Ollama for fast models
ollama pull llama3.1
ollama serve

# Terminal 2: vLLM for high-throughput
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8001
```

In `.env`:
```env
OLLAMA_API_BASE=http://localhost:11434/v1
VLLM_API_BASE=http://localhost:8001/v1
```

Now you can use either:
```bash
curl http://localhost/v1/chat/completions -d '{
  "model": "ollama/llama3.1",
  "messages": [{"role":"user","content":"Quick question"}]
}'

curl http://localhost/v1/chat/completions -d '{
  "model": "vllm/llama-3.1-70b",
  "messages": [{"role":"user","content":"Complex reasoning"}]
}'
```

---

## CLI Mode (No Server)

If you want to use ml-intern as a CLI tool with local models (the original use case):

```bash
# Install the agent CLI
pip install -e .

# Run with local model
ml-intern --model ollama/llama3.1 "Write a Python function to sort a list"

# With local overrides
OLLAMA_API_BASE=http://localhost:11434/v1 ml-intern \
  --model ollama/llama3.1 \
  --yolo \
  "Create a FastAPI app with Redis caching"
```

---

## Hardware Requirements by Backend

| Backend | Min GPU | Recommended GPU | RAM | Notes |
|---------|---------|----------------|-----|-------|
| Ollama (7B) | None (CPU) | 8GB VRAM | 16GB | Best ease-of-use |
| Ollama (70B) | 48GB VRAM | 80GB (A100) | 128GB | Q4 quantization helps |
| LM Studio | None (CPU) | 8GB+ VRAM | 16GB | Great GUI for exploration |
| vLLM (7B) | 16GB VRAM | 24GB (3090/A10G) | 32GB | Highest throughput |
| vLLM (70B) | 80GB VRAM | 2x A100 | 256GB | tensor_parallel required |
| llama.cpp | None (CPU) | Any | 8GB | Best for CPU-only |
| MLX (Mac) | Apple Silicon | M3 Max 36GB | 32GB | Native Apple GPU |
| NVIDIA NIM | 24GB+ | A100/H100 | 64GB | Enterprise support |

---

## Troubleshooting

### "Connection refused" to local LLM

Docker containers can't reach `localhost` on the host. Use:
- **Mac/Windows**: `host.docker.internal` (already in default `.env`)
- **Linux**: Your machine's LAN IP, e.g., `192.168.1.5`
- **All platforms**: Put the LLM server in Docker Compose too

### Ollama in Docker Compose

Add to `docker-compose.yml`:
```yaml
  ollama:
    image: ollama/ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
```
Then set `OLLAMA_API_BASE=http://ollama:11434/v1` (internal Docker DNS).

### "Rate limit exceeded" immediately

The default RPM is 40. For local models with no actual limit, increase it:
```env
DEFAULT_RPM_LIMIT=1000
```

### PostgreSQL connection failed

```bash
# Check if Postgres is running
docker ps | grep postgres

# Check logs
docker logs ml-intern-postgres-1

# Reset database
docker-compose down -v  # WARNING: deletes all data
docker-compose up -d postgres
```

### Grafana shows "No data"

Prometheus needs time to scrape. Wait 30 seconds, or check:
```bash
curl http://localhost:9090/api/v1/targets
```

### Slow first response

Local models load into VRAM/RAM on first request. Subsequent requests are fast. Use Redis caching (enabled by default) to skip LLM calls for repeated prompts.

---

## File Structure (Local Copy)

```
ml-intern/
├── production/
│   ├── docker-compose.yml      # Full stack
│   ├── Dockerfile.prod         # API + worker image
│   ├── production_server.py   # FastAPI app
│   ├── worker.py              # Background tasks
│   ├── init.sql               # DB schema
│   ├── nginx.conf             # Load balancer config
│   ├── prometheus.yml         # Metrics collection
│   ├── requirements.prod.txt  # Python deps
│   ├── .env.example           # Configuration template
│   ├── grafana/               # Dashboards
│   ├── k8s/                   # Kubernetes manifests
│   ├── helm/                  # Helm charts
│   └── tests/                 # Integration + load tests
└── agent/                     # Original ml-intern agent code
```

---

## Next Steps

1. **Load test your setup**: `locust -f production/tests/load_test.py --host http://localhost`
2. **Add cloud fallback**: Set `GROQ_API_KEY` or `OPENAI_API_KEY` for when local model is overloaded
3. **Monitor costs**: Even local models use electricity — Grafana tracks request volume
4. **Scale horizontally**: `docker-compose up -d --scale api=4`

---

## No Internet Required

Once models are downloaded and Docker images are cached, the entire stack runs **offline**:
- Local LLM (Ollama, LM Studio, etc.) — no network
- Redis, PostgreSQL, Nginx — local containers
- Prometheus + Grafana — local containers
- The only outbound calls are to the LLM API on localhost

Perfect for air-gapped environments or private data processing.