ml-intern Production System

Production-grade deployment of ml-intern with horizontal scaling, distributed rate limiting, circuit breakers, caching, multi-tenancy, and comprehensive observability.

Architecture

Clients (CLI / Web / API) -> Nginx (SSL, Rate Limit) -> FastAPI (xN) -> Redis + Postgres
                                                              |
                                                              -> Background Workers
                                                              |
                                                        Prometheus + Grafana

Production Features

Feature	Technology	Benefit
Distributed Rate Limiting	Redis Token Bucket	Per-tenant, per-provider RPM limits
Circuit Breaker	Redis-backed	Prevents cascade failures
Request Caching	Redis TTL	Reduces LLM costs and latency
Multi-Tenancy	PostgreSQL RLS	Isolated sessions
Cost Tracking	Per-session budget	Spending limits and alerts
Connection Pooling	AsyncPG + HTTPX	Efficient DB and API connections
Health Checks	/health endpoint	Self-healing and load balancing
Graceful Shutdown	Signal handlers	Drain in-flight requests
Metrics	Prometheus + Grafana	Full observability
Distributed Tracing	Jaeger + Correlation IDs	Debug across services

Quick Start

# 1. Configure environment
cp .env.example .env
# Edit .env with your API keys

# 2. Start infrastructure
docker-compose up -d redis postgres nginx prometheus grafana

# 3. Start application
docker-compose up -d api worker

# 4. Verify
curl http://localhost/health
curl http://localhost/v1/models

Dashboards

Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
Jaeger: http://localhost:16686
pgAdmin: http://localhost:5050

API Usage

curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Correlation-ID: $(uuidgen)" \
  -d '{
    "model": "groq/llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello"}],
    "session_id": "my-session-123"
  }'

Kubernetes

cd k8s && chmod +x deploy.sh && ./deploy.sh

Helm

cd helm/ml-intern
helm dependency update
helm install ml-intern . --namespace ml-intern --create-namespace

Scaling

# Horizontal
docker-compose up -d --scale api=4
kubectl -n ml-intern scale deployment ml-intern-api --replicas=10

# HPA (auto-scale)
kubectl apply -f k8s/deployment-api.yml  # Includes HPA