raazkumar's picture
Upload production/README.md
03126cc verified

ml-intern Production System

Production-grade deployment of ml-intern with horizontal scaling, distributed rate limiting, circuit breakers, caching, multi-tenancy, and comprehensive observability.

Architecture

Clients (CLI / Web / API) -> Nginx (SSL, Rate Limit) -> FastAPI (xN) -> Redis + Postgres
                                                              |
                                                              -> Background Workers
                                                              |
                                                        Prometheus + Grafana

Production Features

Feature Technology Benefit
Distributed Rate Limiting Redis Token Bucket Per-tenant, per-provider RPM limits
Circuit Breaker Redis-backed Prevents cascade failures
Request Caching Redis TTL Reduces LLM costs and latency
Multi-Tenancy PostgreSQL RLS Isolated sessions
Cost Tracking Per-session budget Spending limits and alerts
Connection Pooling AsyncPG + HTTPX Efficient DB and API connections
Health Checks /health endpoint Self-healing and load balancing
Graceful Shutdown Signal handlers Drain in-flight requests
Metrics Prometheus + Grafana Full observability
Distributed Tracing Jaeger + Correlation IDs Debug across services

Quick Start

# 1. Configure environment
cp .env.example .env
# Edit .env with your API keys

# 2. Start infrastructure
docker-compose up -d redis postgres nginx prometheus grafana

# 3. Start application
docker-compose up -d api worker

# 4. Verify
curl http://localhost/health
curl http://localhost/v1/models

Dashboards

API Usage

curl -X POST http://localhost/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Correlation-ID: $(uuidgen)" \
  -d '{
    "model": "groq/llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello"}],
    "session_id": "my-session-123"
  }'

Kubernetes

cd k8s && chmod +x deploy.sh && ./deploy.sh

Helm

cd helm/ml-intern
helm dependency update
helm install ml-intern . --namespace ml-intern --create-namespace

Scaling

# Horizontal
docker-compose up -d --scale api=4
kubectl -n ml-intern scale deployment ml-intern-api --replicas=10

# HPA (auto-scale)
kubectl apply -f k8s/deployment-api.yml  # Includes HPA