ml-intern Production System
Production-grade deployment of ml-intern with horizontal scaling, distributed rate limiting, circuit breakers, caching, multi-tenancy, and comprehensive observability.
Architecture
Clients (CLI / Web / API) -> Nginx (SSL, Rate Limit) -> FastAPI (xN) -> Redis + Postgres
|
-> Background Workers
|
Prometheus + Grafana
Production Features
| Feature | Technology | Benefit |
|---|---|---|
| Distributed Rate Limiting | Redis Token Bucket | Per-tenant, per-provider RPM limits |
| Circuit Breaker | Redis-backed | Prevents cascade failures |
| Request Caching | Redis TTL | Reduces LLM costs and latency |
| Multi-Tenancy | PostgreSQL RLS | Isolated sessions |
| Cost Tracking | Per-session budget | Spending limits and alerts |
| Connection Pooling | AsyncPG + HTTPX | Efficient DB and API connections |
| Health Checks | /health endpoint | Self-healing and load balancing |
| Graceful Shutdown | Signal handlers | Drain in-flight requests |
| Metrics | Prometheus + Grafana | Full observability |
| Distributed Tracing | Jaeger + Correlation IDs | Debug across services |
Quick Start
# 1. Configure environment
cp .env.example .env
# Edit .env with your API keys
# 2. Start infrastructure
docker-compose up -d redis postgres nginx prometheus grafana
# 3. Start application
docker-compose up -d api worker
# 4. Verify
curl http://localhost/health
curl http://localhost/v1/models
Dashboards
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Jaeger: http://localhost:16686
- pgAdmin: http://localhost:5050
API Usage
curl -X POST http://localhost/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Correlation-ID: $(uuidgen)" \
-d '{
"model": "groq/llama-3.3-70b-versatile",
"messages": [{"role": "user", "content": "Hello"}],
"session_id": "my-session-123"
}'
Kubernetes
cd k8s && chmod +x deploy.sh && ./deploy.sh
Helm
cd helm/ml-intern
helm dependency update
helm install ml-intern . --namespace ml-intern --create-namespace
Scaling
# Horizontal
docker-compose up -d --scale api=4
kubectl -n ml-intern scale deployment ml-intern-api --replicas=10
# HPA (auto-scale)
kubectl apply -f k8s/deployment-api.yml # Includes HPA