| # ml-intern Production System |
|
|
| Production-grade deployment of ml-intern with horizontal scaling, distributed rate limiting, circuit breakers, caching, multi-tenancy, and comprehensive observability. |
|
|
| ## Architecture |
|
|
| ``` |
| Clients (CLI / Web / API) -> Nginx (SSL, Rate Limit) -> FastAPI (xN) -> Redis + Postgres |
| | |
| -> Background Workers |
| | |
| Prometheus + Grafana |
| ``` |
|
|
| ## Production Features |
|
|
| | Feature | Technology | Benefit | |
| |---------|-----------|---------| |
| | **Distributed Rate Limiting** | Redis Token Bucket | Per-tenant, per-provider RPM limits | |
| | **Circuit Breaker** | Redis-backed | Prevents cascade failures | |
| | **Request Caching** | Redis TTL | Reduces LLM costs and latency | |
| | **Multi-Tenancy** | PostgreSQL RLS | Isolated sessions | |
| | **Cost Tracking** | Per-session budget | Spending limits and alerts | |
| | **Connection Pooling** | AsyncPG + HTTPX | Efficient DB and API connections | |
| | **Health Checks** | /health endpoint | Self-healing and load balancing | |
| | **Graceful Shutdown** | Signal handlers | Drain in-flight requests | |
| | **Metrics** | Prometheus + Grafana | Full observability | |
| | **Distributed Tracing** | Jaeger + Correlation IDs | Debug across services | |
|
|
| ## Quick Start |
|
|
| ```bash |
| # 1. Configure environment |
| cp .env.example .env |
| # Edit .env with your API keys |
| |
| # 2. Start infrastructure |
| docker-compose up -d redis postgres nginx prometheus grafana |
| |
| # 3. Start application |
| docker-compose up -d api worker |
| |
| # 4. Verify |
| curl http://localhost/health |
| curl http://localhost/v1/models |
| ``` |
|
|
| ## Dashboards |
|
|
| - Grafana: http://localhost:3000 (admin/admin) |
| - Prometheus: http://localhost:9090 |
| - Jaeger: http://localhost:16686 |
| - pgAdmin: http://localhost:5050 |
|
|
| ## API Usage |
|
|
| ```bash |
| curl -X POST http://localhost/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -H "X-Correlation-ID: $(uuidgen)" \ |
| -d '{ |
| "model": "groq/llama-3.3-70b-versatile", |
| "messages": [{"role": "user", "content": "Hello"}], |
| "session_id": "my-session-123" |
| }' |
| ``` |
|
|
| ## Kubernetes |
|
|
| ```bash |
| cd k8s && chmod +x deploy.sh && ./deploy.sh |
| ``` |
|
|
| ## Helm |
|
|
| ```bash |
| cd helm/ml-intern |
| helm dependency update |
| helm install ml-intern . --namespace ml-intern --create-namespace |
| ``` |
|
|
| ## Scaling |
|
|
| ```bash |
| # Horizontal |
| docker-compose up -d --scale api=4 |
| kubectl -n ml-intern scale deployment ml-intern-api --replicas=10 |
| |
| # HPA (auto-scale) |
| kubectl apply -f k8s/deployment-api.yml # Includes HPA |
| ``` |
|
|