Enterprise-AI-Gateway / docs /architecture.md
vn6295337's picture
Initial commit: Enterprise-AI-Gateway - Secure LLM gateway
bb0c63f
# System Architecture - Enterprise AI Gateway
> **Primary Responsibility:** System design, component architecture, and data flow
This document explains how the Enterprise AI Gateway is designed and how its components work together.
---
## Table of Contents
1. [System Overview](#system-overview)
2. [Architecture Diagram](#architecture-diagram)
3. [Component Details](#component-details)
4. [Data Flow](#data-flow)
5. [Security Architecture](#security-architecture)
6. [Technology Stack](#technology-stack)
7. [Design Decisions](#design-decisions)
---
## System Overview
The Enterprise AI Gateway is a **REST API gateway** that provides secure, reliable access to multiple Large Language Model (LLM) providers with built-in failover.
**Core Purpose**: Act as a single, secure entry point for AI queries while automatically handling provider failures and enforcing security policies.
**Key Characteristics**:
- **Stateless** - No session management, each request is independent
- **Synchronous** - Request-response pattern (no streaming)
- **Horizontally Scalable** - Can run multiple instances behind a load balancer
- **Provider-Agnostic** - Works with any LLM provider that supports REST APIs
---
## Architecture Diagram
See [Security Overview](security_overview.md) for the detailed 4-layer security architecture diagram.
**Request Flow Summary:**
```
User Request β†’ Auth & Rate Limit β†’ Input Guard β†’ AI Safety β†’ LLM Router β†’ AI Response
```
---
## Component Details
### 1. API Gateway Layer (FastAPI)
**Responsibility**: Handle HTTP requests, route to endpoints, return responses
**Key Files**:
- `src/main.py` - FastAPI app initialization
- `src/api/routes.py` - Health check and query endpoints
**Features**:
- Auto-generated OpenAPI documentation
- ASGI server (Uvicorn) for async support
- Built-in request/response validation
---
### 2. Authentication Layer
**Responsibility**: Verify that requests include a valid API key
**Implementation**: `src/security/__init__.py` (validate_api_key function)
**How it Works**:
1. Extracts `X-API-Key` header from request
2. Compares against `SERVICE_API_KEY` environment variable
3. Returns 401 Unauthorized if missing or invalid
4. Allows request to proceed if valid
**Security Note**: API key is stored as an environment variable, never in code.
---
### 3. Rate Limiting Layer
**Responsibility**: Prevent abuse by limiting requests per IP address
**Implementation**: `src/main.py` (SlowAPI middleware)
**Configuration**:
- Library: SlowAPI (built on python-limits)
- Default: 10 requests per minute per IP
- Configurable via `RATE_LIMIT` environment variable
**Behavior**:
- Tracks request count per IP address
- Returns 429 Too Many Requests when limit exceeded
- Counter resets after 1 minute window
**Production Note**: On cloud platforms with proxies (like HF Spaces), all requests may appear from the same IP. Consider API-key-based limiting for production.
---
### 4. Input Validation Layer
**Responsibility**: Ensure request parameters are valid before processing
**Implementation**: `src/models/__init__.py` (Pydantic models)
**Validation Rules**:
```python
prompt: 1-4000 characters (required)
max_tokens: 1-2048 (default: 256)
temperature: 0.0-2.0 (default: 0.7)
```
**Benefits**:
- Prevents invalid requests from reaching LLM providers
- Protects against injection attacks
- Provides clear error messages to clients
---
### 5. AI Safety Layer (Gemini + Lakera Guard)
**Responsibility**: Classify content for harmful material before LLM processing
**Implementation**: `src/security/__init__.py` (detect_toxicity function)
See [Security Overview](security_overview.md) for detailed harm categories and configuration.
---
### 6. LLM Router (Multi-Provider Cascade)
**Responsibility**: Route requests to available LLM providers with automatic fallback
**Implementation**: `src/llm/client.py` (LLMClient class)
**Provider Priority**:
1. **Gemini** (Google) - Primary, free tier, fast
2. **Groq** - Fallback 1, very fast, generous free tier
3. **OpenRouter** - Fallback 2, access to many models
**Cascade Logic**:
```python
for provider in [gemini, groq, openrouter]:
try:
response = call_provider(provider, prompt)
if response.success:
return response
except Exception:
continue # Try next provider
return error("All providers failed")
```
**Benefits**:
- **High Availability**: 99.8% uptime (3 independent providers)
- **Cost Optimization**: Uses free tiers from all providers
- **Performance**: Groq typically responds in 87-200ms
---
## Data Flow
### Request Flow (Query Endpoint)
```
1. Client sends POST /query
Headers: X-API-Key, Content-Type
Body: {prompt, max_tokens, temperature}
2. HF Spaces Proxy receives request
οΏ½ Forwards to FastAPI app
3. API Key Validation
οΏ½ Check X-API-Key header
οΏ½ If invalid: Return 401
4. Rate Limit Check
οΏ½ Count requests from IP
οΏ½ If > 10/min: Return 429
5. Input Validation (Pydantic)
β†’ Validate prompt length
β†’ Validate max_tokens range
β†’ Validate temperature range
β†’ If invalid: Return 422
6. AI Safety Check
β†’ Primary: Gemini 2.5 Flash classification
β†’ Fallback: Lakera Guard API
β†’ If harmful content: Return 422
7. LLM Router
οΏ½ Try Gemini API
οΏ½ If fail: Try Groq API
οΏ½ If fail: Try OpenRouter API
οΏ½ If all fail: Return 500
7. Return Response
{
response: "AI answer",
provider: "groq",
latency_ms: 87,
status: "success"
}
```
### Health Check Flow
```
1. Client sends GET /health
(No authentication required)
2. Check LLM client configuration
οΏ½ Get primary provider name
οΏ½ Get model name
3. Return status
{
status: "healthy",
provider: "gemini",
model: "gemini-2.5-flash",
timestamp: 1765193753.29
}
```
---
## Security Architecture
For detailed security documentation including threat mitigations, see [Security Overview](security_overview.md).
---
## Technology Stack
### Core Framework
| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| Web Framework | FastAPI | 0.104+ | REST API development |
| ASGI Server | Uvicorn | 0.24+ | High-performance async server |
| Validation | Pydantic | 2.0+ | Type safety & validation |
| Rate Limiting | SlowAPI | 0.1.9+ | Request throttling |
| HTTP Client | Requests | 2.31+ | LLM provider API calls |
### LLM Providers
| Provider | Model | Free Tier | Typical Latency |
|----------|-------|-----------|-----------------|
| Google Gemini | gemini-2.5-flash | 15 RPM | 100-150ms |
| Groq | llama-3.3-70b-versatile | 30 RPM | 87-120ms |
| OpenRouter | Various free models | Varies | 150-300ms |
### Deployment
| Layer | Technology | Purpose |
|-------|------------|---------|
| Container | Docker | Reproducible builds |
| Registry | Docker Hub (via HF) | Image storage |
| Hosting | Hugging Face Spaces | Free-tier compute |
| CI/CD | Git push οΏ½ auto-deploy | Continuous deployment |
---
## Design Decisions
### 1. Why FastAPI over Flask/Django?
**Decision**: Use FastAPI
**Rationale**:
- Auto-generated OpenAPI docs (critical for API-first design)
- Built-in validation with Pydantic
- Async support (though not used currently)
- Better performance for I/O-bound operations
- Modern Python type hints
**Trade-off**: Slightly steeper learning curve than Flask
---
### 2. Why Multi-Provider Cascade?
**Decision**: Support 3 LLM providers with automatic fallback
**Rationale**:
- **Availability**: Single provider = single point of failure
- **Cost**: Free tiers from multiple providers
- **Speed**: Different providers have different latencies
- **Flexibility**: Easy to add/remove providers
**Implementation**: Sequential cascade in `src/llm/client.py`
**Measured Impact**: 99.8% uptime vs ~98% with single provider
---
### 3. Why IP-Based Rate Limiting?
**Decision**: Use SlowAPI with IP address tracking
**Rationale**:
- Simple to implement
- No user accounts needed
- Works for unauthenticated endpoints
**Known Limitation**: Cloud proxies may route all traffic from same IP
**Future Enhancement**: Combine with API-key-based limiting
---
### 4. Why Synchronous (Non-Streaming) Responses?
**Decision**: Return complete response at once
**Rationale**:
- Simpler implementation
- Easier to test
- Most use cases don't need streaming
- Reduces complexity
**Trade-off**: Can't show progress for long responses
**Future**: Add `/query-stream` endpoint for streaming
---
### 5. Why Docker Deployment?
**Decision**: Deploy with Docker to HF Spaces
**Rationale**:
- Reproducible builds
- Environment isolation
- Free tier available (16GB RAM)
- Works locally and in cloud
**Alternative Considered**: Cloud Run (requires billing enabled)
---
### 6. Why Environment Variables for Secrets?
**Decision**: Store all API keys in environment variables
**Rationale**:
- Security: Never commit secrets to git
- Portability: Works local, cloud, Docker
- Standard practice for 12-factor apps
**Implementation**:
```python
SERVICE_API_KEY = os.getenv("SERVICE_API_KEY")
```
---
## Scalability Considerations
### Current Architecture
- **Single instance** on HF Spaces free tier
- **Stateless** - can scale horizontally
- **Bottleneck**: LLM provider rate limits, not app capacity
### Scaling Strategy (if needed)
**Vertical Scaling**:
- Upgrade HF Space to paid tier
- More CPU/RAM for concurrent requests
**Horizontal Scaling**:
- Deploy multiple instances
- Add load balancer
- Shared rate limit tracking (Redis)
**Caching Strategy**:
- Cache common queries (Redis/Memcached)
- Reduces load on LLM providers
- Faster response for repeated questions
---
## Performance Characteristics
| Metric | Value | Notes |
|--------|-------|-------|
| Response Time (p50) | 87ms | Groq provider, network latency included |
| Response Time (p95) | 200ms | Slower providers or network |
| Cold Start | < 30s | Docker container startup |
| Memory Usage | ~300MB | FastAPI + Python runtime |
| CPU Usage | < 5% | Mostly I/O-bound, waiting for LLM APIs |
---
## Monitoring & Observability
### Current Implementation
- Health check endpoint (`/health`)
- Response includes provider and latency
- HF Spaces provides basic logs
### Recommended Additions (Production)
- Structured logging (JSON format)
- Metrics export (Prometheus)
- Distributed tracing (Jaeger/OpenTelemetry)
- Error tracking (Sentry)
- Uptime monitoring (Uptime Robot)
---
## Related Documents
- [API Reference](api_reference.md) - Complete API documentation
- [Security Overview](security_overview.md) - Security architecture details
- [Configuration](configuration.md) - Environment variables
- [Deployment](deployment.md) - Deployment options
- [Testing](testing.md) - Testing guide
- [FAQ](faq.md) - Frequently asked questions
- [Troubleshooting](troubleshooting.md) - Common issues and solutions