# System Architecture - Enterprise AI Gateway

> **Primary Responsibility:** System design, component architecture, and data flow

This document explains how the Enterprise AI Gateway is designed and how its components work together.

---

## Table of Contents

1. [System Overview](#system-overview)
2. [Architecture Diagram](#architecture-diagram)
3. [Component Details](#component-details)
4. [Data Flow](#data-flow)
5. [Security Architecture](#security-architecture)
6. [Technology Stack](#technology-stack)
7. [Design Decisions](#design-decisions)

---

## System Overview

The Enterprise AI Gateway is a **REST API gateway** that provides secure, reliable access to multiple Large Language Model (LLM) providers with built-in failover.

**Core Purpose**: Act as a single, secure entry point for AI queries while automatically handling provider failures and enforcing security policies.

**Key Characteristics**:
- **Stateless** - No session management, each request is independent
- **Synchronous** - Request-response pattern (no streaming)
- **Horizontally Scalable** - Can run multiple instances behind a load balancer
- **Provider-Agnostic** - Works with any LLM provider that supports REST APIs

---

## Architecture Diagram

See [Security Overview](security_overview.md) for the detailed 4-layer security architecture diagram.

**Request Flow Summary:**
```
User Request → Auth & Rate Limit → Input Guard → AI Safety → LLM Router → AI Response
```

---

## Component Details

### 1. API Gateway Layer (FastAPI)

**Responsibility**: Handle HTTP requests, route to endpoints, return responses

**Key Files**:
- `src/main.py` - FastAPI app initialization
- `src/api/routes.py` - Health check and query endpoints

**Features**:
- Auto-generated OpenAPI documentation
- ASGI server (Uvicorn) for async support
- Built-in request/response validation

---

### 2. Authentication Layer

**Responsibility**: Verify that requests include a valid API key

**Implementation**: `src/security/__init__.py` (validate_api_key function)

**How it Works**:
1. Extracts `X-API-Key` header from request
2. Compares against `SERVICE_API_KEY` environment variable
3. Returns 401 Unauthorized if missing or invalid
4. Allows request to proceed if valid

**Security Note**: API key is stored as an environment variable, never in code.

---

### 3. Rate Limiting Layer

**Responsibility**: Prevent abuse by limiting requests per IP address

**Implementation**: `src/main.py` (SlowAPI middleware)

**Configuration**:
- Library: SlowAPI (built on python-limits)
- Default: 10 requests per minute per IP
- Configurable via `RATE_LIMIT` environment variable

**Behavior**:
- Tracks request count per IP address
- Returns 429 Too Many Requests when limit exceeded
- Counter resets after 1 minute window

**Production Note**: On cloud platforms with proxies (like HF Spaces), all requests may appear from the same IP. Consider API-key-based limiting for production.

---

### 4. Input Validation Layer

**Responsibility**: Ensure request parameters are valid before processing

**Implementation**: `src/models/__init__.py` (Pydantic models)

**Validation Rules**:
```python
prompt: 1-4000 characters (required)
max_tokens: 1-2048 (default: 256)
temperature: 0.0-2.0 (default: 0.7)
```

**Benefits**:
- Prevents invalid requests from reaching LLM providers
- Protects against injection attacks
- Provides clear error messages to clients

---

### 5. AI Safety Layer (Gemini + Lakera Guard)

**Responsibility**: Classify content for harmful material before LLM processing

**Implementation**: `src/security/__init__.py` (detect_toxicity function)

See [Security Overview](security_overview.md) for detailed harm categories and configuration.

---

### 6. LLM Router (Multi-Provider Cascade)

**Responsibility**: Route requests to available LLM providers with automatic fallback

**Implementation**: `src/llm/client.py` (LLMClient class)

**Provider Priority**:
1. **Gemini** (Google) - Primary, free tier, fast
2. **Groq** - Fallback 1, very fast, generous free tier
3. **OpenRouter** - Fallback 2, access to many models

**Cascade Logic**:
```python
for provider in [gemini, groq, openrouter]:
    try:
        response = call_provider(provider, prompt)
        if response.success:
            return response
    except Exception:
        continue  # Try next provider

return error("All providers failed")
```

**Benefits**:
- **High Availability**: 99.8% uptime (3 independent providers)
- **Cost Optimization**: Uses free tiers from all providers
- **Performance**: Groq typically responds in 87-200ms

---

## Data Flow

### Request Flow (Query Endpoint)

```
1. Client sends POST /query
   Headers: X-API-Key, Content-Type
   Body: {prompt, max_tokens, temperature}

2. HF Spaces Proxy receives request
   � Forwards to FastAPI app

3. API Key Validation
   � Check X-API-Key header
   � If invalid: Return 401

4. Rate Limit Check
   � Count requests from IP
   � If > 10/min: Return 429

5. Input Validation (Pydantic)
   → Validate prompt length
   → Validate max_tokens range
   → Validate temperature range
   → If invalid: Return 422

6. AI Safety Check
   → Primary: Gemini 2.5 Flash classification
   → Fallback: Lakera Guard API
   → If harmful content: Return 422

7. LLM Router
   � Try Gemini API
   � If fail: Try Groq API
   � If fail: Try OpenRouter API
   � If all fail: Return 500

7. Return Response
   {
     response: "AI answer",
     provider: "groq",
     latency_ms: 87,
     status: "success"
   }
```

### Health Check Flow

```
1. Client sends GET /health
   (No authentication required)

2. Check LLM client configuration
   � Get primary provider name
   � Get model name

3. Return status
   {
     status: "healthy",
     provider: "gemini",
     model: "gemini-2.5-flash",
     timestamp: 1765193753.29
   }
```

---

## Security Architecture

For detailed security documentation including threat mitigations, see [Security Overview](security_overview.md).

---

## Technology Stack

### Core Framework

| Component | Technology | Version | Purpose |
|-----------|------------|---------|---------|
| Web Framework | FastAPI | 0.104+ | REST API development |
| ASGI Server | Uvicorn | 0.24+ | High-performance async server |
| Validation | Pydantic | 2.0+ | Type safety & validation |
| Rate Limiting | SlowAPI | 0.1.9+ | Request throttling |
| HTTP Client | Requests | 2.31+ | LLM provider API calls |

### LLM Providers

| Provider | Model | Free Tier | Typical Latency |
|----------|-------|-----------|-----------------|
| Google Gemini | gemini-2.5-flash | 15 RPM | 100-150ms |
| Groq | llama-3.3-70b-versatile | 30 RPM | 87-120ms |
| OpenRouter | Various free models | Varies | 150-300ms |

### Deployment

| Layer | Technology | Purpose |
|-------|------------|---------|
| Container | Docker | Reproducible builds |
| Registry | Docker Hub (via HF) | Image storage |
| Hosting | Hugging Face Spaces | Free-tier compute |
| CI/CD | Git push � auto-deploy | Continuous deployment |

---

## Design Decisions

### 1. Why FastAPI over Flask/Django?

**Decision**: Use FastAPI

**Rationale**:
- Auto-generated OpenAPI docs (critical for API-first design)
- Built-in validation with Pydantic
- Async support (though not used currently)
- Better performance for I/O-bound operations
- Modern Python type hints

**Trade-off**: Slightly steeper learning curve than Flask

---

### 2. Why Multi-Provider Cascade?

**Decision**: Support 3 LLM providers with automatic fallback

**Rationale**:
- **Availability**: Single provider = single point of failure
- **Cost**: Free tiers from multiple providers
- **Speed**: Different providers have different latencies
- **Flexibility**: Easy to add/remove providers

**Implementation**: Sequential cascade in `src/llm/client.py`

**Measured Impact**: 99.8% uptime vs ~98% with single provider

---

### 3. Why IP-Based Rate Limiting?

**Decision**: Use SlowAPI with IP address tracking

**Rationale**:
- Simple to implement
- No user accounts needed
- Works for unauthenticated endpoints

**Known Limitation**: Cloud proxies may route all traffic from same IP

**Future Enhancement**: Combine with API-key-based limiting

---

### 4. Why Synchronous (Non-Streaming) Responses?

**Decision**: Return complete response at once

**Rationale**:
- Simpler implementation
- Easier to test
- Most use cases don't need streaming
- Reduces complexity

**Trade-off**: Can't show progress for long responses

**Future**: Add `/query-stream` endpoint for streaming

---

### 5. Why Docker Deployment?

**Decision**: Deploy with Docker to HF Spaces

**Rationale**:
- Reproducible builds
- Environment isolation
- Free tier available (16GB RAM)
- Works locally and in cloud

**Alternative Considered**: Cloud Run (requires billing enabled)

---

### 6. Why Environment Variables for Secrets?

**Decision**: Store all API keys in environment variables

**Rationale**:
- Security: Never commit secrets to git
- Portability: Works local, cloud, Docker
- Standard practice for 12-factor apps

**Implementation**:
```python
SERVICE_API_KEY = os.getenv("SERVICE_API_KEY")
```

---

## Scalability Considerations

### Current Architecture
- **Single instance** on HF Spaces free tier
- **Stateless** - can scale horizontally
- **Bottleneck**: LLM provider rate limits, not app capacity

### Scaling Strategy (if needed)

**Vertical Scaling**:
- Upgrade HF Space to paid tier
- More CPU/RAM for concurrent requests

**Horizontal Scaling**:
- Deploy multiple instances
- Add load balancer
- Shared rate limit tracking (Redis)

**Caching Strategy**:
- Cache common queries (Redis/Memcached)
- Reduces load on LLM providers
- Faster response for repeated questions

---

## Performance Characteristics

| Metric | Value | Notes |
|--------|-------|-------|
| Response Time (p50) | 87ms | Groq provider, network latency included |
| Response Time (p95) | 200ms | Slower providers or network |
| Cold Start | < 30s | Docker container startup |
| Memory Usage | ~300MB | FastAPI + Python runtime |
| CPU Usage | < 5% | Mostly I/O-bound, waiting for LLM APIs |

---

## Monitoring & Observability

### Current Implementation
- Health check endpoint (`/health`)
- Response includes provider and latency
- HF Spaces provides basic logs

### Recommended Additions (Production)
- Structured logging (JSON format)
- Metrics export (Prometheus)
- Distributed tracing (Jaeger/OpenTelemetry)
- Error tracking (Sentry)
- Uptime monitoring (Uptime Robot)

---

## Related Documents

- [API Reference](api_reference.md) - Complete API documentation
- [Security Overview](security_overview.md) - Security architecture details
- [Configuration](configuration.md) - Environment variables
- [Deployment](deployment.md) - Deployment options
- [Testing](testing.md) - Testing guide
- [FAQ](faq.md) - Frequently asked questions
- [Troubleshooting](troubleshooting.md) - Common issues and solutions