Spaces:

vn6295337
/

Enterprise-AI-Gateway

Sleeping

App Files Files Community

Enterprise-AI-Gateway / docs /architecture.md

vn6295337

Initial commit: Enterprise-AI-Gateway - Secure LLM gateway

bb0c63f 4 months ago

preview code

raw

history blame contribute delete

11 kB

	# System Architecture - Enterprise AI Gateway

	> Primary Responsibility: System design, component architecture, and data flow

	This document explains how the Enterprise AI Gateway is designed and how its components work together.

	---

	## Table of Contents

	1. [System Overview](#system-overview)
	2. [Architecture Diagram](#architecture-diagram)
	3. [Component Details](#component-details)
	4. [Data Flow](#data-flow)
	5. [Security Architecture](#security-architecture)
	6. [Technology Stack](#technology-stack)
	7. [Design Decisions](#design-decisions)

	---

	## System Overview

	The Enterprise AI Gateway is a REST API gateway that provides secure, reliable access to multiple Large Language Model (LLM) providers with built-in failover.

	Core Purpose: Act as a single, secure entry point for AI queries while automatically handling provider failures and enforcing security policies.

	Key Characteristics:
	- Stateless - No session management, each request is independent
	- Synchronous - Request-response pattern (no streaming)
	- Horizontally Scalable - Can run multiple instances behind a load balancer
	- Provider-Agnostic - Works with any LLM provider that supports REST APIs

	---

	## Architecture Diagram

	See [Security Overview](security_overview.md) for the detailed 4-layer security architecture diagram.

	Request Flow Summary:
	```
	User Request → Auth & Rate Limit → Input Guard → AI Safety → LLM Router → AI Response
	```

	---

	## Component Details

	### 1. API Gateway Layer (FastAPI)

	Responsibility: Handle HTTP requests, route to endpoints, return responses

	Key Files:
	- `src/main.py` - FastAPI app initialization
	- `src/api/routes.py` - Health check and query endpoints

	Features:
	- Auto-generated OpenAPI documentation
	- ASGI server (Uvicorn) for async support
	- Built-in request/response validation

	---

	### 2. Authentication Layer

	Responsibility: Verify that requests include a valid API key

	Implementation: `src/security/__init__.py` (validate_api_key function)

	How it Works:
	1. Extracts `X-API-Key` header from request
	2. Compares against `SERVICE_API_KEY` environment variable
	3. Returns 401 Unauthorized if missing or invalid
	4. Allows request to proceed if valid

	Security Note: API key is stored as an environment variable, never in code.

	---

	### 3. Rate Limiting Layer

	Responsibility: Prevent abuse by limiting requests per IP address

	Implementation: `src/main.py` (SlowAPI middleware)

	Configuration:
	- Library: SlowAPI (built on python-limits)
	- Default: 10 requests per minute per IP
	- Configurable via `RATE_LIMIT` environment variable

	Behavior:
	- Tracks request count per IP address
	- Returns 429 Too Many Requests when limit exceeded
	- Counter resets after 1 minute window

	Production Note: On cloud platforms with proxies (like HF Spaces), all requests may appear from the same IP. Consider API-key-based limiting for production.

	---

	### 4. Input Validation Layer

	Responsibility: Ensure request parameters are valid before processing

	Implementation: `src/models/__init__.py` (Pydantic models)

	Validation Rules:
	```python
	prompt: 1-4000 characters (required)
	max_tokens: 1-2048 (default: 256)
	temperature: 0.0-2.0 (default: 0.7)
	```

	Benefits:
	- Prevents invalid requests from reaching LLM providers
	- Protects against injection attacks
	- Provides clear error messages to clients

	---

	### 5. AI Safety Layer (Gemini + Lakera Guard)

	Responsibility: Classify content for harmful material before LLM processing

	Implementation: `src/security/__init__.py` (detect_toxicity function)

	See [Security Overview](security_overview.md) for detailed harm categories and configuration.

	---

	### 6. LLM Router (Multi-Provider Cascade)

	Responsibility: Route requests to available LLM providers with automatic fallback

	Implementation: `src/llm/client.py` (LLMClient class)

	Provider Priority:
	1. Gemini (Google) - Primary, free tier, fast
	2. Groq - Fallback 1, very fast, generous free tier
	3. OpenRouter - Fallback 2, access to many models

	Cascade Logic:
	```python
	for provider in [gemini, groq, openrouter]:
	try:
	response = call_provider(provider, prompt)
	if response.success:
	return response
	except Exception:
	continue # Try next provider

	return error("All providers failed")
	```

	Benefits:
	- High Availability: 99.8% uptime (3 independent providers)
	- Cost Optimization: Uses free tiers from all providers
	- Performance: Groq typically responds in 87-200ms

	---

	## Data Flow

	### Request Flow (Query Endpoint)

	```
	1. Client sends POST /query
	Headers: X-API-Key, Content-Type
	Body: {prompt, max_tokens, temperature}

	2. HF Spaces Proxy receives request
	� Forwards to FastAPI app

	3. API Key Validation
	� Check X-API-Key header
	� If invalid: Return 401

	4. Rate Limit Check
	� Count requests from IP
	� If > 10/min: Return 429

	5. Input Validation (Pydantic)
	→ Validate prompt length
	→ Validate max_tokens range
	→ Validate temperature range
	→ If invalid: Return 422

	6. AI Safety Check
	→ Primary: Gemini 2.5 Flash classification
	→ Fallback: Lakera Guard API
	→ If harmful content: Return 422

	7. LLM Router
	� Try Gemini API
	� If fail: Try Groq API
	� If fail: Try OpenRouter API
	� If all fail: Return 500

	7. Return Response
	{
	response: "AI answer",
	provider: "groq",
	latency_ms: 87,
	status: "success"
	}
	```

	### Health Check Flow

	```
	1. Client sends GET /health
	(No authentication required)

	2. Check LLM client configuration
	� Get primary provider name
	� Get model name

	3. Return status
	{
	status: "healthy",
	provider: "gemini",
	model: "gemini-2.5-flash",
	timestamp: 1765193753.29
	}
	```

	---

	## Security Architecture

	For detailed security documentation including threat mitigations, see [Security Overview](security_overview.md).

	---

	## Technology Stack

	### Core Framework

	\| Component \| Technology \| Version \| Purpose \|
	\|-----------\|------------\|---------\|---------\|
	\| Web Framework \| FastAPI \| 0.104+ \| REST API development \|
	\| ASGI Server \| Uvicorn \| 0.24+ \| High-performance async server \|
	\| Validation \| Pydantic \| 2.0+ \| Type safety & validation \|
	\| Rate Limiting \| SlowAPI \| 0.1.9+ \| Request throttling \|
	\| HTTP Client \| Requests \| 2.31+ \| LLM provider API calls \|

	### LLM Providers

	\| Provider \| Model \| Free Tier \| Typical Latency \|
	\|----------\|-------\|-----------\|-----------------\|
	\| Google Gemini \| gemini-2.5-flash \| 15 RPM \| 100-150ms \|
	\| Groq \| llama-3.3-70b-versatile \| 30 RPM \| 87-120ms \|
	\| OpenRouter \| Various free models \| Varies \| 150-300ms \|

	### Deployment

	\| Layer \| Technology \| Purpose \|
	\|-------\|------------\|---------\|
	\| Container \| Docker \| Reproducible builds \|
	\| Registry \| Docker Hub (via HF) \| Image storage \|
	\| Hosting \| Hugging Face Spaces \| Free-tier compute \|
	\| CI/CD \| Git push � auto-deploy \| Continuous deployment \|

	---

	## Design Decisions

	### 1. Why FastAPI over Flask/Django?

	Decision: Use FastAPI

	Rationale:
	- Auto-generated OpenAPI docs (critical for API-first design)
	- Built-in validation with Pydantic
	- Async support (though not used currently)
	- Better performance for I/O-bound operations
	- Modern Python type hints

	Trade-off: Slightly steeper learning curve than Flask

	---

	### 2. Why Multi-Provider Cascade?

	Decision: Support 3 LLM providers with automatic fallback

	Rationale:
	- Availability: Single provider = single point of failure
	- Cost: Free tiers from multiple providers
	- Speed: Different providers have different latencies
	- Flexibility: Easy to add/remove providers

	Implementation: Sequential cascade in `src/llm/client.py`

	Measured Impact: 99.8% uptime vs ~98% with single provider

	---

	### 3. Why IP-Based Rate Limiting?

	Decision: Use SlowAPI with IP address tracking

	Rationale:
	- Simple to implement
	- No user accounts needed
	- Works for unauthenticated endpoints

	Known Limitation: Cloud proxies may route all traffic from same IP

	Future Enhancement: Combine with API-key-based limiting

	---

	### 4. Why Synchronous (Non-Streaming) Responses?

	Decision: Return complete response at once

	Rationale:
	- Simpler implementation
	- Easier to test
	- Most use cases don't need streaming
	- Reduces complexity

	Trade-off: Can't show progress for long responses

	Future: Add `/query-stream` endpoint for streaming

	---

	### 5. Why Docker Deployment?

	Decision: Deploy with Docker to HF Spaces

	Rationale:
	- Reproducible builds
	- Environment isolation
	- Free tier available (16GB RAM)
	- Works locally and in cloud

	Alternative Considered: Cloud Run (requires billing enabled)

	---

	### 6. Why Environment Variables for Secrets?

	Decision: Store all API keys in environment variables

	Rationale:
	- Security: Never commit secrets to git
	- Portability: Works local, cloud, Docker
	- Standard practice for 12-factor apps

	Implementation:
	```python
	SERVICE_API_KEY = os.getenv("SERVICE_API_KEY")
	```

	---

	## Scalability Considerations

	### Current Architecture
	- Single instance on HF Spaces free tier
	- Stateless - can scale horizontally
	- Bottleneck: LLM provider rate limits, not app capacity

	### Scaling Strategy (if needed)

	Vertical Scaling:
	- Upgrade HF Space to paid tier
	- More CPU/RAM for concurrent requests

	Horizontal Scaling:
	- Deploy multiple instances
	- Add load balancer
	- Shared rate limit tracking (Redis)

	Caching Strategy:
	- Cache common queries (Redis/Memcached)
	- Reduces load on LLM providers
	- Faster response for repeated questions

	---

	## Performance Characteristics

	\| Metric \| Value \| Notes \|
	\|--------\|-------\|-------\|
	\| Response Time (p50) \| 87ms \| Groq provider, network latency included \|
	\| Response Time (p95) \| 200ms \| Slower providers or network \|
	\| Cold Start \| < 30s \| Docker container startup \|
	\| Memory Usage \| ~300MB \| FastAPI + Python runtime \|
	\| CPU Usage \| < 5% \| Mostly I/O-bound, waiting for LLM APIs \|

	---

	## Monitoring & Observability

	### Current Implementation
	- Health check endpoint (`/health`)
	- Response includes provider and latency
	- HF Spaces provides basic logs

	### Recommended Additions (Production)
	- Structured logging (JSON format)
	- Metrics export (Prometheus)
	- Distributed tracing (Jaeger/OpenTelemetry)
	- Error tracking (Sentry)
	- Uptime monitoring (Uptime Robot)

	---

	## Related Documents

	- [API Reference](api_reference.md) - Complete API documentation
	- [Security Overview](security_overview.md) - Security architecture details
	- [Configuration](configuration.md) - Environment variables
	- [Deployment](deployment.md) - Deployment options
	- [Testing](testing.md) - Testing guide
	- [FAQ](faq.md) - Frequently asked questions
	- [Troubleshooting](troubleshooting.md) - Common issues and solutions