Spaces:

vn6295337
/

Enterprise-AI-Gateway

Sleeping

App Files Files Community

Enterprise-AI-Gateway / docs /troubleshooting.md

vn6295337

Initial commit: Enterprise-AI-Gateway - Secure LLM gateway

bb0c63f 4 months ago

preview code

raw

history blame contribute delete

11.5 kB

Troubleshooting Guide

Primary Responsibility: Problem diagnosis and resolution for all issue types

This guide helps diagnose and resolve common issues with the Enterprise AI Gateway.

Health Check Issues
Authentication Problems
API Request Errors
LLM Provider Issues
Performance Problems
Deployment Issues
Security Concerns

Health Check Issues

Service Unreachable

Symptoms:

/health endpoint returns 502, 503, or connection timeout
Application doesn't start

Possible Causes:

Application not running
Port binding issues
Firewall/network restrictions
Insufficient system resources

Solutions:

Check if the application process is running:
```
ps aux | grep uvicorn
```
Verify port binding:
```
netstat -tlnp | grep :8000
```
Check application logs for startup errors
Ensure firewall allows traffic on the application port

Health Status Unhealthy

Symptoms:

/health returns status "unhealthy"
Provider field is null or missing

Possible Causes:

Missing or invalid LLM provider API keys
Misconfigured environment variables
Provider service unavailable

Solutions:

Verify environment variables are set correctly:
```
cat .env
```
Check that at least one LLM provider API key is configured
Test API keys with provider's API directly
Check provider status pages for service outages

Authentication Problems

401 Unauthorized Errors

Symptoms:

All API requests except /health return 401
Error message: "Invalid or missing API key"

Possible Causes:

Missing X-API-Key header
Invalid API key value
SERVICE_API_KEY environment variable not set
API key mismatch between client and server

Solutions:

Verify the X-API-Key header is included in requests:

curl -H "X-API-Key: your_api_key" http://localhost:8000/query

Check that SERVICE_API_KEY is set in environment:
```
echo $SERVICE_API_KEY
```
Ensure API key values match between client and server
Regenerate API key if it may have been compromised

API Key Rejected Despite Being Correct

Symptoms:

Valid API key is rejected
Works intermittently

Possible Causes:

Timing attacks prevention causing delays
Character encoding issues
Whitespace in API key

Solutions:

Strip whitespace from API key:

# Remove any trailing/leading whitespace
SERVICE_API_KEY=$(echo "$SERVICE_API_KEY" | tr -d ' \t\n\r')

Ensure consistent character encoding (UTF-8)
Regenerate API key with alphanumeric characters only

API Request Errors

422 Validation Errors

Symptoms:

Requests return 422 with validation error messages
Specific field errors in response

Possible Causes:

Prompt too short or too long
Invalid max_tokens value
Invalid temperature value
Prompt injection detected

Solutions:

Check prompt length (1-4000 characters)
Verify max_tokens is between 1-2048
Verify temperature is between 0.0-2.0
Review prompt for injection patterns like "ignore previous instructions"

429 Rate Limit Exceeded

Symptoms:

Requests return 429 status code
Error message: "Rate limit exceeded"

Possible Causes:

Too many requests from the same IP within the time window
Misconfigured rate limit settings
Shared proxy/IP affecting multiple users

Solutions:

Reduce request frequency to stay within limits
Increase rate limit in configuration:
```
RATE_LIMIT=20/minute
```
Implement exponential backoff in client code
Use different IP addresses or API keys for different clients

500 Internal Server Errors

Symptoms:

Requests return 500 with generic error messages
"All LLM providers failed" error

Possible Causes:

All configured LLM providers are unavailable
Network connectivity issues
Provider API key issues
Application bugs

Solutions:

Check LLM provider status pages
Verify all API keys are valid and have sufficient quotas
Test network connectivity to provider endpoints
Check application logs for specific error details
Try configuring additional LLM providers

LLM Provider Issues

Provider Timeout

Symptoms:

Slow responses or timeouts
Fallback to secondary providers

Possible Causes:

Provider API latency
Network connectivity issues
Provider rate limits exceeded
Geographic distance from provider

Solutions:

Check provider status dashboards
Verify network connectivity:
```
ping generativelanguage.googleapis.com
```
Review provider rate limits and quotas
Consider using providers geographically closer to your deployment

Provider Returns Empty Response

Symptoms:

Valid responses with empty content
Provider used but no text returned

Possible Causes:

Provider API response format changed
Content filtering blocking response
Invalid request parameters

Solutions:

Check provider documentation for response format changes
Review content moderation settings
Verify request parameters are within acceptable ranges
Test with provider's API directly using same parameters

Provider Quota Exhausted

Symptoms:

Sudden increase in errors from specific provider
Provider-specific error messages about quotas

Possible Causes:

Exceeded free tier limits
Reached paid quota limits
Billing issues with provider

Solutions:

Check provider dashboard for quota usage
Upgrade to paid tier if using free tier
Verify billing information with provider
Distribute load across multiple providers

Performance Problems

Slow Response Times

Symptoms:

High latency in API responses
User experience degradation

Possible Causes:

Slow LLM provider responses
Network latency
Insufficient server resources
Concurrent request overload

Solutions:

Monitor provider response times individually
Optimize network routing
Scale server resources (CPU, memory)
Implement caching for common requests
Use faster LLM providers when possible

High Memory Usage

Symptoms:

Application crashes with out-of-memory errors
System slowdown

Possible Causes:

Memory leaks in application
Large response payloads
Too many concurrent requests

Solutions:

Monitor memory usage over time
Implement response size limits
Add memory limits to container configuration
Scale horizontally with multiple instances

Deployment Issues

Docker Container Won't Start

Symptoms:

Container exits immediately
Error messages in docker logs

Possible Causes:

Missing environment variables
Port conflicts
Incorrect image tag
Insufficient permissions

Solutions:

Check container logs:
```
docker logs container_name
```
Verify all required environment variables are set

Check for port conflicts:

docker run -p 8001:8000 ...  # Use different port

Ensure proper permissions for mounted volumes

Environment Variables Not Loaded

Symptoms:

Configuration values not applied
Default values used instead

Possible Causes:

Incorrect .env file format
Environment file not mounted properly
Variable names don't match expected names

Solutions:

Verify .env file format (no spaces around =):
```
SERVICE_API_KEY=your_key_here
```
Check that environment file is properly mounted in Docker:
```
docker run --env-file .env ...
```
Confirm variable names match documentation

AI Safety Issues

Content Blocked Unexpectedly

Symptoms:

Safe prompts are being blocked
False positives from AI safety check

Possible Causes:

Toxicity threshold too low
Edge case in Gemini classification
Prompt contains keywords triggering false positives

Solutions:

Increase toxicity threshold:

TOXICITY_THRESHOLD=0.8  # Default is 0.7

Check which category triggered the block in the response
Review the prompt for unintended keywords
Test the prompt directly with /check-toxicity endpoint

Gemini Safety API Errors

Symptoms:

Errors mentioning Gemini API
Safety check falling back to Lakera

Possible Causes:

Invalid or expired GEMINI_API_KEY
Gemini API quota exhausted
Network connectivity issues

Solutions:

Verify GEMINI_API_KEY is valid:

curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=$GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Hello"}]}]}'

Check Gemini API quota in Google Cloud Console
Add LAKERA_API_KEY for reliable fallback

Lakera Guard Fallback Issues

Symptoms:

Both Gemini and Lakera failing
No safety check available

Possible Causes:

LAKERA_API_KEY not configured
Lakera API key invalid
Both services experiencing outages

Solutions:

Add LAKERA_API_KEY for fallback:
```
LAKERA_API_KEY=your_lakera_key
```

Test Lakera key directly:

curl -X POST "https://api.lakera.ai/v2/guard" \
  -H "Authorization: Bearer $LAKERA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"content":"test","role":"user"}]}'

Check Lakera status page for outages

Harmful Content Not Being Blocked

Symptoms:

Harmful prompts passing safety checks
AI generating inappropriate content

Possible Causes:

Toxicity threshold too high
Gemini API not properly configured
Prompt using evasion techniques

Solutions:

Lower toxicity threshold:
```
TOXICITY_THRESHOLD=0.5
```
Verify GEMINI_API_KEY is set correctly
Add LAKERA_API_KEY for additional detection
Review and update prompt injection patterns

Security Concerns

Suspicious Activity Detected

Symptoms:

Unexpected traffic patterns
High rate of blocked requests
Unusual API usage

Possible Causes:

Automated scanning/bot activity
Compromised API keys
Misconfigured rate limiting

Solutions:

Review access logs for suspicious patterns
Rotate potentially compromised API keys
Implement IP whitelisting if appropriate
Add more restrictive rate limiting

Prompt Injection Attempts

Symptoms:

High number of requests with injection patterns
Blocked requests with injection warnings

Possible Causes:

Malicious users attempting to bypass security
Legitimate users inadvertently triggering filters
Overly aggressive injection detection

Solutions:

Review blocked prompts to identify false positives
Fine-tune injection detection patterns if needed
Implement additional security layers
Monitor for patterns in attack attempts

Getting Additional Help

If you're unable to resolve an issue:

Check the GitHub Issues for similar problems
Review application logs for detailed error messages
Ensure you're using the latest version of the application
Contact the development team with:
- Detailed description of the problem
- Steps to reproduce
- Relevant log excerpts
- Environment information

Troubleshooting Guide

Table of Contents

Health Check Issues

Service Unreachable

Health Status Unhealthy

Authentication Problems

401 Unauthorized Errors

API Key Rejected Despite Being Correct

API Request Errors

422 Validation Errors

429 Rate Limit Exceeded

500 Internal Server Errors

LLM Provider Issues

Provider Timeout

Provider Returns Empty Response

Provider Quota Exhausted

Performance Problems

Slow Response Times

High Memory Usage

Deployment Issues

Docker Container Won't Start

Environment Variables Not Loaded

AI Safety Issues

Content Blocked Unexpectedly

Gemini Safety API Errors

Lakera Guard Fallback Issues

Harmful Content Not Being Blocked

Security Concerns

Suspicious Activity Detected

Prompt Injection Attempts

Getting Additional Help