Enterprise-AI-Gateway / docs /troubleshooting.md
vn6295337's picture
Initial commit: Enterprise-AI-Gateway - Secure LLM gateway
bb0c63f

Troubleshooting Guide

Primary Responsibility: Problem diagnosis and resolution for all issue types

This guide helps diagnose and resolve common issues with the Enterprise AI Gateway.

Table of Contents

  1. Health Check Issues
  2. Authentication Problems
  3. API Request Errors
  4. LLM Provider Issues
  5. Performance Problems
  6. Deployment Issues
  7. Security Concerns

Health Check Issues

Service Unreachable

Symptoms:

  • /health endpoint returns 502, 503, or connection timeout
  • Application doesn't start

Possible Causes:

  • Application not running
  • Port binding issues
  • Firewall/network restrictions
  • Insufficient system resources

Solutions:

  1. Check if the application process is running:
    ps aux | grep uvicorn
    
  2. Verify port binding:
    netstat -tlnp | grep :8000
    
  3. Check application logs for startup errors
  4. Ensure firewall allows traffic on the application port

Health Status Unhealthy

Symptoms:

  • /health returns status "unhealthy"
  • Provider field is null or missing

Possible Causes:

  • Missing or invalid LLM provider API keys
  • Misconfigured environment variables
  • Provider service unavailable

Solutions:

  1. Verify environment variables are set correctly:
    cat .env
    
  2. Check that at least one LLM provider API key is configured
  3. Test API keys with provider's API directly
  4. Check provider status pages for service outages

Authentication Problems

401 Unauthorized Errors

Symptoms:

  • All API requests except /health return 401
  • Error message: "Invalid or missing API key"

Possible Causes:

  • Missing X-API-Key header
  • Invalid API key value
  • SERVICE_API_KEY environment variable not set
  • API key mismatch between client and server

Solutions:

  1. Verify the X-API-Key header is included in requests:
    curl -H "X-API-Key: your_api_key" http://localhost:8000/query
    
  2. Check that SERVICE_API_KEY is set in environment:
    echo $SERVICE_API_KEY
    
  3. Ensure API key values match between client and server
  4. Regenerate API key if it may have been compromised

API Key Rejected Despite Being Correct

Symptoms:

  • Valid API key is rejected
  • Works intermittently

Possible Causes:

  • Timing attacks prevention causing delays
  • Character encoding issues
  • Whitespace in API key

Solutions:

  1. Strip whitespace from API key:
    # Remove any trailing/leading whitespace
    SERVICE_API_KEY=$(echo "$SERVICE_API_KEY" | tr -d ' \t\n\r')
    
  2. Ensure consistent character encoding (UTF-8)
  3. Regenerate API key with alphanumeric characters only

API Request Errors

422 Validation Errors

Symptoms:

  • Requests return 422 with validation error messages
  • Specific field errors in response

Possible Causes:

  • Prompt too short or too long
  • Invalid max_tokens value
  • Invalid temperature value
  • Prompt injection detected

Solutions:

  1. Check prompt length (1-4000 characters)
  2. Verify max_tokens is between 1-2048
  3. Verify temperature is between 0.0-2.0
  4. Review prompt for injection patterns like "ignore previous instructions"

429 Rate Limit Exceeded

Symptoms:

  • Requests return 429 status code
  • Error message: "Rate limit exceeded"

Possible Causes:

  • Too many requests from the same IP within the time window
  • Misconfigured rate limit settings
  • Shared proxy/IP affecting multiple users

Solutions:

  1. Reduce request frequency to stay within limits
  2. Increase rate limit in configuration:
    RATE_LIMIT=20/minute
    
  3. Implement exponential backoff in client code
  4. Use different IP addresses or API keys for different clients

500 Internal Server Errors

Symptoms:

  • Requests return 500 with generic error messages
  • "All LLM providers failed" error

Possible Causes:

  • All configured LLM providers are unavailable
  • Network connectivity issues
  • Provider API key issues
  • Application bugs

Solutions:

  1. Check LLM provider status pages
  2. Verify all API keys are valid and have sufficient quotas
  3. Test network connectivity to provider endpoints
  4. Check application logs for specific error details
  5. Try configuring additional LLM providers

LLM Provider Issues

Provider Timeout

Symptoms:

  • Slow responses or timeouts
  • Fallback to secondary providers

Possible Causes:

  • Provider API latency
  • Network connectivity issues
  • Provider rate limits exceeded
  • Geographic distance from provider

Solutions:

  1. Check provider status dashboards
  2. Verify network connectivity:
    ping generativelanguage.googleapis.com
    
  3. Review provider rate limits and quotas
  4. Consider using providers geographically closer to your deployment

Provider Returns Empty Response

Symptoms:

  • Valid responses with empty content
  • Provider used but no text returned

Possible Causes:

  • Provider API response format changed
  • Content filtering blocking response
  • Invalid request parameters

Solutions:

  1. Check provider documentation for response format changes
  2. Review content moderation settings
  3. Verify request parameters are within acceptable ranges
  4. Test with provider's API directly using same parameters

Provider Quota Exhausted

Symptoms:

  • Sudden increase in errors from specific provider
  • Provider-specific error messages about quotas

Possible Causes:

  • Exceeded free tier limits
  • Reached paid quota limits
  • Billing issues with provider

Solutions:

  1. Check provider dashboard for quota usage
  2. Upgrade to paid tier if using free tier
  3. Verify billing information with provider
  4. Distribute load across multiple providers

Performance Problems

Slow Response Times

Symptoms:

  • High latency in API responses
  • User experience degradation

Possible Causes:

  • Slow LLM provider responses
  • Network latency
  • Insufficient server resources
  • Concurrent request overload

Solutions:

  1. Monitor provider response times individually
  2. Optimize network routing
  3. Scale server resources (CPU, memory)
  4. Implement caching for common requests
  5. Use faster LLM providers when possible

High Memory Usage

Symptoms:

  • Application crashes with out-of-memory errors
  • System slowdown

Possible Causes:

  • Memory leaks in application
  • Large response payloads
  • Too many concurrent requests

Solutions:

  1. Monitor memory usage over time
  2. Implement response size limits
  3. Add memory limits to container configuration
  4. Scale horizontally with multiple instances

Deployment Issues

Docker Container Won't Start

Symptoms:

  • Container exits immediately
  • Error messages in docker logs

Possible Causes:

  • Missing environment variables
  • Port conflicts
  • Incorrect image tag
  • Insufficient permissions

Solutions:

  1. Check container logs:
    docker logs container_name
    
  2. Verify all required environment variables are set
  3. Check for port conflicts:
    docker run -p 8001:8000 ...  # Use different port
    
  4. Ensure proper permissions for mounted volumes

Environment Variables Not Loaded

Symptoms:

  • Configuration values not applied
  • Default values used instead

Possible Causes:

  • Incorrect .env file format
  • Environment file not mounted properly
  • Variable names don't match expected names

Solutions:

  1. Verify .env file format (no spaces around =):
    SERVICE_API_KEY=your_key_here
    
  2. Check that environment file is properly mounted in Docker:
    docker run --env-file .env ...
    
  3. Confirm variable names match documentation

AI Safety Issues

Content Blocked Unexpectedly

Symptoms:

  • Safe prompts are being blocked
  • False positives from AI safety check

Possible Causes:

  • Toxicity threshold too low
  • Edge case in Gemini classification
  • Prompt contains keywords triggering false positives

Solutions:

  1. Increase toxicity threshold:
    TOXICITY_THRESHOLD=0.8  # Default is 0.7
    
  2. Check which category triggered the block in the response
  3. Review the prompt for unintended keywords
  4. Test the prompt directly with /check-toxicity endpoint

Gemini Safety API Errors

Symptoms:

  • Errors mentioning Gemini API
  • Safety check falling back to Lakera

Possible Causes:

  • Invalid or expired GEMINI_API_KEY
  • Gemini API quota exhausted
  • Network connectivity issues

Solutions:

  1. Verify GEMINI_API_KEY is valid:
    curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=$GEMINI_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"contents":[{"parts":[{"text":"Hello"}]}]}'
    
  2. Check Gemini API quota in Google Cloud Console
  3. Add LAKERA_API_KEY for reliable fallback

Lakera Guard Fallback Issues

Symptoms:

  • Both Gemini and Lakera failing
  • No safety check available

Possible Causes:

  • LAKERA_API_KEY not configured
  • Lakera API key invalid
  • Both services experiencing outages

Solutions:

  1. Add LAKERA_API_KEY for fallback:
    LAKERA_API_KEY=your_lakera_key
    
  2. Test Lakera key directly:
    curl -X POST "https://api.lakera.ai/v2/guard" \
      -H "Authorization: Bearer $LAKERA_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"messages":[{"content":"test","role":"user"}]}'
    
  3. Check Lakera status page for outages

Harmful Content Not Being Blocked

Symptoms:

  • Harmful prompts passing safety checks
  • AI generating inappropriate content

Possible Causes:

  • Toxicity threshold too high
  • Gemini API not properly configured
  • Prompt using evasion techniques

Solutions:

  1. Lower toxicity threshold:
    TOXICITY_THRESHOLD=0.5
    
  2. Verify GEMINI_API_KEY is set correctly
  3. Add LAKERA_API_KEY for additional detection
  4. Review and update prompt injection patterns

Security Concerns

Suspicious Activity Detected

Symptoms:

  • Unexpected traffic patterns
  • High rate of blocked requests
  • Unusual API usage

Possible Causes:

  • Automated scanning/bot activity
  • Compromised API keys
  • Misconfigured rate limiting

Solutions:

  1. Review access logs for suspicious patterns
  2. Rotate potentially compromised API keys
  3. Implement IP whitelisting if appropriate
  4. Add more restrictive rate limiting

Prompt Injection Attempts

Symptoms:

  • High number of requests with injection patterns
  • Blocked requests with injection warnings

Possible Causes:

  • Malicious users attempting to bypass security
  • Legitimate users inadvertently triggering filters
  • Overly aggressive injection detection

Solutions:

  1. Review blocked prompts to identify false positives
  2. Fine-tune injection detection patterns if needed
  3. Implement additional security layers
  4. Monitor for patterns in attack attempts

Getting Additional Help

If you're unable to resolve an issue:

  1. Check the GitHub Issues for similar problems
  2. Review application logs for detailed error messages
  3. Ensure you're using the latest version of the application
  4. Contact the development team with:
    • Detailed description of the problem
    • Steps to reproduce
    • Relevant log excerpts
    • Environment information