Spaces:
Sleeping
Sleeping
| # Troubleshooting Guide | |
| > **Primary Responsibility:** Problem diagnosis and resolution for all issue types | |
| This guide helps diagnose and resolve common issues with the Enterprise AI Gateway. | |
| ## Table of Contents | |
| 1. [Health Check Issues](#health-check-issues) | |
| 2. [Authentication Problems](#authentication-problems) | |
| 3. [API Request Errors](#api-request-errors) | |
| 4. [LLM Provider Issues](#llm-provider-issues) | |
| 5. [Performance Problems](#performance-problems) | |
| 6. [Deployment Issues](#deployment-issues) | |
| 7. [Security Concerns](#security-concerns) | |
| ## Health Check Issues | |
| ### Service Unreachable | |
| **Symptoms**: | |
| - `/health` endpoint returns 502, 503, or connection timeout | |
| - Application doesn't start | |
| **Possible Causes**: | |
| - Application not running | |
| - Port binding issues | |
| - Firewall/network restrictions | |
| - Insufficient system resources | |
| **Solutions**: | |
| 1. Check if the application process is running: | |
| ```bash | |
| ps aux | grep uvicorn | |
| ``` | |
| 2. Verify port binding: | |
| ```bash | |
| netstat -tlnp | grep :8000 | |
| ``` | |
| 3. Check application logs for startup errors | |
| 4. Ensure firewall allows traffic on the application port | |
| ### Health Status Unhealthy | |
| **Symptoms**: | |
| - `/health` returns status "unhealthy" | |
| - Provider field is null or missing | |
| **Possible Causes**: | |
| - Missing or invalid LLM provider API keys | |
| - Misconfigured environment variables | |
| - Provider service unavailable | |
| **Solutions**: | |
| 1. Verify environment variables are set correctly: | |
| ```bash | |
| cat .env | |
| ``` | |
| 2. Check that at least one LLM provider API key is configured | |
| 3. Test API keys with provider's API directly | |
| 4. Check provider status pages for service outages | |
| ## Authentication Problems | |
| ### 401 Unauthorized Errors | |
| **Symptoms**: | |
| - All API requests except `/health` return 401 | |
| - Error message: "Invalid or missing API key" | |
| **Possible Causes**: | |
| - Missing `X-API-Key` header | |
| - Invalid API key value | |
| - `SERVICE_API_KEY` environment variable not set | |
| - API key mismatch between client and server | |
| **Solutions**: | |
| 1. Verify the `X-API-Key` header is included in requests: | |
| ```bash | |
| curl -H "X-API-Key: your_api_key" http://localhost:8000/query | |
| ``` | |
| 2. Check that `SERVICE_API_KEY` is set in environment: | |
| ```bash | |
| echo $SERVICE_API_KEY | |
| ``` | |
| 3. Ensure API key values match between client and server | |
| 4. Regenerate API key if it may have been compromised | |
| ### API Key Rejected Despite Being Correct | |
| **Symptoms**: | |
| - Valid API key is rejected | |
| - Works intermittently | |
| **Possible Causes**: | |
| - Timing attacks prevention causing delays | |
| - Character encoding issues | |
| - Whitespace in API key | |
| **Solutions**: | |
| 1. Strip whitespace from API key: | |
| ```bash | |
| # Remove any trailing/leading whitespace | |
| SERVICE_API_KEY=$(echo "$SERVICE_API_KEY" | tr -d ' \t\n\r') | |
| ``` | |
| 2. Ensure consistent character encoding (UTF-8) | |
| 3. Regenerate API key with alphanumeric characters only | |
| ## API Request Errors | |
| ### 422 Validation Errors | |
| **Symptoms**: | |
| - Requests return 422 with validation error messages | |
| - Specific field errors in response | |
| **Possible Causes**: | |
| - Prompt too short or too long | |
| - Invalid `max_tokens` value | |
| - Invalid `temperature` value | |
| - Prompt injection detected | |
| **Solutions**: | |
| 1. Check prompt length (1-4000 characters) | |
| 2. Verify `max_tokens` is between 1-2048 | |
| 3. Verify `temperature` is between 0.0-2.0 | |
| 4. Review prompt for injection patterns like "ignore previous instructions" | |
| ### 429 Rate Limit Exceeded | |
| **Symptoms**: | |
| - Requests return 429 status code | |
| - Error message: "Rate limit exceeded" | |
| **Possible Causes**: | |
| - Too many requests from the same IP within the time window | |
| - Misconfigured rate limit settings | |
| - Shared proxy/IP affecting multiple users | |
| **Solutions**: | |
| 1. Reduce request frequency to stay within limits | |
| 2. Increase rate limit in configuration: | |
| ```bash | |
| RATE_LIMIT=20/minute | |
| ``` | |
| 3. Implement exponential backoff in client code | |
| 4. Use different IP addresses or API keys for different clients | |
| ### 500 Internal Server Errors | |
| **Symptoms**: | |
| - Requests return 500 with generic error messages | |
| - "All LLM providers failed" error | |
| **Possible Causes**: | |
| - All configured LLM providers are unavailable | |
| - Network connectivity issues | |
| - Provider API key issues | |
| - Application bugs | |
| **Solutions**: | |
| 1. Check LLM provider status pages | |
| 2. Verify all API keys are valid and have sufficient quotas | |
| 3. Test network connectivity to provider endpoints | |
| 4. Check application logs for specific error details | |
| 5. Try configuring additional LLM providers | |
| ## LLM Provider Issues | |
| ### Provider Timeout | |
| **Symptoms**: | |
| - Slow responses or timeouts | |
| - Fallback to secondary providers | |
| **Possible Causes**: | |
| - Provider API latency | |
| - Network connectivity issues | |
| - Provider rate limits exceeded | |
| - Geographic distance from provider | |
| **Solutions**: | |
| 1. Check provider status dashboards | |
| 2. Verify network connectivity: | |
| ```bash | |
| ping generativelanguage.googleapis.com | |
| ``` | |
| 3. Review provider rate limits and quotas | |
| 4. Consider using providers geographically closer to your deployment | |
| ### Provider Returns Empty Response | |
| **Symptoms**: | |
| - Valid responses with empty content | |
| - Provider used but no text returned | |
| **Possible Causes**: | |
| - Provider API response format changed | |
| - Content filtering blocking response | |
| - Invalid request parameters | |
| **Solutions**: | |
| 1. Check provider documentation for response format changes | |
| 2. Review content moderation settings | |
| 3. Verify request parameters are within acceptable ranges | |
| 4. Test with provider's API directly using same parameters | |
| ### Provider Quota Exhausted | |
| **Symptoms**: | |
| - Sudden increase in errors from specific provider | |
| - Provider-specific error messages about quotas | |
| **Possible Causes**: | |
| - Exceeded free tier limits | |
| - Reached paid quota limits | |
| - Billing issues with provider | |
| **Solutions**: | |
| 1. Check provider dashboard for quota usage | |
| 2. Upgrade to paid tier if using free tier | |
| 3. Verify billing information with provider | |
| 4. Distribute load across multiple providers | |
| ## Performance Problems | |
| ### Slow Response Times | |
| **Symptoms**: | |
| - High latency in API responses | |
| - User experience degradation | |
| **Possible Causes**: | |
| - Slow LLM provider responses | |
| - Network latency | |
| - Insufficient server resources | |
| - Concurrent request overload | |
| **Solutions**: | |
| 1. Monitor provider response times individually | |
| 2. Optimize network routing | |
| 3. Scale server resources (CPU, memory) | |
| 4. Implement caching for common requests | |
| 5. Use faster LLM providers when possible | |
| ### High Memory Usage | |
| **Symptoms**: | |
| - Application crashes with out-of-memory errors | |
| - System slowdown | |
| **Possible Causes**: | |
| - Memory leaks in application | |
| - Large response payloads | |
| - Too many concurrent requests | |
| **Solutions**: | |
| 1. Monitor memory usage over time | |
| 2. Implement response size limits | |
| 3. Add memory limits to container configuration | |
| 4. Scale horizontally with multiple instances | |
| ## Deployment Issues | |
| ### Docker Container Won't Start | |
| **Symptoms**: | |
| - Container exits immediately | |
| - Error messages in docker logs | |
| **Possible Causes**: | |
| - Missing environment variables | |
| - Port conflicts | |
| - Incorrect image tag | |
| - Insufficient permissions | |
| **Solutions**: | |
| 1. Check container logs: | |
| ```bash | |
| docker logs container_name | |
| ``` | |
| 2. Verify all required environment variables are set | |
| 3. Check for port conflicts: | |
| ```bash | |
| docker run -p 8001:8000 ... # Use different port | |
| ``` | |
| 4. Ensure proper permissions for mounted volumes | |
| ### Environment Variables Not Loaded | |
| **Symptoms**: | |
| - Configuration values not applied | |
| - Default values used instead | |
| **Possible Causes**: | |
| - Incorrect .env file format | |
| - Environment file not mounted properly | |
| - Variable names don't match expected names | |
| **Solutions**: | |
| 1. Verify .env file format (no spaces around =): | |
| ``` | |
| SERVICE_API_KEY=your_key_here | |
| ``` | |
| 2. Check that environment file is properly mounted in Docker: | |
| ```bash | |
| docker run --env-file .env ... | |
| ``` | |
| 3. Confirm variable names match documentation | |
| ## AI Safety Issues | |
| ### Content Blocked Unexpectedly | |
| **Symptoms**: | |
| - Safe prompts are being blocked | |
| - False positives from AI safety check | |
| **Possible Causes**: | |
| - Toxicity threshold too low | |
| - Edge case in Gemini classification | |
| - Prompt contains keywords triggering false positives | |
| **Solutions**: | |
| 1. Increase toxicity threshold: | |
| ```bash | |
| TOXICITY_THRESHOLD=0.8 # Default is 0.7 | |
| ``` | |
| 2. Check which category triggered the block in the response | |
| 3. Review the prompt for unintended keywords | |
| 4. Test the prompt directly with `/check-toxicity` endpoint | |
| ### Gemini Safety API Errors | |
| **Symptoms**: | |
| - Errors mentioning Gemini API | |
| - Safety check falling back to Lakera | |
| **Possible Causes**: | |
| - Invalid or expired GEMINI_API_KEY | |
| - Gemini API quota exhausted | |
| - Network connectivity issues | |
| **Solutions**: | |
| 1. Verify GEMINI_API_KEY is valid: | |
| ```bash | |
| curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=$GEMINI_API_KEY" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"contents":[{"parts":[{"text":"Hello"}]}]}' | |
| ``` | |
| 2. Check Gemini API quota in Google Cloud Console | |
| 3. Add LAKERA_API_KEY for reliable fallback | |
| ### Lakera Guard Fallback Issues | |
| **Symptoms**: | |
| - Both Gemini and Lakera failing | |
| - No safety check available | |
| **Possible Causes**: | |
| - LAKERA_API_KEY not configured | |
| - Lakera API key invalid | |
| - Both services experiencing outages | |
| **Solutions**: | |
| 1. Add LAKERA_API_KEY for fallback: | |
| ```bash | |
| LAKERA_API_KEY=your_lakera_key | |
| ``` | |
| 2. Test Lakera key directly: | |
| ```bash | |
| curl -X POST "https://api.lakera.ai/v2/guard" \ | |
| -H "Authorization: Bearer $LAKERA_API_KEY" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"messages":[{"content":"test","role":"user"}]}' | |
| ``` | |
| 3. Check Lakera status page for outages | |
| ### Harmful Content Not Being Blocked | |
| **Symptoms**: | |
| - Harmful prompts passing safety checks | |
| - AI generating inappropriate content | |
| **Possible Causes**: | |
| - Toxicity threshold too high | |
| - Gemini API not properly configured | |
| - Prompt using evasion techniques | |
| **Solutions**: | |
| 1. Lower toxicity threshold: | |
| ```bash | |
| TOXICITY_THRESHOLD=0.5 | |
| ``` | |
| 2. Verify GEMINI_API_KEY is set correctly | |
| 3. Add LAKERA_API_KEY for additional detection | |
| 4. Review and update prompt injection patterns | |
| --- | |
| ## Security Concerns | |
| ### Suspicious Activity Detected | |
| **Symptoms**: | |
| - Unexpected traffic patterns | |
| - High rate of blocked requests | |
| - Unusual API usage | |
| **Possible Causes**: | |
| - Automated scanning/bot activity | |
| - Compromised API keys | |
| - Misconfigured rate limiting | |
| **Solutions**: | |
| 1. Review access logs for suspicious patterns | |
| 2. Rotate potentially compromised API keys | |
| 3. Implement IP whitelisting if appropriate | |
| 4. Add more restrictive rate limiting | |
| ### Prompt Injection Attempts | |
| **Symptoms**: | |
| - High number of requests with injection patterns | |
| - Blocked requests with injection warnings | |
| **Possible Causes**: | |
| - Malicious users attempting to bypass security | |
| - Legitimate users inadvertently triggering filters | |
| - Overly aggressive injection detection | |
| **Solutions**: | |
| 1. Review blocked prompts to identify false positives | |
| 2. Fine-tune injection detection patterns if needed | |
| 3. Implement additional security layers | |
| 4. Monitor for patterns in attack attempts | |
| ## Getting Additional Help | |
| If you're unable to resolve an issue: | |
| 1. Check the [GitHub Issues](https://github.com/vn6295337/Enterprise-AI-Gateway/issues) for similar problems | |
| 2. Review application logs for detailed error messages | |
| 3. Ensure you're using the latest version of the application | |
| 4. Contact the development team with: | |
| - Detailed description of the problem | |
| - Steps to reproduce | |
| - Relevant log excerpts | |
| - Environment information |