Spaces:

vn6295337
/

Enterprise-AI-Gateway

Sleeping

App Files Files Community

Enterprise-AI-Gateway / docs /troubleshooting.md

vn6295337

Initial commit: Enterprise-AI-Gateway - Secure LLM gateway

bb0c63f 4 months ago

preview code

raw

history blame contribute delete

11.5 kB

	# Troubleshooting Guide

	> Primary Responsibility: Problem diagnosis and resolution for all issue types

	This guide helps diagnose and resolve common issues with the Enterprise AI Gateway.

	## Table of Contents

	1. [Health Check Issues](#health-check-issues)
	2. [Authentication Problems](#authentication-problems)
	3. [API Request Errors](#api-request-errors)
	4. [LLM Provider Issues](#llm-provider-issues)
	5. [Performance Problems](#performance-problems)
	6. [Deployment Issues](#deployment-issues)
	7. [Security Concerns](#security-concerns)

	## Health Check Issues

	### Service Unreachable

	Symptoms:
	- `/health` endpoint returns 502, 503, or connection timeout
	- Application doesn't start

	Possible Causes:
	- Application not running
	- Port binding issues
	- Firewall/network restrictions
	- Insufficient system resources

	Solutions:
	1. Check if the application process is running:
	```bash
	ps aux \| grep uvicorn
	```
	2. Verify port binding:
	```bash
	netstat -tlnp \| grep :8000
	```
	3. Check application logs for startup errors
	4. Ensure firewall allows traffic on the application port

	### Health Status Unhealthy

	Symptoms:
	- `/health` returns status "unhealthy"
	- Provider field is null or missing

	Possible Causes:
	- Missing or invalid LLM provider API keys
	- Misconfigured environment variables
	- Provider service unavailable

	Solutions:
	1. Verify environment variables are set correctly:
	```bash
	cat .env
	```
	2. Check that at least one LLM provider API key is configured
	3. Test API keys with provider's API directly
	4. Check provider status pages for service outages

	## Authentication Problems

	### 401 Unauthorized Errors

	Symptoms:
	- All API requests except `/health` return 401
	- Error message: "Invalid or missing API key"

	Possible Causes:
	- Missing `X-API-Key` header
	- Invalid API key value
	- `SERVICE_API_KEY` environment variable not set
	- API key mismatch between client and server

	Solutions:
	1. Verify the `X-API-Key` header is included in requests:
	```bash
	curl -H "X-API-Key: your_api_key" http://localhost:8000/query
	```
	2. Check that `SERVICE_API_KEY` is set in environment:
	```bash
	echo $SERVICE_API_KEY
	```
	3. Ensure API key values match between client and server
	4. Regenerate API key if it may have been compromised

	### API Key Rejected Despite Being Correct

	Symptoms:
	- Valid API key is rejected
	- Works intermittently

	Possible Causes:
	- Timing attacks prevention causing delays
	- Character encoding issues
	- Whitespace in API key

	Solutions:
	1. Strip whitespace from API key:
	```bash
	# Remove any trailing/leading whitespace
	SERVICE_API_KEY=$(echo "$SERVICE_API_KEY" \| tr -d ' \t\n\r')
	```
	2. Ensure consistent character encoding (UTF-8)
	3. Regenerate API key with alphanumeric characters only

	## API Request Errors

	### 422 Validation Errors

	Symptoms:
	- Requests return 422 with validation error messages
	- Specific field errors in response

	Possible Causes:
	- Prompt too short or too long
	- Invalid `max_tokens` value
	- Invalid `temperature` value
	- Prompt injection detected

	Solutions:
	1. Check prompt length (1-4000 characters)
	2. Verify `max_tokens` is between 1-2048
	3. Verify `temperature` is between 0.0-2.0
	4. Review prompt for injection patterns like "ignore previous instructions"

	### 429 Rate Limit Exceeded

	Symptoms:
	- Requests return 429 status code
	- Error message: "Rate limit exceeded"

	Possible Causes:
	- Too many requests from the same IP within the time window
	- Misconfigured rate limit settings
	- Shared proxy/IP affecting multiple users

	Solutions:
	1. Reduce request frequency to stay within limits
	2. Increase rate limit in configuration:
	```bash
	RATE_LIMIT=20/minute
	```
	3. Implement exponential backoff in client code
	4. Use different IP addresses or API keys for different clients

	### 500 Internal Server Errors

	Symptoms:
	- Requests return 500 with generic error messages
	- "All LLM providers failed" error

	Possible Causes:
	- All configured LLM providers are unavailable
	- Network connectivity issues
	- Provider API key issues
	- Application bugs

	Solutions:
	1. Check LLM provider status pages
	2. Verify all API keys are valid and have sufficient quotas
	3. Test network connectivity to provider endpoints
	4. Check application logs for specific error details
	5. Try configuring additional LLM providers

	## LLM Provider Issues

	### Provider Timeout

	Symptoms:
	- Slow responses or timeouts
	- Fallback to secondary providers

	Possible Causes:
	- Provider API latency
	- Network connectivity issues
	- Provider rate limits exceeded
	- Geographic distance from provider

	Solutions:
	1. Check provider status dashboards
	2. Verify network connectivity:
	```bash
	ping generativelanguage.googleapis.com
	```
	3. Review provider rate limits and quotas
	4. Consider using providers geographically closer to your deployment

	### Provider Returns Empty Response

	Symptoms:
	- Valid responses with empty content
	- Provider used but no text returned

	Possible Causes:
	- Provider API response format changed
	- Content filtering blocking response
	- Invalid request parameters

	Solutions:
	1. Check provider documentation for response format changes
	2. Review content moderation settings
	3. Verify request parameters are within acceptable ranges
	4. Test with provider's API directly using same parameters

	### Provider Quota Exhausted

	Symptoms:
	- Sudden increase in errors from specific provider
	- Provider-specific error messages about quotas

	Possible Causes:
	- Exceeded free tier limits
	- Reached paid quota limits
	- Billing issues with provider

	Solutions:
	1. Check provider dashboard for quota usage
	2. Upgrade to paid tier if using free tier
	3. Verify billing information with provider
	4. Distribute load across multiple providers

	## Performance Problems

	### Slow Response Times

	Symptoms:
	- High latency in API responses
	- User experience degradation

	Possible Causes:
	- Slow LLM provider responses
	- Network latency
	- Insufficient server resources
	- Concurrent request overload

	Solutions:
	1. Monitor provider response times individually
	2. Optimize network routing
	3. Scale server resources (CPU, memory)
	4. Implement caching for common requests
	5. Use faster LLM providers when possible

	### High Memory Usage

	Symptoms:
	- Application crashes with out-of-memory errors
	- System slowdown

	Possible Causes:
	- Memory leaks in application
	- Large response payloads
	- Too many concurrent requests

	Solutions:
	1. Monitor memory usage over time
	2. Implement response size limits
	3. Add memory limits to container configuration
	4. Scale horizontally with multiple instances

	## Deployment Issues

	### Docker Container Won't Start

	Symptoms:
	- Container exits immediately
	- Error messages in docker logs

	Possible Causes:
	- Missing environment variables
	- Port conflicts
	- Incorrect image tag
	- Insufficient permissions

	Solutions:
	1. Check container logs:
	```bash
	docker logs container_name
	```
	2. Verify all required environment variables are set
	3. Check for port conflicts:
	```bash
	docker run -p 8001:8000 ... # Use different port
	```
	4. Ensure proper permissions for mounted volumes

	### Environment Variables Not Loaded

	Symptoms:
	- Configuration values not applied
	- Default values used instead

	Possible Causes:
	- Incorrect .env file format
	- Environment file not mounted properly
	- Variable names don't match expected names

	Solutions:
	1. Verify .env file format (no spaces around =):
	```
	SERVICE_API_KEY=your_key_here
	```
	2. Check that environment file is properly mounted in Docker:
	```bash
	docker run --env-file .env ...
	```
	3. Confirm variable names match documentation

	## AI Safety Issues

	### Content Blocked Unexpectedly

	Symptoms:
	- Safe prompts are being blocked
	- False positives from AI safety check

	Possible Causes:
	- Toxicity threshold too low
	- Edge case in Gemini classification
	- Prompt contains keywords triggering false positives

	Solutions:
	1. Increase toxicity threshold:
	```bash
	TOXICITY_THRESHOLD=0.8 # Default is 0.7
	```
	2. Check which category triggered the block in the response
	3. Review the prompt for unintended keywords
	4. Test the prompt directly with `/check-toxicity` endpoint

	### Gemini Safety API Errors

	Symptoms:
	- Errors mentioning Gemini API
	- Safety check falling back to Lakera

	Possible Causes:
	- Invalid or expired GEMINI_API_KEY
	- Gemini API quota exhausted
	- Network connectivity issues

	Solutions:
	1. Verify GEMINI_API_KEY is valid:
	```bash
	curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=$GEMINI_API_KEY" \
	-H "Content-Type: application/json" \
	-d '{"contents":[{"parts":[{"text":"Hello"}]}]}'
	```
	2. Check Gemini API quota in Google Cloud Console
	3. Add LAKERA_API_KEY for reliable fallback

	### Lakera Guard Fallback Issues

	Symptoms:
	- Both Gemini and Lakera failing
	- No safety check available

	Possible Causes:
	- LAKERA_API_KEY not configured
	- Lakera API key invalid
	- Both services experiencing outages

	Solutions:
	1. Add LAKERA_API_KEY for fallback:
	```bash
	LAKERA_API_KEY=your_lakera_key
	```
	2. Test Lakera key directly:
	```bash
	curl -X POST "https://api.lakera.ai/v2/guard" \
	-H "Authorization: Bearer $LAKERA_API_KEY" \
	-H "Content-Type: application/json" \
	-d '{"messages":[{"content":"test","role":"user"}]}'
	```
	3. Check Lakera status page for outages

	### Harmful Content Not Being Blocked

	Symptoms:
	- Harmful prompts passing safety checks
	- AI generating inappropriate content

	Possible Causes:
	- Toxicity threshold too high
	- Gemini API not properly configured
	- Prompt using evasion techniques

	Solutions:
	1. Lower toxicity threshold:
	```bash
	TOXICITY_THRESHOLD=0.5
	```
	2. Verify GEMINI_API_KEY is set correctly
	3. Add LAKERA_API_KEY for additional detection
	4. Review and update prompt injection patterns

	---

	## Security Concerns

	### Suspicious Activity Detected

	Symptoms:
	- Unexpected traffic patterns
	- High rate of blocked requests
	- Unusual API usage

	Possible Causes:
	- Automated scanning/bot activity
	- Compromised API keys
	- Misconfigured rate limiting

	Solutions:
	1. Review access logs for suspicious patterns
	2. Rotate potentially compromised API keys
	3. Implement IP whitelisting if appropriate
	4. Add more restrictive rate limiting

	### Prompt Injection Attempts

	Symptoms:
	- High number of requests with injection patterns
	- Blocked requests with injection warnings

	Possible Causes:
	- Malicious users attempting to bypass security
	- Legitimate users inadvertently triggering filters
	- Overly aggressive injection detection

	Solutions:
	1. Review blocked prompts to identify false positives
	2. Fine-tune injection detection patterns if needed
	3. Implement additional security layers
	4. Monitor for patterns in attack attempts

	## Getting Additional Help

	If you're unable to resolve an issue:

	1. Check the [GitHub Issues](https://github.com/vn6295337/Enterprise-AI-Gateway/issues) for similar problems
	2. Review application logs for detailed error messages
	3. Ensure you're using the latest version of the application
	4. Contact the development team with:
	- Detailed description of the problem
	- Steps to reproduce
	- Relevant log excerpts
	- Environment information