--- title: AGI Multi-Model API emoji: 😻 colorFrom: purple colorTo: green sdk: docker pinned: false license: apache-2.0 short_description: C++ scalable llm-server --- # AGI Multi-Model API A local LLM control plane built around `llama-server` and a C++ manager layer. The current primary path in this repository is the C++ `llm-manager`: - runtime model switching - API-key auth - token-aware request validation - per-key rate limiting - bounded priority queue - single-worker scheduler - global and per-request cancel - queue metrics The legacy Python application is still present under [`python/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/python), but the repository is now centered around the C++ manager implementation in [`cpp/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/cpp). ## Features - Dynamic model switching through a single manager endpoint. - One active `llama-server` worker process at a time. - OpenAI-compatible `POST /v1/chat/completions`. - Bounded queue with priority by API-key role. - Token-aware admission based on `messages + max_tokens`. - Per-key request and estimated-token rate limiting. - `request_id` propagation in every response. - Queue metrics at `GET /queue/metrics`. - Conservative streaming rollout: `stream=true` is explicitly rejected until relay support is implemented. ## Current Architecture ```text C++ LLM Manager Architecture +----------------------+ | Clients | |----------------------| | chat requests | | cancel requests | | admin control | | queue metrics | +----------+-----------+ | v +---------------------------------------------------------------------------------------------------+ | llm-manager (Boost.Beast) | |---------------------------------------------------------------------------------------------------| | HTTP Layer | | - request parsing | | - response writing | | - X-Request-Id injection | | - JSON error responses | +--------------------------------------+------------------------------------------------------------+ | v +---------------------------------------------------------------------------------------------------+ | Auth / Policy Layer | |---------------------------------------------------------------------------------------------------| | - API key auth | | - token validation | | - rate limiting | +--------------------------------------+------------------------------------------------------------+ | v +---------------------------------------------------------------------------------------------------+ | Request Lifecycle Layer | |---------------------------------------------------------------------------------------------------| | - request registry | | - bounded priority queue | | - single-worker scheduler | | - cancel / timeout handling | | - queue metrics | +--------------------------------------+------------------------------------------------------------+ | v +---------------------------------------------------------------------------------------------------+ | Backend Control Layer | |---------------------------------------------------------------------------------------------------| | ModelManager | | - one active llama-server worker | | - spawn / readiness / switch / restart | +--------------------------------------+------------------------------------------------------------+ | v +-------------------------------+ | active llama-server worker | |-------------------------------| | one active model at a time | | /v1/chat/completions backend | | proxied GET/UI routes | +-------------------------------+ ``` More detail: - [`docs/CPP_MANAGER_ARCHITECTURE_ASCII.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_ARCHITECTURE_ASCII.md) - [`docs/CPP_MANAGER_SERVER.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_SERVER.md) - [`docs/CPP_MANAGER_IMPLEMENTATION_PLAN.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_IMPLEMENTATION_PLAN.md) ## Components - `llm-manager (Boost.Beast)`: Main C++ HTTP server exposing chat, control, cancel, and queue metrics endpoints. - `Config Layer`: Loads runtime settings from `config.toml` and environment variables. - `API Key Auth`: Authenticates bearer tokens and maps them to `admin` or `user`. - `Token Validation`: Estimates prompt size from `messages + max_tokens` and rejects oversized requests early. - `Rate Limiter`: Applies per-key request and estimated-token budgets. - `Request Registry`: Tracks lifecycle state for each request id. - `Priority Queue`: Holds bounded queued work with admin/user separation and fairness. - `Scheduler`: Forwards one request at a time to the active backend worker. - `ModelManager`: Owns `llama-server` process lifecycle and model switching. - `Queue Metrics`: Exposes queue and runtime telemetry at `GET /queue/metrics`. ## API Endpoints ### Status - `GET /health` - `GET /models` ### Chat - `POST /v1/chat/completions` ### Control - `POST /switch-model` - `POST /stop` - `POST /requests/{request_id}/cancel` ### Metrics - `GET /queue/metrics` ## Request Flow `POST /v1/chat/completions` currently goes through: 1. request parsing 2. request id generation 3. API-key auth 4. stream flag check 5. token estimation and validation 6. per-key rate limiting 7. request registry entry creation 8. bounded queue admission 9. scheduler execution 10. backend response delivery If the queue is full: - return `503 Service Unavailable` - include `Retry-After` If rate limit is exceeded: - return `429 Too Many Requests` - include `Retry-After` If `stream=true` is requested: - return `501 Not Implemented` ## Authentication Requests are authenticated with: ```text Authorization: Bearer ``` Roles: - `admin`: can call privileged control endpoints - `user`: can submit chat requests and cancel owned requests If no API keys are configured, the manager currently stays in a compatibility mode and does not enforce auth. ## Configuration Example config: - [`config.toml.example`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/config.toml.example) Current configuration groups: - `[server]` - `[worker]` - `[llama]` - `[auth]` - `[limits]` - `[queue]` - `[scheduler]` - `[streaming]` - `[rate_limit]` - `[[api_keys]]` Important runtime settings: - `queue.max_size` - `queue.max_tokens` - `queue.admin_quota` - `limits.default_max_tokens` - `limits.max_tokens_per_request` - `limits.request_timeout_sec` - `rate_limit.requests_per_minute` - `rate_limit.estimated_tokens_per_minute` - `streaming.enabled` ## Quick Start ### Docker ```bash docker build -t agi-api . docker run -p 7860:7860 agi-api ``` The manager listens on port `7860` by default. ### Example Requests Health: ```bash curl -s http://localhost:7860/health ``` Switch model: ```bash curl -s -X POST http://localhost:7860/switch-model \ -H "Authorization: Bearer change-me-admin" \ -H "Content-Type: application/json" \ -d '{"model":"QuantFactory/Qwen2.5-7B-Instruct-GGUF:q4_k_m"}' ``` Chat completion: ```bash curl -s -X POST http://localhost:7860/v1/chat/completions \ -H "Authorization: Bearer change-me-user" \ -H "Content-Type: application/json" \ -d '{ "messages":[{"role":"user","content":"Hello"}], "max_tokens":128, "temperature":0.7 }' ``` Queue metrics: ```bash curl -s http://localhost:7860/queue/metrics ``` ## Project Structure ```text AGI/ ├── cpp/ # C++ llm-manager control plane ├── python/ # Legacy Python modules ├── docs/ # Architecture and design documents ├── config.toml.example # Example runtime configuration ├── Dockerfile # Container build ├── pyproject.toml # Python dependencies └── README.md ``` ## Build Notes The Docker build compiles: - `llama-server` from `llama.cpp` - the C++ manager from all files in [`cpp/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/cpp) The current container layout also keeps the legacy Python app under `/home/user/python` with `PYTHONPATH` configured accordingly. ## Current Limitations - Only one backend worker is active at a time. - Running-request cancel is best-effort and may restart the active worker. - Token estimation is still rough, not tokenizer-accurate. - Streaming relay is not implemented yet. - The queue is in-memory and single-process only. ## Legacy Python Path The original Python application is still available under [`python/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/python), but it is no longer the primary architecture described by this README. If needed, it can still be used as a reference or migration fallback. ## Additional Resources - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) - [FastAPI Documentation](https://fastapi.tiangolo.com/) - [Hugging Face Models](https://huggingface.co/models?library=gguf) ## License Apache 2.0