Spaces:
Sleeping
title: AGI Multi-Model API
emoji: π»
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: C++ scalable llm-server
AGI Multi-Model API
A local LLM control plane built around llama-server and a C++ manager layer.
The current primary path in this repository is the C++ llm-manager:
- runtime model switching
- API-key auth
- token-aware request validation
- per-key rate limiting
- bounded priority queue
- single-worker scheduler
- global and per-request cancel
- queue metrics
The legacy Python application is still present under python/, but the repository is now centered around the C++ manager implementation in cpp/.
Features
- Dynamic model switching through a single manager endpoint.
- One active
llama-serverworker process at a time. - OpenAI-compatible
POST /v1/chat/completions. - Bounded queue with priority by API-key role.
- Token-aware admission based on
messages + max_tokens. - Per-key request and estimated-token rate limiting.
request_idpropagation in every response.- Queue metrics at
GET /queue/metrics. - Conservative streaming rollout:
stream=trueis explicitly rejected until relay support is implemented.
Current Architecture
C++ LLM Manager Architecture
+----------------------+
| Clients |
|----------------------|
| chat requests |
| cancel requests |
| admin control |
| queue metrics |
+----------+-----------+
|
v
+---------------------------------------------------------------------------------------------------+
| llm-manager (Boost.Beast) |
|---------------------------------------------------------------------------------------------------|
| HTTP Layer |
| - request parsing |
| - response writing |
| - X-Request-Id injection |
| - JSON error responses |
+--------------------------------------+------------------------------------------------------------+
|
v
+---------------------------------------------------------------------------------------------------+
| Auth / Policy Layer |
|---------------------------------------------------------------------------------------------------|
| - API key auth |
| - token validation |
| - rate limiting |
+--------------------------------------+------------------------------------------------------------+
|
v
+---------------------------------------------------------------------------------------------------+
| Request Lifecycle Layer |
|---------------------------------------------------------------------------------------------------|
| - request registry |
| - bounded priority queue |
| - single-worker scheduler |
| - cancel / timeout handling |
| - queue metrics |
+--------------------------------------+------------------------------------------------------------+
|
v
+---------------------------------------------------------------------------------------------------+
| Backend Control Layer |
|---------------------------------------------------------------------------------------------------|
| ModelManager |
| - one active llama-server worker |
| - spawn / readiness / switch / restart |
+--------------------------------------+------------------------------------------------------------+
|
v
+-------------------------------+
| active llama-server worker |
|-------------------------------|
| one active model at a time |
| /v1/chat/completions backend |
| proxied GET/UI routes |
+-------------------------------+
More detail:
docs/CPP_MANAGER_ARCHITECTURE_ASCII.mddocs/CPP_MANAGER_SERVER.mddocs/CPP_MANAGER_IMPLEMENTATION_PLAN.md
Components
llm-manager (Boost.Beast): Main C++ HTTP server exposing chat, control, cancel, and queue metrics endpoints.Config Layer: Loads runtime settings fromconfig.tomland environment variables.API Key Auth: Authenticates bearer tokens and maps them toadminoruser.Token Validation: Estimates prompt size frommessages + max_tokensand rejects oversized requests early.Rate Limiter: Applies per-key request and estimated-token budgets.Request Registry: Tracks lifecycle state for each request id.Priority Queue: Holds bounded queued work with admin/user separation and fairness.Scheduler: Forwards one request at a time to the active backend worker.ModelManager: Ownsllama-serverprocess lifecycle and model switching.Queue Metrics: Exposes queue and runtime telemetry atGET /queue/metrics.
API Endpoints
Status
GET /healthGET /models
Chat
POST /v1/chat/completions
Control
POST /switch-modelPOST /stopPOST /requests/{request_id}/cancel
Metrics
GET /queue/metrics
Request Flow
POST /v1/chat/completions currently goes through:
- request parsing
- request id generation
- API-key auth
- stream flag check
- token estimation and validation
- per-key rate limiting
- request registry entry creation
- bounded queue admission
- scheduler execution
- backend response delivery
If the queue is full:
- return
503 Service Unavailable - include
Retry-After
If rate limit is exceeded:
- return
429 Too Many Requests - include
Retry-After
If stream=true is requested:
- return
501 Not Implemented
Authentication
Requests are authenticated with:
Authorization: Bearer <token>
Roles:
admin: can call privileged control endpointsuser: can submit chat requests and cancel owned requests
If no API keys are configured, the manager currently stays in a compatibility mode and does not enforce auth.
Configuration
Example config:
Current configuration groups:
[server][worker][llama][auth][limits][queue][scheduler][streaming][rate_limit][[api_keys]]
Important runtime settings:
queue.max_sizequeue.max_tokensqueue.admin_quotalimits.default_max_tokenslimits.max_tokens_per_requestlimits.request_timeout_secrate_limit.requests_per_minuterate_limit.estimated_tokens_per_minutestreaming.enabled
Quick Start
Docker
docker build -t agi-api .
docker run -p 7860:7860 agi-api
The manager listens on port 7860 by default.
Example Requests
Health:
curl -s http://localhost:7860/health
Switch model:
curl -s -X POST http://localhost:7860/switch-model \
-H "Authorization: Bearer change-me-admin" \
-H "Content-Type: application/json" \
-d '{"model":"QuantFactory/Qwen2.5-7B-Instruct-GGUF:q4_k_m"}'
Chat completion:
curl -s -X POST http://localhost:7860/v1/chat/completions \
-H "Authorization: Bearer change-me-user" \
-H "Content-Type: application/json" \
-d '{
"messages":[{"role":"user","content":"Hello"}],
"max_tokens":128,
"temperature":0.7
}'
Queue metrics:
curl -s http://localhost:7860/queue/metrics
Project Structure
AGI/
βββ cpp/ # C++ llm-manager control plane
βββ python/ # Legacy Python modules
βββ docs/ # Architecture and design documents
βββ config.toml.example # Example runtime configuration
βββ Dockerfile # Container build
βββ pyproject.toml # Python dependencies
βββ README.md
Build Notes
The Docker build compiles:
llama-serverfromllama.cpp- the C++ manager from all files in
cpp/
The current container layout also keeps the legacy Python app under /home/user/python with PYTHONPATH configured accordingly.
Current Limitations
- Only one backend worker is active at a time.
- Running-request cancel is best-effort and may restart the active worker.
- Token estimation is still rough, not tokenizer-accurate.
- Streaming relay is not implemented yet.
- The queue is in-memory and single-process only.
Legacy Python Path
The original Python application is still available under python/, but it is no longer the primary architecture described by this README.
If needed, it can still be used as a reference or migration fallback.
Additional Resources
License
Apache 2.0