Spaces:

ResearchEngineering
/

AGI

Sleeping

App Files Files Community

AGI / README.md

Dmitry Beresnev

fix description

952d357 about 1 month ago

preview code

raw

history blame contribute delete

11.5 kB

metadata

title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: C++ scalable llm-server

AGI Multi-Model API

A local LLM control plane built around llama-server and a C++ manager layer.

The current primary path in this repository is the C++ llm-manager:

runtime model switching
API-key auth
token-aware request validation
per-key rate limiting
bounded priority queue
single-worker scheduler
global and per-request cancel
queue metrics

The legacy Python application is still present under python/, but the repository is now centered around the C++ manager implementation in cpp/.

Features

Dynamic model switching through a single manager endpoint.
One active llama-server worker process at a time.
OpenAI-compatible POST /v1/chat/completions.
Bounded queue with priority by API-key role.
Token-aware admission based on messages + max_tokens.
Per-key request and estimated-token rate limiting.
request_id propagation in every response.
Queue metrics at GET /queue/metrics.
Conservative streaming rollout: stream=true is explicitly rejected until relay support is implemented.

Current Architecture

                                           C++ LLM Manager Architecture

                                            +----------------------+
                                            |       Clients        |
                                            |----------------------|
                                            | chat requests        |
                                            | cancel requests      |
                                            | admin control        |
                                            | queue metrics        |
                                            +----------+-----------+
                                                       |
                                                       v
+---------------------------------------------------------------------------------------------------+
|                                     llm-manager (Boost.Beast)                                     |
|---------------------------------------------------------------------------------------------------|
| HTTP Layer                                                                                       |
| - request parsing                                                                                |
| - response writing                                                                               |
| - X-Request-Id injection                                                                         |
| - JSON error responses                                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Auth / Policy Layer                                                                               |
|---------------------------------------------------------------------------------------------------|
| - API key auth                                                                                   |
| - token validation                                                                               |
| - rate limiting                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Request Lifecycle Layer                                                                           |
|---------------------------------------------------------------------------------------------------|
| - request registry                                                                               |
| - bounded priority queue                                                                         |
| - single-worker scheduler                                                                        |
| - cancel / timeout handling                                                                      |
| - queue metrics                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Backend Control Layer                                                                             |
|---------------------------------------------------------------------------------------------------|
| ModelManager                                                                                     |
| - one active llama-server worker                                                                 |
| - spawn / readiness / switch / restart                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
                              +-------------------------------+
                              |   active llama-server worker  |
                              |-------------------------------|
                              | one active model at a time    |
                              | /v1/chat/completions backend  |
                              | proxied GET/UI routes         |
                              +-------------------------------+

More detail:

Components

llm-manager (Boost.Beast): Main C++ HTTP server exposing chat, control, cancel, and queue metrics endpoints.
Config Layer: Loads runtime settings from config.toml and environment variables.
API Key Auth: Authenticates bearer tokens and maps them to admin or user.
Token Validation: Estimates prompt size from messages + max_tokens and rejects oversized requests early.
Rate Limiter: Applies per-key request and estimated-token budgets.
Request Registry: Tracks lifecycle state for each request id.
Priority Queue: Holds bounded queued work with admin/user separation and fairness.
Scheduler: Forwards one request at a time to the active backend worker.
ModelManager: Owns llama-server process lifecycle and model switching.
Queue Metrics: Exposes queue and runtime telemetry at GET /queue/metrics.

API Endpoints

Status

GET /health
GET /models

Chat

POST /v1/chat/completions

Control

POST /switch-model
POST /stop
POST /requests/{request_id}/cancel

Metrics

GET /queue/metrics

Request Flow

POST /v1/chat/completions currently goes through:

request parsing
request id generation
API-key auth
stream flag check
token estimation and validation
per-key rate limiting
request registry entry creation
bounded queue admission
scheduler execution
backend response delivery

If the queue is full:

return 503 Service Unavailable
include Retry-After

If rate limit is exceeded:

return 429 Too Many Requests
include Retry-After

If stream=true is requested:

return 501 Not Implemented

Authentication

Requests are authenticated with:

Authorization: Bearer <token>

Roles:

admin: can call privileged control endpoints
user: can submit chat requests and cancel owned requests

If no API keys are configured, the manager currently stays in a compatibility mode and does not enforce auth.

Configuration

Example config:

config.toml.example

Current configuration groups:

[server]
[worker]
[llama]
[auth]
[limits]
[queue]
[scheduler]
[streaming]
[rate_limit]
[[api_keys]]

Important runtime settings:

queue.max_size
queue.max_tokens
queue.admin_quota
limits.default_max_tokens
limits.max_tokens_per_request
limits.request_timeout_sec
rate_limit.requests_per_minute
rate_limit.estimated_tokens_per_minute
streaming.enabled

Quick Start

Docker

docker build -t agi-api .
docker run -p 7860:7860 agi-api

The manager listens on port 7860 by default.

Example Requests

Health:

curl -s http://localhost:7860/health

Switch model:

curl -s -X POST http://localhost:7860/switch-model \
  -H "Authorization: Bearer change-me-admin" \
  -H "Content-Type: application/json" \
  -d '{"model":"QuantFactory/Qwen2.5-7B-Instruct-GGUF:q4_k_m"}'

Chat completion:

curl -s -X POST http://localhost:7860/v1/chat/completions \
  -H "Authorization: Bearer change-me-user" \
  -H "Content-Type: application/json" \
  -d '{
    "messages":[{"role":"user","content":"Hello"}],
    "max_tokens":128,
    "temperature":0.7
  }'

Queue metrics:

curl -s http://localhost:7860/queue/metrics

Project Structure

AGI/
├── cpp/                      # C++ llm-manager control plane
├── python/                   # Legacy Python modules
├── docs/                     # Architecture and design documents
├── config.toml.example       # Example runtime configuration
├── Dockerfile                # Container build
├── pyproject.toml            # Python dependencies
└── README.md

Build Notes

The Docker build compiles:

llama-server from llama.cpp
the C++ manager from all files in cpp/

The current container layout also keeps the legacy Python app under /home/user/python with PYTHONPATH configured accordingly.

Current Limitations

Only one backend worker is active at a time.
Running-request cancel is best-effort and may restart the active worker.
Token estimation is still rough, not tokenizer-accurate.
Streaming relay is not implemented yet.
The queue is in-memory and single-process only.

Legacy Python Path

The original Python application is still available under python/, but it is no longer the primary architecture described by this README.

If needed, it can still be used as a reference or migration fallback.

Additional Resources

License

Apache 2.0