AGI / README.md
Dmitry Beresnev
fix description
952d357
metadata
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: C++ scalable llm-server

AGI Multi-Model API

A local LLM control plane built around llama-server and a C++ manager layer.

The current primary path in this repository is the C++ llm-manager:

  • runtime model switching
  • API-key auth
  • token-aware request validation
  • per-key rate limiting
  • bounded priority queue
  • single-worker scheduler
  • global and per-request cancel
  • queue metrics

The legacy Python application is still present under python/, but the repository is now centered around the C++ manager implementation in cpp/.

Features

  • Dynamic model switching through a single manager endpoint.
  • One active llama-server worker process at a time.
  • OpenAI-compatible POST /v1/chat/completions.
  • Bounded queue with priority by API-key role.
  • Token-aware admission based on messages + max_tokens.
  • Per-key request and estimated-token rate limiting.
  • request_id propagation in every response.
  • Queue metrics at GET /queue/metrics.
  • Conservative streaming rollout: stream=true is explicitly rejected until relay support is implemented.

Current Architecture

                                           C++ LLM Manager Architecture

                                            +----------------------+
                                            |       Clients        |
                                            |----------------------|
                                            | chat requests        |
                                            | cancel requests      |
                                            | admin control        |
                                            | queue metrics        |
                                            +----------+-----------+
                                                       |
                                                       v
+---------------------------------------------------------------------------------------------------+
|                                     llm-manager (Boost.Beast)                                     |
|---------------------------------------------------------------------------------------------------|
| HTTP Layer                                                                                       |
| - request parsing                                                                                |
| - response writing                                                                               |
| - X-Request-Id injection                                                                         |
| - JSON error responses                                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Auth / Policy Layer                                                                               |
|---------------------------------------------------------------------------------------------------|
| - API key auth                                                                                   |
| - token validation                                                                               |
| - rate limiting                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Request Lifecycle Layer                                                                           |
|---------------------------------------------------------------------------------------------------|
| - request registry                                                                               |
| - bounded priority queue                                                                         |
| - single-worker scheduler                                                                        |
| - cancel / timeout handling                                                                      |
| - queue metrics                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Backend Control Layer                                                                             |
|---------------------------------------------------------------------------------------------------|
| ModelManager                                                                                     |
| - one active llama-server worker                                                                 |
| - spawn / readiness / switch / restart                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
                              +-------------------------------+
                              |   active llama-server worker  |
                              |-------------------------------|
                              | one active model at a time    |
                              | /v1/chat/completions backend  |
                              | proxied GET/UI routes         |
                              +-------------------------------+

More detail:

Components

  • llm-manager (Boost.Beast): Main C++ HTTP server exposing chat, control, cancel, and queue metrics endpoints.
  • Config Layer: Loads runtime settings from config.toml and environment variables.
  • API Key Auth: Authenticates bearer tokens and maps them to admin or user.
  • Token Validation: Estimates prompt size from messages + max_tokens and rejects oversized requests early.
  • Rate Limiter: Applies per-key request and estimated-token budgets.
  • Request Registry: Tracks lifecycle state for each request id.
  • Priority Queue: Holds bounded queued work with admin/user separation and fairness.
  • Scheduler: Forwards one request at a time to the active backend worker.
  • ModelManager: Owns llama-server process lifecycle and model switching.
  • Queue Metrics: Exposes queue and runtime telemetry at GET /queue/metrics.

API Endpoints

Status

  • GET /health
  • GET /models

Chat

  • POST /v1/chat/completions

Control

  • POST /switch-model
  • POST /stop
  • POST /requests/{request_id}/cancel

Metrics

  • GET /queue/metrics

Request Flow

POST /v1/chat/completions currently goes through:

  1. request parsing
  2. request id generation
  3. API-key auth
  4. stream flag check
  5. token estimation and validation
  6. per-key rate limiting
  7. request registry entry creation
  8. bounded queue admission
  9. scheduler execution
  10. backend response delivery

If the queue is full:

  • return 503 Service Unavailable
  • include Retry-After

If rate limit is exceeded:

  • return 429 Too Many Requests
  • include Retry-After

If stream=true is requested:

  • return 501 Not Implemented

Authentication

Requests are authenticated with:

Authorization: Bearer <token>

Roles:

  • admin: can call privileged control endpoints
  • user: can submit chat requests and cancel owned requests

If no API keys are configured, the manager currently stays in a compatibility mode and does not enforce auth.

Configuration

Example config:

Current configuration groups:

  • [server]
  • [worker]
  • [llama]
  • [auth]
  • [limits]
  • [queue]
  • [scheduler]
  • [streaming]
  • [rate_limit]
  • [[api_keys]]

Important runtime settings:

  • queue.max_size
  • queue.max_tokens
  • queue.admin_quota
  • limits.default_max_tokens
  • limits.max_tokens_per_request
  • limits.request_timeout_sec
  • rate_limit.requests_per_minute
  • rate_limit.estimated_tokens_per_minute
  • streaming.enabled

Quick Start

Docker

docker build -t agi-api .
docker run -p 7860:7860 agi-api

The manager listens on port 7860 by default.

Example Requests

Health:

curl -s http://localhost:7860/health

Switch model:

curl -s -X POST http://localhost:7860/switch-model \
  -H "Authorization: Bearer change-me-admin" \
  -H "Content-Type: application/json" \
  -d '{"model":"QuantFactory/Qwen2.5-7B-Instruct-GGUF:q4_k_m"}'

Chat completion:

curl -s -X POST http://localhost:7860/v1/chat/completions \
  -H "Authorization: Bearer change-me-user" \
  -H "Content-Type: application/json" \
  -d '{
    "messages":[{"role":"user","content":"Hello"}],
    "max_tokens":128,
    "temperature":0.7
  }'

Queue metrics:

curl -s http://localhost:7860/queue/metrics

Project Structure

AGI/
β”œβ”€β”€ cpp/                      # C++ llm-manager control plane
β”œβ”€β”€ python/                   # Legacy Python modules
β”œβ”€β”€ docs/                     # Architecture and design documents
β”œβ”€β”€ config.toml.example       # Example runtime configuration
β”œβ”€β”€ Dockerfile                # Container build
β”œβ”€β”€ pyproject.toml            # Python dependencies
└── README.md

Build Notes

The Docker build compiles:

  • llama-server from llama.cpp
  • the C++ manager from all files in cpp/

The current container layout also keeps the legacy Python app under /home/user/python with PYTHONPATH configured accordingly.

Current Limitations

  • Only one backend worker is active at a time.
  • Running-request cancel is best-effort and may restart the active worker.
  • Token estimation is still rough, not tokenizer-accurate.
  • Streaming relay is not implemented yet.
  • The queue is in-memory and single-process only.

Legacy Python Path

The original Python application is still available under python/, but it is no longer the primary architecture described by this README.

If needed, it can still be used as a reference or migration fallback.

Additional Resources

License

Apache 2.0