---
title: AGI Multi-Model API
emoji: 😻
colorFrom: purple
colorTo: green
sdk: docker
pinned: false
license: apache-2.0
short_description: C++ scalable llm-server
---

# AGI Multi-Model API

A local LLM control plane built around `llama-server` and a C++ manager layer.

The current primary path in this repository is the C++ `llm-manager`:
- runtime model switching
- API-key auth
- token-aware request validation
- per-key rate limiting
- bounded priority queue
- single-worker scheduler
- global and per-request cancel
- queue metrics

The legacy Python application is still present under [`python/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/python), but the repository is now centered around the C++ manager implementation in [`cpp/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/cpp).

## Features

- Dynamic model switching through a single manager endpoint.
- One active `llama-server` worker process at a time.
- OpenAI-compatible `POST /v1/chat/completions`.
- Bounded queue with priority by API-key role.
- Token-aware admission based on `messages + max_tokens`.
- Per-key request and estimated-token rate limiting.
- `request_id` propagation in every response.
- Queue metrics at `GET /queue/metrics`.
- Conservative streaming rollout: `stream=true` is explicitly rejected until relay support is implemented.

## Current Architecture

```text
                                           C++ LLM Manager Architecture

                                            +----------------------+
                                            |       Clients        |
                                            |----------------------|
                                            | chat requests        |
                                            | cancel requests      |
                                            | admin control        |
                                            | queue metrics        |
                                            +----------+-----------+
                                                       |
                                                       v
+---------------------------------------------------------------------------------------------------+
|                                     llm-manager (Boost.Beast)                                     |
|---------------------------------------------------------------------------------------------------|
| HTTP Layer                                                                                       |
| - request parsing                                                                                |
| - response writing                                                                               |
| - X-Request-Id injection                                                                         |
| - JSON error responses                                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Auth / Policy Layer                                                                               |
|---------------------------------------------------------------------------------------------------|
| - API key auth                                                                                   |
| - token validation                                                                               |
| - rate limiting                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Request Lifecycle Layer                                                                           |
|---------------------------------------------------------------------------------------------------|
| - request registry                                                                               |
| - bounded priority queue                                                                         |
| - single-worker scheduler                                                                        |
| - cancel / timeout handling                                                                      |
| - queue metrics                                                                                  |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
+---------------------------------------------------------------------------------------------------+
| Backend Control Layer                                                                             |
|---------------------------------------------------------------------------------------------------|
| ModelManager                                                                                     |
| - one active llama-server worker                                                                 |
| - spawn / readiness / switch / restart                                                           |
+--------------------------------------+------------------------------------------------------------+
                                       |
                                       v
                              +-------------------------------+
                              |   active llama-server worker  |
                              |-------------------------------|
                              | one active model at a time    |
                              | /v1/chat/completions backend  |
                              | proxied GET/UI routes         |
                              +-------------------------------+
```

More detail:
- [`docs/CPP_MANAGER_ARCHITECTURE_ASCII.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_ARCHITECTURE_ASCII.md)
- [`docs/CPP_MANAGER_SERVER.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_SERVER.md)
- [`docs/CPP_MANAGER_IMPLEMENTATION_PLAN.md`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/docs/CPP_MANAGER_IMPLEMENTATION_PLAN.md)

## Components

- `llm-manager (Boost.Beast)`: Main C++ HTTP server exposing chat, control, cancel, and queue metrics endpoints.
- `Config Layer`: Loads runtime settings from `config.toml` and environment variables.
- `API Key Auth`: Authenticates bearer tokens and maps them to `admin` or `user`.
- `Token Validation`: Estimates prompt size from `messages + max_tokens` and rejects oversized requests early.
- `Rate Limiter`: Applies per-key request and estimated-token budgets.
- `Request Registry`: Tracks lifecycle state for each request id.
- `Priority Queue`: Holds bounded queued work with admin/user separation and fairness.
- `Scheduler`: Forwards one request at a time to the active backend worker.
- `ModelManager`: Owns `llama-server` process lifecycle and model switching.
- `Queue Metrics`: Exposes queue and runtime telemetry at `GET /queue/metrics`.

## API Endpoints

### Status

- `GET /health`
- `GET /models`

### Chat

- `POST /v1/chat/completions`

### Control

- `POST /switch-model`
- `POST /stop`
- `POST /requests/{request_id}/cancel`

### Metrics

- `GET /queue/metrics`

## Request Flow

`POST /v1/chat/completions` currently goes through:

1. request parsing
2. request id generation
3. API-key auth
4. stream flag check
5. token estimation and validation
6. per-key rate limiting
7. request registry entry creation
8. bounded queue admission
9. scheduler execution
10. backend response delivery

If the queue is full:
- return `503 Service Unavailable`
- include `Retry-After`

If rate limit is exceeded:
- return `429 Too Many Requests`
- include `Retry-After`

If `stream=true` is requested:
- return `501 Not Implemented`

## Authentication

Requests are authenticated with:

```text
Authorization: Bearer <token>
```

Roles:
- `admin`: can call privileged control endpoints
- `user`: can submit chat requests and cancel owned requests

If no API keys are configured, the manager currently stays in a compatibility mode and does not enforce auth.

## Configuration

Example config:
- [`config.toml.example`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/config.toml.example)

Current configuration groups:
- `[server]`
- `[worker]`
- `[llama]`
- `[auth]`
- `[limits]`
- `[queue]`
- `[scheduler]`
- `[streaming]`
- `[rate_limit]`
- `[[api_keys]]`

Important runtime settings:
- `queue.max_size`
- `queue.max_tokens`
- `queue.admin_quota`
- `limits.default_max_tokens`
- `limits.max_tokens_per_request`
- `limits.request_timeout_sec`
- `rate_limit.requests_per_minute`
- `rate_limit.estimated_tokens_per_minute`
- `streaming.enabled`

## Quick Start

### Docker

```bash
docker build -t agi-api .
docker run -p 7860:7860 agi-api
```

The manager listens on port `7860` by default.

### Example Requests

Health:

```bash
curl -s http://localhost:7860/health
```

Switch model:

```bash
curl -s -X POST http://localhost:7860/switch-model \
  -H "Authorization: Bearer change-me-admin" \
  -H "Content-Type: application/json" \
  -d '{"model":"QuantFactory/Qwen2.5-7B-Instruct-GGUF:q4_k_m"}'
```

Chat completion:

```bash
curl -s -X POST http://localhost:7860/v1/chat/completions \
  -H "Authorization: Bearer change-me-user" \
  -H "Content-Type: application/json" \
  -d '{
    "messages":[{"role":"user","content":"Hello"}],
    "max_tokens":128,
    "temperature":0.7
  }'
```

Queue metrics:

```bash
curl -s http://localhost:7860/queue/metrics
```

## Project Structure

```text
AGI/
├── cpp/                      # C++ llm-manager control plane
├── python/                   # Legacy Python modules
├── docs/                     # Architecture and design documents
├── config.toml.example       # Example runtime configuration
├── Dockerfile                # Container build
├── pyproject.toml            # Python dependencies
└── README.md
```

## Build Notes

The Docker build compiles:
- `llama-server` from `llama.cpp`
- the C++ manager from all files in [`cpp/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/cpp)

The current container layout also keeps the legacy Python app under `/home/user/python` with `PYTHONPATH` configured accordingly.

## Current Limitations

- Only one backend worker is active at a time.
- Running-request cancel is best-effort and may restart the active worker.
- Token estimation is still rough, not tokenizer-accurate.
- Streaming relay is not implemented yet.
- The queue is in-memory and single-process only.

## Legacy Python Path

The original Python application is still available under [`python/`](/Users/dmitryberesnev/Project/huggingface/agi/AGI/python), but it is no longer the primary architecture described by this README.

If needed, it can still be used as a reference or migration fallback.

## Additional Resources

- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [FastAPI Documentation](https://fastapi.tiangolo.com/)
- [Hugging Face Models](https://huggingface.co/models?library=gguf)

## License

Apache 2.0