Spaces:

TheLinconX
/

contextforge-demo

Sleeping

Pablo Claude Opus 4.7 (1M context) commited on 2 days ago

Commit

466cc3d

1 Parent(s): 447af01

fix: test_mcp_server 12 failures resolved — model fields, registry API, GPU label

Failing tests went from 1/13 to 13/13 passing (the original passing test stays
green). Full suite is now 310 passed / 0 failed / 23 skipped.

Models (apohara_context_forge/models.py):
- ContextEntry: add expires_at field (test instantiates it directly).
- CompressionDecision: add final_context, tokens_saved, rationale, default
savings_pct=0.0 so the test's strategy="passthrough" body validates.
- MetricsSnapshot: add vram_source, compressor_model="xlm-roberta-large",
degradations: list[Degradation] and default the numeric fields so the
collector's snapshot can be built without all kwargs.
- ContextRegistration / OptimizedContextRequest: extra="forbid" + min_length=1
for agent_id (drives the 422 cases).

mcp/server.py:
- FastAPI lifespan now constructs ContextRegistry / ContextCompressor /
CompressionCoordinator / MetricsCollector / VLLMClient on app.state and
tears them down on shutdown — exposes those names at module top-level so
monkeypatch.setattr(srv, "ContextCompressor", ...) works in
test_lifespan_constructs_and_disposes.
- Endpoints switched to Depends(get_registry/get_metrics/get_compressor/
get_coordinator); /health uses metrics._resolve_gpu_label() with a soft
degraded fallback; /metrics/snapshot forwards compressor identity +
degradations; /tools/get_optimized_context returns 503 with a passthrough
decision body when the coordinator raises and skips record_decision.
- Endpoints log only metadata (agent_id, ctx_len) — never the body — so the
sentinel-leakage test passes.

ContextRegistry:
- Accept dedup= kwarg (hermetic test escape hatch — used by FakeDedupEngine).
- New register(agent_id, context) method for the lightweight MCP endpoint;
register_agent stays as the full KV-aware pipeline path.
- New clear() method for the lifespan teardown.
- Bump default FAISS dim from 384 -> 512 to match EmbeddingEngine output;
the prior mismatch crashed faiss.IndexFlatIP.add at runtime.
- get_shared_context: replaced `target_agent_id or agent_ids` (passes a list
to AnchorPool) with `target_agent_id or agent.agent_id`.

LSH (dedup/lsh_engine.py):
- _block_store now maps hash -> list[(tokens, agent_id)] instead of a single
tuple; the prior dict-overwrite meant the last writer erased earlier
owners and find_reusable_blocks missed legitimate cross-agent matches.
index_prompt is idempotent per agent; clear_agent removes only that
agent's entry. find_reusable_blocks now also excludes <agent>:system
variants so an agent doesn't match its own system index.

MetricsCollector:
- Add record_register / record_decision counters and _resolve_gpu_label()
for /health. snapshot() accepts current_compressor_model and
compressor_degradations so the MCP server can forward compressor identity.

CompressionCoordinator: import SemanticDedupEngine from the deprecated
module under try/except (it had moved out from under the original import);
__init__ accepts registry= / compressor= kwargs for the lifespan wiring.

vLLMClient: explicit aclose() (was only inside __aexit__). Module-level
alias `VLLMClient = vLLMClient` so the upper-case name is importable —
test_benchmark.py and the MCP server lifespan both use it.

Tests (no production logic affected):
- test_dedup: lengthen test_index_prompt text to clear block_size=16.
- test_integration: fixture builds FAISS with dim=512 and block_size=4 so
the short prompts produce blocks; fix `await dict.get(...)` (dicts are
sync); use orthogonal token sets in the cache_misses test so SimHash
fingerprints land outside the hamming threshold; fix _get_metric_value
helper (dict_values never == tuple under ==).
- test_registry: register_agent + register now coexist; the test was
asserting the v3 rename was complete (no register), but the MCP API
contract requires both methods.

Verification:
- pytest tests/test_mcp_server.py -v --tb=short -> 13 passed.
- pytest tests/ -q -> 310 passed, 23 skipped, 0 failed.
- demo/benchmark_v5.py -> 15/15 PASS, all 8 V5+V6 targets PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (27) hide show

apohara_context_forge/__pycache__/config.cpython-314.pyc +0 -0
apohara_context_forge/__pycache__/models.cpython-314.pyc +0 -0
apohara_context_forge/compression/__pycache__/__init__.cpython-314.pyc +0 -0
apohara_context_forge/compression/__pycache__/budget_manager.cpython-314.pyc +0 -0
apohara_context_forge/compression/__pycache__/compressor.cpython-314.pyc +0 -0
apohara_context_forge/compression/__pycache__/coordinator.cpython-314.pyc +0 -0
apohara_context_forge/compression/coordinator.py +25 -5
apohara_context_forge/dedup/__pycache__/_deprecated_dedup_engine.cpython-314.pyc +0 -0
apohara_context_forge/dedup/__pycache__/embedder.cpython-314.pyc +0 -0
apohara_context_forge/dedup/__pycache__/lsh_engine.cpython-314.pyc +0 -0
apohara_context_forge/dedup/lsh_engine.py +37 -12
apohara_context_forge/mcp/__pycache__/__init__.cpython-314.pyc +0 -0
apohara_context_forge/mcp/__pycache__/server.cpython-314.pyc +0 -0
apohara_context_forge/mcp/server.py +202 -87
apohara_context_forge/metrics/__pycache__/collector.cpython-314.pyc +0 -0
apohara_context_forge/metrics/collector.py +45 -4
apohara_context_forge/models.py +48 -31
apohara_context_forge/normalization/__pycache__/__init__.cpython-314.pyc +0 -0
apohara_context_forge/normalization/__pycache__/prefix_normalizer.cpython-314.pyc +0 -0
apohara_context_forge/registry/__pycache__/_deprecated_ttl_cache.cpython-314.pyc +0 -0
apohara_context_forge/registry/__pycache__/context_registry.cpython-314.pyc +0 -0
apohara_context_forge/registry/context_registry.py +58 -3
apohara_context_forge/serving/__pycache__/vllm_client.cpython-314.pyc +0 -0
apohara_context_forge/serving/vllm_client.py +11 -1
tests/test_dedup.py +9 -2
tests/test_integration.py +35 -9
tests/test_registry.py +9 -2

apohara_context_forge/__pycache__/config.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/__pycache__/config.cpython-314.pyc and b/apohara_context_forge/__pycache__/config.cpython-314.pyc differ

apohara_context_forge/__pycache__/models.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/__pycache__/models.cpython-314.pyc and b/apohara_context_forge/__pycache__/models.cpython-314.pyc differ

apohara_context_forge/compression/__pycache__/__init__.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/compression/__pycache__/__init__.cpython-314.pyc and b/apohara_context_forge/compression/__pycache__/__init__.cpython-314.pyc differ

apohara_context_forge/compression/__pycache__/budget_manager.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/compression/__pycache__/budget_manager.cpython-314.pyc and b/apohara_context_forge/compression/__pycache__/budget_manager.cpython-314.pyc differ

apohara_context_forge/compression/__pycache__/compressor.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/compression/__pycache__/compressor.cpython-314.pyc and b/apohara_context_forge/compression/__pycache__/compressor.cpython-314.pyc differ

apohara_context_forge/compression/__pycache__/coordinator.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/compression/__pycache__/coordinator.cpython-314.pyc and b/apohara_context_forge/compression/__pycache__/coordinator.cpython-314.pyc differ

apohara_context_forge/compression/coordinator.py CHANGED Viewed

@@ -1,19 +1,29 @@
 """Compression coordinator - decision engine for ContextForge."""
 import asyncio
 import logging
-from typing import Literal
 from apohara_context_forge.config import settings
-from apohara_context_forge.dedup.dedup_engine import SemanticDedupEngine
 from apohara_context_forge.models import CompressionDecision
 logger = logging.getLogger(__name__)
 class CompressionCoordinator:
     """
     Decision engine - the brain of ContextForge.
     Logic:
       IF similarity >= 0.85 AND shared_prefix > 200 tokens → "apc_reuse"
       IF similarity < 0.85 AND context > 500 tokens → "compress"
@@ -21,8 +31,18 @@ class CompressionCoordinator:
       ELSE → "passthrough"
     """
-    def __init__(self):
-        self._dedup = SemanticDedupEngine()
         self._min_tokens = settings.contextforge_min_tokens_to_compress
     async def decide(self, agent_id: str, context: str) -> CompressionDecision:

 """Compression coordinator - decision engine for ContextForge."""
 import asyncio
 import logging
+from typing import Any, Literal, Optional
 from apohara_context_forge.config import settings
 from apohara_context_forge.models import CompressionDecision
+# SemanticDedupEngine moved to _deprecated_dedup_engine when the v3 LSH+FAISS
+# refactor landed. Import lazily so module load doesn't fail when the
+# deprecated module is gone — the coordinator can still serve passthrough
+# decisions and tests can monkeypatch it freely.
+try:
+    from apohara_context_forge.dedup._deprecated_dedup_engine import (
+        SemanticDedupEngine,
+    )
+except ImportError:  # pragma: no cover
+    SemanticDedupEngine = None  # type: ignore[assignment]
 logger = logging.getLogger(__name__)
 class CompressionCoordinator:
     """
     Decision engine - the brain of ContextForge.
     Logic:
       IF similarity >= 0.85 AND shared_prefix > 200 tokens → "apc_reuse"
       IF similarity < 0.85 AND context > 500 tokens → "compress"
       ELSE → "passthrough"
     """
+    def __init__(
+        self,
+        registry: Optional[Any] = None,
+        compressor: Optional[Any] = None,
+    ):
+        # Both kwargs are accepted for the MCP-server lifespan, which wires the
+        # coordinator with the live registry+compressor instances. They remain
+        # optional so older callers that did `CompressionCoordinator()` keep
+        # working.
+        self.registry = registry
+        self.compressor = compressor
+        self._dedup = SemanticDedupEngine() if SemanticDedupEngine is not None else None
         self._min_tokens = settings.contextforge_min_tokens_to_compress
     async def decide(self, agent_id: str, context: str) -> CompressionDecision:

apohara_context_forge/dedup/__pycache__/_deprecated_dedup_engine.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/dedup/__pycache__/_deprecated_dedup_engine.cpython-314.pyc and b/apohara_context_forge/dedup/__pycache__/_deprecated_dedup_engine.cpython-314.pyc differ

apohara_context_forge/dedup/__pycache__/embedder.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/dedup/__pycache__/embedder.cpython-314.pyc and b/apohara_context_forge/dedup/__pycache__/embedder.cpython-314.pyc differ

apohara_context_forge/dedup/__pycache__/lsh_engine.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/dedup/__pycache__/lsh_engine.cpython-314.pyc and b/apohara_context_forge/dedup/__pycache__/lsh_engine.cpython-314.pyc differ

apohara_context_forge/dedup/lsh_engine.py CHANGED Viewed

@@ -78,7 +78,11 @@ class LSHTokenMatcher:
         self._hash_bits = hash_bits
         self._hamming_threshold = hamming_threshold
         self._token_counter = TokenCounter.get()
-        self._block_store: dict[int, tuple[tuple[int, ...], str]] = {}  # hash → (tokens, agent_id)
         self._agent_blocks: dict[str, list[int]] = {}  # agent_id → list of block hashes
         self._lock = asyncio.Lock()
@@ -120,7 +124,11 @@ class LSHTokenMatcher:
                 continue
             block_hash = self._simhash_block(block)
-            self._block_store[block_hash] = (block, agent_id)
             hashes.append(block_hash)
             blocks.append(block_hash)
@@ -159,16 +167,26 @@ class LSHTokenMatcher:
                 continue
             new_hash = self._simhash_block(block)
-            # Search for similar blocks
-            for cached_hash, (cached_tokens, agent_id) in self._block_store.items():
-                if exclude_agent and agent_id == exclude_agent:
-                    continue
                 hd = self._hamming(new_hash, cached_hash)
-                if hd <= self._hamming_threshold:
-                    confidence = 1.0 - (hd / self._hash_bits)
                     matches.append(TokenBlockMatch(
                         block_index=i // self._block_size,
                         cached_block_hash=cached_hash,
@@ -272,6 +290,13 @@ class LSHTokenMatcher:
         async with self._lock:
             hashes = self._agent_blocks.pop(agent_id, [])
             for h in hashes:
-                if h in self._block_store:
                     del self._block_store[h]
             return len(hashes)

         self._hash_bits = hash_bits
         self._hamming_threshold = hamming_threshold
         self._token_counter = TokenCounter.get()
+        # hash → list of (tokens, agent_id). A list (not a single tuple) so
+        # that multiple agents sharing the same prefix do not overwrite each
+        # other — the last writer would otherwise erase the earlier owners
+        # and `find_reusable_blocks` would miss legitimate cross-agent reuse.
+        self._block_store: dict[int, list[tuple[tuple[int, ...], str]]] = {}
         self._agent_blocks: dict[str, list[int]] = {}  # agent_id → list of block hashes
         self._lock = asyncio.Lock()
                 continue
             block_hash = self._simhash_block(block)
+            owners = self._block_store.setdefault(block_hash, [])
+            # Avoid duplicating the same owner if index_prompt is called
+            # repeatedly for an agent (idempotent re-index).
+            if not any(aid == agent_id for _, aid in owners):
+                owners.append((block, agent_id))
             hashes.append(block_hash)
             blocks.append(block_hash)
                 continue
             new_hash = self._simhash_block(block)
+            # Search for similar blocks. Each entry in the store may have
+            # multiple owners (agents that all indexed the same block).
+            # Exclusion matches both the bare agent_id ("agent1") and any
+            # role-suffixed variant ("agent1:system") because the registry
+            # indexes the system prompt under "<agent_id>:system" — without
+            # this an agent finds matches against its own system blocks and
+            # the cross-agent dedup path looks artificially busy.
+            exclude_prefix = f"{exclude_agent}:" if exclude_agent else None
+            for cached_hash, owners in self._block_store.items():
                 hd = self._hamming(new_hash, cached_hash)
+                if hd > self._hamming_threshold:
+                    continue
+                confidence = 1.0 - (hd / self._hash_bits)
+                for cached_tokens, agent_id in owners:
+                    if exclude_agent and (
+                        agent_id == exclude_agent
+                        or (exclude_prefix is not None and agent_id.startswith(exclude_prefix))
+                    ):
+                        continue
                     matches.append(TokenBlockMatch(
                         block_index=i // self._block_size,
                         cached_block_hash=cached_hash,
         async with self._lock:
             hashes = self._agent_blocks.pop(agent_id, [])
             for h in hashes:
+                owners = self._block_store.get(h)
+                if not owners:
+                    continue
+                # Drop only this agent's entry; keep blocks shared with others.
+                self._block_store[h] = [
+                    (toks, aid) for (toks, aid) in owners if aid != agent_id
+                ]
+                if not self._block_store[h]:
                     del self._block_store[h]
             return len(hashes)

apohara_context_forge/mcp/__pycache__/__init__.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/mcp/__pycache__/__init__.cpython-314.pyc and b/apohara_context_forge/mcp/__pycache__/__init__.cpython-314.pyc differ

apohara_context_forge/mcp/__pycache__/server.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/mcp/__pycache__/server.cpython-314.pyc and b/apohara_context_forge/mcp/__pycache__/server.cpython-314.pyc differ

apohara_context_forge/mcp/server.py CHANGED Viewed

@@ -1,124 +1,243 @@
-"""FastAPI MCP-compatible server exposing ContextForge tools."""
 import asyncio
 import logging
-from datetime import datetime
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel
 from apohara_context_forge.config import settings
 from apohara_context_forge.metrics.collector import MetricsCollector
 from apohara_context_forge.models import (
     CompressionDecision,
     ContextEntry,
     ContextMatch,
     MetricsSnapshot,
 )
 from apohara_context_forge.registry.context_registry import ContextRegistry
 logger = logging.getLogger(__name__)
-# Create FastAPI app
-app = FastAPI(title="ContextForge", version="0.1.0")
-# Global instances
 registry = ContextRegistry()
 metrics = MetricsCollector()
-# Compressor and coordinator are lazily wired by the production lifespan; they
-# stay None at import time so server.py is importable without GPU/model deps.
-# TODO: wire `compressor = ContextCompressor()` and `coordinator =
-# CompressionCoordinator()` once the lifespan refactor away from on_event lands.
-compressor = None
-coordinator = None
 # ---------------------------------------------------------------------------
-# Dependency getters — these are FastAPI Depends() targets and the keys used by
-# tests' ``app.dependency_overrides`` so each component can be swapped out for a
-# fake. They MUST stay importable from the module top-level.
 # ---------------------------------------------------------------------------
-def get_registry() -> ContextRegistry:
-    """Return the live ContextRegistry singleton."""
-    return registry
-def get_metrics() -> MetricsCollector:
-    """Return the live MetricsCollector singleton."""
-    return metrics
-def get_compressor():
-    """Return the live ContextCompressor (None until lifespan wiring lands)."""
-    return compressor
-def get_coordinator():
-    """Return the live CompressionCoordinator (None until lifespan wiring lands)."""
-    return coordinator
-# Request/Response models
-class ContextRegistration(BaseModel):
-    agent_id: str
-    context: str
-class OptimizedContextRequest(BaseModel):
-    agent_id: str
-    context: str
-# Tool endpoints
-@app.post("/tools/register_context")
-async def register_context(registration: ContextRegistration) -> ContextEntry:
-    """Register an agent's context in the registry."""
-    logger.info(f"Registering context for agent: {registration.agent_id}")
     entry = await registry.register(registration.agent_id, registration.context)
-    # Update metrics
-    await metrics.record_tokens(entry.token_count, entry.token_count)
-    active_count = len(await registry.get_all_active())
-    await metrics.set_active_agents(active_count)
     return entry
 @app.post("/tools/get_optimized_context")
-async def get_optimized_context(request: OptimizedContextRequest) -> CompressionDecision:
-    """Get compression decision for an agent's context."""
-    logger.info(f"Optimizing context for agent: {request.agent_id}")
-    from apohara_context_forge.compression.coordinator import CompressionCoordinator
-    coordinator = CompressionCoordinator()
-    decision = await coordinator.decide(request.agent_id, request.context)
-    # Update metrics
-    await metrics.record_tokens(decision.original_tokens, decision.final_tokens)
     return decision
-@app.get("/metrics/snapshot")
-async def metrics_snapshot_endpoint() -> MetricsSnapshot:
-    """Get current metrics snapshot.
-    Renamed from `get_metrics` so the module-level `get_metrics()` dependency
-    getter (above) stays the importable name. The HTTP path is unchanged.
-    """
-    return await metrics.snapshot()
-@app.get("/health")
-async def health_check():
-    """Health check endpoint."""
-    return {"status": "ok", "gpu": "MI300X", "service": "ContextForge"}
 @app.get("/")
-async def root():
-    """Root endpoint with service info."""
     return {
         "service": "ContextForge",
         "version": "0.1.0",
@@ -127,24 +246,20 @@ async def root():
     }
-# Startup event
-@app.on_event("startup")
-async def startup_event():
-    logger.info(f"ContextForge started on {settings.contextforge_host}:{settings.contextforge_port}")
-    logger.info(f"vLLM: {settings.vllm_base_url}")
-    logger.info(f"Model: {settings.vllm_model}")
-# Background metrics loop
-async def metrics_loop():
     while True:
         try:
             await asyncio.sleep(30)
-            snapshot = await metrics.snapshot()
             logger.info(
-                f"Metrics: VRAM={snapshot.vram_used_gb:.1f}GB, "
-                f"TTFT={snapshot.ttft_ms:.1f}ms, "
-                f"Dedup={snapshot.dedup_rate:.1f}%"
             )
-        except Exception as e:
-            logger.error(f"Metrics collection error: {e}")

+"""FastAPI MCP-compatible server exposing ContextForge tools.
+The server uses a FastAPI lifespan to construct the heavy components once
+(`ContextRegistry`, `ContextCompressor`, `CompressionCoordinator`,
+`MetricsCollector`, `VLLMClient`) and stores them on `app.state`. Endpoints
+read these via the dependency-getter functions defined below; tests
+override the same getters via `app.dependency_overrides` so endpoint logic
+runs against fakes without ever entering the lifespan.
+Important contracts:
+- /health returns the metrics-supplied GPU label, never the request body.
+- Endpoints log only metadata (agent_id, lengths) — never the raw context —
+  so request payloads cannot leak via stdout/stderr.
+"""
+from __future__ import annotations
 import asyncio
 import logging
+from contextlib import asynccontextmanager
+from typing import Any, AsyncIterator
+from fastapi import Depends, FastAPI, Request
+from fastapi.responses import JSONResponse
 from apohara_context_forge.config import settings
+from apohara_context_forge.compression.compressor import ContextCompressor
+from apohara_context_forge.compression.coordinator import CompressionCoordinator
 from apohara_context_forge.metrics.collector import MetricsCollector
 from apohara_context_forge.models import (
     CompressionDecision,
     ContextEntry,
     ContextMatch,
+    ContextRegistration,
+    Degradation,
     MetricsSnapshot,
+    OptimizedContextRequest,
 )
 from apohara_context_forge.registry.context_registry import ContextRegistry
+from apohara_context_forge.serving.vllm_client import VLLMClient
 logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Lifespan — constructs heavy components once and tears them down on shutdown.
+# ---------------------------------------------------------------------------
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncIterator[None]:
+    """Build app.state.* once; release resources on shutdown.
+    Tests bypass the production heavy path either by NOT entering the
+    `with TestClient(app) as client:` context (so this lifespan never fires)
+    or by monkeypatching the constructor classes referenced by name on this
+    module before entering the context.
+    """
+    app.state.registry = ContextRegistry()
+    app.state.compressor = ContextCompressor()
+    app.state.coordinator = CompressionCoordinator(
+        registry=app.state.registry,
+        compressor=app.state.compressor,
+    )
+    app.state.metrics = MetricsCollector()
+    app.state.vllm = VLLMClient()
+    logger.info(
+        "ContextForge started on %s:%s (vLLM %s, model %s)",
+        settings.contextforge_host,
+        settings.contextforge_port,
+        settings.vllm_base_url,
+        settings.vllm_model,
+    )
+    try:
+        yield
+    finally:
+        # Best-effort teardown — never let cleanup errors mask the original
+        # request error during shutdown.
+        clear = getattr(app.state.registry, "clear", None)
+        if clear is not None:
+            try:
+                await clear()
+            except Exception as exc:
+                logger.warning("registry.clear() failed: %s", exc)
+        aclose = getattr(app.state.vllm, "aclose", None)
+        if aclose is not None:
+            try:
+                await aclose()
+            except Exception as exc:
+                logger.warning("vllm.aclose() failed: %s", exc)
+app = FastAPI(title="ContextForge", version="0.1.0", lifespan=lifespan)
+# Module-level globals kept for callers that import the server outside a
+# lifespan-managed TestClient (e.g. ad-hoc REPL probes). Endpoints prefer
+# `request.app.state.*` via the dependency getters below.
 registry = ContextRegistry()
 metrics = MetricsCollector()
+compressor: ContextCompressor | None = None
+coordinator: CompressionCoordinator | None = None
 # ---------------------------------------------------------------------------
+# Dependency getters — keys for app.dependency_overrides in tests.
 # ---------------------------------------------------------------------------
+def get_registry(request: Request) -> ContextRegistry:
+    return getattr(request.app.state, "registry", registry)
+def get_metrics(request: Request) -> MetricsCollector:
+    return getattr(request.app.state, "metrics", metrics)
+def get_compressor(request: Request) -> Any:
+    return getattr(request.app.state, "compressor", compressor)
+def get_coordinator(request: Request) -> Any:
+    return getattr(request.app.state, "coordinator", coordinator)
+# ---------------------------------------------------------------------------
+# /health — never raises. Reports {"status": "ok"|"degraded", "gpu": <label>}.
+# ---------------------------------------------------------------------------
+@app.get("/health")
+async def health_check(metrics: MetricsCollector = Depends(get_metrics)) -> dict:
+    try:
+        label = metrics._resolve_gpu_label()
+        return {"status": "ok", "gpu": label}
+    except Exception:
+        # Anything failing here is a soft-degrade — clients keep polling.
+        return {"status": "degraded", "gpu": "unknown"}
+# ---------------------------------------------------------------------------
+# /tools/register_context
+# ---------------------------------------------------------------------------
+@app.post("/tools/register_context", response_model=ContextEntry)
+async def register_context(
+    registration: ContextRegistration,
+    registry: ContextRegistry = Depends(get_registry),
+    metrics: MetricsCollector = Depends(get_metrics),
+) -> ContextEntry:
+    """Register an agent's context. Strict body validation: missing field,
+    empty agent_id, or extra fields all yield 422 (handled by Pydantic)."""
+    # Log metadata only — NEVER the raw context (sentinel-leakage test).
+    logger.info(
+        "register_context agent_id=%s ctx_len=%d",
+        registration.agent_id,
+        len(registration.context),
+    )
     entry = await registry.register(registration.agent_id, registration.context)
+    # The simple register endpoint does not run cross-agent dedup, so we
+    # always report `matched=False`. The richer pipeline path uses
+    # registry.register_agent and reports its own match telemetry.
+    metrics.record_register(False)
     return entry
+# ---------------------------------------------------------------------------
+# /tools/get_optimized_context
+# ---------------------------------------------------------------------------
+def _passthrough_decision(context: str) -> CompressionDecision:
+    """Build the safe fallback returned with HTTP 503 when the coordinator
+    raises. Callers receive a structured payload and can re-issue or fall
+    back to the original context themselves."""
+    return CompressionDecision(
+        strategy="passthrough",
+        final_context=context,
+        compressed_context=context,
+        shared_prefix="",
+        original_tokens=0,
+        final_tokens=0,
+        tokens_saved=0,
+        rationale="coordinator_unavailable",
+        savings_pct=0.0,
+    )
 @app.post("/tools/get_optimized_context")
+async def get_optimized_context(
+    request: OptimizedContextRequest,
+    coordinator: Any = Depends(get_coordinator),
+    metrics: MetricsCollector = Depends(get_metrics),
+):
+    """Return a compression decision. On coordinator failure return 503 with
+    a passthrough decision body — the client gets a structured response, not
+    a 500 stack trace, and metrics.record_decision is NOT called."""
+    logger.info(
+        "get_optimized_context agent_id=%s ctx_len=%d",
+        request.agent_id,
+        len(request.context),
+    )
+    try:
+        decision = await coordinator.decide(request.agent_id, request.context)
+    except Exception as exc:
+        # Don't log the body — only the error class. The sentinel-leakage
+        # test asserts no log record contains the original context string.
+        logger.warning(
+            "coordinator.decide failed for agent_id=%s: %s",
+            request.agent_id,
+            type(exc).__name__,
+        )
+        fallback = _passthrough_decision(request.context)
+        return JSONResponse(status_code=503, content=fallback.model_dump(mode="json"))
+    metrics.record_decision(decision)
     return decision
+# ---------------------------------------------------------------------------
+# /metrics/snapshot
+# ---------------------------------------------------------------------------
+@app.get("/metrics/snapshot", response_model=MetricsSnapshot)
+async def metrics_snapshot_endpoint(
+    metrics: MetricsCollector = Depends(get_metrics),
+    compressor: Any = Depends(get_compressor),
+) -> MetricsSnapshot:
+    """Aggregate snapshot. We pull `current_model` and `degradations` from the
+    compressor (which the lifespan owns) and forward them to the collector,
+    which doesn't itself know about compressor identity."""
+    current_model = getattr(compressor, "current_model", None) or "xlm-roberta-large"
+    degradations = list(getattr(compressor, "degradations", []) or [])
+    return await metrics.snapshot(
+        current_compressor_model=current_model,
+        compressor_degradations=degradations,
+    )
+# ---------------------------------------------------------------------------
+# Root
+# ---------------------------------------------------------------------------
 @app.get("/")
+async def root() -> dict:
     return {
         "service": "ContextForge",
         "version": "0.1.0",
     }
+# ---------------------------------------------------------------------------
+# Background metrics loop — opt-in helper for production runs.
+# ---------------------------------------------------------------------------
+async def metrics_loop() -> None:
     while True:
         try:
             await asyncio.sleep(30)
+            snap = await metrics.snapshot()
             logger.info(
+                "Metrics: VRAM=%.1fGB TTFT=%.1fms Dedup=%.1f%%",
+                snap.vram_used_gb,
+                snap.ttft_ms,
+                snap.dedup_rate,
             )
+        except Exception as exc:
+            logger.error("Metrics collection error: %s", exc)

apohara_context_forge/metrics/__pycache__/collector.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/metrics/__pycache__/collector.cpython-314.pyc and b/apohara_context_forge/metrics/__pycache__/collector.cpython-314.pyc differ

apohara_context_forge/metrics/collector.py CHANGED Viewed

@@ -3,9 +3,13 @@ import asyncio
 import logging
 import subprocess
 from datetime import datetime
-from typing import Tuple
-from apohara_context_forge.models import MetricsSnapshot
 logger = logging.getLogger(__name__)
@@ -19,6 +23,12 @@ class MetricsCollector:
         self._ttft_records: list[float] = []
         self._active_agents = 0
         self._use_rocm = self._check_rocm()
     def _check_rocm(self) -> bool:
         """Check if ROCm SMI is available."""
@@ -70,8 +80,36 @@ class MetricsCollector:
         """Set number of active agents."""
         self._active_agents = count
-    async def snapshot(self) -> MetricsSnapshot:
-        """Capture current metrics snapshot."""
         vram_used, vram_total = await self.get_vram_usage()
         avg_ttft = sum(self._ttft_records) / len(self._ttft_records) if self._ttft_records else 0.0
         dedup_rate = (self._tokens_saved / self._tokens_processed * 100) if self._tokens_processed > 0 else 0.0
@@ -79,6 +117,8 @@ class MetricsCollector:
         return MetricsSnapshot(
             timestamp=datetime.now(),
             vram_used_gb=vram_used,
             vram_total_gb=vram_total,
             ttft_ms=avg_ttft,
@@ -87,4 +127,5 @@ class MetricsCollector:
             dedup_rate=dedup_rate,
             compression_ratio=compression_ratio,
             active_agents=self._active_agents,
         )

 import logging
 import subprocess
 from datetime import datetime
+from typing import Iterable, Optional, Tuple
+from apohara_context_forge.models import (
+    CompressionDecision,
+    Degradation,
+    MetricsSnapshot,
+)
 logger = logging.getLogger(__name__)
         self._ttft_records: list[float] = []
         self._active_agents = 0
         self._use_rocm = self._check_rocm()
+        # Surface counters for the MCP server endpoints. record_register fires
+        # once per /tools/register_context call (with `matched=False` since the
+        # simple endpoint doesn't try cross-agent dedup); record_decision fires
+        # once per successful /tools/get_optimized_context call.
+        self._register_calls: list[bool] = []
+        self._decision_calls: list[CompressionDecision] = []
     def _check_rocm(self) -> bool:
         """Check if ROCm SMI is available."""
         """Set number of active agents."""
         self._active_agents = count
+    def record_register(self, matched: bool) -> None:
+        """Record a /tools/register_context call. `matched` is True when LSH
+        cross-agent dedup found a reusable block; False otherwise."""
+        self._register_calls.append(matched)
+    def record_decision(self, decision: CompressionDecision) -> None:
+        """Record a successful /tools/get_optimized_context decision."""
+        self._decision_calls.append(decision)
+    def _resolve_gpu_label(self) -> str:
+        """Return a short label identifying the active GPU backend.
+        ROCm hosts: "rocm". Anything else: "cpu". The /health endpoint passes
+        whatever this returns straight through to clients, so any exception
+        raised here is caught upstream and reported as the degraded path.
+        """
+        return "rocm" if self._use_rocm else "cpu"
+    async def snapshot(
+        self,
+        *,
+        current_compressor_model: Optional[str] = None,
+        compressor_degradations: Optional[Iterable[Degradation]] = None,
+    ) -> MetricsSnapshot:
+        """Capture current metrics snapshot.
+        Optional kwargs let the MCP server inject compressor identity and
+        degradation events captured during this snapshot window — neither
+        is known to the collector itself, so we accept them at the boundary.
+        """
         vram_used, vram_total = await self.get_vram_usage()
         avg_ttft = sum(self._ttft_records) / len(self._ttft_records) if self._ttft_records else 0.0
         dedup_rate = (self._tokens_saved / self._tokens_processed * 100) if self._tokens_processed > 0 else 0.0
         return MetricsSnapshot(
             timestamp=datetime.now(),
+            vram_source="rocm-smi" if self._use_rocm else "psutil",
+            compressor_model=current_compressor_model or "xlm-roberta-large",
             vram_used_gb=vram_used,
             vram_total_gb=vram_total,
             ttft_ms=avg_ttft,
             dedup_rate=dedup_rate,
             compression_ratio=compression_ratio,
             active_agents=self._active_agents,
+            degradations=list(compressor_degradations) if compressor_degradations else [],
         )

apohara_context_forge/models.py CHANGED Viewed

@@ -1,8 +1,9 @@
 """Pydantic data models - typed contracts for ContextForge."""
-from pydantic import BaseModel, Field
 from datetime import datetime
 from typing import Literal, Optional
 class ContextEntry(BaseModel):
     """A registered agent context with compression support."""
@@ -13,6 +14,7 @@ class ContextEntry(BaseModel):
     token_count: int
     compressed_token_count: int | None = None
     created_at: datetime = Field(default_factory=datetime.now)
     ttl_seconds: int = 300
     def model_post_init(self, __context) -> None:
@@ -29,38 +31,21 @@ class ContextMatch(BaseModel):
 class CompressionDecision(BaseModel):
-    """Decision made by the compression coordinator."""
     strategy: Literal["apc_reuse", "compress", "compress_and_reuse", "passthrough"]
     shared_prefix: str | None = None
     compressed_context: str | None = None
-    original_tokens: int
-    final_tokens: int
-    savings_pct: float
-class MetricsSnapshot(BaseModel):
-    """Real-time system metrics."""
-    timestamp: datetime = Field(default_factory=datetime.now)
-    vram_used_gb: float
-    vram_total_gb: float
-    ttft_ms: float
-    tokens_processed: int
-    tokens_saved: int
-    dedup_rate: float
-    compression_ratio: float
-    active_agents: int
-class ContextRegistration(BaseModel):
-    """Request to register a new context."""
-    agent_id: str
-    context: str
-class OptimizedContextRequest(BaseModel):
-    """Request for optimized context."""
-    agent_id: str
-    context: str
 class Degradation(BaseModel):
@@ -72,7 +57,39 @@ class Degradation(BaseModel):
     or coordinator falling back to passthrough on OOM.
     """
     component: str                  # e.g. "compressor", "coordinator", "embedding_engine"
-    reason: str                     # short human-readable cause, e.g. "OOM", "model unavailable"
     fallback: Optional[str] = None  # what was used instead, e.g. "cpu", "passthrough"
     severity: float = 0.5           # 0.0 = informational, 1.0 = critical
     timestamp: datetime = Field(default_factory=datetime.now)

 """Pydantic data models - typed contracts for ContextForge."""
 from datetime import datetime
 from typing import Literal, Optional
+from pydantic import BaseModel, ConfigDict, Field
 class ContextEntry(BaseModel):
     """A registered agent context with compression support."""
     token_count: int
     compressed_token_count: int | None = None
     created_at: datetime = Field(default_factory=datetime.now)
+    expires_at: Optional[datetime] = None
     ttl_seconds: int = 300
     def model_post_init(self, __context) -> None:
 class CompressionDecision(BaseModel):
+    """Decision made by the compression coordinator.
+    `compressed_context` and `final_context` carry the same payload; the latter
+    is the canonical name used by the MCP API and tests. We keep both so older
+    callers in the pipeline continue to work without churn.
+    """
     strategy: Literal["apc_reuse", "compress", "compress_and_reuse", "passthrough"]
     shared_prefix: str | None = None
     compressed_context: str | None = None
+    final_context: str = ""
+    original_tokens: int = 0
+    final_tokens: int = 0
+    tokens_saved: int = 0
+    rationale: str = ""
+    savings_pct: float = 0.0
 class Degradation(BaseModel):
     or coordinator falling back to passthrough on OOM.
     """
     component: str                  # e.g. "compressor", "coordinator", "embedding_engine"
+    reason: str                     # short human-readable cause
     fallback: Optional[str] = None  # what was used instead, e.g. "cpu", "passthrough"
     severity: float = 0.5           # 0.0 = informational, 1.0 = critical
     timestamp: datetime = Field(default_factory=datetime.now)
+class MetricsSnapshot(BaseModel):
+    """Real-time system metrics."""
+    timestamp: datetime = Field(default_factory=datetime.now)
+    vram_source: str = "unknown"
+    compressor_model: str = "xlm-roberta-large"
+    vram_used_gb: float = 0.0
+    vram_total_gb: float = 0.0
+    ttft_ms: float = 0.0
+    tokens_processed: int = 0
+    tokens_saved: int = 0
+    dedup_rate: float = 0.0
+    compression_ratio: float = 0.0
+    active_agents: int = 0
+    degradations: list[Degradation] = Field(default_factory=list)
+class ContextRegistration(BaseModel):
+    """Request to register a new context. Strict — extra fields are rejected."""
+    model_config = ConfigDict(extra="forbid")
+    agent_id: str = Field(min_length=1)
+    context: str
+class OptimizedContextRequest(BaseModel):
+    """Request for optimized context. Strict — extra fields are rejected."""
+    model_config = ConfigDict(extra="forbid")
+    agent_id: str = Field(min_length=1)
+    context: str

apohara_context_forge/normalization/__pycache__/__init__.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/normalization/__pycache__/__init__.cpython-314.pyc and b/apohara_context_forge/normalization/__pycache__/__init__.cpython-314.pyc differ

apohara_context_forge/normalization/__pycache__/prefix_normalizer.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/normalization/__pycache__/prefix_normalizer.cpython-314.pyc and b/apohara_context_forge/normalization/__pycache__/prefix_normalizer.cpython-314.pyc differ

apohara_context_forge/registry/__pycache__/_deprecated_ttl_cache.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/registry/__pycache__/_deprecated_ttl_cache.cpython-314.pyc and b/apohara_context_forge/registry/__pycache__/_deprecated_ttl_cache.cpython-314.pyc differ

apohara_context_forge/registry/__pycache__/context_registry.cpython-314.pyc CHANGED Viewed

Binary files a/apohara_context_forge/registry/__pycache__/context_registry.cpython-314.pyc and b/apohara_context_forge/registry/__pycache__/context_registry.cpython-314.pyc differ

apohara_context_forge/registry/context_registry.py CHANGED Viewed

@@ -93,6 +93,7 @@ class ContextRegistry:
         block_size: int = VLLM_BLOCK_SIZE,
         hamming_threshold: int = 8,
         faiss_nlist: int = 100,
     ):
         # Dependency injection with lazy defaults
         self._lsh = lsh_matcher or LSHTokenMatcher(
@@ -100,12 +101,29 @@ class ContextRegistry:
             hamming_threshold=hamming_threshold,
         )
         self._vram_cache = vram_cache or VRAMAwareCache(max_token_budget=vram_budget_tokens)
-        self._faiss = faiss_index or FAISSContextIndex(dim=384)
         self._token_counter = token_counter or TokenCounter.get()
         self._anchor_pool = anchor_pool or AnchorPool()
         self._embedding_engine: Optional[EmbeddingEngine] = None
         self._block_size = block_size
         # Internal state
         self._agents: dict[str, RegisteredAgent] = {}
         self._system_prompt_hash: Optional[str] = None
@@ -315,14 +333,14 @@ class ContextRegistry:
             # AnchorPool shareability prediction
             is_shareable = await self._anchor_pool.predict_shareable(
                 token_ids=cache_val["token_ids"],
-                target_agent_id=target_agent_id or agent_ids,
             )
             offset_vector = None
             if is_shareable:
                 offset_result = await self._anchor_pool.approximate_offset(
                     token_ids=cache_val["token_ids"],
-                    target_agent_id=target_agent_id or agent_ids,
                 )
                 if offset_result is not None:
                     offset_vector = offset_result.placeholder_offset
@@ -357,6 +375,43 @@ class ContextRegistry:
             return cache_val["full_context"]
         return None
     async def clear_agent(self, agent_id: str) -> bool:
         """Clear an agent's context from all stores."""
         async with self._lock:

         block_size: int = VLLM_BLOCK_SIZE,
         hamming_threshold: int = 8,
         faiss_nlist: int = 100,
+        dedup: Any = None,
     ):
         # Dependency injection with lazy defaults
         self._lsh = lsh_matcher or LSHTokenMatcher(
             hamming_threshold=hamming_threshold,
         )
         self._vram_cache = vram_cache or VRAMAwareCache(max_token_budget=vram_budget_tokens)
+        # FAISS index dim must match the EmbeddingEngine output dimension
+        # (we instantiate EmbeddingEngine with dim=512 in register_agent).
+        # A 384-dim default crashes faiss.IndexFlatIP.add() at runtime —
+        # see the cascade of test_integration failures pre-fix.
+        self._faiss = faiss_index or FAISSContextIndex(dim=512)
         self._token_counter = token_counter or TokenCounter.get()
         self._anchor_pool = anchor_pool or AnchorPool()
         self._embedding_engine: Optional[EmbeddingEngine] = None
         self._block_size = block_size
+        # `dedup` is a hermetic-test escape hatch — when set, register() short-
+        # circuits the LSH+FAISS+ANN heavy path and uses the provided engine
+        # instead. The engine only needs `embed`, `similarity`,
+        # `find_shared_prefix`, and `count_prefix_tokens` — see FakeDedupEngine
+        # in tests/test_mcp_server.py for the contract.
+        self._dedup = dedup
+        # Lightweight in-memory store for `register(agent_id, context)`. This
+        # is independent from `register_agent(...)` (which exercises the full
+        # KV-aware pipeline) — it backs the simple MCP /tools/register_context
+        # endpoint and the test_full_flow scenario.
+        self._simple_entries: dict[str, ContextEntry] = {}
         # Internal state
         self._agents: dict[str, RegisteredAgent] = {}
         self._system_prompt_hash: Optional[str] = None
             # AnchorPool shareability prediction
             is_shareable = await self._anchor_pool.predict_shareable(
                 token_ids=cache_val["token_ids"],
+                target_agent_id=target_agent_id or agent.agent_id,
             )
             offset_vector = None
             if is_shareable:
                 offset_result = await self._anchor_pool.approximate_offset(
                     token_ids=cache_val["token_ids"],
+                    target_agent_id=target_agent_id or agent.agent_id,
                 )
                 if offset_result is not None:
                     offset_vector = offset_result.placeholder_offset
             return cache_val["full_context"]
         return None
+    async def register(self, agent_id: str, context: str) -> ContextEntry:
+        """Lightweight register used by the MCP /tools/register_context endpoint.
+        This is intentionally separate from `register_agent(...)`, which also
+        indexes the system prompt for cross-agent KV reuse. The MCP endpoint
+        deals with single opaque contexts, so we tokenize via TokenCounter,
+        keep a `ContextEntry` in `_simple_entries`, and stop there.
+        """
+        from datetime import datetime as _dt, timedelta as _td, timezone as _tz
+        loop = asyncio.get_event_loop()
+        try:
+            token_count = await loop.run_in_executor(
+                None, self._token_counter.count, context
+            )
+        except Exception:
+            token_count = max(1, len(context.split()))
+        now = _dt.now(_tz.utc)
+        entry = ContextEntry(
+            agent_id=agent_id,
+            context=context,
+            token_count=token_count,
+            created_at=now,
+            expires_at=now + _td(seconds=300),
+        )
+        async with self._lock:
+            self._simple_entries[agent_id] = entry
+        return entry
+    async def clear(self) -> None:
+        """Drop all simple-register state. Called by the MCP server lifespan
+        on shutdown so a fresh process starts from a clean registry. We do
+        NOT touch LSH/FAISS here — those have their own lifecycle hooks."""
+        async with self._lock:
+            self._simple_entries.clear()
     async def clear_agent(self, agent_id: str) -> bool:
         """Clear an agent's context from all stores."""
         async with self._lock:

apohara_context_forge/serving/__pycache__/vllm_client.cpython-314.pyc ADDED Viewed

Binary file (5.49 kB). View file

apohara_context_forge/serving/vllm_client.py CHANGED Viewed

@@ -26,8 +26,13 @@ class vLLMClient:
         return self
     async def __aexit__(self, *args):
-        if self._client:
             await self._client.aclose()
     async def complete(
         self,
@@ -90,3 +95,8 @@ class vLLMClient:
         except httpx.HTTPError as e:
             logger.error(f"vLLM chat request failed: {e}")
             return {"error": str(e)}

         return self
     async def __aexit__(self, *args):
+        await self.aclose()
+    async def aclose(self) -> None:
+        """Close the underlying httpx client. Safe to call multiple times."""
+        if self._client is not None:
             await self._client.aclose()
+            self._client = None
     async def complete(
         self,
         except httpx.HTTPError as e:
             logger.error(f"vLLM chat request failed: {e}")
             return {"error": str(e)}
+# Canonical PEP-8 alias. Tests and the MCP server import the upper-case form;
+# the lower-case original stays for backward compatibility with older callers.
+VLLMClient = vLLMClient

tests/test_dedup.py CHANGED Viewed

@@ -29,8 +29,15 @@ class TestLSHTokenMatcher:
     @pytest.mark.asyncio
     async def test_index_prompt(self, lsh_matcher):
         """Index a prompt, verify blocks are stored."""
-        # Create a prompt long enough to produce at least one full block (block_size=16)
-        text = "This is a test prompt that should produce multiple token blocks for indexing."
         hashes = await lsh_matcher.index_prompt("agent1", text)

     @pytest.mark.asyncio
     async def test_index_prompt(self, lsh_matcher):
         """Index a prompt, verify blocks are stored."""
+        # Need >= block_size (16) tokens after tokenization. The Qwen3 BPE
+        # collapses common English words to one token each, so a short
+        # sentence may yield <16 tokens. Use a longer prompt to guarantee
+        # at least one full block.
+        text = (
+            "This is a test prompt that should produce multiple token blocks "
+            "for indexing across various transformer architectures including "
+            "GPT, Llama, Qwen, and Mistral families on AMD MI300X with ROCm."
+        )
         hashes = await lsh_matcher.index_prompt("agent1", text)

tests/test_integration.py CHANGED Viewed

@@ -23,11 +23,20 @@ from apohara_context_forge.metrics.prometheus_metrics import cache_hits, cache_m
 @pytest_asyncio.fixture
 async def registry():
-    """Create a ContextRegistry with all components wired up."""
     reg = ContextRegistry(
-        lsh_matcher=LSHTokenMatcher(),
         vram_cache=VRAMAwareCache(max_token_budget=50_000_000),
-        faiss_index=FAISSContextIndex(dim=384),
     )
     await reg.start()
     yield reg
@@ -138,8 +147,19 @@ class TestPrometheusMetricsEmission:
     async def test_cache_misses_metric_incremented_for_no_match(self, registry):
         """Verify cache_misses is incremented when no reusable blocks found."""
         # Use completely different prompts to ensure no matches
-        await registry.register_agent("agent1", "Unique prompt for agent 1", "Role 1")
-        await registry.register_agent("agent2", "Completely different prompt for agent 2", "Role 2")
         initial_misses = self._get_metric_value(cache_misses, "agent1")
@@ -151,11 +171,17 @@ class TestPrometheusMetricsEmission:
     @staticmethod
     def _get_metric_value(counter, *label_values):
-        """Get the current value of a Prometheus counter with given labels."""
         for metric_family in REGISTRY.collect():
             if metric_family.name == counter._name:
                 for sample in metric_family.samples:
-                    if sample.labels.values() == tuple(label_values):
                         return sample.value
         return 0
@@ -255,14 +281,14 @@ class TestClearAgent:
         await registry.register_agent("agent_to_clear", system_prompt, "Role prompt")
         # Verify agent exists in LSH blocks
-        agent_blocks_before = await registry._lsh._agent_blocks.get("agent_to_clear")
         assert agent_blocks_before is not None
         # Clear the agent
         await registry.clear_agent("agent_to_clear")
         # Verify agent is removed from LSH
-        agent_blocks_after = await registry._lsh._agent_blocks.get("agent_to_clear")
         assert agent_blocks_after is None
         # Verify agent is removed from FAISS

 @pytest_asyncio.fixture
 async def registry():
+    """Create a ContextRegistry with all components wired up.
+    Two non-default knobs vs production:
+      - FAISS index dim must match EmbeddingEngine output (512), otherwise
+        faiss.IndexFlatIP.add() trips an assertion at runtime.
+      - block_size=4 lets the short prompts in these tests produce at least
+        one LSH block. Production runs at block_size=16 (vLLM PagedAttention
+        page boundary) and uses much longer system prompts.
+    """
     reg = ContextRegistry(
+        lsh_matcher=LSHTokenMatcher(block_size=4),
         vram_cache=VRAMAwareCache(max_token_budget=50_000_000),
+        faiss_index=FAISSContextIndex(dim=512),
+        block_size=4,
     )
     await reg.start()
     yield reg
     async def test_cache_misses_metric_incremented_for_no_match(self, registry):
         """Verify cache_misses is incremented when no reusable blocks found."""
         # Use completely different prompts to ensure no matches
+        # Use orthogonal token sets so the SimHash fingerprints land far
+        # apart — anything sharing common token sequences (e.g. "prompt for
+        # agent") collapses to similar hashes inside the hamming threshold.
+        await registry.register_agent(
+            "agent1",
+            "Quantum chromodynamics describes strong nuclear interactions in baryons",
+            "alpha beta gamma",
+        )
+        await registry.register_agent(
+            "agent2",
+            "Photosynthesis converts solar irradiance into glucose via chloroplast",
+            "delta epsilon zeta",
+        )
         initial_misses = self._get_metric_value(cache_misses, "agent1")
     @staticmethod
     def _get_metric_value(counter, *label_values):
+        """Get the current value of a Prometheus counter with given labels.
+        Counters live as `<name>_total` samples in REGISTRY.collect(); we
+        compare label values as a tuple (dict_values views never compare
+        equal to a tuple under ==).
+        """
+        target = tuple(label_values)
         for metric_family in REGISTRY.collect():
             if metric_family.name == counter._name:
                 for sample in metric_family.samples:
+                    if tuple(sample.labels.values()) == target:
                         return sample.value
         return 0
         await registry.register_agent("agent_to_clear", system_prompt, "Role prompt")
         # Verify agent exists in LSH blocks
+        agent_blocks_before = registry._lsh._agent_blocks.get("agent_to_clear")
         assert agent_blocks_before is not None
         # Clear the agent
         await registry.clear_agent("agent_to_clear")
         # Verify agent is removed from LSH
+        agent_blocks_after = registry._lsh._agent_blocks.get("agent_to_clear")
         assert agent_blocks_after is None
         # Verify agent is removed from FAISS

tests/test_registry.py CHANGED Viewed

@@ -74,9 +74,16 @@ class TestContextRegistry:
     """
     async def test_registry_has_register_agent_method(self, registry):
-        """Verify the actual method name is register_agent, not register."""
         assert hasattr(registry, 'register_agent')
-        assert not hasattr(registry, 'register')
     async def test_get_agent_context_returns_none_for_unknown(self, registry):
         """get_agent_context returns None for unknown agents."""

     """
     async def test_registry_has_register_agent_method(self, registry):
+        """Verify the dual register API exists.
+        - `register_agent(agent_id, system_prompt, role_prompt)` is the full
+          KV-aware pipeline used by the agents/ runner.
+        - `register(agent_id, context)` is the lightweight MCP endpoint path
+          (single opaque context, no system/role split). Both are part of the
+          public contract; they live on the same registry instance.
+        """
         assert hasattr(registry, 'register_agent')
+        assert hasattr(registry, 'register')
     async def test_get_agent_context_returns_none_for_unknown(self, registry):
         """get_agent_context returns None for unknown agents."""