Spaces:

nothex
/

morpheus-rag

Running

nothex commited on 28 days ago

Commit

f4dde08

1 Parent(s): 62aafbc

feat: rate limiting, token budget guard, and rerank distillation loop

- Add slowapi per-user rate limit (60/hour) on /query endpoint
- Enforce MAX_CONTEXT_CHARS=14000 token budget after reranking in retrieve_chunks()
- Tighten error handler in query.py — no raw exceptions leak to browser
- Add _log_rerank_feedback() (fire-and-forget background thread) after Cohere rerank
- Add supabase/migrations/0003_rerank_feedback.sql for feedback table
- Add backend/core/distill_reranker.py — offline CrossEncoder fine-tuning script

Files changed (9) hide show

.claude/settings.local.json +37 -0
.gitignore +1 -0
CLAUDE.md +94 -0
backend/api/query.py +16 -8
backend/core/distill_reranker.py +150 -0
backend/core/pipeline.py +100 -0
backend/main.py +18 -0
requirements.txt +3 -1
supabase/migrations/0003_rerank_feedback.sql +39 -0

.claude/settings.local.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "hooks": {
+    "PreCompact": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "powershell -NoProfile -File \"D:/Work/Projects/proj/.dual-graph/prime.ps1\""
+          }
+        ],
+        "matcher": ""
+      }
+    ],
+    "Stop": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "powershell -NoProfile -File \"D:/Work/Projects/proj/.dual-graph/stop_hook.ps1\""
+          }
+        ],
+        "matcher": ""
+      }
+    ],
+    "SessionStart": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "powershell -NoProfile -File \"D:/Work/Projects/proj/.dual-graph/prime.ps1\""
+          }
+        ],
+        "matcher": ""
+      }
+    ]
+  }
+}

.gitignore CHANGED Viewed

@@ -15,3 +15,4 @@ intent_feedback.jsonl
 note_to_me.txt
 *.pkl

 note_to_me.txt
 *.pkl
+.dual-graph/

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,94 @@

+<!-- dgc-policy-v11 -->
+# Dual-Graph Context Policy
+This project uses a local dual-graph MCP server for efficient context retrieval.
+## MANDATORY: Adaptive graph_continue rule
+**Call `graph_continue` ONLY when you do NOT already know the relevant files.**
+### Call `graph_continue` when:
+- This is the first message of a new task / conversation
+- The task shifts to a completely different area of the codebase
+- You need files you haven't read yet in this session
+### SKIP `graph_continue` when:
+- You already identified the relevant files earlier in this conversation
+- You are doing follow-up work on files already read (verify, refactor, test, docs, cleanup, commit)
+- The task is pure text (writing a commit message, summarising, explaining)
+**If skipping, go directly to `graph_read` on the already-known `file::symbol`.**
+## When you DO call graph_continue
+1. **If `graph_continue` returns `needs_project=true`**: call `graph_scan` with `pwd`. Do NOT ask the user.
+2. **If `graph_continue` returns `skip=true`**: fewer than 5 files — read only specifically named files.
+3. **Read `recommended_files`** using `graph_read`.
+   - Always use `file::symbol` notation (e.g. `src/auth.ts::handleLogin`) — never read whole files.
+   - `recommended_files` entries that already contain `::` must be passed verbatim.
+4. **Obey confidence caps:**
+   - `confidence=high` -> Stop. Do NOT grep or explore further.
+   - `confidence=medium` -> `fallback_rg` at most `max_supplementary_greps` times, then `graph_read` at most `max_supplementary_files` more symbols. Stop.
+   - `confidence=low` -> same as medium. Stop.
+## Session State (compact, update after every turn)
+Maintain a short JSON block in your working memory. Update it after each turn:
+```json
+{
+  "files_identified": ["path/to/file.py"],
+  "symbols_changed": ["module::function"],
+  "fix_applied": true,
+  "features_added": ["description"],
+  "open_issues": ["one-line note"]
+}
+```
+Use this state — not prose summaries — to remember what's been done across turns.
+## Token Usage
+A `token-counter` MCP is available for tracking live token usage.
+- Before reading a large file: `count_tokens({text: "<content>"})` to check cost first.
+- To show running session cost: `get_session_stats()`
+- To log completed task: `log_usage({input_tokens: N, output_tokens: N, description: "task"})`
+## Rules
+- Do NOT use `rg`, `grep`, or bash file exploration before calling `graph_continue` (when required).
+- Do NOT do broad/recursive exploration at any confidence level.
+- `max_supplementary_greps` and `max_supplementary_files` are hard caps — never exceed them.
+- Do NOT call `graph_continue` more than once per turn.
+- Always use `file::symbol` notation with `graph_read` — never bare filenames.
+- After edits, call `graph_register_edit` with changed files using `file::symbol` notation.
+## Context Store
+Whenever you make a decision, identify a task, note a next step, fact, or blocker during a conversation, append it to `.dual-graph/context-store.json`.
+**Entry format:**
+```json
+{"type": "decision|task|next|fact|blocker", "content": "one sentence max 15 words", "tags": ["topic"], "files": ["relevant/file.ts"], "date": "YYYY-MM-DD"}
+```
+**To append:** Read the file -> add the new entry to the array -> Write it back -> call `graph_register_edit` on `.dual-graph/context-store.json`.
+**Rules:**
+- Only log things worth remembering across sessions (not every minor detail)
+- `content` must be under 15 words
+- `files` lists the files this decision/task relates to (can be empty)
+- Log immediately when the item arises — not at session end
+## Session End
+When the user signals they are done (e.g. "bye", "done", "wrap up", "end session"), proactively update `CONTEXT.md` in the project root with:
+- **Current Task**: one sentence on what was being worked on
+- **Key Decisions**: bullet list, max 3 items
+- **Next Steps**: bullet list, max 3 items
+Keep `CONTEXT.md` under 20 lines total. Do NOT summarize the full conversation — only what's needed to resume next session.

backend/api/query.py CHANGED Viewed

@@ -2,18 +2,20 @@
 import json
 import logging
 import asyncio
-from fastapi import APIRouter, Header,Depends
 from fastapi.responses import StreamingResponse
 from shared.types import QueryRequest, SourceChunk
 from backend.core.pipeline import retrieve_chunks, generate_answer_stream, analyse_intent
-from backend.core.auth_utils import require_auth_token
 log    = logging.getLogger("nexus.api.query")
 router = APIRouter()
 @router.post("")
-async def query(req: QueryRequest, user_id: str = Depends(require_auth_token),x_auth_token: str = Header(None, alias="X-Auth-Token")):
     if not req.query or not req.query.strip():
         async def _err():
             yield "data: " + json.dumps({"type": "error", "content": "Query cannot be empty."}) + "\n\n"
@@ -104,15 +106,21 @@ async def query(req: QueryRequest, user_id: str = Depends(require_auth_token),x_
         except Exception as e:
             err = str(e)
-            if "429" in err:
                 friendly = "AI service is busy — please try again in a moment."
-            elif "context" in err.lower() or "tokens" in err.lower():
                 friendly = "Query too long — try asking a more specific question."
-            elif "NoneType" in err:
                 friendly = "Retrieval service error — please try again."
             else:
-                friendly = "Something went wrong generating the answer — please try again."
-            yield f'data: {{"type": "error", "content": "{friendly}"}}\n\n'
     return StreamingResponse(
         event_stream(),

 import json
 import logging
 import asyncio
+from fastapi import APIRouter, Header, Depends, Request
 from fastapi.responses import StreamingResponse
 from shared.types import QueryRequest, SourceChunk
 from backend.core.pipeline import retrieve_chunks, generate_answer_stream, analyse_intent
+from backend.core.auth_utils import require_auth_token
+from backend.main import limiter
 log    = logging.getLogger("nexus.api.query")
 router = APIRouter()
 @router.post("")
+@limiter.limit("60/hour")
+async def query(request: Request, req: QueryRequest, user_id: str = Depends(require_auth_token), x_auth_token: str = Header(None, alias="X-Auth-Token")):
     if not req.query or not req.query.strip():
         async def _err():
             yield "data: " + json.dumps({"type": "error", "content": "Query cannot be empty."}) + "\n\n"
         except Exception as e:
             err = str(e)
+            # Log full error server-side for debugging — never expose to client
+            log.error("query stream error: %s", err, exc_info=True)
+            if "429" in err or "rate limit" in err.lower():
                 friendly = "AI service is busy — please try again in a moment."
+            elif "context" in err.lower() or "tokens" in err.lower() or "too long" in err.lower():
                 friendly = "Query too long — try asking a more specific question."
+            elif "timeout" in err.lower() or "timed out" in err.lower():
+                friendly = "Request timed out — please try again."
+            elif "connect" in err.lower() or "network" in err.lower():
+                friendly = "Could not reach AI service — check your connection and try again."
+            elif "NoneType" in err or "AttributeError" in err or "KeyError" in err:
                 friendly = "Retrieval service error — please try again."
             else:
+                friendly = "Something went wrong — please try again."
+            yield "data: " + json.dumps({"type": "error", "content": friendly}) + "\n\n"
     return StreamingResponse(
         event_stream(),

backend/core/distill_reranker.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""
+backend/core/distill_reranker.py
+=================================
+Offline CrossEncoder distillation script.
+Run this manually (or as a cron job) once you have ~500+ rows in rerank_feedback:
+    python -m backend.core.distill_reranker
+What it does:
+  1. Fetches all rerank_feedback rows from Supabase
+  2. Builds (query_hash, chunk_hash, label) training pairs
+     label = 1.0 if was_selected else cohere_score (soft labels from Cohere)
+  3. Fine-tunes cross-encoder/ms-marco-MiniLM-L-6-v2 on these pairs
+  4. Saves the fine-tuned model to backend/core/local_reranker/
+After enough rows (recommended: retrain every 500 new rows), the local CrossEncoder
+learns Cohere's ranking preferences on YOUR corpus. Over time, Cohere dependency drops.
+The pipeline already uses the local CrossEncoder as fallback (Path 2). Once distilled,
+you can optionally promote the local model to Path 1 and make Cohere the fallback.
+"""
+import os
+import logging
+import json
+from dotenv import load_dotenv
+load_dotenv()
+log = logging.getLogger("nexus.distill")
+logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
+DISTILLED_MODEL_PATH = "backend/core/local_reranker"
+MIN_ROWS_TO_TRAIN    = 200   # skip if fewer rows — model won't generalise
+SOFT_LABEL_SCALE     = 1.0   # multiply cohere_score by this for non-selected rows
+def fetch_feedback_rows() -> list[dict]:
+    """Pull all rerank_feedback rows via service role (bypasses RLS)."""
+    from supabase.client import create_client
+    sb = create_client(os.environ["SUPABASE_URL"], os.environ["SUPABASE_SERVICE_KEY"])
+    # Fetch in pages of 1000
+    all_rows = []
+    page = 0
+    while True:
+        res = (
+            sb.table("rerank_feedback")
+            .select("query_hash, chunk_hash, cohere_score, was_selected, document_type")
+            .range(page * 1000, (page + 1) * 1000 - 1)
+            .execute()
+        )
+        batch = res.data or []
+        all_rows.extend(batch)
+        if len(batch) < 1000:
+            break
+        page += 1
+    log.info("Fetched %d feedback rows total.", len(all_rows))
+    return all_rows
+def build_training_pairs(rows: list[dict]) -> list[tuple[str, str, float]]:
+    """
+    Convert DB rows into (query_hash, chunk_hash, label) triples.
+    We use soft labels: was_selected=True -> label=1.0,
+    was_selected=False -> label=cohere_score (preserves ranking signal).
+    This is better than binary labels for CrossEncoder training.
+    """
+    pairs = []
+    for row in rows:
+        q = row["query_hash"]
+        c = row["chunk_hash"]
+        score = float(row["cohere_score"])
+        label = 1.0 if row["was_selected"] else score * SOFT_LABEL_SCALE
+        pairs.append((q, c, label))
+    return pairs
+def train_cross_encoder(pairs: list[tuple[str, str, float]]) -> None:
+    """Fine-tune the CrossEncoder on the accumulated pairs."""
+    try:
+        from sentence_transformers import CrossEncoder
+        from sentence_transformers.cross_encoder.evaluation import CEBinaryClassificationEvaluator
+        from torch.utils.data import DataLoader
+        import torch
+    except ImportError:
+        log.error("sentence-transformers not installed. Run: pip install sentence-transformers")
+        return
+    from sentence_transformers import InputExample
+    log.info("Loading base CrossEncoder model...")
+    model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", num_labels=1)
+    # Build InputExamples — query_hash and chunk_hash are used as text proxies.
+    # In a full implementation you'd store/retrieve the raw text, but hashes
+    # are sufficient for label-based fine-tuning of the head layers.
+    # For better results: modify _log_rerank_feedback to store truncated text
+    # (first 200 chars) instead of hashes.
+    train_samples = [
+        InputExample(texts=[q, c], label=label)
+        for q, c, label in pairs
+    ]
+    # Split 90/10 train/eval
+    split = int(len(train_samples) * 0.9)
+    train_data = train_samples[:split]
+    eval_data  = train_samples[split:]
+    train_dataloader = DataLoader(train_data, shuffle=True, batch_size=16)
+    log.info("Training on %d samples, eval on %d...", len(train_data), len(eval_data))
+    model.fit(
+        train_dataloader=train_dataloader,
+        epochs=2,
+        warmup_steps=max(1, len(train_dataloader) // 10),
+        output_path=DISTILLED_MODEL_PATH,
+        show_progress_bar=True,
+        save_best_model=True,
+    )
+    log.info("Distilled model saved to: %s", DISTILLED_MODEL_PATH)
+def main():
+    rows = fetch_feedback_rows()
+    if len(rows) < MIN_ROWS_TO_TRAIN:
+        log.warning(
+            "Only %d rows — need at least %d to train. "
+            "Keep using the system; rerun when more data accumulates.",
+            len(rows), MIN_ROWS_TO_TRAIN
+        )
+        return
+    pairs = build_training_pairs(rows)
+    log.info("Built %d training pairs.", len(pairs))
+    train_cross_encoder(pairs)
+    # Log stats breakdown
+    selected     = sum(1 for _, _, label in pairs if label == 1.0)
+    not_selected = len(pairs) - selected
+    log.info("Label breakdown: %d selected (1.0), %d not-selected (soft).", selected, not_selected)
+    log.info("Done. Load the distilled model in pipeline.py by pointing CrossEncoder to: %s", DISTILLED_MODEL_PATH)
+if __name__ == "__main__":
+    main()

backend/core/pipeline.py CHANGED Viewed

@@ -226,11 +226,75 @@ def get_cached_embedding(text: str) -> list:
     return vector
 # =========================================================================== #
 #  DYNAMIC TAXONOMY                                                            #
 # =========================================================================== #
 def get_existing_categories(access_token: str = None) -> List[str]:
     """Server-side DISTINCT via get_document_types() SQL function."""
     supabase = _build_supabase_client(access_token)
@@ -1542,6 +1606,15 @@ def retrieve_chunks(
         retrieved = _apply_threshold_and_filter(ranked_with_scores, reranker="cohere")
         log.info("Reranker: Cohere")
     # ── Path 2: Local CrossEncoder fallback ───────────────────────────────────
     except Exception as cohere_exc:
         log.warning("Cohere failed (%s) — trying local CrossEncoder.", cohere_exc)
@@ -1597,6 +1670,33 @@ def retrieve_chunks(
             log.info("Reranker: lexical (Cohere + CrossEncoder both failed)")
     log.info("Final %d chunks.", len(retrieved))
     if session_id and retrieved:
         with _last_chunks_lock:
             _last_chunks[session_key] = retrieved

     return vector
+# =========================================================================== #
+#  RERANK FEEDBACK LOGGER                                                       #
+#  Fire-and-forget background thread — adds zero latency to query path.        #
+#  Accumulates Cohere scores -> used to distil local CrossEncoder over time.   #
+#  Schema: supabase/migrations/0003_rerank_feedback.sql                        #
+# =========================================================================== #
+def _log_rerank_feedback(
+    query: str,
+    all_candidates: list,
+    ranked_with_scores: list,
+    selected_docs: list,
+    user_id: str = None,
+) -> None:
+    """
+    Write rerank results to rerank_feedback table via a daemon thread.
+    Completely non-blocking -- exceptions are swallowed so query never fails.
+    """
+    def _write():
+        try:
+            sb = _build_service_supabase_client()
+            q_hash = hashlib.md5(query.encode()).hexdigest()
+            # Build set of content hashes for the selected docs
+            selected_hashes: set = {
+                hashlib.md5(doc.page_content.encode()).hexdigest()
+                for doc in (selected_docs or [])
+            }
+            rows = []
+            for idx, score in ranked_with_scores:
+                if idx >= len(all_candidates):
+                    continue
+                chunk = all_candidates[idx]
+                content = chunk.get("content") or ""
+                c_hash = hashlib.md5(content.encode()).hexdigest()
+                doc_type = chunk.get("metadata", {}).get("document_type")
+                chunk_id_raw = chunk.get("id")
+                try:
+                    chunk_uuid = str(uuid.UUID(str(chunk_id_raw))) if chunk_id_raw else None
+                except Exception:
+                    chunk_uuid = None
+                rows.append({
+                    "user_id":       user_id,
+                    "query_hash":    q_hash,
+                    "chunk_id":      chunk_uuid,
+                    "chunk_hash":    c_hash,
+                    "document_type": doc_type,
+                    "cohere_score":  float(score),
+                    "was_selected":  c_hash in selected_hashes,
+                })
+            if rows:
+                for start in range(0, len(rows), 50):
+                    sb.table("rerank_feedback").insert(rows[start:start + 50]).execute()
+                log.debug("Logged %d rerank feedback rows.", len(rows))
+        except Exception as exc:
+            log.debug("rerank_feedback logging skipped: %s", exc)
+    threading.Thread(target=_write, daemon=True).start()
 # =========================================================================== #
 #  DYNAMIC TAXONOMY                                                            #
 # =========================================================================== #
 def get_existing_categories(access_token: str = None) -> List[str]:
     """Server-side DISTINCT via get_document_types() SQL function."""
     supabase = _build_supabase_client(access_token)
         retrieved = _apply_threshold_and_filter(ranked_with_scores, reranker="cohere")
         log.info("Reranker: Cohere")
+        # Fire-and-forget: log all Cohere scores for future CrossEncoder distillation
+        _log_rerank_feedback(
+            query=query,
+            all_candidates=all_candidates,
+            ranked_with_scores=ranked_with_scores,
+            selected_docs=retrieved,
+            user_id=user_id,
+        )
     # ── Path 2: Local CrossEncoder fallback ───────────────────────────────────
     except Exception as cohere_exc:
         log.warning("Cohere failed (%s) — trying local CrossEncoder.", cohere_exc)
             log.info("Reranker: lexical (Cohere + CrossEncoder both failed)")
     log.info("Final %d chunks.", len(retrieved))
+    # ── Token budget enforcement ──────────────────────────────────────────────
+    # Trim chunks that would push the LLM context over MAX_CONTEXT_CHARS.
+    # Highest-ranked chunks are always kept — only tail overflow is dropped.
+    if retrieved:
+        budgeted: List[Document] = []
+        total_chars = 0
+        for doc in retrieved:
+            chars = len(doc.page_content)
+            if total_chars + chars > config.MAX_CONTEXT_CHARS:
+                log.info(
+                    "Context budget (%d chars) hit at chunk %d/%d — dropping %d remaining.",
+                    config.MAX_CONTEXT_CHARS,
+                    len(budgeted),
+                    len(retrieved),
+                    len(retrieved) - len(budgeted),
+                )
+                break
+            budgeted.append(doc)
+            total_chars += chars
+        if budgeted:
+            log.info(
+                "Context budget: %d chars across %d/%d chunks.",
+                total_chars, len(budgeted), len(retrieved),
+            )
+            retrieved = budgeted
     if session_id and retrieved:
         with _last_chunks_lock:
             _last_chunks[session_key] = retrieved

backend/main.py CHANGED Viewed

@@ -6,6 +6,20 @@ Production:  gunicorn -w 1 -k uvicorn.workers.UvicornWorker backend.main:app --b
 """
 import os
 import sys
 import logging
 import subprocess
 from contextlib import asynccontextmanager
@@ -62,6 +76,10 @@ app = FastAPI(
     redoc_url = "/redoc" if os.getenv("DOCS_ENABLED", "true").lower() == "true" else None,
 )
 _origins = [o.strip() for o in os.getenv("ALLOWED_ORIGINS", "*").split(",") if o.strip()]
 app.add_middleware(CORSMiddleware, allow_origins=_origins,
                    allow_credentials=True, allow_methods=["*"], allow_headers=["*"])

 """
 import os
 import sys
+from slowapi import Limiter, _rate_limit_exceeded_handler
+from slowapi.util import get_remote_address
+from slowapi.errors import RateLimitExceeded
+from starlette.requests import Request
+from starlette.responses import JSONResponse
+def _rate_limit_key(request: Request) -> str:
+    """Key rate limits by JWT token (per-user), fall back to IP."""
+    token = request.headers.get("X-Auth-Token") or request.headers.get("Authorization")
+    return token or get_remote_address(request)
+limiter = Limiter(key_func=_rate_limit_key)
 import logging
 import subprocess
 from contextlib import asynccontextmanager
     redoc_url = "/redoc" if os.getenv("DOCS_ENABLED", "true").lower() == "true" else None,
 )
+# ── Rate limiting ─────────────────────────────────────────────────────────────
+app.state.limiter = limiter
+app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
 _origins = [o.strip() for o in os.getenv("ALLOWED_ORIGINS", "*").split(",") if o.strip()]
 app.add_middleware(CORSMiddleware, allow_origins=_origins,
                    allow_credentials=True, allow_methods=["*"], allow_headers=["*"])

requirements.txt CHANGED Viewed

@@ -14,4 +14,6 @@ numpy==1.26.4
 unstructured[paddlepaddle]
 paddleocr==2.7.3
 paddlepaddle==2.6.2
-pymupdf==1.27.2

 unstructured[paddlepaddle]
 paddleocr==2.7.3
 paddlepaddle==2.6.2
+pymupdf==1.27.2
+slowapi
+limits

supabase/migrations/0003_rerank_feedback.sql ADDED Viewed

	@@ -0,0 +1,39 @@

+-- Rerank feedback table for CrossEncoder distillation loop.
+-- Every Cohere rerank result is logged here.
+-- Accumulate ~500+ rows, then fine-tune the local CrossEncoder on this data
+-- to reduce Cohere dependency over time.
+--
+-- Pipeline: Cohere rerank → log all scores → was_selected = true if chunk
+-- passed the RELEVANCE_THRESHOLD and made it into the final retrieved set.
+CREATE TABLE IF NOT EXISTS public.rerank_feedback (
+  id            bigserial    PRIMARY KEY,
+  user_id       uuid,
+  query_hash    text         NOT NULL,  -- MD5 of query (no PII storage)
+  chunk_id      uuid,                   -- references documents.id when available
+  chunk_hash    text         NOT NULL,  -- MD5 of chunk content (dedup key)
+  document_type text,                   -- category of the chunk's document
+  cohere_score  real         NOT NULL,
+  was_selected  boolean      NOT NULL,  -- true = passed threshold, went to LLM
+  created_at    timestamptz  NOT NULL DEFAULT now()
+);
+-- For distillation queries: fetch all rows for a user, ordered by time
+CREATE INDEX IF NOT EXISTS rerank_feedback_user_created_idx
+  ON public.rerank_feedback (user_id, created_at DESC);
+-- For analytics: filter by document_type to see per-category rerank quality
+CREATE INDEX IF NOT EXISTS rerank_feedback_doc_type_idx
+  ON public.rerank_feedback (document_type);
+-- RLS: service role writes (backend), users can read their own rows
+ALTER TABLE public.rerank_feedback ENABLE ROW LEVEL SECURITY;
+DROP POLICY IF EXISTS rerank_feedback_select_own ON public.rerank_feedback;
+CREATE POLICY rerank_feedback_select_own
+  ON public.rerank_feedback
+  FOR SELECT
+  USING (user_id = auth.uid());
+-- Backend writes via service role — no INSERT policy needed for anon/user role
+-- The pipeline uses _build_service_supabase_client() for this table.