Spaces:

nothex
/

morpheus-rag

Running

App Files Files Community

morpheus-rag / ARCHITECTURE.md

nothex

fix: avoid hugging face build-time model download

b2e76a8 8 days ago

preview code

raw

history blame contribute delete

33.2 kB

Morpheus — Architecture Guide

Ask anything. Your documents answer.

This file is the source of truth for how Morpheus works today, what is already live in the codebase, and what the project is prioritizing next.

What Morpheus Is Today

Morpheus is a multi-tenant RAG platform for user-uploaded PDFs.

Users upload PDF documents. They ask questions in natural language. Morpheus retrieves evidence and streams grounded answers with citations and diagnostics.

What is already live:

Each user sees only their own documents, enforced through tenant-scoped writes and RLS-backed reads
Retrieval combines BM25 keyword search + pgvector semantic search + reranking
Routing is multi-path, not single-path: exact/page-scoped lookup, structural tree search, dense retrieval, graph-assisted retrieval, and memory-aware follow-ups
Repeated questions can short-circuit through a semantic cache
Conversations are remembered across sessions via episodic memory
Query traces capture route selection, expert weights, retrieval diagnostics, and answer quality metadata
If one AI provider fails, the system tries the next provider/model in the chain

Important framing:

Morpheus is a RAG engine
Morpheus does have a mixture-of-experts-style orchestration layer at retrieval time
Morpheus is not a trained neural MoE model
Morpheus is not yet a true agentic retrieval system with an iterative planner/executor loop
Morpheus currently supports PDF upload ingestion as the primary production path, plus a new local Python code graph indexing path for graph-first exploration

Current Engineering Priority Order

This is the active operating order for the project. New source kinds should not leapfrog retrieval quality work.

Phase 0 — Understand What We Have

Implemented: per-query traces with route selection, selected experts, expert weights, rerank audit, diagnostics, and quality metrics
Implemented: router mechanism and retrieval branch selection in code
Required next operating discipline: maintain a manually scored baseline of 50-100 real queries

Phase 1 — Fix Quality on the Existing Path

Highest priority: improve retrieval and generation quality on the current PDF path
Focus areas: chunking, hybrid weighting, reranker thresholds, grounding instructions, hallucination guardrails

Phase 2 — Add Source-Kind Architecture

Only after Phase 1 is stable
Add source_kind, data_shape, and parser_kind
Add code ingestion as a separate structured pipeline
Add URL ingestion behind strict security controls

Status note:

Initial Phase 2 substrate is now live: source_kind, data_shape, parser_kind, DB-backed graph_runs, graph-first API endpoints, and deterministic Python code graph indexing via local script
Remaining Phase 2 work is broader source support (markdown, url, richer code languages) and deeper graph-first answer orchestration

Phase 3 — Scale and Cost Optimization

Model tiering by route/query complexity
Embedding cache layer
Reranker gating based on retrieval confidence

Supported Sources Right Now

Source	Status	Notes
PDF upload	Live	Only production ingestion path today
URL ingestion	Not live	Planned for a later phase
Markdown files	Not live	Planned via source-kind architecture
Python code graph indexing	Live (local/scripted)	Deterministic AST-first graph indexing for graph-first exploration; not a browser upload path
Code/config/API files (beyond Python graph indexing)	Partial	Broader structured ingestion still planned as a separate pipeline, not an extension of generic document chunking

Project Structure

morpheus/
├── backend/
│   ├── main.py                  FastAPI app, startup wiring, rate limiter
│   ├── api/
│   │   ├── auth.py              /api/v1/auth/*
│   │   ├── query.py             /api/v1/query — SSE streaming query path
│   │   ├── corpus.py            /api/v1/corpus/*
│   │   ├── ingest.py            /api/v1/ingest/*
│   │   ├── graph.py             /api/v1/graph — graph search, path, export
│   │   ├── frontend_config.py   /api/v1/config
│   │   └── admin.py             traces + feedback admin endpoints
│   └── core/
│       ├── pipeline.py          Main orchestration layer
│       ├── pipeline_routing.py  Route classes + expert weighting
│       ├── pipeline_retrieval.py Retrieval helpers
│       ├── pipeline_generation.py Generation helpers
│       ├── pipeline_ingestion.py PDF ingestion workflow
│       ├── pipeline_pageindex.py Structural tree retrieval
│       ├── pipeline_memory.py   Episodic memory helpers
│       ├── pipeline_ambiguity.py Scope and ambiguity handling
│       ├── pipeline_types.py    Shared pipeline metadata types
│       ├── graph_hybrid.py      Source metadata + graph-first search/path/export helpers
│       ├── code_graph.py        Deterministic Python AST graph indexing
│       ├── providers.py         LLM + embedding provider fallback
│       ├── classifier.py        3-stage document classifier
│       ├── intent_classifier.py sklearn intent model, online retraining
│       ├── cache_manager.py     Semantic Redis cache with version invalidation
│       ├── auth_utils.py        JWT helpers + require_auth_token Depends()
│       ├── config.py            All constants and tuneable settings
│       ├── rate_limit.py        Shared limiter setup
│       └── tasks.py             Celery task for background PDF ingestion
├── frontend/
│   ├── index.html
│   └── js/
│       ├── config.js            Runtime config
│       ├── api.js               All fetch() calls — single source of truth
│       ├── state.js             Global STATE object
│       ├── chat.js              Streaming chat UI
│       ├── corpus.js            Upload + document management
│       ├── graph.js             D3 force-directed knowledge graph
│       ├── inspect.js           Node detail panel
│       ├── ui.js                Shared UI helpers
│       └── main.js              Boot sequence + auth gate
├── shared/
│   └── types.py                 Pydantic models shared by API and pipeline
└── supabase/
    ├── migrations/
    └── rls/

The Database (Supabase / PostgreSQL)

Tables

documents — The vector store. Every chunk and summary node from every PDF lives here.

Column	Type	Purpose
`id`	uuid	Deterministic ID: uuid5(file_hash + chunk_index)
`content`	text	Chunk text that gets searched
`metadata`	jsonb	source, file_hash, document_type, page numbers, retrieval metadata
`embedding`	vector(2048)	nvidia-nemotron embedding for pgvector search
`user_id`	uuid	RLS tenant isolation

ingested_files — Dedup registry.

Column	Type	Purpose
`file_hash`	text	SHA-256 of the PDF — the dedup key
`filename`	text	Display name in the UI
`document_type`	text	Category e.g. `academic_syllabus`
`source_kind`	text	`pdf`, `code`, `markdown`, `url`, etc.
`data_shape`	text	`structured`, `unstructured`, or `hybrid`
`parser_kind`	text	Parser used e.g. `pdf_partition`, `python_ast`
`chunk_count`	int	Includes RAPTOR tree nodes
`user_id`	uuid	Tenant isolation
`user_overridden`	bool	True if user manually changed category — classifier skips

chat_memory — Episodic memory, searchable by semantic similarity.

Column	Type	Purpose
`session_id`	text	Groups messages from the same conversation
`role`	text	`user` or `assistant`
`content`	text	The message text
`embedding`	vector	For semantic search via match_memory RPC
`user_id`	uuid	Tenant isolation

document_trees — Hierarchical tree index for structural queries.

Column	Type	Purpose
`file_hash`	text	Links tree to the source document
`tree_json`	jsonb	Recursive node structure: {title, content, children}
`user_id`	uuid	Tenant isolation

category_centroids — The document classifier's learned memory.

Column	Type	Purpose
`document_type`	text	Category label
`centroid_vector`	array	Running average embedding of all docs of this type
`document_count`	int	Number of documents that contributed
`user_id`	uuid	Per-tenant centroids

evaluation_logs — RAGAS quality metrics written after every query.

rerank_feedback — Every Cohere rerank decision, stored for future CrossEncoder distillation.

intent_feedback — Online training data for the intent classifier.

query_traces — Per-query trace record with route mode, selected experts, expert weights, candidate counts, diagnostics, and quality metrics.

graph_nodes / graph_edges — Graph foundation built during ingestion and enriched by query/feedback workflows.

graph_runs — Auditable record of each graph extraction/indexing pass, including parser/source metadata and content hash.

Category list — Derived directly from each tenant's ingested_files.document_type values.

Supabase RPC Functions

Function	Purpose
`hybrid_search(query_text, query_embedding, match_count, filter, semantic_weight, keyword_weight, p_user_id)`	Combined BM25 + pgvector search (tenant-scoped overload)
`match_memory(query_embedding, match_session_id, match_count)`	Semantic search over chat history
`insert_document_chunk(p_id, p_content, p_metadata, p_embedding, p_user_id)`	Secure insert with explicit user_id
`get_document_types()`	Returns distinct categories for this tenant

Row Level Security

Every table has RLS policies. Core rule: user_id = auth.uid() for reads.

Writes from Celery workers use the service-role key (Celery has no browser session so auth.uid() is NULL), but always inject user_id explicitly via the insert_document_chunk RPC — extracted from the JWT at the API boundary before the task is queued.

The Ingestion Pipeline

Browser
  POST /api/v1/ingest/upload (X-Auth-Token header)
  FastAPI: JWT validated, upload limits checked, MIME type checked
  Per-user document count checked (50 max)
  PDF saved to temp file
  process_pdf_task.delay() → Redis queue
  Returns {task_id} immediately
  Browser polls /api/v1/ingest/status/{task_id} every 2 seconds

Celery worker (background):

Step 1: Dedup check
  SHA-256 fingerprint of PDF
  Check ingested_files table (O(1) indexed lookup)
  Already ingested → return "already_ingested"
  user_overridden=True → skip classifier, use forced_category

Step 2: PDF partitioning (unstructured library)
  partition_pdf() — OCR + layout detection
  extract_images_from_pdf() — PyMuPDF, filters tiny/skewed images
  Returns Element objects (Title, NarrativeText, Table, Image...)

Step 3: Classification (classifier.py)
  Three-stage cascade:
    Stage 1: Centroid nearest-neighbour (no API call, cosine similarity)
             Confidence >= 0.72 → done
    Stage 2: Ensemble vote (centroid + label-embed + TF-IDF)
             Score >= 0.38 → done
    Stage 3: LLM chain-of-thought (novel document types only)
  Sparse/tabular pre-check: routes to visual classification if word count < 200
  After classification: centroid updated with this document's vector

Step 4: Chunking + AI summaries
  chunk_by_title() groups elements into logical sections
  Chunks with tables or images: parallel AI vision summarisation (5 workers)
  Each chunk becomes a LangChain Document with rich metadata

Step 5: RAPTOR tree indexing
  Groups leaf chunks into clusters of 5
  LLM generates parent summary for each cluster
  Repeats up the tree until single root node
  Root node answers "what is this document about?"
  Leaf nodes answer specific detail questions
  All nodes (leaves + summaries) uploaded to documents table

Step 6: Embedding + upload
  Batch embed all nodes via nvidia-nemotron (2048 dims)
  Insert each via insert_document_chunk RPC (explicit user_id)
  Register in ingested_files
  Invalidate semantic cache for this user (kb_version++)

Step 7: Graph foundation persistence
  Document/entity/topic graph rows written to graph_nodes / graph_edges

Current ingestion boundary

Live source model: PDF only
Live semantic category model: document_type
Not yet live: source_kind, data_shape, parser_kind
Design rule: future code/config/API ingestion should not be bolted onto this PDF chunking path

The Retrieval Pipeline

Browser
  POST /api/v1/query {query, category, history, session_id, alpha}
  X-Auth-Token header

FastAPI validates JWT, starts SSE streaming response

Step 1: Intent analysis (analyse_intent)
  Local sklearn classifier — under 5ms, no API call
  Inputs: query text, has_category, has_history
  Output: {is_clear, enriched_query, clarification_question}
  If needs clarification → stream question back, stop
  Clarification limit: after 2 consecutive turns, proceed regardless
  Reference queries ("summarise it"): replaced with previous query
  Every query logged to intent_feedback for online retraining

Step 1.5: Ambiguity / scope safety (check_query_ambiguity)
  If the user has NOT pinned a document:
  - If **multiple docs are in scope** and the query is **identity/page-scoped** (owner/title/publisher/cover/first page), Morpheus **asks the user to pick a document** (never guesses).
  - Otherwise, Morpheus may ask a clarification question for generic queries when multiple docs match.
  Implementation detail: ambiguity scoring uses `hybrid_search(..., p_user_id=...)` to avoid PostgREST overload ambiguity.

Step 2: Query routing
  Route class chosen first:
    exact_fact / page_scoped / summary / follow_up / compare / multi_part / relational / factoid / no_retrieval
  Expert weights assigned and persisted:
    dense_chunk / raptor_summary / graph_traversal / episodic_memory / hybrid_compare
  Special deterministic branch:
    identity_store
  Structural queries?
    → tree_search(): recursive traversal of document_trees for this user
    → If tree search returns 0 results: falls back to hybrid search
  Exact/page-scoped questions with curated evidence?
    → identity_store path
  Relational / graph-supported questions?
    → graph retrieval can join the candidate pool
  Everything else → retrieve_chunks() hybrid path

Step 3: retrieve_chunks() — hybrid retrieval path
  a) Follow-up detection
     Query ≤8 words with pronouns (it/this/that/they)?
     Reuse _last_chunks[session_key] — no re-search
     Safety guard: ordinal follow-ups like "the second one" must have an explicit referent (a list);
     otherwise the API asks for clarification instead of guessing.

  b) Semantic cache check
     Embed query (256-entry in-memory LRU cache)
     Scan Redis for cosine similarity ≥ 0.92
     Cache hit → return __CACHE_HIT__ sentinel document

  c) Query rewriting
     LLM breaks query into 1-3 targeted sub-queries
     Short queries (≤3 words) skip this step

  d) Hybrid search (per sub-query)
     hybrid_search RPC: BM25 + pgvector combined
     alpha=0.5 = equal weight (adjustable via UI slider)
     Deduplicates across sub-queries by chunk ID
     Category filter active if user selected one
     Graph pinning can restrict search to pinned files

  e) Reranking (3-tier fallback)
     Tier 1: Cohere rerank-multilingual-v3.0 (cloud, best quality)
     Tier 2: CrossEncoder ms-marco-MiniLM-L-6-v2 (local, free)
     Tier 3: Lexical Jaccard similarity (pure Python, always works)
     Relevance threshold: 0.35 (relaxed to 0.05 for small corpus)
     Diversity filter: max 2 chunks per source, cross-category seeding

  f) Log rerank feedback (fire-and-forget thread)

Step 4: generate_answer_stream()
  __CACHE_HIT__ sentinel → stream cached answer directly, skip LLM
  match_memory RPC → retrieve past relevant Q&A pairs (episodic memory)
  identity_store-only exact/page-scoped route can bypass normal generative answering
  Build prompt: system + retrieved chunks + memories + history + query
  Stream tokens: Groq → Gemini → OpenRouter fallback chain
  After streaming: save Q&A pair to chat_memory (background thread)
  Store answer in semantic cache (versioned key, TTL by document type)

Step 5: Trace + quality persistence
  Persist query_traces row:
    trace_id, route_mode, selected_experts, expert_weights,
    candidate_counts, doc_diagnostics, quality_metrics, latency
  Persist evaluation_logs
  Optionally enrich trace graph links

Step 6: Emit sources
  {type: "done", sources: [...], images: [...]} SSE event

Deterministic vs Probabilistic Paths

This distinction should stay explicit in the architecture and in future roadmap work.

More deterministic today

JWT validation and tenant scoping
File dedup
MIME gating for ingestion
Identity-store retrieval for exact/page-scoped questions
PageIndex tree traversal
Explicit ambiguity gating and document pinning
Query trace persistence

More probabilistic today

Query rewriting
Embedding retrieval
BM25/vector score balancing
Reranking
LLM generation
LLM fallback document classification
RAPTOR summary generation

Design rule

For high-precision questions, Morpheus should prefer deterministic branches first and only use probabilistic retrieval/generation when necessary.

The Provider System

ProviderFactory.build_chat_llm(purpose=...)

  purpose="text"       Groq → Gemini → OpenRouter
  purpose="ingestion"  Gemini (1M context) → OpenRouter
  purpose="vision"     Gemini (native multimodal) → OpenRouter vision
  purpose="rewriter"   OpenRouter → Groq
  purpose="classifier" OpenRouter classifier models only

Embeddings: nvidia/llama-nemotron-embed-vl-1b-v2:free (2048 dims) → text-embedding-3-small

Current model lists (all configurable in config.py):

Provider	Models (in fallback order)
Groq	llama-4-scout-17b → llama-3.3-70b-versatile → qwen3-32b → llama-3.1-8b-instant
Gemini	gemini-2.5-flash → gemini-2.5-flash-lite
OpenRouter	stepfun/step-3.5-flash:free → nvidia/nemotron-3-super-120b:free → arcee-ai/trinity-large-preview:free → meta-llama/llama-3.3-70b-instruct:free → more

Retry logic in each provider wrapper: 404, 429, 503 are retryable — the wrapper moves to the next model in the list.

The Semantic Cache

cache_manager.py — version-invalidated, similarity-based lookup.

Each user has a kb_version integer in Redis: nexus:kb_version:{user_id}
Cache entries keyed by version: nexus:qcache:{user_id}:v{version}:...
Lookup: scan all entries for this user+version, find best cosine similarity
Hit threshold: 0.92 (strict — wrong answers are worse than cache misses)
Corpus change (ingest or delete): increment_kb_version() → version N → N+1
All v1 entries become invisible under v2 — no explicit deletion needed

TTL by document type: academic_syllabus and reference_chart cache for 7 days; technical_manual and research_paper for 3 days; financial_report and hr_policy for 1 day; general_document for 1 hour.

Trace-First Quality Workflow

Phase 0 is partly implemented in code and should now be treated as normal workflow, not optional ops work.

Already live

trace_id emitted through the query path
query_traces persistence
selected_experts and expert_weights
rerank audit data
route class and route reason
candidate counts and document diagnostics
answer preview and latency logging
evaluation_logs writes

What still needs to become routine

Manual review set of 50-100 real queries
Explicit tagging of recurring failure modes
Regular tuning cycles driven by query traces instead of anecdotal examples

This phase remains the entry point for major retrieval changes.

The Intent Classifier

intent_classifier.py — sklearn, runs locally, under 5ms per query.

What it classifies: Is this query clear enough to search, or should we ask for clarification?

Features: has_category flag, has_history flag, query text embedded via all-MiniLM-L6-v2.

Online learning: Every query is logged to intent_feedback. Every 25 rows, the model retrains on accumulated examples and saves to intent_model.pkl. Active learning targets the uncertain region (entropy 0.40–0.60) to maximize training efficiency.

Clarification limit: After 2 consecutive clarification turns in the same session, the system proceeds regardless. This prevents the system from getting stuck in a loop with genuinely ambiguous users.

The Document Classifier

classifier.py — three-stage cascade that learns with every ingestion.

Incoming document
        |
        v
Sparse/tabular pre-check (word count < 200 OR unique word ratio > 0.85)
  YES → visual classification (structural fingerprint to LLM)
  NO  → continue
        |
        v
Stage 1: Centroid nearest-neighbour
  Cosine similarity to stored category centroids
  Confidence ≥ 0.72 → done (no API call)
        |
        v
Stage 2: Ensemble vote
  Signal A: cosine to centroids        (weight 0.45)
  Signal B: cosine to label embeddings (weight 0.30)
  Signal C: TF-IDF keyword matching    (weight 0.25)
  Score ≥ 0.38 → done
        |
        v
Stage 3: LLM chain-of-thought
  Sends excerpt to classifier LLM
  Classifies FORMAT + STRUCTURE, not just topic
  Fallback: "general_document"

After classification, the winning category's centroid is updated with this document's embedding — the classifier improves with every ingestion.

User override lock: If ingested_files.user_overridden=True, the entire cascade is skipped. Returns synthetic result with stage_used="user_override", confidence=1.0.

The Three Self-Improvement Loops

Morpheus has three feedback loops that make it more accurate over time:

Loop 1 — Intent classifier (every 25 queries) User queries logged to intent_feedback. Every 25 new rows, the model retrains automatically and saves to disk. Learns the specific query patterns of your users.

Loop 2 — Document classifier (every ingestion) Each ingested document updates its category centroid. The next similar document gets classified at Stage 1 (no API call) instead of needing the LLM fallback. Classification gets faster and more accurate as the corpus grows.

Loop 3 — Reranker distillation (pipeline in place) Every query logs Cohere rerank scores to rerank_feedback. These accumulated labels will be used to train a local CrossEncoder to match Cohere quality without the API cost.

What Morpheus Does Not Yet Have

These should not be documented or discussed as shipped capabilities:

URL ingestion
Markdown ingestion
Code/config/API source ingestion
First-class source_kind, data_shape, and parser_kind schema
True agentic retrieval with an iterative planner/executor loop

Future additions should be documented here only after the code path, tests, and rollout controls exist.

The Frontend

Authentication Flow

Page load
  initSupabase() fetches Supabase keys from /api/v1/config
  supabaseClient.auth.getSession()
  Session exists → showApp() + bootApp()
  No session → showLogin()

Login
  supabaseClient.auth.signInWithPassword(email, password)
  Supabase-js stores JWT in localStorage automatically
  Every API call: getSupabaseToken() reads it from localStorage
  Sent as X-Auth-Token header
  Backend require_auth_token Depends() validates JWT and returns user_id

Global State

state.js — single source of truth:

Key	Contents
`STATE.files`	Ingested document list from /api/v1/corpus/files
`STATE.categories`	Category strings
`STATE.catColors`	Color mapping for graph
`STATE.chatHistory`	Current conversation turns
`STATE.sessionId`	UUID per browser tab
`STATE.simulation`	D3 force simulation reference
`STATE.alpha`	Retrieval weight slider (0=keyword, 1=semantic)
`STATE.isThinking`	Double-submit guard

Upload + Progress

corpus.js — processUpload() calls apiIngestFile(), then enters pollIngestStatus() which polls every 2 seconds and exits only on COMPLETED or FAILED. Shows cycling heartbeat messages through pipeline stages while waiting.

Chat Streaming

chat.js — sendChat() has a 500ms debounce guard. The assistant bubble is created immediately with thinking dots. async onToken() yields to the browser with await new Promise(r => setTimeout(r, 0)) after each token update so the DOM repaints during streaming rather than all at once at the end.

Graph

graph.js — Obsidian-style D3 force simulation. graphReheat() uses alpha(0.3) not alphaTarget(0.2). The alpha() method sets current energy directly — it forces a restart even when the simulation has fully stopped. alphaTarget only sets where energy wants to decay toward, which does nothing if the simulation is already below alphaMin. onGraphTabVisible() is called from main.js with a 50ms delay for the CSS display change to propagate before D3 reads panel dimensions.

Key Design Decisions

Why Celery + Redis? Ingestion takes 60–120 seconds (OCR, AI summaries, RAPTOR tree building). FastAPI requests time out well before that. Celery lets the task run in the background while the browser polls for status.

Why service-role key for writes? Celery workers have no browser session, so auth.uid() is NULL in the database. The security boundary is enforced at the API level — the JWT is validated before the task is queued and the user_id is passed explicitly to the insert_document_chunk RPC.

Why RAPTOR tree indexing? Flat chunking misses questions that span multiple sections ("total credits across all categories"). RAPTOR builds parent summaries that aggregate child content, enabling retrieval at multiple granularities. Root nodes answer overview questions; leaf nodes answer specific details.

Why tree search for structural queries? Vector similarity is calibrated to semantic meaning, not document structure. A query for "Capstone Project credits" fails vector search because the chunk summary emphasises overall credit structure, not individual line items. Tree search traverses the document hierarchy and finds the exact node.

Why semantic cache with version invalidation? Repeated questions should not cost API calls. But cached answers must go stale when the corpus changes. Version-based invalidation solves the second problem without tracking which cache entry references which document — increment the version, all old entries become invisible.

Why 3-tier reranker? Cohere costs money and has rate limits. CrossEncoder is free but needs local GPU. Lexical always works. This order maximises quality while guaranteeing retrieval never fails completely.

Why alpha(0.3) not alphaTarget(0.2) in D3 graph reheat? alphaTarget sets where the simulation wants to decay toward. If the simulation has already stopped (alpha < alphaMin = 0.001), alphaTarget does nothing. The alpha() method sets current energy directly and always forces a restart.

Why quality work before new source kinds? The current system already has enough retrieval and generation complexity that quality regressions can hide behind “more features.” The project workflow now explicitly prioritizes trace review, failure-mode reduction, and retrieval tuning before adding URL, Markdown, or code ingestion.

Why code ingestion should be separate later? Code/config/API material is structurally different from prose documents. When Morpheus gains code support, it should use a dedicated exact/AST-first path rather than treating code as generic document chunks.

Environment Variables

Variable	Purpose	Required
`SUPABASE_URL`	Supabase project URL	Yes
`SUPABASE_ANON_KEY`	Frontend-safe key (user-scoped reads)	Yes
`SUPABASE_SERVICE_KEY`	Server-only key (bypasses RLS for writes)	Yes
`SUPABASE_JWT_SECRET`	JWT signature verification	Yes
`OPENROUTER_API_KEY`	OpenRouter API access	Yes
`GROQ_API_KEY`	Groq API access (primary generation)	Yes
`GEMINI_API_KEY`	Google Gemini access (ingestion + vision)	Yes
`COHERE_API_KEY`	Cohere reranking	Yes
`REDIS_URL`	Redis connection string	Yes
`CELERY_WORKER_CANCEL_ON_CONNECTION_LOSS`	Whether broker disconnect cancels active ingestion tasks; keep `false` for long uploads	No
`CELERY_TASK_ACKS_LATE`	Whether ingestion tasks acknowledge only after completion; keep `false` to avoid redelivery loops on flaky Redis	No
`CELERY_TASK_REJECT_ON_WORKER_LOST`	Whether lost workers requeue tasks; keep `false` unless duplicate-safe retries are required	No
`MASTER_ADMIN_KEY`	Admin endpoint access	Yes
`ALLOWED_ORIGINS`	CORS allowed origins (use `*` for dev only)	Yes
`DOCS_ENABLED`	Enable /docs and /redoc (set `false` in prod)	No
`LOG_LEVEL`	Logging verbosity (`INFO` or `DEBUG`)	No
`AUTO_START_CELERY`	Auto-spawn Celery subprocess on startup	No
`HF_HUB_DISABLE_XET`	Disable Xet-backed model downloads during build/runtime	No

Common Debugging

Ingestion crashes at embedding step Look for: ValueError: Model X returned null embeddings Cause: OpenRouter returns HTTP 200 with data=null Fix: FallbackEmbeddings null guard retries the next model automatically — check provider logs for rate limits

Cache not invalidating after delete Check Redis for key nexus:kb_version:{user_id} If missing: the version key was never written — run a fresh ingest to initialise it

Graph not reheating on tab switch Check onGraphTabVisible is defined in graph.js Check _hookGraphTabVisible IIFE in main.js Expected: graph animates within 50ms of tab click

Classifier ignoring user category Check: ingested_files.user_overridden = true for that file hash Logs should show: User override active — forcing category 'X', skipping classifier

__CACHE_HIT__ showing as a source chip in the UI Hard-refresh (Ctrl+Shift+R) to load the latest chat.js The visibleSources filter in onDone() strips sentinel entries

Gemini 404 errors during ingestion Check GEMINI_TEXT_MODELS and GEMINI_VISION_MODELS in config.py Must be gemini-2.5-flash and gemini-2.5-flash-lite gemini-1.5-flash and gemini-2.0-flash are deprecated

Update Rule for This File

When the architecture changes, update this document in this order:

Current live behavior
Trace and observability impact
Deterministic vs probabilistic impact
Roadmap/status movement between phases

Do not document planned capabilities as shipped.

Complete Request Flow Example

User asks: "What are the core courses?"
Category filter: academic_syllabus

1. Browser POST /api/v1/query
   Headers: X-Auth-Token: eyJ...
   Body: {query, category="academic_syllabus", history, session_id, alpha=0.5}

2. require_auth_token: decodes JWT → user_id="ee903934..."

3. analyse_intent()
   sklearn: needs_clarification=False, confidence=1.00
   Category active: enriched query = "query academic_syllabus"
   Logs to intent_feedback

4. Route selection
   route_class = factoid
   selected_experts = ["dense_chunk", ...]
   expert_weights persisted into trace metadata
   _should_use_tree_path() → False (not a structural keyword query)
   retrieve_chunks() hybrid path

5. Semantic cache check: MISS (first time this query)
   generate_sub_queries → ["B.Tech CSE core courses", "program core credits", ...]
   hybrid_search RPC × 3 sub-queries → 12 raw candidates
   Cohere rerank → ranked by relevance score
   Threshold + diversity filter → 3 final chunks
   Store in _last_chunks[session_key]
   Log rerank feedback (background thread)

6. generate_answer_stream()
   match_memory RPC → 2 past relevant Q&A pairs
   Build prompt: system + 3 chunks + 2 memories + history + query
   Groq astream() → tokens arrive one by one
   Yield {type:"token", content:"The"}, {type:"token", content:" core"}, ...
   After streaming: save Q&A to chat_memory (background thread)
   Store in semantic cache (version v4, TTL 7 days for academic_syllabus)
   Persist query trace and evaluation logs

7. Yield {type:"done", sources:[...], images:[...], trace_id:"..."}

8. Browser: onToken() fills bubble token by token
   onDone() appends source chips and keeps the trace id available for review

Last updated: April 2026