umanggarg Claude Sonnet 4.6 commited on
Commit
c970958
·
1 Parent(s): 39a37e3

Fix 7 architect-identified issues, add eval harness — targeting 9.5+

Browse files

BM25 fix (silent correctness bug):
- _text_to_sparse() used Python hash() which is randomised per process
(PYTHONHASHSEED). Query tokens mapped to different dimensions than stored
tokens → keyword search was returning random noise.
- Fix: use hashlib.md5 (stable across all runs and processes).

find_callers fix (wrong implementation):
- Was doing text search instead of using the 'calls' payload field.
- Added QdrantStore.find_callers() that filters by calls array in Qdrant.
- MCP tool now returns exact structural call sites, not fuzzy text matches.

Shared Embedder (600MB saved):
- IngestionService and RetrievalService each loaded the 600MB model.
- Both now accept an optional Embedder param; main.py creates one instance.

Shared QdrantStore (single connection pool):
- main.py had 3 separate QdrantStore() instantiations.
- Now one _qdrant_store passed to IngestionService, GraphService, MCP server.

Async ingestion (unblocks event loop):
- ingest_repo route was calling svc.ingest() on the main event loop.
This blocked ALL requests during ingestion (minutes).
- Fix: asyncio.to_thread() offloads to thread pool.

Real agent token streaming (not fake word-splitting):
- stream() was collecting the full LLM response then splitting by spaces.
- Added _stream_final_answer(): runs sync streaming LLM call in thread pool,
bridges tokens to async generator via asyncio.Queue + call_soon_threadsafe.
- Tokens now arrive at the client as the LLM generates them.

Rate limiting:
- /ingest endpoint now enforces INGEST_RATE_LIMIT req/min per IP (default 5).
- Sliding window counter — no external dependency, works in single process.

Eval harness (eval/eval.py):
- Metrics: Hit@k, MRR, Precision@k across all three retrieval modes.
- Test cases for karpathy/micrograd (8 cases covering core functions).
- CLI: python -m eval.eval --repo karpathy/micrograd --modes hybrid semantic keyword

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

backend/config.py CHANGED
@@ -48,5 +48,11 @@ class Settings:
48
  # so CORS allows the deployed frontend to call the backend.
49
  frontend_url: str = os.getenv("FRONTEND_URL", "")
50
 
 
 
 
 
 
 
51
 
52
  settings = Settings()
 
48
  # so CORS allows the deployed frontend to call the backend.
49
  frontend_url: str = os.getenv("FRONTEND_URL", "")
50
 
51
+ # ── Rate limiting ─────────────────────────────────────────────────────────
52
+ # Max /ingest requests per IP per minute. Each ingestion downloads a repo,
53
+ # runs the embedding model, and writes to Qdrant — it's expensive.
54
+ # Set to 0 to disable rate limiting (e.g. in local dev).
55
+ ingest_rate_limit: int = int(os.getenv("INGEST_RATE_LIMIT", "5"))
56
+
57
 
58
  settings = Settings()
backend/main.py CHANGED
@@ -31,10 +31,13 @@ Endpoints:
31
  POST /mcp — MCP protocol endpoint (for MCP clients)
32
  """
33
 
 
 
 
34
  from contextlib import asynccontextmanager
35
  from typing import Annotated
36
 
37
- from fastapi import FastAPI, Depends, HTTPException, Query
38
  from fastapi.middleware.cors import CORSMiddleware
39
  from fastapi.responses import StreamingResponse
40
 
@@ -54,6 +57,7 @@ from backend.mcp_server import mcp, init_services as init_mcp_services
54
  from backend.mcp_client import MCPClient
55
  from retrieval.retrieval import RetrievalService
56
  from ingestion.qdrant_store import QdrantStore
 
57
 
58
 
59
  # ── Shared service instances ───────────────────────────────────────────────────
@@ -77,15 +81,28 @@ async def lifespan(app: FastAPI):
77
 
78
  print("Starting up — loading models and connecting to Qdrant...")
79
 
80
- # Core services (unchanged)
81
- _retrieval_service = RetrievalService()
82
- _ingestion_service = IngestionService()
83
- _graph_service = GraphService(QdrantStore())
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  _generation_service = GenerationService()
85
 
86
  # ── MCP server setup ───────────────────────────────────────────────────────
87
  # Inject shared service instances into the MCP server's tool functions.
88
- init_mcp_services(_retrieval_service, QdrantStore())
89
 
90
  # ── MCP client + agent setup ───────────────────────────────────────────────
91
  if settings.groq_api_key or settings.anthropic_api_key:
@@ -151,6 +168,44 @@ app.add_middleware(
151
  app.mount("/mcp", mcp.streamable_http_app())
152
 
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  # ── Dependency providers ───────────────────────────────────────────────────────
155
 
156
  def get_ingestion_service() -> IngestionService:
@@ -238,6 +293,7 @@ async def mcp_status():
238
  async def ingest_repo(
239
  request: IngestRequest,
240
  svc: Annotated[IngestionService, Depends(get_ingestion_service)],
 
241
  ):
242
  """
243
  Ingest a GitHub repository into the vector index.
@@ -247,7 +303,10 @@ async def ingest_repo(
247
  Set force=true to delete and re-index from scratch.
248
  """
249
  try:
250
- result = svc.ingest(request.repo_url, force=request.force)
 
 
 
251
  return IngestResponse(**result)
252
  except ValueError as e:
253
  raise HTTPException(status_code=400, detail=str(e))
 
31
  POST /mcp — MCP protocol endpoint (for MCP clients)
32
  """
33
 
34
+ import asyncio
35
+ import time
36
+ from collections import defaultdict, deque
37
  from contextlib import asynccontextmanager
38
  from typing import Annotated
39
 
40
+ from fastapi import FastAPI, Depends, HTTPException, Query, Request
41
  from fastapi.middleware.cors import CORSMiddleware
42
  from fastapi.responses import StreamingResponse
43
 
 
57
  from backend.mcp_client import MCPClient
58
  from retrieval.retrieval import RetrievalService
59
  from ingestion.qdrant_store import QdrantStore
60
+ from ingestion.embedder import Embedder
61
 
62
 
63
  # ── Shared service instances ───────────────────────────────────────────────────
 
81
 
82
  print("Starting up — loading models and connecting to Qdrant...")
83
 
84
+ # ── Single shared Embedder ─────────────────────────────────────────────────
85
+ # The embedding model is 600MB. Loading it twice wastes ~600MB RAM.
86
+ # We create one instance and pass it to both IngestionService (for indexing)
87
+ # and RetrievalService (for query embedding). Same model, one load.
88
+ _embedder = Embedder()
89
+
90
+ # ── Single shared QdrantStore ──────────────────────────────────────────────
91
+ # One client, one connection pool. All services use this same instance.
92
+ # Previously we created 3 separate QdrantStore() calls — each opened its
93
+ # own HTTP connection pool and auth session, wasting resources and making
94
+ # it harder to reason about state.
95
+ _qdrant_store = QdrantStore()
96
+
97
+ # Core services — all share the same store + embedder instances
98
+ _retrieval_service = RetrievalService(embedder=_embedder)
99
+ _ingestion_service = IngestionService(store=_qdrant_store, embedder=_embedder)
100
+ _graph_service = GraphService(_qdrant_store)
101
  _generation_service = GenerationService()
102
 
103
  # ── MCP server setup ───────────────────────────────────────────────────────
104
  # Inject shared service instances into the MCP server's tool functions.
105
+ init_mcp_services(_retrieval_service, _qdrant_store)
106
 
107
  # ── MCP client + agent setup ───────────────────────────────────────────────
108
  if settings.groq_api_key or settings.anthropic_api_key:
 
168
  app.mount("/mcp", mcp.streamable_http_app())
169
 
170
 
171
+ # ── Rate limiter ───────────────────────────────────────────────────────────────
172
+ # Sliding window counter: track timestamps of recent requests per IP.
173
+ # On each request, drop timestamps older than 60s, then check the count.
174
+ # No external dependency — a deque per IP in a defaultdict is sufficient
175
+ # for a single-process server. For multi-process deployments, use Redis.
176
+
177
+ _rate_windows: dict[str, deque] = defaultdict(deque)
178
+
179
+
180
+ def _check_rate_limit(request: Request) -> None:
181
+ """
182
+ Raise 429 if the caller has exceeded INGEST_RATE_LIMIT requests/minute.
183
+
184
+ Uses the X-Forwarded-For header when behind a proxy (e.g. Render),
185
+ falling back to request.client.host for direct connections.
186
+ """
187
+ limit = settings.ingest_rate_limit
188
+ if limit <= 0:
189
+ return # disabled
190
+
191
+ ip = request.headers.get("X-Forwarded-For", "").split(",")[0].strip()
192
+ ip = ip or (request.client.host if request.client else "unknown")
193
+ now = time.monotonic()
194
+
195
+ window = _rate_windows[ip]
196
+ # Drop timestamps older than 60 seconds
197
+ while window and window[0] < now - 60:
198
+ window.popleft()
199
+
200
+ if len(window) >= limit:
201
+ raise HTTPException(
202
+ status_code=429,
203
+ detail=f"Rate limit exceeded: max {limit} ingestion requests per minute.",
204
+ )
205
+
206
+ window.append(now)
207
+
208
+
209
  # ── Dependency providers ───────────────────────────────────────────────────────
210
 
211
  def get_ingestion_service() -> IngestionService:
 
293
  async def ingest_repo(
294
  request: IngestRequest,
295
  svc: Annotated[IngestionService, Depends(get_ingestion_service)],
296
+ _: None = Depends(_check_rate_limit),
297
  ):
298
  """
299
  Ingest a GitHub repository into the vector index.
 
303
  Set force=true to delete and re-index from scratch.
304
  """
305
  try:
306
+ # Ingestion is CPU+IO bound: downloads zip, runs AST parsing, embeds 600MB model.
307
+ # Running it in the main event loop would block ALL other requests for minutes.
308
+ # asyncio.to_thread() offloads it to a thread pool — the loop stays responsive.
309
+ result = await asyncio.to_thread(svc.ingest, request.repo_url, request.force)
310
  return IngestResponse(**result)
311
  except ValueError as e:
312
  raise HTTPException(status_code=400, detail=str(e))
backend/mcp_server.py CHANGED
@@ -180,23 +180,31 @@ def find_callers(function_name: str, repo: Optional[str] = None) -> str:
180
  Essential for understanding HOW something is used, not just what it does.
181
  Use this after search_code when you need usage patterns and call sites.
182
 
 
 
 
183
  Args:
184
  function_name: The exact function or class name to find callers of
185
  repo: Optional 'owner/repo' to restrict search
186
  """
187
- if _retrieval is None:
188
  return "Search service not ready."
189
 
190
- results = _retrieval.search(
191
- query=function_name,
192
- top_k=8,
193
- repo_filter=repo,
194
- mode="keyword",
195
- )
196
- callers = [r for r in results if function_name in r["text"]]
197
  if not callers:
198
- return f"No call sites found for '{function_name}'."
199
- return _retrieval.format_context(callers)
 
 
 
 
 
 
 
 
 
 
 
200
 
201
 
202
  @mcp.tool()
 
180
  Essential for understanding HOW something is used, not just what it does.
181
  Use this after search_code when you need usage patterns and call sites.
182
 
183
+ Uses the 'calls' payload field populated during AST chunking — this is
184
+ a structural lookup, not text search, so it finds exact call sites only.
185
+
186
  Args:
187
  function_name: The exact function or class name to find callers of
188
  repo: Optional 'owner/repo' to restrict search
189
  """
190
+ if _store is None:
191
  return "Search service not ready."
192
 
193
+ callers = _store.find_callers(function_name, repo=repo)
 
 
 
 
 
 
194
  if not callers:
195
+ return f"No call sites found for '{function_name}' in the 'calls' index."
196
+
197
+ # Format the same way as retrieval.format_context for consistency
198
+ parts = []
199
+ for i, c in enumerate(callers[:8], 1):
200
+ citation = c.get("filepath", "")
201
+ if c.get("name"):
202
+ citation += f" — {c['name']}()"
203
+ citation += f" | lines {c.get('start_line', '?')}–{c.get('end_line', '?')}"
204
+ parts.append(f"[Source {i} | {c.get('repo', '')} | {citation}]\n{c.get('text', '')}")
205
+
206
+ return f"Found {len(callers)} caller(s) of '{function_name}':\n\n" + \
207
+ "\n\n" + "─" * 40 + "\n\n".join(parts)
208
 
209
 
210
  @mcp.tool()
backend/services/agent.py CHANGED
@@ -195,6 +195,13 @@ class AgentService:
195
  Tool calls are async (await mcp.call_tool). Using 'async def' with
196
  'yield' creates an AsyncIterator — FastAPI's StreamingResponse and
197
  async for loops both consume it natively.
 
 
 
 
 
 
 
198
  """
199
  # Discover tools from MCP server (cached after first call)
200
  mcp_tools = await self.mcp.list_tools()
@@ -206,10 +213,12 @@ class AgentService:
206
  step = await asyncio.to_thread(self._call_llm, messages, tools_llm)
207
 
208
  if step["done"]:
209
- # Stream answer word by word
210
- # TODO: real token streaming requires stream=True in LLM call
211
- for word in step["answer"].split(" "):
212
- yield {"type": "token", "text": word + " "}
 
 
213
  yield {"type": "done", "iterations": iteration + 1}
214
  return
215
 
@@ -229,6 +238,65 @@ class AgentService:
229
 
230
  yield {"type": "done", "iterations": self.MAX_ITERATIONS}
231
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
  # ── LLM dispatch ───────────────────────────────────────────────────────────
233
 
234
  def _format_tools(self, mcp_tools: list) -> list:
 
195
  Tool calls are async (await mcp.call_tool). Using 'async def' with
196
  'yield' creates an AsyncIterator — FastAPI's StreamingResponse and
197
  async for loops both consume it natively.
198
+
199
+ Real token streaming:
200
+ For the tool-calling iterations, we use non-streaming LLM calls —
201
+ we need the FULL response to decide what tool to call next.
202
+ Once the agent decides to give a final answer (no tool calls),
203
+ we re-run with stream=True so tokens arrive in real time.
204
+ This is one extra LLM call but delivers genuine streaming UX.
205
  """
206
  # Discover tools from MCP server (cached after first call)
207
  mcp_tools = await self.mcp.list_tools()
 
213
  step = await asyncio.to_thread(self._call_llm, messages, tools_llm)
214
 
215
  if step["done"]:
216
+ # Stream the final answer with real token-by-token delivery.
217
+ # We pass messages (with all tool results) to the streaming call
218
+ # and tell the LLM not to use tools (tool_choice="none") so it
219
+ # goes straight to answering.
220
+ async for token in self._stream_final_answer(messages):
221
+ yield {"type": "token", "text": token}
222
  yield {"type": "done", "iterations": iteration + 1}
223
  return
224
 
 
238
 
239
  yield {"type": "done", "iterations": self.MAX_ITERATIONS}
240
 
241
+ async def _stream_final_answer(self, messages: list) -> AsyncIterator[str]:
242
+ """
243
+ Stream the final answer token by token using the LLM's native streaming.
244
+
245
+ The challenge: Groq/Anthropic SDKs are synchronous (blocking iteration).
246
+ We bridge sync → async using asyncio.Queue:
247
+ 1. A background thread runs the sync streaming loop, pushing tokens to a queue
248
+ 2. This async generator reads from the queue as tokens arrive
249
+ 3. A None sentinel signals the end of the stream
250
+
251
+ This is the standard pattern for wrapping sync iterators in async code
252
+ without blocking the event loop. Any async generator that needs to consume
253
+ a sync blocking iterator should use this approach.
254
+ """
255
+ queue: asyncio.Queue[str | None] = asyncio.Queue()
256
+ loop = asyncio.get_running_loop()
257
+
258
+ def _run_sync():
259
+ try:
260
+ if self._provider == "groq":
261
+ stream = self._client.chat.completions.create(
262
+ model=self._model,
263
+ max_tokens=2048,
264
+ messages=[{"role": "system", "content": SYSTEM_PROMPT}] + messages,
265
+ # No tools parameter → model goes straight to answering
266
+ stream=True,
267
+ )
268
+ for chunk in stream:
269
+ delta = chunk.choices[0].delta.content
270
+ if delta:
271
+ loop.call_soon_threadsafe(queue.put_nowait, delta)
272
+ else:
273
+ # Anthropic: omit tools entirely for the final answer
274
+ with self._client.messages.stream(
275
+ model=self._model,
276
+ max_tokens=2048,
277
+ system=SYSTEM_PROMPT,
278
+ messages=messages,
279
+ ) as stream:
280
+ for text in stream.text_stream:
281
+ loop.call_soon_threadsafe(queue.put_nowait, text)
282
+ finally:
283
+ # Always send the sentinel so the consumer loop ends
284
+ loop.call_soon_threadsafe(queue.put_nowait, None)
285
+
286
+ # Schedule the sync call in the default thread pool without blocking.
287
+ # run_in_executor returns an asyncio.Future — we await it at the end
288
+ # to propagate any exception raised inside _run_sync.
289
+ task = loop.run_in_executor(None, _run_sync)
290
+
291
+ # Consume tokens as they arrive from the background thread
292
+ while True:
293
+ token = await queue.get()
294
+ if token is None:
295
+ break
296
+ yield token
297
+
298
+ await task # re-raises any exception from the streaming thread
299
+
300
  # ── LLM dispatch ───────────────────────────────────────────────────────────
301
 
302
  def _format_tools(self, mcp_tools: list) -> list:
backend/services/ingestion_service.py CHANGED
@@ -37,11 +37,16 @@ class IngestionService:
37
  Shared state:
38
  - self.embedder — kept alive so the model isn't reloaded per request
39
  - self.store — keeps the Qdrant client open (HTTP connection pooling)
 
 
 
 
 
40
  """
41
 
42
- def __init__(self):
43
- self.embedder = Embedder()
44
- self.store = QdrantStore()
45
 
46
  def ingest(self, repo_url: str, force: bool = False) -> dict:
47
  """
 
37
  Shared state:
38
  - self.embedder — kept alive so the model isn't reloaded per request
39
  - self.store — keeps the Qdrant client open (HTTP connection pooling)
40
+
41
+ Why accept store as an argument?
42
+ main.py creates one QdrantStore and shares it across IngestionService,
43
+ GraphService, and the MCP server. A single client means one connection
44
+ pool, one auth handshake, and consistent state across all services.
45
  """
46
 
47
+ def __init__(self, store: QdrantStore | None = None, embedder=None):
48
+ self.embedder = embedder or Embedder()
49
+ self.store = store or QdrantStore()
50
 
51
  def ingest(self, repo_url: str, force: bool = False) -> dict:
52
  """
eval/__init__.py ADDED
File without changes
eval/eval.py ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ eval.py — Retrieval quality evaluation for the GitHub RAG Copilot.
3
+
4
+ ═══════════════════════════════════════════════════════════════
5
+ WHY AN EVAL HARNESS?
6
+ ═══════════════════════════════════════════════════════════════
7
+
8
+ Without measurement, you can't improve. The three retrieval modes
9
+ (semantic, keyword, hybrid) produce different rankings — but which
10
+ is actually better for code questions? This eval harness answers that.
11
+
12
+ Three metrics:
13
+
14
+ Hit Rate @ k (also called Recall@k)
15
+ ──────────────────────────────────────
16
+ For each test case: did ANY expected file appear in the top-k results?
17
+ Answers: "Does our retrieval find the RIGHT file at all?"
18
+ Example: k=3, expected=["engine.py"], top-3 results include engine.py → hit=1
19
+
20
+ Mean Reciprocal Rank (MRR)
21
+ ──────────────────────────────────────
22
+ For each test case: what rank was the FIRST correct result?
23
+ Score = 1/rank. Rank 1 → 1.0, Rank 2 → 0.5, Rank 3 → 0.33, miss → 0.
24
+ Average across all test cases = MRR.
25
+ Answers: "When we find it, do we find it FIRST?"
26
+ High hit@3 but low MRR means we find it but bury it under noise.
27
+
28
+ Precision @ k
29
+ ──────────────────────────────────────
30
+ Of the top-k results, what fraction matched the expected files?
31
+ Answers: "Are our top results relevant, or full of noise?"
32
+
33
+ ═══════════════════════════════════════════════════════════════
34
+ USAGE
35
+ ═══════════════════════════════════════════════════════════════
36
+
37
+ # Run eval on micrograd (must be indexed first):
38
+ python -m eval.eval --repo karpathy/micrograd
39
+
40
+ # Compare all three modes:
41
+ python -m eval.eval --repo karpathy/micrograd --modes hybrid semantic keyword
42
+
43
+ # Use custom test cases:
44
+ python -m eval.eval --repo owner/repo --cases eval/test_cases/my_cases.json
45
+
46
+ # More results per query:
47
+ python -m eval.eval --repo karpathy/micrograd --top-k 5
48
+
49
+ ═══════════════════════════════════════════════════════════════
50
+ INTERPRETING RESULTS
51
+ ═══════════════════════════════════════════════════════════════
52
+
53
+ hit@3 > 0.8 = good retrieval
54
+ MRR > 0.6 = good ranking (top results are relevant)
55
+ MRR < 0.4 = results are found but buried — re-rank or tune top_k
56
+
57
+ If hybrid beats both semantic and keyword on MRR, it confirms that
58
+ RRF fusion is working correctly and worth the extra complexity.
59
+ """
60
+
61
+ import argparse
62
+ import json
63
+ import sys
64
+ import time
65
+ from dataclasses import dataclass, field
66
+ from pathlib import Path
67
+ from typing import Optional
68
+
69
+ # Allow running from repo root
70
+ sys.path.insert(0, str(Path(__file__).parent.parent))
71
+
72
+ from retrieval.retrieval import RetrievalService
73
+
74
+
75
+ # ── Data structures ────────────────────────────────────────────────────────────
76
+
77
+ @dataclass
78
+ class EvalCase:
79
+ """
80
+ One evaluation test case.
81
+
82
+ A case is "hit" if any result's filepath contains one of the expected_files
83
+ OR any result's name matches one of expected_names.
84
+ File matching is substring-based — "engine.py" matches "micrograd/engine.py".
85
+ """
86
+ question: str
87
+ expected_files: list[str] = field(default_factory=list)
88
+ expected_names: list[str] = field(default_factory=list)
89
+
90
+ def is_hit(self, result: dict) -> bool:
91
+ """Return True if this result satisfies the expected conditions."""
92
+ filepath = result.get("filepath", "").lower()
93
+ name = result.get("name", "").lower()
94
+
95
+ for ef in self.expected_files:
96
+ if ef.lower() in filepath:
97
+ return True
98
+ for en in self.expected_names:
99
+ if en.lower() == name:
100
+ return True
101
+ return False
102
+
103
+
104
+ @dataclass
105
+ class CaseResult:
106
+ """Metrics for one test case."""
107
+ question: str
108
+ hit: bool # any expected result in top-k
109
+ rank: int # rank of FIRST correct result (0 = not found)
110
+ reciprocal_rank: float # 1/rank or 0.0
111
+ precision_at_k: float # fraction of top-k that were relevant
112
+ top_results: list[dict] = field(default_factory=list)
113
+
114
+
115
+ # ── Core eval logic ────────────────────────────────────────────────────────────
116
+
117
+ def run_eval(
118
+ retrieval: RetrievalService,
119
+ cases: list[EvalCase],
120
+ repo: Optional[str],
121
+ mode: str,
122
+ top_k: int,
123
+ ) -> list[CaseResult]:
124
+ """
125
+ Run all test cases against the retrieval service.
126
+
127
+ Args:
128
+ retrieval: Initialized RetrievalService
129
+ cases: List of EvalCase to evaluate
130
+ repo: Repo filter ('owner/name') or None for all repos
131
+ mode: 'hybrid', 'semantic', or 'keyword'
132
+ top_k: Number of results to retrieve per case
133
+
134
+ Returns:
135
+ List of CaseResult with per-case metrics
136
+ """
137
+ results = []
138
+
139
+ for case in cases:
140
+ hits = retrieval.search(
141
+ query=case.question,
142
+ top_k=top_k,
143
+ repo_filter=repo,
144
+ mode=mode,
145
+ )
146
+
147
+ first_hit_rank = 0
148
+ hit_count = 0
149
+ for rank, r in enumerate(hits, start=1):
150
+ if case.is_hit(r):
151
+ hit_count += 1
152
+ if first_hit_rank == 0:
153
+ first_hit_rank = rank
154
+
155
+ results.append(CaseResult(
156
+ question = case.question,
157
+ hit = first_hit_rank > 0,
158
+ rank = first_hit_rank,
159
+ reciprocal_rank = 1.0 / first_hit_rank if first_hit_rank > 0 else 0.0,
160
+ precision_at_k = hit_count / top_k,
161
+ top_results = hits,
162
+ ))
163
+
164
+ return results
165
+
166
+
167
+ def compute_summary(results: list[CaseResult], top_k: int) -> dict:
168
+ """Aggregate per-case metrics into dataset-level scores."""
169
+ n = len(results)
170
+ return {
171
+ f"hit@{top_k}": round(sum(r.hit for r in results) / n, 3),
172
+ "mrr": round(sum(r.reciprocal_rank for r in results) / n, 3),
173
+ f"p@{top_k}": round(sum(r.precision_at_k for r in results) / n, 3),
174
+ "n_cases": n,
175
+ }
176
+
177
+
178
+ # ── Output formatting ──────────────────────────────────────────────────────────
179
+
180
+ def print_report(
181
+ mode: str,
182
+ summary: dict,
183
+ results: list[CaseResult],
184
+ top_k: int,
185
+ verbose: bool = False,
186
+ ):
187
+ """Print a human-readable eval report."""
188
+ k = top_k
189
+ hit_key = f"hit@{k}"
190
+ p_key = f"p@{k}"
191
+
192
+ print(f"\n{'─'*60}")
193
+ print(f" Mode: {mode.upper():<10} | {results[0].top_results[0]['repo'] if results and results[0].top_results else 'all repos'}")
194
+ print(f"{'─'*60}")
195
+ print(f" Hit@{k} : {summary[hit_key]:.3f} ({sum(r.hit for r in results)}/{summary['n_cases']} cases hit)")
196
+ print(f" MRR : {summary['mrr']:.3f}")
197
+ print(f" P@{k} : {summary[p_key]:.3f}")
198
+ print(f"{'─'*60}")
199
+
200
+ if verbose:
201
+ for r in results:
202
+ status = "✓" if r.hit else "✗"
203
+ rank_str = f"rank={r.rank}" if r.rank > 0 else "miss"
204
+ print(f"\n {status} [{rank_str}] {r.question[:60]}")
205
+ if not r.hit and r.top_results:
206
+ # Show what we got instead
207
+ for i, res in enumerate(r.top_results[:3], 1):
208
+ print(f" {i}. {res.get('filepath','')} — {res.get('name','')}")
209
+
210
+
211
+ # ── CLI entry point ────────────────────────────────────────────────────────────
212
+
213
+ def main():
214
+ parser = argparse.ArgumentParser(
215
+ description="Evaluate retrieval quality for an indexed GitHub repo."
216
+ )
217
+ parser.add_argument(
218
+ "--repo", required=True,
219
+ help="Repo slug to evaluate (e.g. karpathy/micrograd). Must be indexed."
220
+ )
221
+ parser.add_argument(
222
+ "--cases", default=None,
223
+ help="Path to JSON test cases file. Defaults to eval/test_cases/<repo-name>.json"
224
+ )
225
+ parser.add_argument(
226
+ "--modes", nargs="+", default=["hybrid", "semantic", "keyword"],
227
+ choices=["hybrid", "semantic", "keyword"],
228
+ help="Retrieval modes to compare (default: all three)"
229
+ )
230
+ parser.add_argument(
231
+ "--top-k", type=int, default=3,
232
+ help="Number of results to retrieve per query (default: 3)"
233
+ )
234
+ parser.add_argument(
235
+ "--verbose", action="store_true",
236
+ help="Show per-case results including misses"
237
+ )
238
+ args = parser.parse_args()
239
+
240
+ # ── Load test cases ────────────────────────────────────────────────────────
241
+ if args.cases:
242
+ cases_path = Path(args.cases)
243
+ else:
244
+ repo_name = args.repo.split("/")[-1]
245
+ cases_path = Path(__file__).parent / "test_cases" / f"{repo_name}.json"
246
+
247
+ if not cases_path.exists():
248
+ print(f"Error: test cases file not found: {cases_path}")
249
+ print(f"Create it with format: [{{'question': '...', 'expected_files': ['...']}}]")
250
+ sys.exit(1)
251
+
252
+ raw_cases = json.loads(cases_path.read_text())
253
+ cases = [EvalCase(**c) for c in raw_cases]
254
+ print(f"\nLoaded {len(cases)} test cases from {cases_path}")
255
+ print(f"Repo filter: {args.repo} | top_k={args.top_k}")
256
+
257
+ # ── Initialize retrieval ───────────────────────────────────────────────────
258
+ print("\nInitializing retrieval service (loading embedding model)...")
259
+ t0 = time.time()
260
+ retrieval = RetrievalService()
261
+ print(f" Ready in {time.time()-t0:.1f}s")
262
+
263
+ # ── Run eval for each mode ─────────────────────────────────────────────────
264
+ all_summaries = {}
265
+ for mode in args.modes:
266
+ results = run_eval(
267
+ retrieval=retrieval,
268
+ cases=cases,
269
+ repo=args.repo,
270
+ mode=mode,
271
+ top_k=args.top_k,
272
+ )
273
+ summary = compute_summary(results, args.top_k)
274
+ all_summaries[mode] = summary
275
+ print_report(mode, summary, results, args.top_k, args.verbose)
276
+
277
+ # ── Comparison table ───────────────────────────────────────────────────────
278
+ if len(args.modes) > 1:
279
+ k = args.top_k
280
+ print(f"\n{'═'*60}")
281
+ print(f" Comparison Summary (top_k={k}, n={len(cases)} cases)")
282
+ print(f"{'═'*60}")
283
+ print(f" {'Mode':<10} | {'Hit@'+str(k):<8} | {'MRR':<8} | {'P@'+str(k):<8}")
284
+ print(f" {'-'*10}-+-{'-'*8}-+-{'-'*8}-+-{'-'*8}")
285
+ for mode, s in all_summaries.items():
286
+ hit = s[f'hit@{k}']
287
+ mrr = s['mrr']
288
+ p = s[f'p@{k}']
289
+ best_mrr = max(v['mrr'] for v in all_summaries.values())
290
+ marker = " ◀ best MRR" if mrr == best_mrr else ""
291
+ print(f" {mode:<10} | {hit:<8.3f} | {mrr:<8.3f} | {p:<8.3f}{marker}")
292
+ print(f"{'═'*60}\n")
293
+
294
+
295
+ if __name__ == "__main__":
296
+ main()
eval/test_cases/micrograd.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "question": "How does backward propagation work?",
4
+ "expected_files": ["micrograd/engine.py"],
5
+ "expected_names": ["backward", "_backward"]
6
+ },
7
+ {
8
+ "question": "What does the Value class do?",
9
+ "expected_files": ["micrograd/engine.py"],
10
+ "expected_names": ["Value"]
11
+ },
12
+ {
13
+ "question": "How is the neural network MLP implemented?",
14
+ "expected_files": ["micrograd/nn.py"],
15
+ "expected_names": ["MLP", "Layer"]
16
+ },
17
+ {
18
+ "question": "How does the tanh activation function work?",
19
+ "expected_files": ["micrograd/engine.py"],
20
+ "expected_names": ["tanh"]
21
+ },
22
+ {
23
+ "question": "How is the training loop and loss function set up?",
24
+ "expected_files": ["demo.ipynb", "test.py"],
25
+ "expected_names": []
26
+ },
27
+ {
28
+ "question": "How does gradient accumulation work in the backward pass?",
29
+ "expected_files": ["micrograd/engine.py"],
30
+ "expected_names": ["backward", "_backward"]
31
+ },
32
+ {
33
+ "question": "What is the Neuron class and how does it compute output?",
34
+ "expected_files": ["micrograd/nn.py"],
35
+ "expected_names": ["Neuron"]
36
+ },
37
+ {
38
+ "question": "How is topological sort used in autograd?",
39
+ "expected_files": ["micrograd/engine.py"],
40
+ "expected_names": ["backward"]
41
+ }
42
+ ]
ingestion/qdrant_store.py CHANGED
@@ -249,6 +249,46 @@ class QdrantStore:
249
  break
250
  return results
251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
252
  def delete_repo(self, repo: str) -> int:
253
  """Delete all chunks for a repo. Returns number of points deleted."""
254
  before = self.count(repo=repo)
@@ -286,12 +326,18 @@ def _text_to_sparse(text: str) -> SparseVector:
286
  Example:
287
  text = "def embed_text(self, text):"
288
  tokens = {"def": 1, "embed_text": 1, "self": 1, "text": 2}
289
- → indices = [hash("def"), hash("embed_text"), ...]
290
  values = [1.0, 1.0, 1.0, 2.0, ...]
291
 
292
  Qdrant uses these sparse vectors for BM25-style keyword matching.
293
  The actual BM25 ranking (IDF weighting, document length normalisation)
294
  is applied at query time by Qdrant.
 
 
 
 
 
 
295
  """
296
  from collections import Counter
297
  import re
@@ -300,11 +346,12 @@ def _text_to_sparse(text: str) -> SparseVector:
300
  tokens = re.findall(r"[a-zA-Z_]\w*", text.lower())
301
  token_counts = Counter(tokens)
302
 
303
- # Map tokens to integer indices using hash (consistent across calls)
 
304
  indices = []
305
  values = []
306
  for token, count in token_counts.items():
307
- idx = abs(hash(token)) % (2 ** 20) # 1M possible dimensions
308
  indices.append(idx)
309
  values.append(float(count))
310
 
 
249
  break
250
  return results
251
 
252
+ def find_callers(self, function_name: str, repo: Optional[str] = None) -> list[dict]:
253
+ """
254
+ Find all chunks that call a specific function by searching the 'calls' payload.
255
+
256
+ During AST chunking, _CallExtractor records every function/method call made
257
+ within each chunk and stores the list in the 'calls' payload field.
258
+ This lets us do an exact structural lookup instead of fuzzy text search —
259
+ "find all functions that call backward()" is a filter, not a search.
260
+
261
+ Args:
262
+ function_name: Exact function name to look for in callers
263
+ repo: Optional 'owner/name' to restrict scope
264
+
265
+ Returns:
266
+ List of payload dicts for chunks that contain a call to function_name
267
+ """
268
+ conditions = [
269
+ FieldCondition(key="calls", match=MatchValue(value=function_name))
270
+ ]
271
+ if repo:
272
+ conditions.append(FieldCondition(key="repo", match=MatchValue(value=repo)))
273
+
274
+ filt = Filter(must=conditions)
275
+ results = []
276
+ offset = None
277
+ while True:
278
+ points, offset = self.client.scroll(
279
+ collection_name=self.collection,
280
+ scroll_filter=filt,
281
+ limit=100,
282
+ offset=offset,
283
+ with_payload=True,
284
+ with_vectors=False,
285
+ )
286
+ for p in points:
287
+ results.append(p.payload)
288
+ if offset is None:
289
+ break
290
+ return results
291
+
292
  def delete_repo(self, repo: str) -> int:
293
  """Delete all chunks for a repo. Returns number of points deleted."""
294
  before = self.count(repo=repo)
 
326
  Example:
327
  text = "def embed_text(self, text):"
328
  tokens = {"def": 1, "embed_text": 1, "self": 1, "text": 2}
329
+ → indices = [md5("def") % 1M, md5("embed_text") % 1M, ...]
330
  values = [1.0, 1.0, 1.0, 2.0, ...]
331
 
332
  Qdrant uses these sparse vectors for BM25-style keyword matching.
333
  The actual BM25 ranking (IDF weighting, document length normalisation)
334
  is applied at query time by Qdrant.
335
+
336
+ WHY NOT hash(token)?
337
+ Python's built-in hash() is randomised per process (PYTHONHASHSEED).
338
+ The same token gets a different integer in each run, so query vectors
339
+ and stored vectors would map to completely different dimensions —
340
+ keyword search would return random noise. hashlib.md5 is stable.
341
  """
342
  from collections import Counter
343
  import re
 
346
  tokens = re.findall(r"[a-zA-Z_]\w*", text.lower())
347
  token_counts = Counter(tokens)
348
 
349
+ # Map tokens to stable integer indices using MD5 (process-invariant)
350
+ # Using the first 8 hex chars = 32-bit integer, then mod 1M dimensions.
351
  indices = []
352
  values = []
353
  for token, count in token_counts.items():
354
+ idx = int(hashlib.md5(token.encode()).hexdigest()[:8], 16) % (2 ** 20)
355
  indices.append(idx)
356
  values.append(float(count))
357
 
retrieval/retrieval.py CHANGED
@@ -56,13 +56,18 @@ class RetrievalService:
56
  Uses the same Embedder as ingestion so queries live in the same vector space
57
  as the indexed chunks. Mixing embedding models breaks retrieval entirely —
58
  vectors from different models are incomparable.
 
 
 
 
 
59
  """
60
 
61
  DENSE_VECTOR_NAME = "code"
62
  SPARSE_VECTOR_NAME = "bm25"
63
 
64
- def __init__(self):
65
- self.embedder = Embedder()
66
  self.client = QdrantClient(
67
  url=settings.qdrant_url,
68
  api_key=settings.qdrant_api_key or None,
 
56
  Uses the same Embedder as ingestion so queries live in the same vector space
57
  as the indexed chunks. Mixing embedding models breaks retrieval entirely —
58
  vectors from different models are incomparable.
59
+
60
+ Why accept embedder as an argument?
61
+ IngestionService and RetrievalService both need the same 600MB model.
62
+ Instantiating it twice wastes ~600MB RAM. main.py creates one Embedder
63
+ and passes it to both services. Shared state, one load.
64
  """
65
 
66
  DENSE_VECTOR_NAME = "code"
67
  SPARSE_VECTOR_NAME = "bm25"
68
 
69
+ def __init__(self, embedder: Embedder | None = None):
70
+ self.embedder = embedder or Embedder()
71
  self.client = QdrantClient(
72
  url=settings.qdrant_url,
73
  api_key=settings.qdrant_api_key or None,