Spaces:

umanggarg
/

cartographer

Running

umanggarg Claude Sonnet 4.6 commited on Mar 21

Commit

b5dbf45

0 Parent(s):

Project setup: GitHub RAG Copilot

- PLAN.md: full architecture, 4 phases, tech stack decisions
- LEARN.md: learning guide — starts with code vs doc RAG differences + Claude Code features (CLAUDE.md, slash commands, hooks, subagents)
- CLAUDE.md: project instructions for Claude Code sessions
- notes/000-project-setup.md: first notes entry
- .claude/commands/: /ingest-repo, /search-code, /add-to-notes slash commands
- requirements.txt: Qdrant, tree-sitter, gitpython, nomic-embed-code deps
- .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (9) hide show

.claude/commands/add-to-notes.md +35 -0
.claude/commands/ingest-repo.md +25 -0
.claude/commands/search-code.md +24 -0
.gitignore +12 -0
CLAUDE.md +91 -0
LEARN.md +256 -0
PLAN.md +114 -0
notes/000-project-setup.md +44 -0
requirements.txt +26 -0

.claude/commands/add-to-notes.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Add a Note Entry
+Create a new entry in the `notes/` directory documenting what was just built.
+## Usage
+```
+/add-to-notes
+```
+## Steps
+1. Look at the most recent note file in `notes/` to determine the next number (NNN)
+2. Look at recent git changes (`git diff HEAD~1` or `git status`) to understand what was built
+3. Create `notes/NNN-<short-title>.md` with this structure:
+```markdown
+# Note NNN — <Title>
+**Date:** <today>
+**PR:** <PR title or "in progress">
+---
+## What was built
+<what was added/changed>
+## Key decisions
+<why these choices were made>
+## Concepts learned
+<RAG/code concepts this feature demonstrates>
+## What's next
+<next steps>
+```

.claude/commands/ingest-repo.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Ingest a GitHub Repository
+Ingest the repository at the given URL into the vector index.
+## Usage
+```
+/ingest-repo <github-url>
+```
+## What this does
+1. Clones or fetches the repo via GitHub API
+2. Filters files (skips binaries, lock files, node_modules, etc.)
+3. Chunks code by AST boundaries (functions/classes)
+4. Embeds chunks with nomic-embed-code
+5. Upserts into Qdrant Cloud collection
+## Steps
+Run the ingestion pipeline for the provided GitHub URL: $ARGUMENTS
+- Call `ingestion/repo_fetcher.py` to fetch the repo
+- Call `ingestion/file_filter.py` to get the list of files to index
+- Call `ingestion/code_chunker.py` to chunk each file
+- Call `backend/services/ingestion_service.py` to embed and store
+- Print a summary: files indexed, chunks stored, languages detected

.claude/commands/search-code.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# Search Code Without Generating an Answer
+Run a retrieval-only search against the indexed repositories.
+## Usage
+```
+/search-code <query>
+```
+## What this does
+Calls the `/search` endpoint (no LLM) and displays the raw retrieved chunks
+with their file paths, line numbers, and relevance scores. Useful for:
+- Verifying the index contains what you expect
+- Debugging retrieval quality before blaming the LLM
+- Exploring the codebase without a full RAG query
+## Steps
+Search for: $ARGUMENTS
+Call `GET /search?query=<query>&top_k=10` and display:
+- Filepath + line range for each result
+- Relevance score
+- The actual code chunk

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+.venv/
+__pycache__/
+*.pyc
+.env
+.env.*
+chroma_db/
+repos/          # cloned repos (temp)
+node_modules/
+dist/
+build/
+*.egg-info/
+.DS_Store

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,91 @@

+# GitHub RAG Copilot — Claude Code Instructions
+This file is read by Claude Code at the start of every session.
+It tells Claude how to work in this project.
+---
+## Project Purpose
+A RAG system that indexes GitHub repositories and answers questions about code.
+This is a **learning project** — prioritise clarity and explanation over brevity.
+---
+## Architecture at a Glance
+```
+ingestion/          ← repo fetching, file filtering, AST chunking, embedding
+retrieval/          ← Qdrant hybrid search, BM25 sparse vectors
+backend/            ← FastAPI: /ingest, /query, /search endpoints
+  services/         ← ingestion_service.py, retrieval_service.py, generation.py
+  routers/          ← ingest.py, query.py
+  models/           ← schemas.py (Pydantic models)
+ui/                 ← React + Vite frontend
+notes/              ← Updated after every PR (NNN-title.md)
+PLAN.md             ← Build plan and phase tracking
+LEARN.md            ← Learning guide, updated as features are built
+```
+---
+## Coding Rules
+- Write comments explaining **why**, not what — this is a learning project
+- Each new concept gets a docstring explaining it from first principles
+- Prefer explicit over implicit — avoid magic
+- No LangChain, no LlamaIndex — build from scratch so concepts are visible
+- Match the style of `rag-research-copilot/` (the sibling project)
+---
+## Notes Convention
+After every significant feature (PR-worthy), add an entry to `notes/`:
+- Filename: `NNN-short-title.md` (zero-padded, e.g. `001-ingestion.md`)
+- Contents: what was built, key decisions, concepts learned, what's next
+---
+## Running the Project
+```bash
+# Backend
+cd github-rag-copilot
+python -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+uvicorn backend.main:app --reload
+# Frontend
+cd ui && npm install && npm run dev
+```
+---
+## Environment Variables
+```
+QDRANT_URL=         # Qdrant Cloud cluster URL
+QDRANT_API_KEY=     # Qdrant Cloud API key
+GROQ_API_KEY=       # LLM (free)
+ANTHROPIC_API_KEY=  # LLM fallback (optional)
+GITHUB_TOKEN=       # Optional — increases API rate limit from 60 to 5000 req/hr
+```
+---
+## Slash Commands Available
+- `/ingest-repo` — ingest a GitHub repository by URL
+- `/search-code` — search the index without generating an answer
+- `/add-to-notes` — add a note entry for the current work
+---
+## Key Design Decisions (don't change without good reason)
+- **Qdrant Cloud** for vector storage (not ChromaDB) — enables free deployment
+- **AST chunking** at function/class boundaries — not character windows
+- **nomic-embed-code** embedding model — code-optimised, not general text
+- **Qdrant native hybrid search** — replaces manual BM25 index + RRF fusion
+- **No auth required** for public repo ingestion — GitHub API unauthenticated allows 60 req/hr, with token 5000 req/hr

LEARN.md ADDED Viewed

	@@ -0,0 +1,256 @@

+# GitHub RAG Copilot — Learning Guide
+This document grows with the project. Each section is added when the corresponding
+feature is built. Read it alongside the code and the `notes/` entries.
+---
+# Table of Contents
+1. [Why Code RAG is Different from Document RAG](#1-why-code-rag-is-different)
+2. [AST-Based Chunking](#2-ast-based-chunking) ← _coming in Phase 1_
+3. [Code Embeddings](#3-code-embeddings) ← _coming in Phase 1_
+4. [Qdrant — A Hosted Vector Database](#4-qdrant) ← _coming in Phase 1_
+5. [Native Hybrid Search in Qdrant](#5-native-hybrid-search) ← _coming in Phase 2_
+6. [Generation for Code Queries](#6-generation-for-code-queries) ← _coming in Phase 2_
+7. [Claude Code Features](#7-claude-code-features) ← _built throughout_
+   - 7a. CLAUDE.md
+   - 7b. Slash Commands
+   - 7c. Hooks
+   - 7d. Subagents
+---
+# 1. Why Code RAG is Different
+## The same idea, different inputs
+In the PDF RAG Copilot, the pipeline was:
+```
+PDF → pages → text chunks → embed → ChromaDB → query → answer
+```
+In this project it's:
+```
+GitHub repo → files → code chunks → embed → Qdrant → query → answer
+```
+The retrieval, LLM generation, and API layers are nearly identical.
+The differences are all in **ingestion** — how you get the text, and how you
+chunk it.
+## Problem 1: What files do you index?
+A GitHub repo contains many things that shouldn't be indexed:
+- **Binary files** — images, compiled artifacts, `.pyc` files
+- **Auto-generated files** — `package-lock.json`, `*.lock`, migration files
+- **Dependency directories** — `node_modules/`, `.venv/`, `vendor/`
+- **Build output** — `dist/`, `build/`, `__pycache__/`
+If you index these, queries like "how does authentication work?" return
+lock file entries and compiled output instead of actual code.
+Solution: a `file_filter.py` with explicit include/exclude rules per language.
+## Problem 2: How do you chunk code?
+In the PDF project, we used **fixed character windows** with overlap:
+```
+[---- chunk 1 (800 chars) ----]
+              [---- chunk 2 (800 chars) ----]
+```
+This works for prose because a sentence is a self-contained unit — splitting
+a 200-page paper anywhere still yields readable text.
+**Code is different.** A function is the natural unit:
+```python
+def embed_text(self, text: str) -> list[float]:
+    """Embed a single text string into a 384-dim vector."""
+    return self.model.encode([text])[0].tolist()
+```
+Splitting this mid-way loses the function signature (what it takes/returns)
+or the body (what it actually does). A chunk without the signature can't
+answer "what does embed_text take as input?". A chunk without the body can't
+answer "how does embed_text work?".
+**Solution: AST-based chunking** — parse the code into its syntax tree, then
+use function and class boundaries as natural split points.
+## Problem 3: What metadata matters?
+PDF RAG metadata: `source` (paper name), `page` (page number)
+Code RAG metadata: `repo`, `filepath`, `language`, `function_name`,
+                   `class_name`, `start_line`, `end_line`
+This makes citations meaningful:
+```
+PDF:  (Source: attention_2017, Page 4)
+Code: (repo: pytorch/pytorch, file: torch/nn/functional.py, lines 1823–1851)
+```
+And it enables powerful filters:
+- "Only search in Python files"
+- "Only search in the `auth/` directory"
+- "Only search in test files"
+## What stays the same
+Everything downstream of ingestion:
+- Embedding queries and comparing to stored vectors
+- Relevance threshold to reject out-of-domain queries
+- Hybrid search combining semantic + keyword
+- LLM generation from retrieved context
+- Citations, confidence scores, streaming
+---
+# 7. Claude Code Features
+This section is different from the rest — instead of explaining RAG concepts,
+it explains how to use Claude Code more effectively while building this project.
+## 7a. CLAUDE.md
+`CLAUDE.md` is a file Claude Code reads at the start of every session.
+Think of it as a briefing document for Claude — it tells Claude:
+- What the project does
+- How the codebase is structured
+- Coding conventions to follow
+- What commands to run
+- Key design decisions that shouldn't be changed without thought
+Without `CLAUDE.md`, Claude starts every session with no project context and
+has to rediscover everything from reading files. With it, Claude immediately
+knows the architecture, conventions, and constraints.
+**Best practices for CLAUDE.md:**
+- Keep it concise — Claude reads it on every message, so every line costs tokens
+- Include the directory structure (a quick mental map)
+- List environment variables needed
+- Document non-obvious decisions and _why_ they were made
+- Keep commands up-to-date (wrong commands waste time)
+**What NOT to put in CLAUDE.md:**
+- Detailed implementation explanations (that's what code comments are for)
+- Anything that changes frequently (stale info is worse than no info)
+- Things derivable from the code itself
+Our `CLAUDE.md` lives at the root of the project. Open it to see an example.
+## 7b. Slash Commands
+Slash commands are custom prompts stored in `.claude/commands/`.
+They're invoked with `/command-name [args]`.
+```
+.claude/
+  commands/
+    ingest-repo.md    → /ingest-repo <github-url>
+    search-code.md    → /search-code <query>
+    add-to-notes.md   → /add-to-notes
+```
+Each command file is a markdown prompt with `$ARGUMENTS` as the placeholder
+for whatever you pass after the command name.
+**Why slash commands?**
+Instead of typing a long, precise instruction every time ("run the ingestion
+pipeline for this repo, print a summary of files indexed, languages detected,
+and chunks stored"), you define it once and invoke it with `/ingest-repo <url>`.
+They're especially useful for:
+- Repetitive operations (ingest a repo, update notes, run tests)
+- Multi-step workflows you want to be consistent
+- Sharing workflows with others on the project
+## 7c. Hooks
+Hooks are shell commands that run automatically in response to Claude Code events.
+Configure them in `.claude/settings.json`.
+Available hook events:
+- `PreToolUse` — runs before Claude calls a tool (e.g., before editing a file)
+- `PostToolUse` — runs after Claude calls a tool (e.g., after writing a file)
+- `Stop` — runs when Claude finishes a response
+**Example: auto-lint after every file edit**
+```json
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Edit|Write",
+        "hooks": [{
+          "type": "command",
+          "command": "cd /path/to/project && ruff check $CLAUDE_FILE_PATH --fix"
+        }]
+      }
+    ]
+  }
+}
+```
+**Example: auto-update notes after a commit**
+```json
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Bash",
+        "hooks": [{
+          "type": "command",
+          "command": "if echo '$CLAUDE_TOOL_INPUT' | grep -q 'git commit'; then echo 'Remember to run /add-to-notes'; fi"
+        }]
+      }
+    ]
+  }
+}
+```
+**Why hooks?**
+They automate quality gates without relying on Claude to remember to run them.
+A lint hook means every file Claude edits is checked — you never accidentally
+commit code with style errors.
+## 7d. Subagents
+Subagents are Claude instances spawned from the main Claude session to handle
+independent tasks in parallel. In Claude Code, you use the `Agent` tool.
+**When to use subagents:**
+- Tasks that are independent and can run simultaneously
+- Tasks that would pollute the main context with large outputs
+- Specialised tasks (research, exploration, review)
+**Example: parallel repo ingestion**
+Instead of ingesting repos sequentially:
+```
+Ingest repo A (2 min) → Ingest repo B (2 min) → Ingest repo C (2 min) = 6 min
+```
+Spawn three subagents in parallel:
+```
+Ingest repo A ─┐
+Ingest repo B ─┼→ done in ~2 min
+Ingest repo C ─┘
+```
+**Example: exploration agent**
+Before implementing a feature, spawn an Explore agent to read all relevant
+files and return a summary — without filling the main context with file contents.
+**Subagent types in Claude Code:**
+- `general-purpose` — full tool access, good for implementation tasks
+- `Explore` — read-only, fast codebase exploration
+- `Plan` — architecture and design planning
+We'll use subagents when:
+1. Ingesting multiple repos simultaneously
+2. Running an expert review before a PR
+3. Exploring unfamiliar repos before answering questions about them
+---
+_Sections 2–6 will be added as each phase is built._

PLAN.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# GitHub RAG Copilot — Build Plan
+A RAG system that indexes GitHub repositories and answers questions about their
+code, architecture, and documentation. Extends the PDF RAG Copilot concepts to
+source code.
+---
+## Learning Objectives
+By the end of this project you will understand:
+- How RAG applies to code (not just documents)
+- AST-based code chunking vs. character-window chunking
+- Code-aware embeddings vs. general text embeddings
+- Metadata-rich retrieval (file, function, class, language)
+- Hosted vector DB (Qdrant Cloud) vs. local (ChromaDB)
+- Claude Code features: CLAUDE.md, hooks, slash commands, subagents
+---
+## Architecture Overview
+```
+GitHub URL
+    │
+    ▼
+[Ingestion Pipeline]
+    ├── Clone / fetch repo via GitHub API
+    ├── Filter files (language-aware)
+    ├── Chunk by AST boundaries (functions, classes)
+    │       └── fallback: character windows (markdown, config)
+    ├── Embed with code-optimized model
+    └── Store in Qdrant Cloud (vector + metadata)
+            └── metadata: repo, filepath, language,
+                         function_name, class_name, start_line
+    │
+    ▼
+[Query Pipeline]  ← identical to PDF RAG
+    ├── Embed query
+    ├── Hybrid search (semantic + BM25 via Qdrant)
+    ├── Relevance threshold
+    └── LLM generation (Groq / Claude)
+            └── citations: filepath + line range
+```
+---
+## Phases
+### Phase 1 — Core Ingestion (Week 1)
+- [ ] `repo_fetcher.py` — clone or fetch via GitHub API (no auth needed for public repos)
+- [ ] `file_filter.py` — include/exclude rules per language, skip binaries/lock files
+- [ ] `code_chunker.py` — AST-based chunking for Python; character-window fallback
+- [ ] `embedder.py` — reuse from PDF RAG, swap model to `nomic-ai/nomic-embed-code`
+- [ ] `qdrant_store.py` — replace ChromaDB with Qdrant client
+### Phase 2 — Retrieval & Generation (Week 1–2)
+- [ ] `retrieval.py` — hybrid search using Qdrant's native BM25 + vector
+- [ ] `generation.py` — reuse from PDF RAG, update system prompt for code answers
+- [ ] FastAPI backend with `/ingest`, `/query`, `/search` endpoints
+### Phase 3 — UI (Week 2)
+- [ ] Reuse PDF RAG UI structure
+- [ ] Input: GitHub URL instead of PDF upload
+- [ ] Citations show filepath + line numbers instead of page numbers
+- [ ] Syntax highlighting for code chunks in source passages
+### Phase 4 — Claude Code Features (Throughout)
+- [ ] `CLAUDE.md` — project instructions for Claude Code
+- [ ] Hooks — auto-lint on edit, auto-update notes on commit
+- [ ] Slash commands — `/ingest-repo`, `/search-code`, `/add-to-notes`
+- [ ] Subagent patterns — parallel ingestion of multiple repos
+---
+## Tech Stack
+| Layer | Choice | Why |
+|---|---|---|
+| Repo fetch | `gitpython` + GitHub API | No auth for public repos |
+| Code parsing | `ast` (Python), `tree-sitter` (multi-lang) | Function/class boundaries |
+| Embeddings | `nomic-ai/nomic-embed-code` | Trained on code, free |
+| Vector DB | Qdrant Cloud (free tier) | Permanent free, 1GB |
+| Keyword search | Qdrant sparse vectors (BM25) | Native, no separate index |
+| LLM | Groq Llama 3.3 70B / Claude | Same as PDF RAG |
+| Backend | FastAPI | Same as PDF RAG |
+| Frontend | React + Vite | Same as PDF RAG |
+---
+## Key Differences vs PDF RAG
+| Concern | PDF RAG | GitHub RAG |
+|---|---|---|
+| Ingestion source | Local file upload | GitHub URL |
+| Chunking strategy | Fixed character windows | AST-aware (function/class) |
+| Metadata | source, page | repo, filepath, language, function, line |
+| Vector DB | ChromaDB (local) | Qdrant Cloud (hosted) |
+| Embedding model | all-MiniLM-L6-v2 | nomic-embed-code |
+| Citations | Paper name + page | Filepath + line range |
+| Hybrid search | Manual RRF | Qdrant native hybrid |
+---
+## Notes Directory
+`notes/` is updated after every PR with:
+- What was built
+- Key decisions made
+- Concepts learned
+- What's next
+See `notes/000-project-setup.md` for the first entry.

notes/000-project-setup.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# Note 000 — Project Setup
+**Date:** 2026-03-22
+**PR:** Initial setup (no PR — baseline)
+---
+## What was set up
+- Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/`
+- `PLAN.md` written with full architecture, phases, and tech stack decisions
+- `CLAUDE.md` written with project instructions for Claude Code
+- `LEARN.md` started — will grow as each phase is built
+- Git repo initialized
+---
+## Key architectural decisions
+**Why Qdrant over ChromaDB?**
+ChromaDB is local-only — data lives on disk and disappears if you redeploy.
+Qdrant Cloud has a permanent free tier (1GB), making the app deployable without
+paying for storage. It also has native hybrid search (sparse + dense vectors),
+eliminating the need for our manual BM25 index.
+**Why nomic-embed-code over all-MiniLM-L6-v2?**
+`all-MiniLM-L6-v2` was trained on natural language. Code has different patterns:
+identifier names, function signatures, call chains. `nomic-embed-code` was
+fine-tuned on code and produces better semantic similarity for code queries.
+**Why AST chunking over character windows?**
+Character windows split wherever they hit the size limit — often mid-function.
+A function is the natural unit of code: it has a name, a purpose, inputs/outputs.
+Chunking at function boundaries keeps each chunk semantically complete and makes
+citations meaningful ("see `embed_text()` in `retrieval/embedder.py`").
+---
+## What's next
+Phase 1: Core ingestion pipeline
+- `repo_fetcher.py` — clone public repos
+- `file_filter.py` — skip binaries, lock files, node_modules
+- `code_chunker.py` — AST-based chunking for Python

requirements.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+# Web framework
+fastapi
+uvicorn[standard]
+python-multipart
+# Vector database
+qdrant-client
+# Embeddings
+sentence-transformers  # will use nomic-embed-code
+# Repo fetching
+gitpython
+requests
+# Code parsing
+tree-sitter          # multi-language AST parsing
+tree-sitter-python   # Python grammar for tree-sitter
+# LLM providers
+groq
+anthropic
+# Utilities
+python-dotenv
+pydantic