Spaces:
Running
Running
Commit ·
b5dbf45
0
Parent(s):
Project setup: GitHub RAG Copilot
Browse files- PLAN.md: full architecture, 4 phases, tech stack decisions
- LEARN.md: learning guide — starts with code vs doc RAG differences + Claude Code features (CLAUDE.md, slash commands, hooks, subagents)
- CLAUDE.md: project instructions for Claude Code sessions
- notes/000-project-setup.md: first notes entry
- .claude/commands/: /ingest-repo, /search-code, /add-to-notes slash commands
- requirements.txt: Qdrant, tree-sitter, gitpython, nomic-embed-code deps
- .gitignore
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- .claude/commands/add-to-notes.md +35 -0
- .claude/commands/ingest-repo.md +25 -0
- .claude/commands/search-code.md +24 -0
- .gitignore +12 -0
- CLAUDE.md +91 -0
- LEARN.md +256 -0
- PLAN.md +114 -0
- notes/000-project-setup.md +44 -0
- requirements.txt +26 -0
.claude/commands/add-to-notes.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Add a Note Entry
|
| 2 |
+
|
| 3 |
+
Create a new entry in the `notes/` directory documenting what was just built.
|
| 4 |
+
|
| 5 |
+
## Usage
|
| 6 |
+
```
|
| 7 |
+
/add-to-notes
|
| 8 |
+
```
|
| 9 |
+
|
| 10 |
+
## Steps
|
| 11 |
+
|
| 12 |
+
1. Look at the most recent note file in `notes/` to determine the next number (NNN)
|
| 13 |
+
2. Look at recent git changes (`git diff HEAD~1` or `git status`) to understand what was built
|
| 14 |
+
3. Create `notes/NNN-<short-title>.md` with this structure:
|
| 15 |
+
|
| 16 |
+
```markdown
|
| 17 |
+
# Note NNN — <Title>
|
| 18 |
+
|
| 19 |
+
**Date:** <today>
|
| 20 |
+
**PR:** <PR title or "in progress">
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## What was built
|
| 25 |
+
<what was added/changed>
|
| 26 |
+
|
| 27 |
+
## Key decisions
|
| 28 |
+
<why these choices were made>
|
| 29 |
+
|
| 30 |
+
## Concepts learned
|
| 31 |
+
<RAG/code concepts this feature demonstrates>
|
| 32 |
+
|
| 33 |
+
## What's next
|
| 34 |
+
<next steps>
|
| 35 |
+
```
|
.claude/commands/ingest-repo.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ingest a GitHub Repository
|
| 2 |
+
|
| 3 |
+
Ingest the repository at the given URL into the vector index.
|
| 4 |
+
|
| 5 |
+
## Usage
|
| 6 |
+
```
|
| 7 |
+
/ingest-repo <github-url>
|
| 8 |
+
```
|
| 9 |
+
|
| 10 |
+
## What this does
|
| 11 |
+
1. Clones or fetches the repo via GitHub API
|
| 12 |
+
2. Filters files (skips binaries, lock files, node_modules, etc.)
|
| 13 |
+
3. Chunks code by AST boundaries (functions/classes)
|
| 14 |
+
4. Embeds chunks with nomic-embed-code
|
| 15 |
+
5. Upserts into Qdrant Cloud collection
|
| 16 |
+
|
| 17 |
+
## Steps
|
| 18 |
+
|
| 19 |
+
Run the ingestion pipeline for the provided GitHub URL: $ARGUMENTS
|
| 20 |
+
|
| 21 |
+
- Call `ingestion/repo_fetcher.py` to fetch the repo
|
| 22 |
+
- Call `ingestion/file_filter.py` to get the list of files to index
|
| 23 |
+
- Call `ingestion/code_chunker.py` to chunk each file
|
| 24 |
+
- Call `backend/services/ingestion_service.py` to embed and store
|
| 25 |
+
- Print a summary: files indexed, chunks stored, languages detected
|
.claude/commands/search-code.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Search Code Without Generating an Answer
|
| 2 |
+
|
| 3 |
+
Run a retrieval-only search against the indexed repositories.
|
| 4 |
+
|
| 5 |
+
## Usage
|
| 6 |
+
```
|
| 7 |
+
/search-code <query>
|
| 8 |
+
```
|
| 9 |
+
|
| 10 |
+
## What this does
|
| 11 |
+
Calls the `/search` endpoint (no LLM) and displays the raw retrieved chunks
|
| 12 |
+
with their file paths, line numbers, and relevance scores. Useful for:
|
| 13 |
+
- Verifying the index contains what you expect
|
| 14 |
+
- Debugging retrieval quality before blaming the LLM
|
| 15 |
+
- Exploring the codebase without a full RAG query
|
| 16 |
+
|
| 17 |
+
## Steps
|
| 18 |
+
|
| 19 |
+
Search for: $ARGUMENTS
|
| 20 |
+
|
| 21 |
+
Call `GET /search?query=<query>&top_k=10` and display:
|
| 22 |
+
- Filepath + line range for each result
|
| 23 |
+
- Relevance score
|
| 24 |
+
- The actual code chunk
|
.gitignore
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.venv/
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.pyc
|
| 4 |
+
.env
|
| 5 |
+
.env.*
|
| 6 |
+
chroma_db/
|
| 7 |
+
repos/ # cloned repos (temp)
|
| 8 |
+
node_modules/
|
| 9 |
+
dist/
|
| 10 |
+
build/
|
| 11 |
+
*.egg-info/
|
| 12 |
+
.DS_Store
|
CLAUDE.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GitHub RAG Copilot — Claude Code Instructions
|
| 2 |
+
|
| 3 |
+
This file is read by Claude Code at the start of every session.
|
| 4 |
+
It tells Claude how to work in this project.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Project Purpose
|
| 9 |
+
|
| 10 |
+
A RAG system that indexes GitHub repositories and answers questions about code.
|
| 11 |
+
This is a **learning project** — prioritise clarity and explanation over brevity.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Architecture at a Glance
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
ingestion/ ← repo fetching, file filtering, AST chunking, embedding
|
| 19 |
+
retrieval/ ← Qdrant hybrid search, BM25 sparse vectors
|
| 20 |
+
backend/ ← FastAPI: /ingest, /query, /search endpoints
|
| 21 |
+
services/ ← ingestion_service.py, retrieval_service.py, generation.py
|
| 22 |
+
routers/ ← ingest.py, query.py
|
| 23 |
+
models/ ← schemas.py (Pydantic models)
|
| 24 |
+
ui/ ← React + Vite frontend
|
| 25 |
+
notes/ ← Updated after every PR (NNN-title.md)
|
| 26 |
+
PLAN.md ← Build plan and phase tracking
|
| 27 |
+
LEARN.md ← Learning guide, updated as features are built
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Coding Rules
|
| 33 |
+
|
| 34 |
+
- Write comments explaining **why**, not what — this is a learning project
|
| 35 |
+
- Each new concept gets a docstring explaining it from first principles
|
| 36 |
+
- Prefer explicit over implicit — avoid magic
|
| 37 |
+
- No LangChain, no LlamaIndex — build from scratch so concepts are visible
|
| 38 |
+
- Match the style of `rag-research-copilot/` (the sibling project)
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Notes Convention
|
| 43 |
+
|
| 44 |
+
After every significant feature (PR-worthy), add an entry to `notes/`:
|
| 45 |
+
- Filename: `NNN-short-title.md` (zero-padded, e.g. `001-ingestion.md`)
|
| 46 |
+
- Contents: what was built, key decisions, concepts learned, what's next
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## Running the Project
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
# Backend
|
| 54 |
+
cd github-rag-copilot
|
| 55 |
+
python -m venv .venv && source .venv/bin/activate
|
| 56 |
+
pip install -r requirements.txt
|
| 57 |
+
uvicorn backend.main:app --reload
|
| 58 |
+
|
| 59 |
+
# Frontend
|
| 60 |
+
cd ui && npm install && npm run dev
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Environment Variables
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
QDRANT_URL= # Qdrant Cloud cluster URL
|
| 69 |
+
QDRANT_API_KEY= # Qdrant Cloud API key
|
| 70 |
+
GROQ_API_KEY= # LLM (free)
|
| 71 |
+
ANTHROPIC_API_KEY= # LLM fallback (optional)
|
| 72 |
+
GITHUB_TOKEN= # Optional — increases API rate limit from 60 to 5000 req/hr
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Slash Commands Available
|
| 78 |
+
|
| 79 |
+
- `/ingest-repo` — ingest a GitHub repository by URL
|
| 80 |
+
- `/search-code` — search the index without generating an answer
|
| 81 |
+
- `/add-to-notes` — add a note entry for the current work
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Key Design Decisions (don't change without good reason)
|
| 86 |
+
|
| 87 |
+
- **Qdrant Cloud** for vector storage (not ChromaDB) — enables free deployment
|
| 88 |
+
- **AST chunking** at function/class boundaries — not character windows
|
| 89 |
+
- **nomic-embed-code** embedding model — code-optimised, not general text
|
| 90 |
+
- **Qdrant native hybrid search** — replaces manual BM25 index + RRF fusion
|
| 91 |
+
- **No auth required** for public repo ingestion — GitHub API unauthenticated allows 60 req/hr, with token 5000 req/hr
|
LEARN.md
ADDED
|
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GitHub RAG Copilot — Learning Guide
|
| 2 |
+
|
| 3 |
+
This document grows with the project. Each section is added when the corresponding
|
| 4 |
+
feature is built. Read it alongside the code and the `notes/` entries.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Table of Contents
|
| 9 |
+
|
| 10 |
+
1. [Why Code RAG is Different from Document RAG](#1-why-code-rag-is-different)
|
| 11 |
+
2. [AST-Based Chunking](#2-ast-based-chunking) ← _coming in Phase 1_
|
| 12 |
+
3. [Code Embeddings](#3-code-embeddings) ← _coming in Phase 1_
|
| 13 |
+
4. [Qdrant — A Hosted Vector Database](#4-qdrant) ← _coming in Phase 1_
|
| 14 |
+
5. [Native Hybrid Search in Qdrant](#5-native-hybrid-search) ← _coming in Phase 2_
|
| 15 |
+
6. [Generation for Code Queries](#6-generation-for-code-queries) ← _coming in Phase 2_
|
| 16 |
+
7. [Claude Code Features](#7-claude-code-features) ← _built throughout_
|
| 17 |
+
- 7a. CLAUDE.md
|
| 18 |
+
- 7b. Slash Commands
|
| 19 |
+
- 7c. Hooks
|
| 20 |
+
- 7d. Subagents
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# 1. Why Code RAG is Different
|
| 25 |
+
|
| 26 |
+
## The same idea, different inputs
|
| 27 |
+
|
| 28 |
+
In the PDF RAG Copilot, the pipeline was:
|
| 29 |
+
```
|
| 30 |
+
PDF → pages → text chunks → embed → ChromaDB → query → answer
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
In this project it's:
|
| 34 |
+
```
|
| 35 |
+
GitHub repo → files → code chunks → embed → Qdrant → query → answer
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
The retrieval, LLM generation, and API layers are nearly identical.
|
| 39 |
+
The differences are all in **ingestion** — how you get the text, and how you
|
| 40 |
+
chunk it.
|
| 41 |
+
|
| 42 |
+
## Problem 1: What files do you index?
|
| 43 |
+
|
| 44 |
+
A GitHub repo contains many things that shouldn't be indexed:
|
| 45 |
+
- **Binary files** — images, compiled artifacts, `.pyc` files
|
| 46 |
+
- **Auto-generated files** — `package-lock.json`, `*.lock`, migration files
|
| 47 |
+
- **Dependency directories** — `node_modules/`, `.venv/`, `vendor/`
|
| 48 |
+
- **Build output** — `dist/`, `build/`, `__pycache__/`
|
| 49 |
+
|
| 50 |
+
If you index these, queries like "how does authentication work?" return
|
| 51 |
+
lock file entries and compiled output instead of actual code.
|
| 52 |
+
|
| 53 |
+
Solution: a `file_filter.py` with explicit include/exclude rules per language.
|
| 54 |
+
|
| 55 |
+
## Problem 2: How do you chunk code?
|
| 56 |
+
|
| 57 |
+
In the PDF project, we used **fixed character windows** with overlap:
|
| 58 |
+
```
|
| 59 |
+
[---- chunk 1 (800 chars) ----]
|
| 60 |
+
[---- chunk 2 (800 chars) ----]
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
This works for prose because a sentence is a self-contained unit — splitting
|
| 64 |
+
a 200-page paper anywhere still yields readable text.
|
| 65 |
+
|
| 66 |
+
**Code is different.** A function is the natural unit:
|
| 67 |
+
```python
|
| 68 |
+
def embed_text(self, text: str) -> list[float]:
|
| 69 |
+
"""Embed a single text string into a 384-dim vector."""
|
| 70 |
+
return self.model.encode([text])[0].tolist()
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
Splitting this mid-way loses the function signature (what it takes/returns)
|
| 74 |
+
or the body (what it actually does). A chunk without the signature can't
|
| 75 |
+
answer "what does embed_text take as input?". A chunk without the body can't
|
| 76 |
+
answer "how does embed_text work?".
|
| 77 |
+
|
| 78 |
+
**Solution: AST-based chunking** — parse the code into its syntax tree, then
|
| 79 |
+
use function and class boundaries as natural split points.
|
| 80 |
+
|
| 81 |
+
## Problem 3: What metadata matters?
|
| 82 |
+
|
| 83 |
+
PDF RAG metadata: `source` (paper name), `page` (page number)
|
| 84 |
+
Code RAG metadata: `repo`, `filepath`, `language`, `function_name`,
|
| 85 |
+
`class_name`, `start_line`, `end_line`
|
| 86 |
+
|
| 87 |
+
This makes citations meaningful:
|
| 88 |
+
```
|
| 89 |
+
PDF: (Source: attention_2017, Page 4)
|
| 90 |
+
Code: (repo: pytorch/pytorch, file: torch/nn/functional.py, lines 1823–1851)
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
And it enables powerful filters:
|
| 94 |
+
- "Only search in Python files"
|
| 95 |
+
- "Only search in the `auth/` directory"
|
| 96 |
+
- "Only search in test files"
|
| 97 |
+
|
| 98 |
+
## What stays the same
|
| 99 |
+
|
| 100 |
+
Everything downstream of ingestion:
|
| 101 |
+
- Embedding queries and comparing to stored vectors
|
| 102 |
+
- Relevance threshold to reject out-of-domain queries
|
| 103 |
+
- Hybrid search combining semantic + keyword
|
| 104 |
+
- LLM generation from retrieved context
|
| 105 |
+
- Citations, confidence scores, streaming
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
# 7. Claude Code Features
|
| 110 |
+
|
| 111 |
+
This section is different from the rest — instead of explaining RAG concepts,
|
| 112 |
+
it explains how to use Claude Code more effectively while building this project.
|
| 113 |
+
|
| 114 |
+
## 7a. CLAUDE.md
|
| 115 |
+
|
| 116 |
+
`CLAUDE.md` is a file Claude Code reads at the start of every session.
|
| 117 |
+
Think of it as a briefing document for Claude — it tells Claude:
|
| 118 |
+
- What the project does
|
| 119 |
+
- How the codebase is structured
|
| 120 |
+
- Coding conventions to follow
|
| 121 |
+
- What commands to run
|
| 122 |
+
- Key design decisions that shouldn't be changed without thought
|
| 123 |
+
|
| 124 |
+
Without `CLAUDE.md`, Claude starts every session with no project context and
|
| 125 |
+
has to rediscover everything from reading files. With it, Claude immediately
|
| 126 |
+
knows the architecture, conventions, and constraints.
|
| 127 |
+
|
| 128 |
+
**Best practices for CLAUDE.md:**
|
| 129 |
+
- Keep it concise — Claude reads it on every message, so every line costs tokens
|
| 130 |
+
- Include the directory structure (a quick mental map)
|
| 131 |
+
- List environment variables needed
|
| 132 |
+
- Document non-obvious decisions and _why_ they were made
|
| 133 |
+
- Keep commands up-to-date (wrong commands waste time)
|
| 134 |
+
|
| 135 |
+
**What NOT to put in CLAUDE.md:**
|
| 136 |
+
- Detailed implementation explanations (that's what code comments are for)
|
| 137 |
+
- Anything that changes frequently (stale info is worse than no info)
|
| 138 |
+
- Things derivable from the code itself
|
| 139 |
+
|
| 140 |
+
Our `CLAUDE.md` lives at the root of the project. Open it to see an example.
|
| 141 |
+
|
| 142 |
+
## 7b. Slash Commands
|
| 143 |
+
|
| 144 |
+
Slash commands are custom prompts stored in `.claude/commands/`.
|
| 145 |
+
They're invoked with `/command-name [args]`.
|
| 146 |
+
|
| 147 |
+
```
|
| 148 |
+
.claude/
|
| 149 |
+
commands/
|
| 150 |
+
ingest-repo.md → /ingest-repo <github-url>
|
| 151 |
+
search-code.md → /search-code <query>
|
| 152 |
+
add-to-notes.md → /add-to-notes
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
Each command file is a markdown prompt with `$ARGUMENTS` as the placeholder
|
| 156 |
+
for whatever you pass after the command name.
|
| 157 |
+
|
| 158 |
+
**Why slash commands?**
|
| 159 |
+
Instead of typing a long, precise instruction every time ("run the ingestion
|
| 160 |
+
pipeline for this repo, print a summary of files indexed, languages detected,
|
| 161 |
+
and chunks stored"), you define it once and invoke it with `/ingest-repo <url>`.
|
| 162 |
+
|
| 163 |
+
They're especially useful for:
|
| 164 |
+
- Repetitive operations (ingest a repo, update notes, run tests)
|
| 165 |
+
- Multi-step workflows you want to be consistent
|
| 166 |
+
- Sharing workflows with others on the project
|
| 167 |
+
|
| 168 |
+
## 7c. Hooks
|
| 169 |
+
|
| 170 |
+
Hooks are shell commands that run automatically in response to Claude Code events.
|
| 171 |
+
Configure them in `.claude/settings.json`.
|
| 172 |
+
|
| 173 |
+
Available hook events:
|
| 174 |
+
- `PreToolUse` — runs before Claude calls a tool (e.g., before editing a file)
|
| 175 |
+
- `PostToolUse` — runs after Claude calls a tool (e.g., after writing a file)
|
| 176 |
+
- `Stop` — runs when Claude finishes a response
|
| 177 |
+
|
| 178 |
+
**Example: auto-lint after every file edit**
|
| 179 |
+
```json
|
| 180 |
+
{
|
| 181 |
+
"hooks": {
|
| 182 |
+
"PostToolUse": [
|
| 183 |
+
{
|
| 184 |
+
"matcher": "Edit|Write",
|
| 185 |
+
"hooks": [{
|
| 186 |
+
"type": "command",
|
| 187 |
+
"command": "cd /path/to/project && ruff check $CLAUDE_FILE_PATH --fix"
|
| 188 |
+
}]
|
| 189 |
+
}
|
| 190 |
+
]
|
| 191 |
+
}
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
**Example: auto-update notes after a commit**
|
| 196 |
+
```json
|
| 197 |
+
{
|
| 198 |
+
"hooks": {
|
| 199 |
+
"PostToolUse": [
|
| 200 |
+
{
|
| 201 |
+
"matcher": "Bash",
|
| 202 |
+
"hooks": [{
|
| 203 |
+
"type": "command",
|
| 204 |
+
"command": "if echo '$CLAUDE_TOOL_INPUT' | grep -q 'git commit'; then echo 'Remember to run /add-to-notes'; fi"
|
| 205 |
+
}]
|
| 206 |
+
}
|
| 207 |
+
]
|
| 208 |
+
}
|
| 209 |
+
}
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
**Why hooks?**
|
| 213 |
+
They automate quality gates without relying on Claude to remember to run them.
|
| 214 |
+
A lint hook means every file Claude edits is checked — you never accidentally
|
| 215 |
+
commit code with style errors.
|
| 216 |
+
|
| 217 |
+
## 7d. Subagents
|
| 218 |
+
|
| 219 |
+
Subagents are Claude instances spawned from the main Claude session to handle
|
| 220 |
+
independent tasks in parallel. In Claude Code, you use the `Agent` tool.
|
| 221 |
+
|
| 222 |
+
**When to use subagents:**
|
| 223 |
+
- Tasks that are independent and can run simultaneously
|
| 224 |
+
- Tasks that would pollute the main context with large outputs
|
| 225 |
+
- Specialised tasks (research, exploration, review)
|
| 226 |
+
|
| 227 |
+
**Example: parallel repo ingestion**
|
| 228 |
+
Instead of ingesting repos sequentially:
|
| 229 |
+
```
|
| 230 |
+
Ingest repo A (2 min) → Ingest repo B (2 min) → Ingest repo C (2 min) = 6 min
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
Spawn three subagents in parallel:
|
| 234 |
+
```
|
| 235 |
+
Ingest repo A ─┐
|
| 236 |
+
Ingest repo B ─┼→ done in ~2 min
|
| 237 |
+
Ingest repo C ─┘
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
**Example: exploration agent**
|
| 241 |
+
Before implementing a feature, spawn an Explore agent to read all relevant
|
| 242 |
+
files and return a summary — without filling the main context with file contents.
|
| 243 |
+
|
| 244 |
+
**Subagent types in Claude Code:**
|
| 245 |
+
- `general-purpose` — full tool access, good for implementation tasks
|
| 246 |
+
- `Explore` — read-only, fast codebase exploration
|
| 247 |
+
- `Plan` — architecture and design planning
|
| 248 |
+
|
| 249 |
+
We'll use subagents when:
|
| 250 |
+
1. Ingesting multiple repos simultaneously
|
| 251 |
+
2. Running an expert review before a PR
|
| 252 |
+
3. Exploring unfamiliar repos before answering questions about them
|
| 253 |
+
|
| 254 |
+
---
|
| 255 |
+
|
| 256 |
+
_Sections 2–6 will be added as each phase is built._
|
PLAN.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GitHub RAG Copilot — Build Plan
|
| 2 |
+
|
| 3 |
+
A RAG system that indexes GitHub repositories and answers questions about their
|
| 4 |
+
code, architecture, and documentation. Extends the PDF RAG Copilot concepts to
|
| 5 |
+
source code.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Learning Objectives
|
| 10 |
+
|
| 11 |
+
By the end of this project you will understand:
|
| 12 |
+
- How RAG applies to code (not just documents)
|
| 13 |
+
- AST-based code chunking vs. character-window chunking
|
| 14 |
+
- Code-aware embeddings vs. general text embeddings
|
| 15 |
+
- Metadata-rich retrieval (file, function, class, language)
|
| 16 |
+
- Hosted vector DB (Qdrant Cloud) vs. local (ChromaDB)
|
| 17 |
+
- Claude Code features: CLAUDE.md, hooks, slash commands, subagents
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Architecture Overview
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
GitHub URL
|
| 25 |
+
│
|
| 26 |
+
▼
|
| 27 |
+
[Ingestion Pipeline]
|
| 28 |
+
├── Clone / fetch repo via GitHub API
|
| 29 |
+
├── Filter files (language-aware)
|
| 30 |
+
├── Chunk by AST boundaries (functions, classes)
|
| 31 |
+
│ └── fallback: character windows (markdown, config)
|
| 32 |
+
├── Embed with code-optimized model
|
| 33 |
+
└── Store in Qdrant Cloud (vector + metadata)
|
| 34 |
+
└── metadata: repo, filepath, language,
|
| 35 |
+
function_name, class_name, start_line
|
| 36 |
+
|
| 37 |
+
│
|
| 38 |
+
▼
|
| 39 |
+
[Query Pipeline] ← identical to PDF RAG
|
| 40 |
+
├── Embed query
|
| 41 |
+
├── Hybrid search (semantic + BM25 via Qdrant)
|
| 42 |
+
├── Relevance threshold
|
| 43 |
+
└── LLM generation (Groq / Claude)
|
| 44 |
+
└── citations: filepath + line range
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Phases
|
| 50 |
+
|
| 51 |
+
### Phase 1 — Core Ingestion (Week 1)
|
| 52 |
+
- [ ] `repo_fetcher.py` — clone or fetch via GitHub API (no auth needed for public repos)
|
| 53 |
+
- [ ] `file_filter.py` — include/exclude rules per language, skip binaries/lock files
|
| 54 |
+
- [ ] `code_chunker.py` — AST-based chunking for Python; character-window fallback
|
| 55 |
+
- [ ] `embedder.py` — reuse from PDF RAG, swap model to `nomic-ai/nomic-embed-code`
|
| 56 |
+
- [ ] `qdrant_store.py` — replace ChromaDB with Qdrant client
|
| 57 |
+
|
| 58 |
+
### Phase 2 — Retrieval & Generation (Week 1–2)
|
| 59 |
+
- [ ] `retrieval.py` — hybrid search using Qdrant's native BM25 + vector
|
| 60 |
+
- [ ] `generation.py` — reuse from PDF RAG, update system prompt for code answers
|
| 61 |
+
- [ ] FastAPI backend with `/ingest`, `/query`, `/search` endpoints
|
| 62 |
+
|
| 63 |
+
### Phase 3 — UI (Week 2)
|
| 64 |
+
- [ ] Reuse PDF RAG UI structure
|
| 65 |
+
- [ ] Input: GitHub URL instead of PDF upload
|
| 66 |
+
- [ ] Citations show filepath + line numbers instead of page numbers
|
| 67 |
+
- [ ] Syntax highlighting for code chunks in source passages
|
| 68 |
+
|
| 69 |
+
### Phase 4 — Claude Code Features (Throughout)
|
| 70 |
+
- [ ] `CLAUDE.md` — project instructions for Claude Code
|
| 71 |
+
- [ ] Hooks — auto-lint on edit, auto-update notes on commit
|
| 72 |
+
- [ ] Slash commands — `/ingest-repo`, `/search-code`, `/add-to-notes`
|
| 73 |
+
- [ ] Subagent patterns — parallel ingestion of multiple repos
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## Tech Stack
|
| 78 |
+
|
| 79 |
+
| Layer | Choice | Why |
|
| 80 |
+
|---|---|---|
|
| 81 |
+
| Repo fetch | `gitpython` + GitHub API | No auth for public repos |
|
| 82 |
+
| Code parsing | `ast` (Python), `tree-sitter` (multi-lang) | Function/class boundaries |
|
| 83 |
+
| Embeddings | `nomic-ai/nomic-embed-code` | Trained on code, free |
|
| 84 |
+
| Vector DB | Qdrant Cloud (free tier) | Permanent free, 1GB |
|
| 85 |
+
| Keyword search | Qdrant sparse vectors (BM25) | Native, no separate index |
|
| 86 |
+
| LLM | Groq Llama 3.3 70B / Claude | Same as PDF RAG |
|
| 87 |
+
| Backend | FastAPI | Same as PDF RAG |
|
| 88 |
+
| Frontend | React + Vite | Same as PDF RAG |
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Key Differences vs PDF RAG
|
| 93 |
+
|
| 94 |
+
| Concern | PDF RAG | GitHub RAG |
|
| 95 |
+
|---|---|---|
|
| 96 |
+
| Ingestion source | Local file upload | GitHub URL |
|
| 97 |
+
| Chunking strategy | Fixed character windows | AST-aware (function/class) |
|
| 98 |
+
| Metadata | source, page | repo, filepath, language, function, line |
|
| 99 |
+
| Vector DB | ChromaDB (local) | Qdrant Cloud (hosted) |
|
| 100 |
+
| Embedding model | all-MiniLM-L6-v2 | nomic-embed-code |
|
| 101 |
+
| Citations | Paper name + page | Filepath + line range |
|
| 102 |
+
| Hybrid search | Manual RRF | Qdrant native hybrid |
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Notes Directory
|
| 107 |
+
|
| 108 |
+
`notes/` is updated after every PR with:
|
| 109 |
+
- What was built
|
| 110 |
+
- Key decisions made
|
| 111 |
+
- Concepts learned
|
| 112 |
+
- What's next
|
| 113 |
+
|
| 114 |
+
See `notes/000-project-setup.md` for the first entry.
|
notes/000-project-setup.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Note 000 — Project Setup
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-03-22
|
| 4 |
+
**PR:** Initial setup (no PR — baseline)
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## What was set up
|
| 9 |
+
|
| 10 |
+
- Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/`
|
| 11 |
+
- `PLAN.md` written with full architecture, phases, and tech stack decisions
|
| 12 |
+
- `CLAUDE.md` written with project instructions for Claude Code
|
| 13 |
+
- `LEARN.md` started — will grow as each phase is built
|
| 14 |
+
- Git repo initialized
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Key architectural decisions
|
| 19 |
+
|
| 20 |
+
**Why Qdrant over ChromaDB?**
|
| 21 |
+
ChromaDB is local-only — data lives on disk and disappears if you redeploy.
|
| 22 |
+
Qdrant Cloud has a permanent free tier (1GB), making the app deployable without
|
| 23 |
+
paying for storage. It also has native hybrid search (sparse + dense vectors),
|
| 24 |
+
eliminating the need for our manual BM25 index.
|
| 25 |
+
|
| 26 |
+
**Why nomic-embed-code over all-MiniLM-L6-v2?**
|
| 27 |
+
`all-MiniLM-L6-v2` was trained on natural language. Code has different patterns:
|
| 28 |
+
identifier names, function signatures, call chains. `nomic-embed-code` was
|
| 29 |
+
fine-tuned on code and produces better semantic similarity for code queries.
|
| 30 |
+
|
| 31 |
+
**Why AST chunking over character windows?**
|
| 32 |
+
Character windows split wherever they hit the size limit — often mid-function.
|
| 33 |
+
A function is the natural unit of code: it has a name, a purpose, inputs/outputs.
|
| 34 |
+
Chunking at function boundaries keeps each chunk semantically complete and makes
|
| 35 |
+
citations meaningful ("see `embed_text()` in `retrieval/embedder.py`").
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## What's next
|
| 40 |
+
|
| 41 |
+
Phase 1: Core ingestion pipeline
|
| 42 |
+
- `repo_fetcher.py` — clone public repos
|
| 43 |
+
- `file_filter.py` — skip binaries, lock files, node_modules
|
| 44 |
+
- `code_chunker.py` — AST-based chunking for Python
|
requirements.txt
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Web framework
|
| 2 |
+
fastapi
|
| 3 |
+
uvicorn[standard]
|
| 4 |
+
python-multipart
|
| 5 |
+
|
| 6 |
+
# Vector database
|
| 7 |
+
qdrant-client
|
| 8 |
+
|
| 9 |
+
# Embeddings
|
| 10 |
+
sentence-transformers # will use nomic-embed-code
|
| 11 |
+
|
| 12 |
+
# Repo fetching
|
| 13 |
+
gitpython
|
| 14 |
+
requests
|
| 15 |
+
|
| 16 |
+
# Code parsing
|
| 17 |
+
tree-sitter # multi-language AST parsing
|
| 18 |
+
tree-sitter-python # Python grammar for tree-sitter
|
| 19 |
+
|
| 20 |
+
# LLM providers
|
| 21 |
+
groq
|
| 22 |
+
anthropic
|
| 23 |
+
|
| 24 |
+
# Utilities
|
| 25 |
+
python-dotenv
|
| 26 |
+
pydantic
|