umanggarg Claude Sonnet 4.6 commited on
Commit
b5dbf45
·
0 Parent(s):

Project setup: GitHub RAG Copilot

Browse files

- PLAN.md: full architecture, 4 phases, tech stack decisions
- LEARN.md: learning guide — starts with code vs doc RAG differences + Claude Code features (CLAUDE.md, slash commands, hooks, subagents)
- CLAUDE.md: project instructions for Claude Code sessions
- notes/000-project-setup.md: first notes entry
- .claude/commands/: /ingest-repo, /search-code, /add-to-notes slash commands
- requirements.txt: Qdrant, tree-sitter, gitpython, nomic-embed-code deps
- .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

.claude/commands/add-to-notes.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Add a Note Entry
2
+
3
+ Create a new entry in the `notes/` directory documenting what was just built.
4
+
5
+ ## Usage
6
+ ```
7
+ /add-to-notes
8
+ ```
9
+
10
+ ## Steps
11
+
12
+ 1. Look at the most recent note file in `notes/` to determine the next number (NNN)
13
+ 2. Look at recent git changes (`git diff HEAD~1` or `git status`) to understand what was built
14
+ 3. Create `notes/NNN-<short-title>.md` with this structure:
15
+
16
+ ```markdown
17
+ # Note NNN — <Title>
18
+
19
+ **Date:** <today>
20
+ **PR:** <PR title or "in progress">
21
+
22
+ ---
23
+
24
+ ## What was built
25
+ <what was added/changed>
26
+
27
+ ## Key decisions
28
+ <why these choices were made>
29
+
30
+ ## Concepts learned
31
+ <RAG/code concepts this feature demonstrates>
32
+
33
+ ## What's next
34
+ <next steps>
35
+ ```
.claude/commands/ingest-repo.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ingest a GitHub Repository
2
+
3
+ Ingest the repository at the given URL into the vector index.
4
+
5
+ ## Usage
6
+ ```
7
+ /ingest-repo <github-url>
8
+ ```
9
+
10
+ ## What this does
11
+ 1. Clones or fetches the repo via GitHub API
12
+ 2. Filters files (skips binaries, lock files, node_modules, etc.)
13
+ 3. Chunks code by AST boundaries (functions/classes)
14
+ 4. Embeds chunks with nomic-embed-code
15
+ 5. Upserts into Qdrant Cloud collection
16
+
17
+ ## Steps
18
+
19
+ Run the ingestion pipeline for the provided GitHub URL: $ARGUMENTS
20
+
21
+ - Call `ingestion/repo_fetcher.py` to fetch the repo
22
+ - Call `ingestion/file_filter.py` to get the list of files to index
23
+ - Call `ingestion/code_chunker.py` to chunk each file
24
+ - Call `backend/services/ingestion_service.py` to embed and store
25
+ - Print a summary: files indexed, chunks stored, languages detected
.claude/commands/search-code.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Search Code Without Generating an Answer
2
+
3
+ Run a retrieval-only search against the indexed repositories.
4
+
5
+ ## Usage
6
+ ```
7
+ /search-code <query>
8
+ ```
9
+
10
+ ## What this does
11
+ Calls the `/search` endpoint (no LLM) and displays the raw retrieved chunks
12
+ with their file paths, line numbers, and relevance scores. Useful for:
13
+ - Verifying the index contains what you expect
14
+ - Debugging retrieval quality before blaming the LLM
15
+ - Exploring the codebase without a full RAG query
16
+
17
+ ## Steps
18
+
19
+ Search for: $ARGUMENTS
20
+
21
+ Call `GET /search?query=<query>&top_k=10` and display:
22
+ - Filepath + line range for each result
23
+ - Relevance score
24
+ - The actual code chunk
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .venv/
2
+ __pycache__/
3
+ *.pyc
4
+ .env
5
+ .env.*
6
+ chroma_db/
7
+ repos/ # cloned repos (temp)
8
+ node_modules/
9
+ dist/
10
+ build/
11
+ *.egg-info/
12
+ .DS_Store
CLAUDE.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GitHub RAG Copilot — Claude Code Instructions
2
+
3
+ This file is read by Claude Code at the start of every session.
4
+ It tells Claude how to work in this project.
5
+
6
+ ---
7
+
8
+ ## Project Purpose
9
+
10
+ A RAG system that indexes GitHub repositories and answers questions about code.
11
+ This is a **learning project** — prioritise clarity and explanation over brevity.
12
+
13
+ ---
14
+
15
+ ## Architecture at a Glance
16
+
17
+ ```
18
+ ingestion/ ← repo fetching, file filtering, AST chunking, embedding
19
+ retrieval/ ← Qdrant hybrid search, BM25 sparse vectors
20
+ backend/ ← FastAPI: /ingest, /query, /search endpoints
21
+ services/ ← ingestion_service.py, retrieval_service.py, generation.py
22
+ routers/ ← ingest.py, query.py
23
+ models/ ← schemas.py (Pydantic models)
24
+ ui/ ← React + Vite frontend
25
+ notes/ ← Updated after every PR (NNN-title.md)
26
+ PLAN.md ← Build plan and phase tracking
27
+ LEARN.md ← Learning guide, updated as features are built
28
+ ```
29
+
30
+ ---
31
+
32
+ ## Coding Rules
33
+
34
+ - Write comments explaining **why**, not what — this is a learning project
35
+ - Each new concept gets a docstring explaining it from first principles
36
+ - Prefer explicit over implicit — avoid magic
37
+ - No LangChain, no LlamaIndex — build from scratch so concepts are visible
38
+ - Match the style of `rag-research-copilot/` (the sibling project)
39
+
40
+ ---
41
+
42
+ ## Notes Convention
43
+
44
+ After every significant feature (PR-worthy), add an entry to `notes/`:
45
+ - Filename: `NNN-short-title.md` (zero-padded, e.g. `001-ingestion.md`)
46
+ - Contents: what was built, key decisions, concepts learned, what's next
47
+
48
+ ---
49
+
50
+ ## Running the Project
51
+
52
+ ```bash
53
+ # Backend
54
+ cd github-rag-copilot
55
+ python -m venv .venv && source .venv/bin/activate
56
+ pip install -r requirements.txt
57
+ uvicorn backend.main:app --reload
58
+
59
+ # Frontend
60
+ cd ui && npm install && npm run dev
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Environment Variables
66
+
67
+ ```
68
+ QDRANT_URL= # Qdrant Cloud cluster URL
69
+ QDRANT_API_KEY= # Qdrant Cloud API key
70
+ GROQ_API_KEY= # LLM (free)
71
+ ANTHROPIC_API_KEY= # LLM fallback (optional)
72
+ GITHUB_TOKEN= # Optional — increases API rate limit from 60 to 5000 req/hr
73
+ ```
74
+
75
+ ---
76
+
77
+ ## Slash Commands Available
78
+
79
+ - `/ingest-repo` — ingest a GitHub repository by URL
80
+ - `/search-code` — search the index without generating an answer
81
+ - `/add-to-notes` — add a note entry for the current work
82
+
83
+ ---
84
+
85
+ ## Key Design Decisions (don't change without good reason)
86
+
87
+ - **Qdrant Cloud** for vector storage (not ChromaDB) — enables free deployment
88
+ - **AST chunking** at function/class boundaries — not character windows
89
+ - **nomic-embed-code** embedding model — code-optimised, not general text
90
+ - **Qdrant native hybrid search** — replaces manual BM25 index + RRF fusion
91
+ - **No auth required** for public repo ingestion — GitHub API unauthenticated allows 60 req/hr, with token 5000 req/hr
LEARN.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GitHub RAG Copilot — Learning Guide
2
+
3
+ This document grows with the project. Each section is added when the corresponding
4
+ feature is built. Read it alongside the code and the `notes/` entries.
5
+
6
+ ---
7
+
8
+ # Table of Contents
9
+
10
+ 1. [Why Code RAG is Different from Document RAG](#1-why-code-rag-is-different)
11
+ 2. [AST-Based Chunking](#2-ast-based-chunking) ← _coming in Phase 1_
12
+ 3. [Code Embeddings](#3-code-embeddings) ← _coming in Phase 1_
13
+ 4. [Qdrant — A Hosted Vector Database](#4-qdrant) ← _coming in Phase 1_
14
+ 5. [Native Hybrid Search in Qdrant](#5-native-hybrid-search) ← _coming in Phase 2_
15
+ 6. [Generation for Code Queries](#6-generation-for-code-queries) ← _coming in Phase 2_
16
+ 7. [Claude Code Features](#7-claude-code-features) ← _built throughout_
17
+ - 7a. CLAUDE.md
18
+ - 7b. Slash Commands
19
+ - 7c. Hooks
20
+ - 7d. Subagents
21
+
22
+ ---
23
+
24
+ # 1. Why Code RAG is Different
25
+
26
+ ## The same idea, different inputs
27
+
28
+ In the PDF RAG Copilot, the pipeline was:
29
+ ```
30
+ PDF → pages → text chunks → embed → ChromaDB → query → answer
31
+ ```
32
+
33
+ In this project it's:
34
+ ```
35
+ GitHub repo → files → code chunks → embed → Qdrant → query → answer
36
+ ```
37
+
38
+ The retrieval, LLM generation, and API layers are nearly identical.
39
+ The differences are all in **ingestion** — how you get the text, and how you
40
+ chunk it.
41
+
42
+ ## Problem 1: What files do you index?
43
+
44
+ A GitHub repo contains many things that shouldn't be indexed:
45
+ - **Binary files** — images, compiled artifacts, `.pyc` files
46
+ - **Auto-generated files** — `package-lock.json`, `*.lock`, migration files
47
+ - **Dependency directories** — `node_modules/`, `.venv/`, `vendor/`
48
+ - **Build output** — `dist/`, `build/`, `__pycache__/`
49
+
50
+ If you index these, queries like "how does authentication work?" return
51
+ lock file entries and compiled output instead of actual code.
52
+
53
+ Solution: a `file_filter.py` with explicit include/exclude rules per language.
54
+
55
+ ## Problem 2: How do you chunk code?
56
+
57
+ In the PDF project, we used **fixed character windows** with overlap:
58
+ ```
59
+ [---- chunk 1 (800 chars) ----]
60
+ [---- chunk 2 (800 chars) ----]
61
+ ```
62
+
63
+ This works for prose because a sentence is a self-contained unit — splitting
64
+ a 200-page paper anywhere still yields readable text.
65
+
66
+ **Code is different.** A function is the natural unit:
67
+ ```python
68
+ def embed_text(self, text: str) -> list[float]:
69
+ """Embed a single text string into a 384-dim vector."""
70
+ return self.model.encode([text])[0].tolist()
71
+ ```
72
+
73
+ Splitting this mid-way loses the function signature (what it takes/returns)
74
+ or the body (what it actually does). A chunk without the signature can't
75
+ answer "what does embed_text take as input?". A chunk without the body can't
76
+ answer "how does embed_text work?".
77
+
78
+ **Solution: AST-based chunking** — parse the code into its syntax tree, then
79
+ use function and class boundaries as natural split points.
80
+
81
+ ## Problem 3: What metadata matters?
82
+
83
+ PDF RAG metadata: `source` (paper name), `page` (page number)
84
+ Code RAG metadata: `repo`, `filepath`, `language`, `function_name`,
85
+ `class_name`, `start_line`, `end_line`
86
+
87
+ This makes citations meaningful:
88
+ ```
89
+ PDF: (Source: attention_2017, Page 4)
90
+ Code: (repo: pytorch/pytorch, file: torch/nn/functional.py, lines 1823–1851)
91
+ ```
92
+
93
+ And it enables powerful filters:
94
+ - "Only search in Python files"
95
+ - "Only search in the `auth/` directory"
96
+ - "Only search in test files"
97
+
98
+ ## What stays the same
99
+
100
+ Everything downstream of ingestion:
101
+ - Embedding queries and comparing to stored vectors
102
+ - Relevance threshold to reject out-of-domain queries
103
+ - Hybrid search combining semantic + keyword
104
+ - LLM generation from retrieved context
105
+ - Citations, confidence scores, streaming
106
+
107
+ ---
108
+
109
+ # 7. Claude Code Features
110
+
111
+ This section is different from the rest — instead of explaining RAG concepts,
112
+ it explains how to use Claude Code more effectively while building this project.
113
+
114
+ ## 7a. CLAUDE.md
115
+
116
+ `CLAUDE.md` is a file Claude Code reads at the start of every session.
117
+ Think of it as a briefing document for Claude — it tells Claude:
118
+ - What the project does
119
+ - How the codebase is structured
120
+ - Coding conventions to follow
121
+ - What commands to run
122
+ - Key design decisions that shouldn't be changed without thought
123
+
124
+ Without `CLAUDE.md`, Claude starts every session with no project context and
125
+ has to rediscover everything from reading files. With it, Claude immediately
126
+ knows the architecture, conventions, and constraints.
127
+
128
+ **Best practices for CLAUDE.md:**
129
+ - Keep it concise — Claude reads it on every message, so every line costs tokens
130
+ - Include the directory structure (a quick mental map)
131
+ - List environment variables needed
132
+ - Document non-obvious decisions and _why_ they were made
133
+ - Keep commands up-to-date (wrong commands waste time)
134
+
135
+ **What NOT to put in CLAUDE.md:**
136
+ - Detailed implementation explanations (that's what code comments are for)
137
+ - Anything that changes frequently (stale info is worse than no info)
138
+ - Things derivable from the code itself
139
+
140
+ Our `CLAUDE.md` lives at the root of the project. Open it to see an example.
141
+
142
+ ## 7b. Slash Commands
143
+
144
+ Slash commands are custom prompts stored in `.claude/commands/`.
145
+ They're invoked with `/command-name [args]`.
146
+
147
+ ```
148
+ .claude/
149
+ commands/
150
+ ingest-repo.md → /ingest-repo <github-url>
151
+ search-code.md → /search-code <query>
152
+ add-to-notes.md → /add-to-notes
153
+ ```
154
+
155
+ Each command file is a markdown prompt with `$ARGUMENTS` as the placeholder
156
+ for whatever you pass after the command name.
157
+
158
+ **Why slash commands?**
159
+ Instead of typing a long, precise instruction every time ("run the ingestion
160
+ pipeline for this repo, print a summary of files indexed, languages detected,
161
+ and chunks stored"), you define it once and invoke it with `/ingest-repo <url>`.
162
+
163
+ They're especially useful for:
164
+ - Repetitive operations (ingest a repo, update notes, run tests)
165
+ - Multi-step workflows you want to be consistent
166
+ - Sharing workflows with others on the project
167
+
168
+ ## 7c. Hooks
169
+
170
+ Hooks are shell commands that run automatically in response to Claude Code events.
171
+ Configure them in `.claude/settings.json`.
172
+
173
+ Available hook events:
174
+ - `PreToolUse` — runs before Claude calls a tool (e.g., before editing a file)
175
+ - `PostToolUse` — runs after Claude calls a tool (e.g., after writing a file)
176
+ - `Stop` — runs when Claude finishes a response
177
+
178
+ **Example: auto-lint after every file edit**
179
+ ```json
180
+ {
181
+ "hooks": {
182
+ "PostToolUse": [
183
+ {
184
+ "matcher": "Edit|Write",
185
+ "hooks": [{
186
+ "type": "command",
187
+ "command": "cd /path/to/project && ruff check $CLAUDE_FILE_PATH --fix"
188
+ }]
189
+ }
190
+ ]
191
+ }
192
+ }
193
+ ```
194
+
195
+ **Example: auto-update notes after a commit**
196
+ ```json
197
+ {
198
+ "hooks": {
199
+ "PostToolUse": [
200
+ {
201
+ "matcher": "Bash",
202
+ "hooks": [{
203
+ "type": "command",
204
+ "command": "if echo '$CLAUDE_TOOL_INPUT' | grep -q 'git commit'; then echo 'Remember to run /add-to-notes'; fi"
205
+ }]
206
+ }
207
+ ]
208
+ }
209
+ }
210
+ ```
211
+
212
+ **Why hooks?**
213
+ They automate quality gates without relying on Claude to remember to run them.
214
+ A lint hook means every file Claude edits is checked — you never accidentally
215
+ commit code with style errors.
216
+
217
+ ## 7d. Subagents
218
+
219
+ Subagents are Claude instances spawned from the main Claude session to handle
220
+ independent tasks in parallel. In Claude Code, you use the `Agent` tool.
221
+
222
+ **When to use subagents:**
223
+ - Tasks that are independent and can run simultaneously
224
+ - Tasks that would pollute the main context with large outputs
225
+ - Specialised tasks (research, exploration, review)
226
+
227
+ **Example: parallel repo ingestion**
228
+ Instead of ingesting repos sequentially:
229
+ ```
230
+ Ingest repo A (2 min) → Ingest repo B (2 min) → Ingest repo C (2 min) = 6 min
231
+ ```
232
+
233
+ Spawn three subagents in parallel:
234
+ ```
235
+ Ingest repo A ─┐
236
+ Ingest repo B ─┼→ done in ~2 min
237
+ Ingest repo C ─┘
238
+ ```
239
+
240
+ **Example: exploration agent**
241
+ Before implementing a feature, spawn an Explore agent to read all relevant
242
+ files and return a summary — without filling the main context with file contents.
243
+
244
+ **Subagent types in Claude Code:**
245
+ - `general-purpose` — full tool access, good for implementation tasks
246
+ - `Explore` — read-only, fast codebase exploration
247
+ - `Plan` — architecture and design planning
248
+
249
+ We'll use subagents when:
250
+ 1. Ingesting multiple repos simultaneously
251
+ 2. Running an expert review before a PR
252
+ 3. Exploring unfamiliar repos before answering questions about them
253
+
254
+ ---
255
+
256
+ _Sections 2–6 will be added as each phase is built._
PLAN.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GitHub RAG Copilot — Build Plan
2
+
3
+ A RAG system that indexes GitHub repositories and answers questions about their
4
+ code, architecture, and documentation. Extends the PDF RAG Copilot concepts to
5
+ source code.
6
+
7
+ ---
8
+
9
+ ## Learning Objectives
10
+
11
+ By the end of this project you will understand:
12
+ - How RAG applies to code (not just documents)
13
+ - AST-based code chunking vs. character-window chunking
14
+ - Code-aware embeddings vs. general text embeddings
15
+ - Metadata-rich retrieval (file, function, class, language)
16
+ - Hosted vector DB (Qdrant Cloud) vs. local (ChromaDB)
17
+ - Claude Code features: CLAUDE.md, hooks, slash commands, subagents
18
+
19
+ ---
20
+
21
+ ## Architecture Overview
22
+
23
+ ```
24
+ GitHub URL
25
+
26
+
27
+ [Ingestion Pipeline]
28
+ ├── Clone / fetch repo via GitHub API
29
+ ├── Filter files (language-aware)
30
+ ├── Chunk by AST boundaries (functions, classes)
31
+ │ └── fallback: character windows (markdown, config)
32
+ ├── Embed with code-optimized model
33
+ └── Store in Qdrant Cloud (vector + metadata)
34
+ └── metadata: repo, filepath, language,
35
+ function_name, class_name, start_line
36
+
37
+
38
+
39
+ [Query Pipeline] ← identical to PDF RAG
40
+ ├── Embed query
41
+ ├── Hybrid search (semantic + BM25 via Qdrant)
42
+ ├── Relevance threshold
43
+ └── LLM generation (Groq / Claude)
44
+ └── citations: filepath + line range
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Phases
50
+
51
+ ### Phase 1 — Core Ingestion (Week 1)
52
+ - [ ] `repo_fetcher.py` — clone or fetch via GitHub API (no auth needed for public repos)
53
+ - [ ] `file_filter.py` — include/exclude rules per language, skip binaries/lock files
54
+ - [ ] `code_chunker.py` — AST-based chunking for Python; character-window fallback
55
+ - [ ] `embedder.py` — reuse from PDF RAG, swap model to `nomic-ai/nomic-embed-code`
56
+ - [ ] `qdrant_store.py` — replace ChromaDB with Qdrant client
57
+
58
+ ### Phase 2 — Retrieval & Generation (Week 1–2)
59
+ - [ ] `retrieval.py` — hybrid search using Qdrant's native BM25 + vector
60
+ - [ ] `generation.py` — reuse from PDF RAG, update system prompt for code answers
61
+ - [ ] FastAPI backend with `/ingest`, `/query`, `/search` endpoints
62
+
63
+ ### Phase 3 — UI (Week 2)
64
+ - [ ] Reuse PDF RAG UI structure
65
+ - [ ] Input: GitHub URL instead of PDF upload
66
+ - [ ] Citations show filepath + line numbers instead of page numbers
67
+ - [ ] Syntax highlighting for code chunks in source passages
68
+
69
+ ### Phase 4 — Claude Code Features (Throughout)
70
+ - [ ] `CLAUDE.md` — project instructions for Claude Code
71
+ - [ ] Hooks — auto-lint on edit, auto-update notes on commit
72
+ - [ ] Slash commands — `/ingest-repo`, `/search-code`, `/add-to-notes`
73
+ - [ ] Subagent patterns — parallel ingestion of multiple repos
74
+
75
+ ---
76
+
77
+ ## Tech Stack
78
+
79
+ | Layer | Choice | Why |
80
+ |---|---|---|
81
+ | Repo fetch | `gitpython` + GitHub API | No auth for public repos |
82
+ | Code parsing | `ast` (Python), `tree-sitter` (multi-lang) | Function/class boundaries |
83
+ | Embeddings | `nomic-ai/nomic-embed-code` | Trained on code, free |
84
+ | Vector DB | Qdrant Cloud (free tier) | Permanent free, 1GB |
85
+ | Keyword search | Qdrant sparse vectors (BM25) | Native, no separate index |
86
+ | LLM | Groq Llama 3.3 70B / Claude | Same as PDF RAG |
87
+ | Backend | FastAPI | Same as PDF RAG |
88
+ | Frontend | React + Vite | Same as PDF RAG |
89
+
90
+ ---
91
+
92
+ ## Key Differences vs PDF RAG
93
+
94
+ | Concern | PDF RAG | GitHub RAG |
95
+ |---|---|---|
96
+ | Ingestion source | Local file upload | GitHub URL |
97
+ | Chunking strategy | Fixed character windows | AST-aware (function/class) |
98
+ | Metadata | source, page | repo, filepath, language, function, line |
99
+ | Vector DB | ChromaDB (local) | Qdrant Cloud (hosted) |
100
+ | Embedding model | all-MiniLM-L6-v2 | nomic-embed-code |
101
+ | Citations | Paper name + page | Filepath + line range |
102
+ | Hybrid search | Manual RRF | Qdrant native hybrid |
103
+
104
+ ---
105
+
106
+ ## Notes Directory
107
+
108
+ `notes/` is updated after every PR with:
109
+ - What was built
110
+ - Key decisions made
111
+ - Concepts learned
112
+ - What's next
113
+
114
+ See `notes/000-project-setup.md` for the first entry.
notes/000-project-setup.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Note 000 — Project Setup
2
+
3
+ **Date:** 2026-03-22
4
+ **PR:** Initial setup (no PR — baseline)
5
+
6
+ ---
7
+
8
+ ## What was set up
9
+
10
+ - Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/`
11
+ - `PLAN.md` written with full architecture, phases, and tech stack decisions
12
+ - `CLAUDE.md` written with project instructions for Claude Code
13
+ - `LEARN.md` started — will grow as each phase is built
14
+ - Git repo initialized
15
+
16
+ ---
17
+
18
+ ## Key architectural decisions
19
+
20
+ **Why Qdrant over ChromaDB?**
21
+ ChromaDB is local-only — data lives on disk and disappears if you redeploy.
22
+ Qdrant Cloud has a permanent free tier (1GB), making the app deployable without
23
+ paying for storage. It also has native hybrid search (sparse + dense vectors),
24
+ eliminating the need for our manual BM25 index.
25
+
26
+ **Why nomic-embed-code over all-MiniLM-L6-v2?**
27
+ `all-MiniLM-L6-v2` was trained on natural language. Code has different patterns:
28
+ identifier names, function signatures, call chains. `nomic-embed-code` was
29
+ fine-tuned on code and produces better semantic similarity for code queries.
30
+
31
+ **Why AST chunking over character windows?**
32
+ Character windows split wherever they hit the size limit — often mid-function.
33
+ A function is the natural unit of code: it has a name, a purpose, inputs/outputs.
34
+ Chunking at function boundaries keeps each chunk semantically complete and makes
35
+ citations meaningful ("see `embed_text()` in `retrieval/embedder.py`").
36
+
37
+ ---
38
+
39
+ ## What's next
40
+
41
+ Phase 1: Core ingestion pipeline
42
+ - `repo_fetcher.py` — clone public repos
43
+ - `file_filter.py` — skip binaries, lock files, node_modules
44
+ - `code_chunker.py` — AST-based chunking for Python
requirements.txt ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Web framework
2
+ fastapi
3
+ uvicorn[standard]
4
+ python-multipart
5
+
6
+ # Vector database
7
+ qdrant-client
8
+
9
+ # Embeddings
10
+ sentence-transformers # will use nomic-embed-code
11
+
12
+ # Repo fetching
13
+ gitpython
14
+ requests
15
+
16
+ # Code parsing
17
+ tree-sitter # multi-language AST parsing
18
+ tree-sitter-python # Python grammar for tree-sitter
19
+
20
+ # LLM providers
21
+ groq
22
+ anthropic
23
+
24
+ # Utilities
25
+ python-dotenv
26
+ pydantic