cartographer / notes /000-project-setup.md
umanggarg's picture
Project setup: GitHub RAG Copilot
b5dbf45
# Note 000 β€” Project Setup
**Date:** 2026-03-22
**PR:** Initial setup (no PR β€” baseline)
---
## What was set up
- Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/`
- `PLAN.md` written with full architecture, phases, and tech stack decisions
- `CLAUDE.md` written with project instructions for Claude Code
- `LEARN.md` started β€” will grow as each phase is built
- Git repo initialized
---
## Key architectural decisions
**Why Qdrant over ChromaDB?**
ChromaDB is local-only β€” data lives on disk and disappears if you redeploy.
Qdrant Cloud has a permanent free tier (1GB), making the app deployable without
paying for storage. It also has native hybrid search (sparse + dense vectors),
eliminating the need for our manual BM25 index.
**Why nomic-embed-code over all-MiniLM-L6-v2?**
`all-MiniLM-L6-v2` was trained on natural language. Code has different patterns:
identifier names, function signatures, call chains. `nomic-embed-code` was
fine-tuned on code and produces better semantic similarity for code queries.
**Why AST chunking over character windows?**
Character windows split wherever they hit the size limit β€” often mid-function.
A function is the natural unit of code: it has a name, a purpose, inputs/outputs.
Chunking at function boundaries keeps each chunk semantically complete and makes
citations meaningful ("see `embed_text()` in `retrieval/embedder.py`").
---
## What's next
Phase 1: Core ingestion pipeline
- `repo_fetcher.py` β€” clone public repos
- `file_filter.py` β€” skip binaries, lock files, node_modules
- `code_chunker.py` β€” AST-based chunking for Python