cartographer / notes /000-project-setup.md
umanggarg's picture
Project setup: GitHub RAG Copilot
b5dbf45

Note 000 — Project Setup

Date: 2026-03-22 PR: Initial setup (no PR — baseline)


What was set up

  • Project structure created: backend/, ingestion/, retrieval/, notes/, .claude/
  • PLAN.md written with full architecture, phases, and tech stack decisions
  • CLAUDE.md written with project instructions for Claude Code
  • LEARN.md started — will grow as each phase is built
  • Git repo initialized

Key architectural decisions

Why Qdrant over ChromaDB? ChromaDB is local-only — data lives on disk and disappears if you redeploy. Qdrant Cloud has a permanent free tier (1GB), making the app deployable without paying for storage. It also has native hybrid search (sparse + dense vectors), eliminating the need for our manual BM25 index.

Why nomic-embed-code over all-MiniLM-L6-v2? all-MiniLM-L6-v2 was trained on natural language. Code has different patterns: identifier names, function signatures, call chains. nomic-embed-code was fine-tuned on code and produces better semantic similarity for code queries.

Why AST chunking over character windows? Character windows split wherever they hit the size limit — often mid-function. A function is the natural unit of code: it has a name, a purpose, inputs/outputs. Chunking at function boundaries keeps each chunk semantically complete and makes citations meaningful ("see embed_text() in retrieval/embedder.py").


What's next

Phase 1: Core ingestion pipeline

  • repo_fetcher.py — clone public repos
  • file_filter.py — skip binaries, lock files, node_modules
  • code_chunker.py — AST-based chunking for Python