Spaces:

umanggarg
/

cartographer

Running

App Files Files Community

cartographer / notes /000-project-setup.md

umanggarg

Project setup: GitHub RAG Copilot

b5dbf45 about 2 months ago

preview code

raw

history blame contribute delete

1.64 kB

Note 000 — Project Setup

Date: 2026-03-22 PR: Initial setup (no PR — baseline)

What was set up

Project structure created: backend/, ingestion/, retrieval/, notes/, .claude/
PLAN.md written with full architecture, phases, and tech stack decisions
CLAUDE.md written with project instructions for Claude Code
LEARN.md started — will grow as each phase is built
Git repo initialized

Key architectural decisions

Why Qdrant over ChromaDB? ChromaDB is local-only — data lives on disk and disappears if you redeploy. Qdrant Cloud has a permanent free tier (1GB), making the app deployable without paying for storage. It also has native hybrid search (sparse + dense vectors), eliminating the need for our manual BM25 index.

Why nomic-embed-code over all-MiniLM-L6-v2? all-MiniLM-L6-v2 was trained on natural language. Code has different patterns: identifier names, function signatures, call chains. nomic-embed-code was fine-tuned on code and produces better semantic similarity for code queries.

Why AST chunking over character windows? Character windows split wherever they hit the size limit — often mid-function. A function is the natural unit of code: it has a name, a purpose, inputs/outputs. Chunking at function boundaries keeps each chunk semantically complete and makes citations meaningful ("see embed_text() in retrieval/embedder.py").

What's next

Phase 1: Core ingestion pipeline

repo_fetcher.py — clone public repos
file_filter.py — skip binaries, lock files, node_modules
code_chunker.py — AST-based chunking for Python