# Note 000 — Project Setup **Date:** 2026-03-22 **PR:** Initial setup (no PR — baseline) --- ## What was set up - Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/` - `PLAN.md` written with full architecture, phases, and tech stack decisions - `CLAUDE.md` written with project instructions for Claude Code - `LEARN.md` started — will grow as each phase is built - Git repo initialized --- ## Key architectural decisions **Why Qdrant over ChromaDB?** ChromaDB is local-only — data lives on disk and disappears if you redeploy. Qdrant Cloud has a permanent free tier (1GB), making the app deployable without paying for storage. It also has native hybrid search (sparse + dense vectors), eliminating the need for our manual BM25 index. **Why nomic-embed-code over all-MiniLM-L6-v2?** `all-MiniLM-L6-v2` was trained on natural language. Code has different patterns: identifier names, function signatures, call chains. `nomic-embed-code` was fine-tuned on code and produces better semantic similarity for code queries. **Why AST chunking over character windows?** Character windows split wherever they hit the size limit — often mid-function. A function is the natural unit of code: it has a name, a purpose, inputs/outputs. Chunking at function boundaries keeps each chunk semantically complete and makes citations meaningful ("see `embed_text()` in `retrieval/embedder.py`"). --- ## What's next Phase 1: Core ingestion pipeline - `repo_fetcher.py` — clone public repos - `file_filter.py` — skip binaries, lock files, node_modules - `code_chunker.py` — AST-based chunking for Python