Spaces:
Running
Note 000 — Project Setup
Date: 2026-03-22 PR: Initial setup (no PR — baseline)
What was set up
- Project structure created:
backend/,ingestion/,retrieval/,notes/,.claude/ PLAN.mdwritten with full architecture, phases, and tech stack decisionsCLAUDE.mdwritten with project instructions for Claude CodeLEARN.mdstarted — will grow as each phase is built- Git repo initialized
Key architectural decisions
Why Qdrant over ChromaDB? ChromaDB is local-only — data lives on disk and disappears if you redeploy. Qdrant Cloud has a permanent free tier (1GB), making the app deployable without paying for storage. It also has native hybrid search (sparse + dense vectors), eliminating the need for our manual BM25 index.
Why nomic-embed-code over all-MiniLM-L6-v2?
all-MiniLM-L6-v2 was trained on natural language. Code has different patterns:
identifier names, function signatures, call chains. nomic-embed-code was
fine-tuned on code and produces better semantic similarity for code queries.
Why AST chunking over character windows?
Character windows split wherever they hit the size limit — often mid-function.
A function is the natural unit of code: it has a name, a purpose, inputs/outputs.
Chunking at function boundaries keeps each chunk semantically complete and makes
citations meaningful ("see embed_text() in retrieval/embedder.py").
What's next
Phase 1: Core ingestion pipeline
repo_fetcher.py— clone public reposfile_filter.py— skip binaries, lock files, node_modulescode_chunker.py— AST-based chunking for Python