Spaces:
Running
Running
| # Note 000 β Project Setup | |
| **Date:** 2026-03-22 | |
| **PR:** Initial setup (no PR β baseline) | |
| --- | |
| ## What was set up | |
| - Project structure created: `backend/`, `ingestion/`, `retrieval/`, `notes/`, `.claude/` | |
| - `PLAN.md` written with full architecture, phases, and tech stack decisions | |
| - `CLAUDE.md` written with project instructions for Claude Code | |
| - `LEARN.md` started β will grow as each phase is built | |
| - Git repo initialized | |
| --- | |
| ## Key architectural decisions | |
| **Why Qdrant over ChromaDB?** | |
| ChromaDB is local-only β data lives on disk and disappears if you redeploy. | |
| Qdrant Cloud has a permanent free tier (1GB), making the app deployable without | |
| paying for storage. It also has native hybrid search (sparse + dense vectors), | |
| eliminating the need for our manual BM25 index. | |
| **Why nomic-embed-code over all-MiniLM-L6-v2?** | |
| `all-MiniLM-L6-v2` was trained on natural language. Code has different patterns: | |
| identifier names, function signatures, call chains. `nomic-embed-code` was | |
| fine-tuned on code and produces better semantic similarity for code queries. | |
| **Why AST chunking over character windows?** | |
| Character windows split wherever they hit the size limit β often mid-function. | |
| A function is the natural unit of code: it has a name, a purpose, inputs/outputs. | |
| Chunking at function boundaries keeps each chunk semantically complete and makes | |
| citations meaningful ("see `embed_text()` in `retrieval/embedder.py`"). | |
| --- | |
| ## What's next | |
| Phase 1: Core ingestion pipeline | |
| - `repo_fetcher.py` β clone public repos | |
| - `file_filter.py` β skip binaries, lock files, node_modules | |
| - `code_chunker.py` β AST-based chunking for Python | |