cartographer / PLAN.md
umanggarg's picture
rebrand: update all display name references to Cartographer
410d1c8
# Cartographer β€” Build Plan
A RAG system that indexes GitHub repositories and answers natural language
questions about their code, architecture, and documentation.
---
## Learning Objectives
By the end of this project you will understand:
- How RAG works on source code (not just documents)
- AST-based code chunking vs. fixed character windows
- Code-aware embeddings vs. general text embeddings
- Metadata-rich retrieval (file, function, class, language, line numbers)
- Hosted vector databases (Qdrant Cloud) and why they enable free deployment
- Live deployment: frontend on Vercel, backend on Render, vectors on Qdrant Cloud
- Claude Code features: CLAUDE.md, hooks, slash commands, subagents
---
## Architecture Overview
```
GitHub URL
β”‚
β–Ό
[Ingestion Pipeline]
β”œβ”€β”€ Fetch repo via GitHub API (no clone needed for public repos)
β”œβ”€β”€ Filter files by language β€” skip binaries, lock files, node_modules
β”œβ”€β”€ Chunk by AST boundaries (functions, classes)
β”‚ └── fallback: character windows for markdown, config, plain text
β”œβ”€β”€ Embed with nomic-embed-code (code-optimised model)
└── Store in Qdrant Cloud
└── metadata: repo, filepath, language,
function_name, class_name, start_line, end_line
β”‚
β–Ό
[Query Pipeline]
β”œβ”€β”€ Embed query with same model
β”œβ”€β”€ Hybrid search (dense vector + sparse BM25, native in Qdrant)
β”œβ”€β”€ Relevance threshold (reject out-of-domain queries)
└── LLM generation (Groq / Claude)
└── citations: filepath + line range
```
---
## Phases
### Phase 1 β€” Core Ingestion
- [ ] `ingestion/repo_fetcher.py` β€” fetch file tree + content via GitHub API
- [ ] `ingestion/file_filter.py` β€” include/exclude rules per language
- [ ] `ingestion/code_chunker.py` β€” AST-based chunking for Python; character-window fallback for other file types
- [ ] `ingestion/embedder.py` β€” embed chunks with `nomic-ai/nomic-embed-code`
- [ ] `ingestion/qdrant_store.py` β€” upsert chunks into Qdrant Cloud collection
### Phase 2 β€” Retrieval & Generation
- [ ] `retrieval/retrieval.py` β€” hybrid search using Qdrant's native dense + sparse
- [ ] `backend/services/generation.py` β€” LLM answer generation with code-aware system prompt
- [ ] `backend/services/ingestion_service.py` β€” orchestrate full ingestion pipeline
- [ ] FastAPI backend with `/ingest`, `/query`, `/search` endpoints
### Phase 3 β€” UI
- [ ] React + Vite frontend
- [ ] Repo URL input instead of file upload
- [ ] Citations show filepath + line numbers
- [ ] Syntax-highlighted code chunks in source passages
- [ ] Multi-repo selector in sidebar
### Phase 4 β€” Live Deployment
- [ ] **Frontend β†’ Vercel** (free, static hosting)
- [ ] **Backend β†’ Render** (free tier β€” lightweight since no local ML model)
- [ ] **Vector DB β†’ Qdrant Cloud** (permanent free tier, 1GB)
- [ ] **Embeddings β†’ Qdrant's built-in vectoriser** or Voyage AI API (removes model from backend, keeps Render on free tier)
- [ ] Environment variable setup, CORS configuration
- [ ] GitHub Actions CI: lint + deploy on push to main
### Phase 5 β€” Claude Code Features (Throughout)
- [ ] `CLAUDE.md` β€” project briefing for Claude Code sessions
- [ ] Hooks β€” auto-lint on file edit, reminder to update notes after commit
- [ ] Slash commands β€” `/ingest-repo`, `/search-code`, `/add-to-notes`
- [ ] Subagent patterns β€” parallel ingestion, expert review before PRs
---
## Tech Stack
| Layer | Choice | Why |
|---|---|---|
| Repo fetch | GitHub REST API | No local clone needed; works without git installed |
| Code parsing | `ast` (Python), `tree-sitter` (multi-lang) | Split at function/class boundaries |
| Embeddings | `nomic-ai/nomic-embed-code` | Fine-tuned on code, free, runs locally |
| Vector DB | Qdrant Cloud (free tier) | Permanent free 1GB, native hybrid search, enables deployment |
| LLM | Groq Llama 3.3 70B / Claude Haiku | Fast, cheap/free |
| Backend | FastAPI + Uvicorn | Lightweight, async, auto-docs |
| Frontend | React + Vite | Fast dev server, small production bundle |
| Frontend hosting | Vercel | Free, zero-config for Vite apps |
| Backend hosting | Render | Free tier works once model is removed from server |
| CI/CD | GitHub Actions | Lint and deploy on push |
---
## Deployment Architecture
```
User browser
β”‚
β”œβ”€β”€ Static files ──→ Vercel (free)
β”‚ React UI
β”‚
└── API calls ──────→ Render (free)
FastAPI backend
β”‚
β”œβ”€β”€β†’ Qdrant Cloud (free)
β”‚ Vector storage + hybrid search
β”‚
└──→ Groq API (free)
LLM generation
```
The key insight: by using **Qdrant Cloud** for vector storage and a **remote
embedding API** (instead of running the model on the server), the backend
becomes a lightweight HTTP service with minimal RAM usage β€” fitting within
Render's free tier (512MB RAM).
---
## Notes Directory
`notes/` is updated after every PR:
- What was built
- Key decisions made
- Concepts learned
- What's next
See `notes/000-project-setup.md` for the first entry.