Spaces:

umanggarg
/

cartographer

Sleeping

App Files Files Community

cartographer / PLAN.md

umanggarg

rebrand: update all display name references to Cartographer

410d1c8 about 1 month ago

preview code

raw

history blame contribute delete

5.42 kB

	# Cartographer — Build Plan

	A RAG system that indexes GitHub repositories and answers natural language
	questions about their code, architecture, and documentation.

	---

	## Learning Objectives

	By the end of this project you will understand:
	- How RAG works on source code (not just documents)
	- AST-based code chunking vs. fixed character windows
	- Code-aware embeddings vs. general text embeddings
	- Metadata-rich retrieval (file, function, class, language, line numbers)
	- Hosted vector databases (Qdrant Cloud) and why they enable free deployment
	- Live deployment: frontend on Vercel, backend on Render, vectors on Qdrant Cloud
	- Claude Code features: CLAUDE.md, hooks, slash commands, subagents

	---

	## Architecture Overview

	```
	GitHub URL
	│
	▼
	[Ingestion Pipeline]
	├── Fetch repo via GitHub API (no clone needed for public repos)
	├── Filter files by language — skip binaries, lock files, node_modules
	├── Chunk by AST boundaries (functions, classes)
	│ └── fallback: character windows for markdown, config, plain text
	├── Embed with nomic-embed-code (code-optimised model)
	└── Store in Qdrant Cloud
	└── metadata: repo, filepath, language,
	function_name, class_name, start_line, end_line

	│
	▼
	[Query Pipeline]
	├── Embed query with same model
	├── Hybrid search (dense vector + sparse BM25, native in Qdrant)
	├── Relevance threshold (reject out-of-domain queries)
	└── LLM generation (Groq / Claude)
	└── citations: filepath + line range
	```

	---

	## Phases

	### Phase 1 — Core Ingestion
	- [ ] `ingestion/repo_fetcher.py` — fetch file tree + content via GitHub API
	- [ ] `ingestion/file_filter.py` — include/exclude rules per language
	- [ ] `ingestion/code_chunker.py` — AST-based chunking for Python; character-window fallback for other file types
	- [ ] `ingestion/embedder.py` — embed chunks with `nomic-ai/nomic-embed-code`
	- [ ] `ingestion/qdrant_store.py` — upsert chunks into Qdrant Cloud collection

	### Phase 2 — Retrieval & Generation
	- [ ] `retrieval/retrieval.py` — hybrid search using Qdrant's native dense + sparse
	- [ ] `backend/services/generation.py` — LLM answer generation with code-aware system prompt
	- [ ] `backend/services/ingestion_service.py` — orchestrate full ingestion pipeline
	- [ ] FastAPI backend with `/ingest`, `/query`, `/search` endpoints

	### Phase 3 — UI
	- [ ] React + Vite frontend
	- [ ] Repo URL input instead of file upload
	- [ ] Citations show filepath + line numbers
	- [ ] Syntax-highlighted code chunks in source passages
	- [ ] Multi-repo selector in sidebar

	### Phase 4 — Live Deployment
	- [ ] Frontend → Vercel (free, static hosting)
	- [ ] Backend → Render (free tier — lightweight since no local ML model)
	- [ ] Vector DB → Qdrant Cloud (permanent free tier, 1GB)
	- [ ] Embeddings → Qdrant's built-in vectoriser or Voyage AI API (removes model from backend, keeps Render on free tier)
	- [ ] Environment variable setup, CORS configuration
	- [ ] GitHub Actions CI: lint + deploy on push to main

	### Phase 5 — Claude Code Features (Throughout)
	- [ ] `CLAUDE.md` — project briefing for Claude Code sessions
	- [ ] Hooks — auto-lint on file edit, reminder to update notes after commit
	- [ ] Slash commands — `/ingest-repo`, `/search-code`, `/add-to-notes`
	- [ ] Subagent patterns — parallel ingestion, expert review before PRs

	---

	## Tech Stack

	\| Layer \| Choice \| Why \|
	\|---\|---\|---\|
	\| Repo fetch \| GitHub REST API \| No local clone needed; works without git installed \|
	\| Code parsing \| `ast` (Python), `tree-sitter` (multi-lang) \| Split at function/class boundaries \|
	\| Embeddings \| `nomic-ai/nomic-embed-code` \| Fine-tuned on code, free, runs locally \|
	\| Vector DB \| Qdrant Cloud (free tier) \| Permanent free 1GB, native hybrid search, enables deployment \|
	\| LLM \| Groq Llama 3.3 70B / Claude Haiku \| Fast, cheap/free \|
	\| Backend \| FastAPI + Uvicorn \| Lightweight, async, auto-docs \|
	\| Frontend \| React + Vite \| Fast dev server, small production bundle \|
	\| Frontend hosting \| Vercel \| Free, zero-config for Vite apps \|
	\| Backend hosting \| Render \| Free tier works once model is removed from server \|
	\| CI/CD \| GitHub Actions \| Lint and deploy on push \|

	---

	## Deployment Architecture

	```
	User browser
	│
	├── Static files ──→ Vercel (free)
	│ React UI
	│
	└── API calls ──────→ Render (free)
	FastAPI backend
	│
	├──→ Qdrant Cloud (free)
	│ Vector storage + hybrid search
	│
	└──→ Groq API (free)
	LLM generation
	```

	The key insight: by using Qdrant Cloud for vector storage and a **remote
	embedding API** (instead of running the model on the server), the backend
	becomes a lightweight HTTP service with minimal RAM usage — fitting within
	Render's free tier (512MB RAM).

	---

	## Notes Directory

	`notes/` is updated after every PR:
	- What was built
	- Key decisions made
	- Concepts learned
	- What's next

	See `notes/000-project-setup.md` for the first entry.