Spaces:
Running
Claude Code Source Study β Learnings for Cartographer
Reference session: April 2026 Source: https://github.com/codeaashu/claude-code (reconstruction/tutorial repo) Reference: anthropics/anthropic-cookbook patterns/agents/
1. /init Phase 2 β Manifest Files Before Code
What claude-code does:
"Launch a subagent to survey the codebase, and ask it to read key files: manifest files (package.json, Cargo.toml, pyproject.toml, go.mod, pom.xml, etc.), README, Makefile/build configs..."
Why this matters for cartographer: Manifest files are the universal, language-agnostic entry point to any repo:
- They declare dependencies β immediately reveals tech stack (fastapi = web API, torch = ML, tree-sitter = code parsing, no framework = pure library like micrograd)
- They declare entry points/scripts β reveals how the system is run
- They work for ANY repo: web apps, ML libraries, compilers, game engines, CLIs
What we were doing wrong:
Phase 1 was reading main.py/app.py module chunks β these are bootstrap files
that import ALL features equally. main.py in a FastAPI app imports every router,
every service, every feature. Its import graph says "everything is equally important",
which is the opposite of the pipeline signal we need.
Fix applied:
Added _manifest_chunks() to read project manifests first in Phase 1. The LLM
anchors on declared dependencies β understands project type β identifies pipeline
stages correctly.
2. No Hardcoded Heuristics β Principles Over Examples
What claude-code does:
The /init prompt states principles and asks the subagent to DISCOVER, not guess:
- "Detect: Build, test, and lint commands (especially non-standard ones)"
- "Note what you could NOT figure out from code alone β these become interview questions"
It never says "if you see a routers/ directory, skip it" or "if you see ingestion/,
it's a pipeline stage". The agent reads actual file content and reasons from there.
What we were doing wrong: Our Phase 1 prompt was full of domain-specific terms:
- "ingestion, parsing, embedding, retrieval, inference" β only valid for LLM/RAG apps
- "routers, routes, middleware, handlers" β only valid for web apps
- Good/bad examples like "Gradient Backpropagation" or "Token Embedding"
These break silently on any non-web, non-LLM repo.
Fix applied: Phase 1 prompt now:
- Reads manifest β understands tech stack from dependencies
- Reads README β understands what the system does
- Reads module-level imports β sees what each file ACTUALLY uses
- States rules as universal principles: "a stage takes data in one form and produces it in another β evident from its imports and function signatures"
No domain terms, no directory name assumptions, no illustrative examples.
3. Evaluator-Optimizer Pattern (Anthropic Cookbook)
From patterns/agents/evaluator_optimizer.ipynb:
def evaluate(prompt, content, task):
# Returns <evaluation>PASS|NEEDS_IMPROVEMENT|FAIL</evaluation>
# AND <feedback>specific actionable feedback</feedback>
def loop(task, evaluator_prompt, generator_prompt):
while True:
evaluation, feedback = evaluate(evaluator_prompt, result, task)
if evaluation == "PASS":
return result
context = "Previous attempts: [memory] Feedback: [feedback]"
result = generate(generator_prompt, task, context)
Key properties:
- Feedback accumulates across rounds β each round gets context of what was tried
- Clear pass/fail criteria β universal principles, not content-specific examples
- Evaluation and generation are separate concerns β different prompts, different roles
What we were doing wrong: Our evaluator's "remove trivial infrastructure" instruction had no corresponding response format β the LLM said "remove" but didn't know HOW to signal removal. Result: "Noise File Exclusion" passed because it sounds like a technique name.
Fix applied:
Explicit action: keep|rename|remove field. Three-test rubric for quality:
- Is it a technique/decision name (not a filename/class name)?
- Would removing it leave an engineer unable to understand the system's core behaviour?
- Does the subtitle describe a real design decision with tradeoffs?
4. Tool Design β get_architecture vs Import Graph Reading
What claude-code does:
Exposes get_architecture as a curated tool that provides high-level overview.
Also list_directory, search_source, read_source_file for targeted exploration.
The agent DECIDES what to look at. It doesn't get a static snapshot of everything.
What we were doing wrong: Phase 1 gave the LLM a static 80-file flat list + 14 random module chunks and asked it to infer the pipeline. No structure, no priority, no anchoring.
Fix applied:
- Files grouped by directory (structural signal)
- Manifest files first (tech stack signal)
- Module chunks from non-bootstrap files only
- README as anchor for what capabilities must appear
Still room to improve:
The right long-term fix is to make Phase 1 truly agentic β give it tools
(list_files, search_code, read_file) and let it explore rather than giving
it a pre-assembled snapshot. This is how claude-code's /init actually works:
a subagent with tools reads specific files, not a single-shot prompt.
5. Bootstrap Files Are Wiring, Not Pipeline
Universal pattern (any framework):
main.py,app.py,server.pyβ Python web apps (FastAPI, Flask, Django)index.js,server.jsβ Node.js apps (Express, Fastify)main.goβ Go appsmain.rsβ Rust apps (Actix, Axum)Application.javaβ Spring Boot
These files wire together all features/routers/services. They import everything. Their import graph tells you "all features exist" β not which ones are the core pipeline.
Fix applied:
Removed bootstrap filenames from _ENTRY_NAMES. Phase 1 now gets module chunks
from service/library files that actually DO the work.
6. What /init's "Note what you could NOT figure out" means for tours
The /init prompt says: "Note what you could NOT figure out from code alone β
these become interview questions."
This is already implemented in Phase 2 (_phase_investigate) via the gaps field:
"GAPS: What important design rationale CANNOT be determined from this code alone?"
The ask field on each concept card also embodies this: it surfaces the question
a new engineer MUST understand to work with this component, often something not
visible in the code.
This is the right model. Keep it.
7. Contextual Retrieval Prompt Quality
Current _CONTEXT_SYSTEM:
"Write 1-2 sentences that situate this chunk within the document: name the function/class, state its role in the file's pipeline, and name the key identifier(s) a developer would search for."
This is solid. The system prompt correctly tells the model its output is prepended to chunks for embedding β it must match developer search queries, not explain failure modes.
The chunk_question correctly uses Anthropic's prompt caching pattern:
- Document block β
cache_control: ephemeral(cached per file, ~10% cost for subsequent chunks) - Chunk + question β varies per chunk (not cached)
No changes needed here β contextual retrieval was not touched in this session.
8. Remaining Quality Issues to Fix
From the latest tour output:
- "In-Memory Archive Extraction" from
qdrant_store.pywith garbled description β Phase 2 hallucinating about a storage layer file. Fix: Phase 2 investigation prompt needs stronger grounding: "only use information visible in the code above" - Some concept descriptions still feel thin
- The
askquestions could be sharper β they should be answerable ONLY by someone who read the specific concept, not generic questions
Next session priorities:
- Fix Phase 2 hallucination on storage/infrastructure files
- Improve
askquestion specificity - Consider agentic Phase 1 (tool-based exploration instead of static snapshot)