cartographer / notes /004-claude-code-learnings.md
umanggarg's picture
feat: agentic Phase 2 + trace UI fix + token budget cuts
675dfb1

Claude Code Source Study β€” Learnings for Cartographer

Reference session: April 2026 Source: https://github.com/codeaashu/claude-code (reconstruction/tutorial repo) Reference: anthropics/anthropic-cookbook patterns/agents/


1. /init Phase 2 β€” Manifest Files Before Code

What claude-code does:

"Launch a subagent to survey the codebase, and ask it to read key files: manifest files (package.json, Cargo.toml, pyproject.toml, go.mod, pom.xml, etc.), README, Makefile/build configs..."

Why this matters for cartographer: Manifest files are the universal, language-agnostic entry point to any repo:

  • They declare dependencies β†’ immediately reveals tech stack (fastapi = web API, torch = ML, tree-sitter = code parsing, no framework = pure library like micrograd)
  • They declare entry points/scripts β†’ reveals how the system is run
  • They work for ANY repo: web apps, ML libraries, compilers, game engines, CLIs

What we were doing wrong: Phase 1 was reading main.py/app.py module chunks β€” these are bootstrap files that import ALL features equally. main.py in a FastAPI app imports every router, every service, every feature. Its import graph says "everything is equally important", which is the opposite of the pipeline signal we need.

Fix applied: Added _manifest_chunks() to read project manifests first in Phase 1. The LLM anchors on declared dependencies β†’ understands project type β†’ identifies pipeline stages correctly.


2. No Hardcoded Heuristics β€” Principles Over Examples

What claude-code does: The /init prompt states principles and asks the subagent to DISCOVER, not guess:

  • "Detect: Build, test, and lint commands (especially non-standard ones)"
  • "Note what you could NOT figure out from code alone β€” these become interview questions"

It never says "if you see a routers/ directory, skip it" or "if you see ingestion/, it's a pipeline stage". The agent reads actual file content and reasons from there.

What we were doing wrong: Our Phase 1 prompt was full of domain-specific terms:

  • "ingestion, parsing, embedding, retrieval, inference" β€” only valid for LLM/RAG apps
  • "routers, routes, middleware, handlers" β€” only valid for web apps
  • Good/bad examples like "Gradient Backpropagation" or "Token Embedding"

These break silently on any non-web, non-LLM repo.

Fix applied: Phase 1 prompt now:

  1. Reads manifest β†’ understands tech stack from dependencies
  2. Reads README β†’ understands what the system does
  3. Reads module-level imports β†’ sees what each file ACTUALLY uses
  4. States rules as universal principles: "a stage takes data in one form and produces it in another β€” evident from its imports and function signatures"

No domain terms, no directory name assumptions, no illustrative examples.


3. Evaluator-Optimizer Pattern (Anthropic Cookbook)

From patterns/agents/evaluator_optimizer.ipynb:

def evaluate(prompt, content, task):
    # Returns <evaluation>PASS|NEEDS_IMPROVEMENT|FAIL</evaluation>
    # AND <feedback>specific actionable feedback</feedback>

def loop(task, evaluator_prompt, generator_prompt):
    while True:
        evaluation, feedback = evaluate(evaluator_prompt, result, task)
        if evaluation == "PASS":
            return result
        context = "Previous attempts: [memory] Feedback: [feedback]"
        result = generate(generator_prompt, task, context)

Key properties:

  1. Feedback accumulates across rounds β€” each round gets context of what was tried
  2. Clear pass/fail criteria β€” universal principles, not content-specific examples
  3. Evaluation and generation are separate concerns β€” different prompts, different roles

What we were doing wrong: Our evaluator's "remove trivial infrastructure" instruction had no corresponding response format β€” the LLM said "remove" but didn't know HOW to signal removal. Result: "Noise File Exclusion" passed because it sounds like a technique name.

Fix applied: Explicit action: keep|rename|remove field. Three-test rubric for quality:

  1. Is it a technique/decision name (not a filename/class name)?
  2. Would removing it leave an engineer unable to understand the system's core behaviour?
  3. Does the subtitle describe a real design decision with tradeoffs?

4. Tool Design β€” get_architecture vs Import Graph Reading

What claude-code does: Exposes get_architecture as a curated tool that provides high-level overview. Also list_directory, search_source, read_source_file for targeted exploration.

The agent DECIDES what to look at. It doesn't get a static snapshot of everything.

What we were doing wrong: Phase 1 gave the LLM a static 80-file flat list + 14 random module chunks and asked it to infer the pipeline. No structure, no priority, no anchoring.

Fix applied:

  • Files grouped by directory (structural signal)
  • Manifest files first (tech stack signal)
  • Module chunks from non-bootstrap files only
  • README as anchor for what capabilities must appear

Still room to improve: The right long-term fix is to make Phase 1 truly agentic β€” give it tools (list_files, search_code, read_file) and let it explore rather than giving it a pre-assembled snapshot. This is how claude-code's /init actually works: a subagent with tools reads specific files, not a single-shot prompt.


5. Bootstrap Files Are Wiring, Not Pipeline

Universal pattern (any framework):

  • main.py, app.py, server.py β€” Python web apps (FastAPI, Flask, Django)
  • index.js, server.js β€” Node.js apps (Express, Fastify)
  • main.go β€” Go apps
  • main.rs β€” Rust apps (Actix, Axum)
  • Application.java β€” Spring Boot

These files wire together all features/routers/services. They import everything. Their import graph tells you "all features exist" β€” not which ones are the core pipeline.

Fix applied: Removed bootstrap filenames from _ENTRY_NAMES. Phase 1 now gets module chunks from service/library files that actually DO the work.


6. What /init's "Note what you could NOT figure out" means for tours

The /init prompt says: "Note what you could NOT figure out from code alone β€” these become interview questions."

This is already implemented in Phase 2 (_phase_investigate) via the gaps field:

"GAPS: What important design rationale CANNOT be determined from this code alone?"

The ask field on each concept card also embodies this: it surfaces the question a new engineer MUST understand to work with this component, often something not visible in the code.

This is the right model. Keep it.


7. Contextual Retrieval Prompt Quality

Current _CONTEXT_SYSTEM:

"Write 1-2 sentences that situate this chunk within the document: name the function/class, state its role in the file's pipeline, and name the key identifier(s) a developer would search for."

This is solid. The system prompt correctly tells the model its output is prepended to chunks for embedding β€” it must match developer search queries, not explain failure modes.

The chunk_question correctly uses Anthropic's prompt caching pattern:

  • Document block β†’ cache_control: ephemeral (cached per file, ~10% cost for subsequent chunks)
  • Chunk + question β†’ varies per chunk (not cached)

No changes needed here β€” contextual retrieval was not touched in this session.


8. Remaining Quality Issues to Fix

From the latest tour output:

  • "In-Memory Archive Extraction" from qdrant_store.py with garbled description β†’ Phase 2 hallucinating about a storage layer file. Fix: Phase 2 investigation prompt needs stronger grounding: "only use information visible in the code above"
  • Some concept descriptions still feel thin
  • The ask questions could be sharper β€” they should be answerable ONLY by someone who read the specific concept, not generic questions

Next session priorities:

  1. Fix Phase 2 hallucination on storage/infrastructure files
  2. Improve ask question specificity
  3. Consider agentic Phase 1 (tool-based exploration instead of static snapshot)