Spaces:

umanggarg
/

cartographer

Running

umanggarg commited on 16 days ago

Commit

bedd40f

1 Parent(s): 9eea8ac

Premium-mode cap overrides for richer prebaked artifacts

Token + content caps in the artifact pipeline were tuned for free-tier
providers (Cerebras 8B, Gemini Flash) where context windows are smaller
and verbose reasoning patterns drive cost. Applied to a Sonnet 4.6
prebake those defaults clip long classes, cut ReAct reasoning short,
and constrain comprehensive READMEs.

GenerationService now exposes a single tunable lookup:

gen.cap("react_round_tokens", 700)

returning the caller's default by default, or the matching entry from
GenerationService.PREMIUM_CAPS when self.premium_mode is True. The
prebake CLI flips premium_mode for the whole run, so every cap site
automatically picks up the larger value without any per-call kwargs.

Wired sites:
- tour_agent.py: ReAct rounds, forced-DONE summaries, Phase 1 mapping,
Phase 3 description, Phase 3 code excerpt, Phase 3 final synthesis
- readme_service.py: README max_tokens
- ingestion_service.py: contextual retrieval — chunk preview chars,
surrounding doc chars, and per-chunk max_tokens (threaded into
_anthropic_contextualise)

Runtime path (premium_mode=False) is unchanged — every cap returns its
original default value, so deployed traffic still hits the safer caps
that work with rate-limited free providers.

CLAUDE.md gains a "Pre-baked Artifact Cache" section documenting the
prebake flow, the premium tier, and the cap-override mechanism.

.gitignore drops .claude/ so the local scheduled_tasks.lock and any
other agent-local files don't get tracked.

Files changed (6) hide show

.gitignore +1 -0
CLAUDE.md +26 -0
backend/services/generation.py +43 -0
backend/services/ingestion_service.py +13 -5
backend/services/readme_service.py +1 -1
backend/services/tour_agent.py +15 -8

.gitignore CHANGED Viewed

@@ -42,3 +42,4 @@ LEARN.md
 /*.png
 .playwright-mcp/
 posthog-setup-report.md

 /*.png
 .playwright-mcp/
 posthog-setup-report.md
+.claude/

CLAUDE.md CHANGED Viewed

@@ -102,6 +102,32 @@ Every `OpenAI(...)` client instantiation MUST have `timeout=30` (or use `_TIMEOU
 A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
 OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
 ## Runtime Data — always gitignore, never commit
 Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:

 A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
 OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
+## Pre-baked Artifact Cache
+Tour, diagram (architecture/class), README, and repo-map outputs are persisted in a Qdrant sidecar collection (`<collection>_artifacts`) so they survive container restarts and are shared across all users. Reads go through `QdrantStore.load_artifact(repo, kind)`; writes through `save_artifact(repo, kind, data, generated_by_model)`.
+A canonical set of repos can be pre-generated at premium quality with:
+```bash
+.venv/bin/python -m scripts.prebake_repos                     # default Karpathy set
+.venv/bin/python -m scripts.prebake_repos owner1/repo1 ...    # specific repos
+.venv/bin/python -m scripts.prebake_repos --force ...         # rebuild
+```
+The script flips `gen.premium_mode = True` for the entire run, which:
+- Routes every `gen.generate(...)` call to the Claude Sonnet 4.6 client (`ANTHROPIC_API_KEY` required).
+- Activates `PREMIUM_CAPS` overrides in `GenerationService` — every `gen.cap(name, default)` call returns the larger premium value (longer ReAct rounds, fuller chunk previews in contextual retrieval, larger README budget, etc.).
+Runtime requests from the deployed app keep the original (smaller) caps so free-tier providers don't drown.
+To inspect what's been baked for a repo:
+```bash
+curl https://<host>/repos/<owner>/<name>/artifacts/info
+```
+returns `kind / generated_by_model / generated_at` per cached artifact. HF Spaces logs also print `[cache hit] kind for repo (model)` on every served artifact.
 ## Runtime Data — always gitignore, never commit
 Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:

backend/services/generation.py CHANGED Viewed

@@ -280,6 +280,49 @@ class GenerationService:
         # through every service layer.
         self.premium_mode = False
     def _init_premium(self) -> None:
         """Initialise the optional premium client (Claude Sonnet 4.6) used
         for one-time generation of cached artifacts (tour, README, diagrams,

         # through every service layer.
         self.premium_mode = False
+    # ── Premium-mode cap overrides ─────────────────────────────────────────
+    # The artifact-generation pipeline has many small caps — max_tokens for
+    # ReAct rounds, character limits on chunks shown to the model, README
+    # length budgets, etc. They were tuned for free-tier providers (Cerebras
+    # 8B, Gemini Flash) which have shorter context windows and verbose
+    # reasoning patterns. Applied to a Sonnet 4.6 prebake, those defaults
+    # leave quality on the table — long classes get truncated, agent
+    # reasoning is cut mid-thought, READMEs run short.
+    #
+    # When premium_mode is True, cap() returns the entry from PREMIUM_CAPS
+    # below if one exists; otherwise the caller's default. Per-callsite
+    # invocation looks like:
+    #     max_tokens = self._gen.cap("react_round_tokens", 700)
+    # Free-tier runtime paths (premium_mode=False) get the default unchanged.
+    PREMIUM_CAPS = {
+        # ── Tour generation (TourAgent) ──
+        "react_round_tokens":      1500,   # was 700  — let Sonnet reason fully per round
+        "react_done_tokens":       1200,   # was 600  — richer forced-DONE summaries
+        "phase_map_tokens":        2048,   # was 1024 — more concept candidates surfaced
+        "concept_desc_tokens":     1800,   # was 900  — deeper per-concept descriptions
+        "tour_synthesis_tokens":   16384,  # was 8192 — full Sonnet output budget
+        "tool_result_tokens":       800,   # was 400  — give the agent more of each tool result
+        "phase3_code_chars":       6000,   # was 3000 — include more code per concept
+        # ── Diagrams ──
+        "diagram_tokens":          4096,   # was 2048 — JSON output room for richer node lists
+        # ── README ──
+        "readme_tokens":           4096,   # was 1800 — comprehensive README budget
+        # ── Contextual retrieval (ingestion) ──
+        "context_chunk_tokens":     400,   # was 200  — longer contextual descriptions
+        "context_chunk_chars":     2000,   # was 800  — model sees more of each chunk
+        "context_doc_chars":      12000,   # was 6000 — model sees more surrounding file
+    }
+    def cap(self, name: str, default: int) -> int:
+        """Resolve a tunable cap. In premium_mode, returns the override from
+        PREMIUM_CAPS if present; otherwise returns the caller's default.
+        Lets every cap site stay readable as a single line:
+            max_tokens=self._gen.cap("react_round_tokens", 700)
+        """
+        if self.premium_mode and name in self.PREMIUM_CAPS:
+            return self.PREMIUM_CAPS[name]
+        return default
     def _init_premium(self) -> None:
         """Initialise the optional premium client (Claude Sonnet 4.6) used
         for one-time generation of cached artifacts (tour, README, diagrams,

backend/services/ingestion_service.py CHANGED Viewed

@@ -320,7 +320,8 @@ def _chunk_importance(c: dict) -> int:
 def _anthropic_contextualise(
-    client, model: str, system: str, doc_text: str, chunk_question: str
 ) -> str:
     """
     Call Anthropic with prompt caching on the document block.
@@ -341,7 +342,7 @@ def _anthropic_contextualise(
     """
     resp = client.messages.create(
         model=model,
-        max_tokens=200,
         system=system,
         messages=[{
             "role": "user",
@@ -430,18 +431,25 @@ def _add_context(
     # are 200-6000 tokens, so most will qualify.
     _use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
     # Worker function for a single chunk — called from multiple threads.
     # Returns (idx, updated_chunk) or (idx, None) on failure.
     def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
         filepath   = chunk.get("filepath", "")
         chunk_text = chunk.get("text", "")
-        doc_text   = file_content_map.get(filepath, "")[:6000]
         if not chunk_text or not doc_text:
             return idx, None
         chunk_question = (
             f"Here is the chunk we want to situate within the document above:\n"
-            f"<chunk>\n{chunk_text[:800]}\n</chunk>\n\n"
             "Please give a short succinct context to situate this chunk within the overall "
             "document for the purpose of improving search retrieval of the chunk. "
             "Name the function/class/block, its role in the file's pipeline, and the key "
@@ -459,7 +467,7 @@ def _add_context(
                 # processing into O(N_files) full-cost calls.
                 sentence = _anthropic_contextualise(
                     gen._client, gen._model, _CONTEXT_SYSTEM,
-                    doc_text, chunk_question,
                 )
             else:
                 prompt = (

 def _anthropic_contextualise(
+    client, model: str, system: str, doc_text: str, chunk_question: str,
+    max_tokens: int = 200,
 ) -> str:
     """
     Call Anthropic with prompt caching on the document block.
     """
     resp = client.messages.create(
         model=model,
+        max_tokens=max_tokens,
         system=system,
         messages=[{
             "role": "user",
     # are 200-6000 tokens, so most will qualify.
     _use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
+    # Tunable caps — bumped automatically when gen.premium_mode is on so a
+    # premium prebake includes more of each chunk and surrounding file context
+    # than the free-tier defaults allow.
+    _doc_chars      = gen.cap("context_doc_chars",   6000)
+    _chunk_chars    = gen.cap("context_chunk_chars",  800)
+    _ctx_max_tokens = gen.cap("context_chunk_tokens", 200)
     # Worker function for a single chunk — called from multiple threads.
     # Returns (idx, updated_chunk) or (idx, None) on failure.
     def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
         filepath   = chunk.get("filepath", "")
         chunk_text = chunk.get("text", "")
+        doc_text   = file_content_map.get(filepath, "")[:_doc_chars]
         if not chunk_text or not doc_text:
             return idx, None
         chunk_question = (
             f"Here is the chunk we want to situate within the document above:\n"
+            f"<chunk>\n{chunk_text[:_chunk_chars]}\n</chunk>\n\n"
             "Please give a short succinct context to situate this chunk within the overall "
             "document for the purpose of improving search retrieval of the chunk. "
             "Name the function/class/block, its role in the file's pipeline, and the key "
                 # processing into O(N_files) full-cost calls.
                 sentence = _anthropic_contextualise(
                     gen._client, gen._model, _CONTEXT_SYSTEM,
+                    doc_text, chunk_question, max_tokens=_ctx_max_tokens,
                 )
             else:
                 prompt = (

backend/services/readme_service.py CHANGED Viewed

@@ -183,7 +183,7 @@ Output ONLY the markdown. No preamble, no "Here is the README", no trailing comm
                 system=system,
                 prompt=prompt,
                 temperature=0.3,
-                max_tokens=1800,
             )
         except Exception as e:
             yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}

                 system=system,
                 prompt=prompt,
                 temperature=0.3,
+                max_tokens=self._gen.cap("readme_tokens", 1800),
             )
         except Exception as e:
             yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}

backend/services/tour_agent.py CHANGED Viewed

@@ -802,7 +802,8 @@ class TourAgent:
         for round_n in range(max_rounds):
             raw = self._gen.generate(
                 self._AGENTIC_MAP_SYSTEM, transcript,
-                temperature=0.0, max_tokens=700,  # Gemma 4 needs ~700 for verbose THINK+TOOL
             )
             # Parse THINK + TOOL or DONE from the LLM's response
@@ -883,7 +884,8 @@ class TourAgent:
         transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
         raw = self._gen.generate_non_thinking(
             self._AGENTIC_MAP_SYSTEM, transcript,
-            temperature=0.0, max_tokens=700,
         )
         done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
         try:
@@ -999,7 +1001,8 @@ Rules:
   authors considered important enough to document
 """
         raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
-                                  json_mode=True, max_tokens=1024)
         try:
             result = _parse_json(raw)
             if "pipeline_stages" not in result or not result["pipeline_stages"]:
@@ -1112,7 +1115,8 @@ Rules:
             # investigation from RuntimeError is better than a hallucinated one.
             raw = self._gen.generate_quality(
                 self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
-                temperature=0.0, max_tokens=700,  # Gemma 4 needs ~700 for verbose THINK+TOOL
             )
             # Parse THINK + TOOL or DONE
@@ -1195,7 +1199,8 @@ Rules:
         # to ensure the summary is grounded in actual tool call results.
         raw = self._gen.generate_quality(
             self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
-            temperature=0.0, max_tokens=600,
         )
         done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
         try:
@@ -1268,7 +1273,7 @@ Rules:
         # Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
         # Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
         # on free-tier models with 8K context windows.
-        code_text = _token_budget(code_text, max_tokens=3000)
         prompt = f"""Repository: {repo}
 Concept to investigate: {stage_name}
@@ -1314,7 +1319,8 @@ Rules:
         # The agentic loop already required quality — when it falls back to static,
         # the same quality requirement applies.
         raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
-                                         json_mode=True, max_tokens=900)
         try:
             result = _parse_json(raw)
             result.setdefault("name",          stage_name)
@@ -1654,7 +1660,8 @@ Rules:
         # budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
         # never the Cerebras 8B model that would otherwise receive it after quota is spent.
         raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
-                                           temperature=0.0, json_mode=True, max_tokens=3000)
         try:
             tour = _parse_json(raw)
         except Exception as e:

         for round_n in range(max_rounds):
             raw = self._gen.generate(
                 self._AGENTIC_MAP_SYSTEM, transcript,
+                temperature=0.0,
+                max_tokens=self._gen.cap("react_round_tokens", 700),
             )
             # Parse THINK + TOOL or DONE from the LLM's response
         transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
         raw = self._gen.generate_non_thinking(
             self._AGENTIC_MAP_SYSTEM, transcript,
+            temperature=0.0,
+            max_tokens=self._gen.cap("react_done_tokens", 700),
         )
         done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
         try:
   authors considered important enough to document
 """
         raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
+                                  json_mode=True,
+                                  max_tokens=self._gen.cap("phase_map_tokens", 1024))
         try:
             result = _parse_json(raw)
             if "pipeline_stages" not in result or not result["pipeline_stages"]:
             # investigation from RuntimeError is better than a hallucinated one.
             raw = self._gen.generate_quality(
                 self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
+                temperature=0.0,
+                max_tokens=self._gen.cap("react_round_tokens", 700),
             )
             # Parse THINK + TOOL or DONE
         # to ensure the summary is grounded in actual tool call results.
         raw = self._gen.generate_quality(
             self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
+            temperature=0.0,
+            max_tokens=self._gen.cap("react_done_tokens", 600),
         )
         done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
         try:
         # Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
         # Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
         # on free-tier models with 8K context windows.
+        code_text = _token_budget(code_text, max_tokens=self._gen.cap("phase3_code_chars", 3000))
         prompt = f"""Repository: {repo}
 Concept to investigate: {stage_name}
         # The agentic loop already required quality — when it falls back to static,
         # the same quality requirement applies.
         raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
+                                         json_mode=True,
+                                         max_tokens=self._gen.cap("concept_desc_tokens", 900))
         try:
             result = _parse_json(raw)
             result.setdefault("name",          stage_name)
         # budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
         # never the Cerebras 8B model that would otherwise receive it after quota is spent.
         raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
+                                           temperature=0.0, json_mode=True,
+                                           max_tokens=self._gen.cap("tour_synthesis_tokens", 3000))
         try:
             tour = _parse_json(raw)
         except Exception as e: