umanggarg commited on
Commit
bedd40f
·
1 Parent(s): 9eea8ac

Premium-mode cap overrides for richer prebaked artifacts

Browse files

Token + content caps in the artifact pipeline were tuned for free-tier
providers (Cerebras 8B, Gemini Flash) where context windows are smaller
and verbose reasoning patterns drive cost. Applied to a Sonnet 4.6
prebake those defaults clip long classes, cut ReAct reasoning short,
and constrain comprehensive READMEs.

GenerationService now exposes a single tunable lookup:

gen.cap("react_round_tokens", 700)

returning the caller's default by default, or the matching entry from
GenerationService.PREMIUM_CAPS when self.premium_mode is True. The
prebake CLI flips premium_mode for the whole run, so every cap site
automatically picks up the larger value without any per-call kwargs.

Wired sites:
- tour_agent.py: ReAct rounds, forced-DONE summaries, Phase 1 mapping,
Phase 3 description, Phase 3 code excerpt, Phase 3 final synthesis
- readme_service.py: README max_tokens
- ingestion_service.py: contextual retrieval — chunk preview chars,
surrounding doc chars, and per-chunk max_tokens (threaded into
_anthropic_contextualise)

Runtime path (premium_mode=False) is unchanged — every cap returns its
original default value, so deployed traffic still hits the safer caps
that work with rate-limited free providers.

CLAUDE.md gains a "Pre-baked Artifact Cache" section documenting the
prebake flow, the premium tier, and the cap-override mechanism.

.gitignore drops .claude/ so the local scheduled_tasks.lock and any
other agent-local files don't get tracked.

.gitignore CHANGED
@@ -42,3 +42,4 @@ LEARN.md
42
  /*.png
43
  .playwright-mcp/
44
  posthog-setup-report.md
 
 
42
  /*.png
43
  .playwright-mcp/
44
  posthog-setup-report.md
45
+ .claude/
CLAUDE.md CHANGED
@@ -102,6 +102,32 @@ Every `OpenAI(...)` client instantiation MUST have `timeout=30` (or use `_TIMEOU
102
  A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
103
  OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ## Runtime Data — always gitignore, never commit
106
 
107
  Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:
 
102
  A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
103
  OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
104
 
105
+ ## Pre-baked Artifact Cache
106
+
107
+ Tour, diagram (architecture/class), README, and repo-map outputs are persisted in a Qdrant sidecar collection (`<collection>_artifacts`) so they survive container restarts and are shared across all users. Reads go through `QdrantStore.load_artifact(repo, kind)`; writes through `save_artifact(repo, kind, data, generated_by_model)`.
108
+
109
+ A canonical set of repos can be pre-generated at premium quality with:
110
+
111
+ ```bash
112
+ .venv/bin/python -m scripts.prebake_repos # default Karpathy set
113
+ .venv/bin/python -m scripts.prebake_repos owner1/repo1 ... # specific repos
114
+ .venv/bin/python -m scripts.prebake_repos --force ... # rebuild
115
+ ```
116
+
117
+ The script flips `gen.premium_mode = True` for the entire run, which:
118
+ - Routes every `gen.generate(...)` call to the Claude Sonnet 4.6 client (`ANTHROPIC_API_KEY` required).
119
+ - Activates `PREMIUM_CAPS` overrides in `GenerationService` — every `gen.cap(name, default)` call returns the larger premium value (longer ReAct rounds, fuller chunk previews in contextual retrieval, larger README budget, etc.).
120
+
121
+ Runtime requests from the deployed app keep the original (smaller) caps so free-tier providers don't drown.
122
+
123
+ To inspect what's been baked for a repo:
124
+
125
+ ```bash
126
+ curl https://<host>/repos/<owner>/<name>/artifacts/info
127
+ ```
128
+
129
+ returns `kind / generated_by_model / generated_at` per cached artifact. HF Spaces logs also print `[cache hit] kind for repo (model)` on every served artifact.
130
+
131
  ## Runtime Data — always gitignore, never commit
132
 
133
  Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:
backend/services/generation.py CHANGED
@@ -280,6 +280,49 @@ class GenerationService:
280
  # through every service layer.
281
  self.premium_mode = False
282
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
283
  def _init_premium(self) -> None:
284
  """Initialise the optional premium client (Claude Sonnet 4.6) used
285
  for one-time generation of cached artifacts (tour, README, diagrams,
 
280
  # through every service layer.
281
  self.premium_mode = False
282
 
283
+ # ── Premium-mode cap overrides ─────────────────────────────────────────
284
+ # The artifact-generation pipeline has many small caps — max_tokens for
285
+ # ReAct rounds, character limits on chunks shown to the model, README
286
+ # length budgets, etc. They were tuned for free-tier providers (Cerebras
287
+ # 8B, Gemini Flash) which have shorter context windows and verbose
288
+ # reasoning patterns. Applied to a Sonnet 4.6 prebake, those defaults
289
+ # leave quality on the table — long classes get truncated, agent
290
+ # reasoning is cut mid-thought, READMEs run short.
291
+ #
292
+ # When premium_mode is True, cap() returns the entry from PREMIUM_CAPS
293
+ # below if one exists; otherwise the caller's default. Per-callsite
294
+ # invocation looks like:
295
+ # max_tokens = self._gen.cap("react_round_tokens", 700)
296
+ # Free-tier runtime paths (premium_mode=False) get the default unchanged.
297
+ PREMIUM_CAPS = {
298
+ # ── Tour generation (TourAgent) ──
299
+ "react_round_tokens": 1500, # was 700 — let Sonnet reason fully per round
300
+ "react_done_tokens": 1200, # was 600 — richer forced-DONE summaries
301
+ "phase_map_tokens": 2048, # was 1024 — more concept candidates surfaced
302
+ "concept_desc_tokens": 1800, # was 900 — deeper per-concept descriptions
303
+ "tour_synthesis_tokens": 16384, # was 8192 — full Sonnet output budget
304
+ "tool_result_tokens": 800, # was 400 — give the agent more of each tool result
305
+ "phase3_code_chars": 6000, # was 3000 — include more code per concept
306
+ # ── Diagrams ──
307
+ "diagram_tokens": 4096, # was 2048 — JSON output room for richer node lists
308
+ # ── README ──
309
+ "readme_tokens": 4096, # was 1800 — comprehensive README budget
310
+ # ── Contextual retrieval (ingestion) ──
311
+ "context_chunk_tokens": 400, # was 200 — longer contextual descriptions
312
+ "context_chunk_chars": 2000, # was 800 — model sees more of each chunk
313
+ "context_doc_chars": 12000, # was 6000 — model sees more surrounding file
314
+ }
315
+
316
+ def cap(self, name: str, default: int) -> int:
317
+ """Resolve a tunable cap. In premium_mode, returns the override from
318
+ PREMIUM_CAPS if present; otherwise returns the caller's default.
319
+ Lets every cap site stay readable as a single line:
320
+ max_tokens=self._gen.cap("react_round_tokens", 700)
321
+ """
322
+ if self.premium_mode and name in self.PREMIUM_CAPS:
323
+ return self.PREMIUM_CAPS[name]
324
+ return default
325
+
326
  def _init_premium(self) -> None:
327
  """Initialise the optional premium client (Claude Sonnet 4.6) used
328
  for one-time generation of cached artifacts (tour, README, diagrams,
backend/services/ingestion_service.py CHANGED
@@ -320,7 +320,8 @@ def _chunk_importance(c: dict) -> int:
320
 
321
 
322
  def _anthropic_contextualise(
323
- client, model: str, system: str, doc_text: str, chunk_question: str
 
324
  ) -> str:
325
  """
326
  Call Anthropic with prompt caching on the document block.
@@ -341,7 +342,7 @@ def _anthropic_contextualise(
341
  """
342
  resp = client.messages.create(
343
  model=model,
344
- max_tokens=200,
345
  system=system,
346
  messages=[{
347
  "role": "user",
@@ -430,18 +431,25 @@ def _add_context(
430
  # are 200-6000 tokens, so most will qualify.
431
  _use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
432
 
 
 
 
 
 
 
 
433
  # Worker function for a single chunk — called from multiple threads.
434
  # Returns (idx, updated_chunk) or (idx, None) on failure.
435
  def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
436
  filepath = chunk.get("filepath", "")
437
  chunk_text = chunk.get("text", "")
438
- doc_text = file_content_map.get(filepath, "")[:6000]
439
  if not chunk_text or not doc_text:
440
  return idx, None
441
 
442
  chunk_question = (
443
  f"Here is the chunk we want to situate within the document above:\n"
444
- f"<chunk>\n{chunk_text[:800]}\n</chunk>\n\n"
445
  "Please give a short succinct context to situate this chunk within the overall "
446
  "document for the purpose of improving search retrieval of the chunk. "
447
  "Name the function/class/block, its role in the file's pipeline, and the key "
@@ -459,7 +467,7 @@ def _add_context(
459
  # processing into O(N_files) full-cost calls.
460
  sentence = _anthropic_contextualise(
461
  gen._client, gen._model, _CONTEXT_SYSTEM,
462
- doc_text, chunk_question,
463
  )
464
  else:
465
  prompt = (
 
320
 
321
 
322
  def _anthropic_contextualise(
323
+ client, model: str, system: str, doc_text: str, chunk_question: str,
324
+ max_tokens: int = 200,
325
  ) -> str:
326
  """
327
  Call Anthropic with prompt caching on the document block.
 
342
  """
343
  resp = client.messages.create(
344
  model=model,
345
+ max_tokens=max_tokens,
346
  system=system,
347
  messages=[{
348
  "role": "user",
 
431
  # are 200-6000 tokens, so most will qualify.
432
  _use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
433
 
434
+ # Tunable caps — bumped automatically when gen.premium_mode is on so a
435
+ # premium prebake includes more of each chunk and surrounding file context
436
+ # than the free-tier defaults allow.
437
+ _doc_chars = gen.cap("context_doc_chars", 6000)
438
+ _chunk_chars = gen.cap("context_chunk_chars", 800)
439
+ _ctx_max_tokens = gen.cap("context_chunk_tokens", 200)
440
+
441
  # Worker function for a single chunk — called from multiple threads.
442
  # Returns (idx, updated_chunk) or (idx, None) on failure.
443
  def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
444
  filepath = chunk.get("filepath", "")
445
  chunk_text = chunk.get("text", "")
446
+ doc_text = file_content_map.get(filepath, "")[:_doc_chars]
447
  if not chunk_text or not doc_text:
448
  return idx, None
449
 
450
  chunk_question = (
451
  f"Here is the chunk we want to situate within the document above:\n"
452
+ f"<chunk>\n{chunk_text[:_chunk_chars]}\n</chunk>\n\n"
453
  "Please give a short succinct context to situate this chunk within the overall "
454
  "document for the purpose of improving search retrieval of the chunk. "
455
  "Name the function/class/block, its role in the file's pipeline, and the key "
 
467
  # processing into O(N_files) full-cost calls.
468
  sentence = _anthropic_contextualise(
469
  gen._client, gen._model, _CONTEXT_SYSTEM,
470
+ doc_text, chunk_question, max_tokens=_ctx_max_tokens,
471
  )
472
  else:
473
  prompt = (
backend/services/readme_service.py CHANGED
@@ -183,7 +183,7 @@ Output ONLY the markdown. No preamble, no "Here is the README", no trailing comm
183
  system=system,
184
  prompt=prompt,
185
  temperature=0.3,
186
- max_tokens=1800,
187
  )
188
  except Exception as e:
189
  yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}
 
183
  system=system,
184
  prompt=prompt,
185
  temperature=0.3,
186
+ max_tokens=self._gen.cap("readme_tokens", 1800),
187
  )
188
  except Exception as e:
189
  yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}
backend/services/tour_agent.py CHANGED
@@ -802,7 +802,8 @@ class TourAgent:
802
  for round_n in range(max_rounds):
803
  raw = self._gen.generate(
804
  self._AGENTIC_MAP_SYSTEM, transcript,
805
- temperature=0.0, max_tokens=700, # Gemma 4 needs ~700 for verbose THINK+TOOL
 
806
  )
807
 
808
  # Parse THINK + TOOL or DONE from the LLM's response
@@ -883,7 +884,8 @@ class TourAgent:
883
  transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
884
  raw = self._gen.generate_non_thinking(
885
  self._AGENTIC_MAP_SYSTEM, transcript,
886
- temperature=0.0, max_tokens=700,
 
887
  )
888
  done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
889
  try:
@@ -999,7 +1001,8 @@ Rules:
999
  authors considered important enough to document
1000
  """
1001
  raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
1002
- json_mode=True, max_tokens=1024)
 
1003
  try:
1004
  result = _parse_json(raw)
1005
  if "pipeline_stages" not in result or not result["pipeline_stages"]:
@@ -1112,7 +1115,8 @@ Rules:
1112
  # investigation from RuntimeError is better than a hallucinated one.
1113
  raw = self._gen.generate_quality(
1114
  self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
1115
- temperature=0.0, max_tokens=700, # Gemma 4 needs ~700 for verbose THINK+TOOL
 
1116
  )
1117
 
1118
  # Parse THINK + TOOL or DONE
@@ -1195,7 +1199,8 @@ Rules:
1195
  # to ensure the summary is grounded in actual tool call results.
1196
  raw = self._gen.generate_quality(
1197
  self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
1198
- temperature=0.0, max_tokens=600,
 
1199
  )
1200
  done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
1201
  try:
@@ -1268,7 +1273,7 @@ Rules:
1268
  # Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
1269
  # Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
1270
  # on free-tier models with 8K context windows.
1271
- code_text = _token_budget(code_text, max_tokens=3000)
1272
 
1273
  prompt = f"""Repository: {repo}
1274
  Concept to investigate: {stage_name}
@@ -1314,7 +1319,8 @@ Rules:
1314
  # The agentic loop already required quality — when it falls back to static,
1315
  # the same quality requirement applies.
1316
  raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
1317
- json_mode=True, max_tokens=900)
 
1318
  try:
1319
  result = _parse_json(raw)
1320
  result.setdefault("name", stage_name)
@@ -1654,7 +1660,8 @@ Rules:
1654
  # budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
1655
  # never the Cerebras 8B model that would otherwise receive it after quota is spent.
1656
  raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
1657
- temperature=0.0, json_mode=True, max_tokens=3000)
 
1658
  try:
1659
  tour = _parse_json(raw)
1660
  except Exception as e:
 
802
  for round_n in range(max_rounds):
803
  raw = self._gen.generate(
804
  self._AGENTIC_MAP_SYSTEM, transcript,
805
+ temperature=0.0,
806
+ max_tokens=self._gen.cap("react_round_tokens", 700),
807
  )
808
 
809
  # Parse THINK + TOOL or DONE from the LLM's response
 
884
  transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
885
  raw = self._gen.generate_non_thinking(
886
  self._AGENTIC_MAP_SYSTEM, transcript,
887
+ temperature=0.0,
888
+ max_tokens=self._gen.cap("react_done_tokens", 700),
889
  )
890
  done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
891
  try:
 
1001
  authors considered important enough to document
1002
  """
1003
  raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
1004
+ json_mode=True,
1005
+ max_tokens=self._gen.cap("phase_map_tokens", 1024))
1006
  try:
1007
  result = _parse_json(raw)
1008
  if "pipeline_stages" not in result or not result["pipeline_stages"]:
 
1115
  # investigation from RuntimeError is better than a hallucinated one.
1116
  raw = self._gen.generate_quality(
1117
  self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
1118
+ temperature=0.0,
1119
+ max_tokens=self._gen.cap("react_round_tokens", 700),
1120
  )
1121
 
1122
  # Parse THINK + TOOL or DONE
 
1199
  # to ensure the summary is grounded in actual tool call results.
1200
  raw = self._gen.generate_quality(
1201
  self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
1202
+ temperature=0.0,
1203
+ max_tokens=self._gen.cap("react_done_tokens", 600),
1204
  )
1205
  done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
1206
  try:
 
1273
  # Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
1274
  # Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
1275
  # on free-tier models with 8K context windows.
1276
+ code_text = _token_budget(code_text, max_tokens=self._gen.cap("phase3_code_chars", 3000))
1277
 
1278
  prompt = f"""Repository: {repo}
1279
  Concept to investigate: {stage_name}
 
1319
  # The agentic loop already required quality — when it falls back to static,
1320
  # the same quality requirement applies.
1321
  raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
1322
+ json_mode=True,
1323
+ max_tokens=self._gen.cap("concept_desc_tokens", 900))
1324
  try:
1325
  result = _parse_json(raw)
1326
  result.setdefault("name", stage_name)
 
1660
  # budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
1661
  # never the Cerebras 8B model that would otherwise receive it after quota is spent.
1662
  raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
1663
+ temperature=0.0, json_mode=True,
1664
+ max_tokens=self._gen.cap("tour_synthesis_tokens", 3000))
1665
  try:
1666
  tour = _parse_json(raw)
1667
  except Exception as e: