Spaces:
Running
Premium-mode cap overrides for richer prebaked artifacts
Browse filesToken + content caps in the artifact pipeline were tuned for free-tier
providers (Cerebras 8B, Gemini Flash) where context windows are smaller
and verbose reasoning patterns drive cost. Applied to a Sonnet 4.6
prebake those defaults clip long classes, cut ReAct reasoning short,
and constrain comprehensive READMEs.
GenerationService now exposes a single tunable lookup:
gen.cap("react_round_tokens", 700)
returning the caller's default by default, or the matching entry from
GenerationService.PREMIUM_CAPS when self.premium_mode is True. The
prebake CLI flips premium_mode for the whole run, so every cap site
automatically picks up the larger value without any per-call kwargs.
Wired sites:
- tour_agent.py: ReAct rounds, forced-DONE summaries, Phase 1 mapping,
Phase 3 description, Phase 3 code excerpt, Phase 3 final synthesis
- readme_service.py: README max_tokens
- ingestion_service.py: contextual retrieval — chunk preview chars,
surrounding doc chars, and per-chunk max_tokens (threaded into
_anthropic_contextualise)
Runtime path (premium_mode=False) is unchanged — every cap returns its
original default value, so deployed traffic still hits the safer caps
that work with rate-limited free providers.
CLAUDE.md gains a "Pre-baked Artifact Cache" section documenting the
prebake flow, the premium tier, and the cap-override mechanism.
.gitignore drops .claude/ so the local scheduled_tasks.lock and any
other agent-local files don't get tracked.
- .gitignore +1 -0
- CLAUDE.md +26 -0
- backend/services/generation.py +43 -0
- backend/services/ingestion_service.py +13 -5
- backend/services/readme_service.py +1 -1
- backend/services/tour_agent.py +15 -8
|
@@ -42,3 +42,4 @@ LEARN.md
|
|
| 42 |
/*.png
|
| 43 |
.playwright-mcp/
|
| 44 |
posthog-setup-report.md
|
|
|
|
|
|
| 42 |
/*.png
|
| 43 |
.playwright-mcp/
|
| 44 |
posthog-setup-report.md
|
| 45 |
+
.claude/
|
|
@@ -102,6 +102,32 @@ Every `OpenAI(...)` client instantiation MUST have `timeout=30` (or use `_TIMEOU
|
|
| 102 |
A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
|
| 103 |
OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
## Runtime Data — always gitignore, never commit
|
| 106 |
|
| 107 |
Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:
|
|
|
|
| 102 |
A client without a timeout will hang indefinitely on a slow/unresponsive provider — verified incident with Gemma 4.
|
| 103 |
OpenRouter uses its own helper `_openrouter_client()` which already sets `timeout=45`.
|
| 104 |
|
| 105 |
+
## Pre-baked Artifact Cache
|
| 106 |
+
|
| 107 |
+
Tour, diagram (architecture/class), README, and repo-map outputs are persisted in a Qdrant sidecar collection (`<collection>_artifacts`) so they survive container restarts and are shared across all users. Reads go through `QdrantStore.load_artifact(repo, kind)`; writes through `save_artifact(repo, kind, data, generated_by_model)`.
|
| 108 |
+
|
| 109 |
+
A canonical set of repos can be pre-generated at premium quality with:
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
.venv/bin/python -m scripts.prebake_repos # default Karpathy set
|
| 113 |
+
.venv/bin/python -m scripts.prebake_repos owner1/repo1 ... # specific repos
|
| 114 |
+
.venv/bin/python -m scripts.prebake_repos --force ... # rebuild
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
The script flips `gen.premium_mode = True` for the entire run, which:
|
| 118 |
+
- Routes every `gen.generate(...)` call to the Claude Sonnet 4.6 client (`ANTHROPIC_API_KEY` required).
|
| 119 |
+
- Activates `PREMIUM_CAPS` overrides in `GenerationService` — every `gen.cap(name, default)` call returns the larger premium value (longer ReAct rounds, fuller chunk previews in contextual retrieval, larger README budget, etc.).
|
| 120 |
+
|
| 121 |
+
Runtime requests from the deployed app keep the original (smaller) caps so free-tier providers don't drown.
|
| 122 |
+
|
| 123 |
+
To inspect what's been baked for a repo:
|
| 124 |
+
|
| 125 |
+
```bash
|
| 126 |
+
curl https://<host>/repos/<owner>/<name>/artifacts/info
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
returns `kind / generated_by_model / generated_at` per cached artifact. HF Spaces logs also print `[cache hit] kind for repo (model)` on every served artifact.
|
| 130 |
+
|
| 131 |
## Runtime Data — always gitignore, never commit
|
| 132 |
|
| 133 |
Directories written at runtime must be in `.gitignore`. Check before first commit of any new feature:
|
|
@@ -280,6 +280,49 @@ class GenerationService:
|
|
| 280 |
# through every service layer.
|
| 281 |
self.premium_mode = False
|
| 282 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
def _init_premium(self) -> None:
|
| 284 |
"""Initialise the optional premium client (Claude Sonnet 4.6) used
|
| 285 |
for one-time generation of cached artifacts (tour, README, diagrams,
|
|
|
|
| 280 |
# through every service layer.
|
| 281 |
self.premium_mode = False
|
| 282 |
|
| 283 |
+
# ── Premium-mode cap overrides ─────────────────────────────────────────
|
| 284 |
+
# The artifact-generation pipeline has many small caps — max_tokens for
|
| 285 |
+
# ReAct rounds, character limits on chunks shown to the model, README
|
| 286 |
+
# length budgets, etc. They were tuned for free-tier providers (Cerebras
|
| 287 |
+
# 8B, Gemini Flash) which have shorter context windows and verbose
|
| 288 |
+
# reasoning patterns. Applied to a Sonnet 4.6 prebake, those defaults
|
| 289 |
+
# leave quality on the table — long classes get truncated, agent
|
| 290 |
+
# reasoning is cut mid-thought, READMEs run short.
|
| 291 |
+
#
|
| 292 |
+
# When premium_mode is True, cap() returns the entry from PREMIUM_CAPS
|
| 293 |
+
# below if one exists; otherwise the caller's default. Per-callsite
|
| 294 |
+
# invocation looks like:
|
| 295 |
+
# max_tokens = self._gen.cap("react_round_tokens", 700)
|
| 296 |
+
# Free-tier runtime paths (premium_mode=False) get the default unchanged.
|
| 297 |
+
PREMIUM_CAPS = {
|
| 298 |
+
# ── Tour generation (TourAgent) ──
|
| 299 |
+
"react_round_tokens": 1500, # was 700 — let Sonnet reason fully per round
|
| 300 |
+
"react_done_tokens": 1200, # was 600 — richer forced-DONE summaries
|
| 301 |
+
"phase_map_tokens": 2048, # was 1024 — more concept candidates surfaced
|
| 302 |
+
"concept_desc_tokens": 1800, # was 900 — deeper per-concept descriptions
|
| 303 |
+
"tour_synthesis_tokens": 16384, # was 8192 — full Sonnet output budget
|
| 304 |
+
"tool_result_tokens": 800, # was 400 — give the agent more of each tool result
|
| 305 |
+
"phase3_code_chars": 6000, # was 3000 — include more code per concept
|
| 306 |
+
# ── Diagrams ──
|
| 307 |
+
"diagram_tokens": 4096, # was 2048 — JSON output room for richer node lists
|
| 308 |
+
# ── README ──
|
| 309 |
+
"readme_tokens": 4096, # was 1800 — comprehensive README budget
|
| 310 |
+
# ── Contextual retrieval (ingestion) ──
|
| 311 |
+
"context_chunk_tokens": 400, # was 200 — longer contextual descriptions
|
| 312 |
+
"context_chunk_chars": 2000, # was 800 — model sees more of each chunk
|
| 313 |
+
"context_doc_chars": 12000, # was 6000 — model sees more surrounding file
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
def cap(self, name: str, default: int) -> int:
|
| 317 |
+
"""Resolve a tunable cap. In premium_mode, returns the override from
|
| 318 |
+
PREMIUM_CAPS if present; otherwise returns the caller's default.
|
| 319 |
+
Lets every cap site stay readable as a single line:
|
| 320 |
+
max_tokens=self._gen.cap("react_round_tokens", 700)
|
| 321 |
+
"""
|
| 322 |
+
if self.premium_mode and name in self.PREMIUM_CAPS:
|
| 323 |
+
return self.PREMIUM_CAPS[name]
|
| 324 |
+
return default
|
| 325 |
+
|
| 326 |
def _init_premium(self) -> None:
|
| 327 |
"""Initialise the optional premium client (Claude Sonnet 4.6) used
|
| 328 |
for one-time generation of cached artifacts (tour, README, diagrams,
|
|
@@ -320,7 +320,8 @@ def _chunk_importance(c: dict) -> int:
|
|
| 320 |
|
| 321 |
|
| 322 |
def _anthropic_contextualise(
|
| 323 |
-
client, model: str, system: str, doc_text: str, chunk_question: str
|
|
|
|
| 324 |
) -> str:
|
| 325 |
"""
|
| 326 |
Call Anthropic with prompt caching on the document block.
|
|
@@ -341,7 +342,7 @@ def _anthropic_contextualise(
|
|
| 341 |
"""
|
| 342 |
resp = client.messages.create(
|
| 343 |
model=model,
|
| 344 |
-
max_tokens=
|
| 345 |
system=system,
|
| 346 |
messages=[{
|
| 347 |
"role": "user",
|
|
@@ -430,18 +431,25 @@ def _add_context(
|
|
| 430 |
# are 200-6000 tokens, so most will qualify.
|
| 431 |
_use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
|
| 432 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 433 |
# Worker function for a single chunk — called from multiple threads.
|
| 434 |
# Returns (idx, updated_chunk) or (idx, None) on failure.
|
| 435 |
def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
|
| 436 |
filepath = chunk.get("filepath", "")
|
| 437 |
chunk_text = chunk.get("text", "")
|
| 438 |
-
doc_text = file_content_map.get(filepath, "")[:
|
| 439 |
if not chunk_text or not doc_text:
|
| 440 |
return idx, None
|
| 441 |
|
| 442 |
chunk_question = (
|
| 443 |
f"Here is the chunk we want to situate within the document above:\n"
|
| 444 |
-
f"<chunk>\n{chunk_text[:
|
| 445 |
"Please give a short succinct context to situate this chunk within the overall "
|
| 446 |
"document for the purpose of improving search retrieval of the chunk. "
|
| 447 |
"Name the function/class/block, its role in the file's pipeline, and the key "
|
|
@@ -459,7 +467,7 @@ def _add_context(
|
|
| 459 |
# processing into O(N_files) full-cost calls.
|
| 460 |
sentence = _anthropic_contextualise(
|
| 461 |
gen._client, gen._model, _CONTEXT_SYSTEM,
|
| 462 |
-
doc_text, chunk_question,
|
| 463 |
)
|
| 464 |
else:
|
| 465 |
prompt = (
|
|
|
|
| 320 |
|
| 321 |
|
| 322 |
def _anthropic_contextualise(
|
| 323 |
+
client, model: str, system: str, doc_text: str, chunk_question: str,
|
| 324 |
+
max_tokens: int = 200,
|
| 325 |
) -> str:
|
| 326 |
"""
|
| 327 |
Call Anthropic with prompt caching on the document block.
|
|
|
|
| 342 |
"""
|
| 343 |
resp = client.messages.create(
|
| 344 |
model=model,
|
| 345 |
+
max_tokens=max_tokens,
|
| 346 |
system=system,
|
| 347 |
messages=[{
|
| 348 |
"role": "user",
|
|
|
|
| 431 |
# are 200-6000 tokens, so most will qualify.
|
| 432 |
_use_anthropic_cache = getattr(gen, 'provider', None) == 'anthropic'
|
| 433 |
|
| 434 |
+
# Tunable caps — bumped automatically when gen.premium_mode is on so a
|
| 435 |
+
# premium prebake includes more of each chunk and surrounding file context
|
| 436 |
+
# than the free-tier defaults allow.
|
| 437 |
+
_doc_chars = gen.cap("context_doc_chars", 6000)
|
| 438 |
+
_chunk_chars = gen.cap("context_chunk_chars", 800)
|
| 439 |
+
_ctx_max_tokens = gen.cap("context_chunk_tokens", 200)
|
| 440 |
+
|
| 441 |
# Worker function for a single chunk — called from multiple threads.
|
| 442 |
# Returns (idx, updated_chunk) or (idx, None) on failure.
|
| 443 |
def _enrich_one(idx: int, chunk: dict) -> tuple[int, dict | None]:
|
| 444 |
filepath = chunk.get("filepath", "")
|
| 445 |
chunk_text = chunk.get("text", "")
|
| 446 |
+
doc_text = file_content_map.get(filepath, "")[:_doc_chars]
|
| 447 |
if not chunk_text or not doc_text:
|
| 448 |
return idx, None
|
| 449 |
|
| 450 |
chunk_question = (
|
| 451 |
f"Here is the chunk we want to situate within the document above:\n"
|
| 452 |
+
f"<chunk>\n{chunk_text[:_chunk_chars]}\n</chunk>\n\n"
|
| 453 |
"Please give a short succinct context to situate this chunk within the overall "
|
| 454 |
"document for the purpose of improving search retrieval of the chunk. "
|
| 455 |
"Name the function/class/block, its role in the file's pipeline, and the key "
|
|
|
|
| 467 |
# processing into O(N_files) full-cost calls.
|
| 468 |
sentence = _anthropic_contextualise(
|
| 469 |
gen._client, gen._model, _CONTEXT_SYSTEM,
|
| 470 |
+
doc_text, chunk_question, max_tokens=_ctx_max_tokens,
|
| 471 |
)
|
| 472 |
else:
|
| 473 |
prompt = (
|
|
@@ -183,7 +183,7 @@ Output ONLY the markdown. No preamble, no "Here is the README", no trailing comm
|
|
| 183 |
system=system,
|
| 184 |
prompt=prompt,
|
| 185 |
temperature=0.3,
|
| 186 |
-
max_tokens=1800,
|
| 187 |
)
|
| 188 |
except Exception as e:
|
| 189 |
yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}
|
|
|
|
| 183 |
system=system,
|
| 184 |
prompt=prompt,
|
| 185 |
temperature=0.3,
|
| 186 |
+
max_tokens=self._gen.cap("readme_tokens", 1800),
|
| 187 |
)
|
| 188 |
except Exception as e:
|
| 189 |
yield {"stage": "error", "progress": 1.0, "error": f"Generation failed: {e}"}
|
|
@@ -802,7 +802,8 @@ class TourAgent:
|
|
| 802 |
for round_n in range(max_rounds):
|
| 803 |
raw = self._gen.generate(
|
| 804 |
self._AGENTIC_MAP_SYSTEM, transcript,
|
| 805 |
-
temperature=0.0,
|
|
|
|
| 806 |
)
|
| 807 |
|
| 808 |
# Parse THINK + TOOL or DONE from the LLM's response
|
|
@@ -883,7 +884,8 @@ class TourAgent:
|
|
| 883 |
transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
|
| 884 |
raw = self._gen.generate_non_thinking(
|
| 885 |
self._AGENTIC_MAP_SYSTEM, transcript,
|
| 886 |
-
temperature=0.0,
|
|
|
|
| 887 |
)
|
| 888 |
done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
|
| 889 |
try:
|
|
@@ -999,7 +1001,8 @@ Rules:
|
|
| 999 |
authors considered important enough to document
|
| 1000 |
"""
|
| 1001 |
raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
|
| 1002 |
-
json_mode=True,
|
|
|
|
| 1003 |
try:
|
| 1004 |
result = _parse_json(raw)
|
| 1005 |
if "pipeline_stages" not in result or not result["pipeline_stages"]:
|
|
@@ -1112,7 +1115,8 @@ Rules:
|
|
| 1112 |
# investigation from RuntimeError is better than a hallucinated one.
|
| 1113 |
raw = self._gen.generate_quality(
|
| 1114 |
self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
|
| 1115 |
-
temperature=0.0,
|
|
|
|
| 1116 |
)
|
| 1117 |
|
| 1118 |
# Parse THINK + TOOL or DONE
|
|
@@ -1195,7 +1199,8 @@ Rules:
|
|
| 1195 |
# to ensure the summary is grounded in actual tool call results.
|
| 1196 |
raw = self._gen.generate_quality(
|
| 1197 |
self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
|
| 1198 |
-
temperature=0.0,
|
|
|
|
| 1199 |
)
|
| 1200 |
done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
|
| 1201 |
try:
|
|
@@ -1268,7 +1273,7 @@ Rules:
|
|
| 1268 |
# Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
|
| 1269 |
# Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
|
| 1270 |
# on free-tier models with 8K context windows.
|
| 1271 |
-
code_text = _token_budget(code_text, max_tokens=3000)
|
| 1272 |
|
| 1273 |
prompt = f"""Repository: {repo}
|
| 1274 |
Concept to investigate: {stage_name}
|
|
@@ -1314,7 +1319,8 @@ Rules:
|
|
| 1314 |
# The agentic loop already required quality — when it falls back to static,
|
| 1315 |
# the same quality requirement applies.
|
| 1316 |
raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
|
| 1317 |
-
json_mode=True,
|
|
|
|
| 1318 |
try:
|
| 1319 |
result = _parse_json(raw)
|
| 1320 |
result.setdefault("name", stage_name)
|
|
@@ -1654,7 +1660,8 @@ Rules:
|
|
| 1654 |
# budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
|
| 1655 |
# never the Cerebras 8B model that would otherwise receive it after quota is spent.
|
| 1656 |
raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
|
| 1657 |
-
temperature=0.0, json_mode=True,
|
|
|
|
| 1658 |
try:
|
| 1659 |
tour = _parse_json(raw)
|
| 1660 |
except Exception as e:
|
|
|
|
| 802 |
for round_n in range(max_rounds):
|
| 803 |
raw = self._gen.generate(
|
| 804 |
self._AGENTIC_MAP_SYSTEM, transcript,
|
| 805 |
+
temperature=0.0,
|
| 806 |
+
max_tokens=self._gen.cap("react_round_tokens", 700),
|
| 807 |
)
|
| 808 |
|
| 809 |
# Parse THINK + TOOL or DONE from the LLM's response
|
|
|
|
| 884 |
transcript += "\nROUND LIMIT REACHED. Output DONE: now with what you have found.\n"
|
| 885 |
raw = self._gen.generate_non_thinking(
|
| 886 |
self._AGENTIC_MAP_SYSTEM, transcript,
|
| 887 |
+
temperature=0.0,
|
| 888 |
+
max_tokens=self._gen.cap("react_done_tokens", 700),
|
| 889 |
)
|
| 890 |
done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
|
| 891 |
try:
|
|
|
|
| 1001 |
authors considered important enough to document
|
| 1002 |
"""
|
| 1003 |
raw = self._gen.generate(_MAP_SYSTEM, prompt, temperature=0.0,
|
| 1004 |
+
json_mode=True,
|
| 1005 |
+
max_tokens=self._gen.cap("phase_map_tokens", 1024))
|
| 1006 |
try:
|
| 1007 |
result = _parse_json(raw)
|
| 1008 |
if "pipeline_stages" not in result or not result["pipeline_stages"]:
|
|
|
|
| 1115 |
# investigation from RuntimeError is better than a hallucinated one.
|
| 1116 |
raw = self._gen.generate_quality(
|
| 1117 |
self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
|
| 1118 |
+
temperature=0.0,
|
| 1119 |
+
max_tokens=self._gen.cap("react_round_tokens", 700),
|
| 1120 |
)
|
| 1121 |
|
| 1122 |
# Parse THINK + TOOL or DONE
|
|
|
|
| 1199 |
# to ensure the summary is grounded in actual tool call results.
|
| 1200 |
raw = self._gen.generate_quality(
|
| 1201 |
self._AGENTIC_INVESTIGATE_SYSTEM, transcript,
|
| 1202 |
+
temperature=0.0,
|
| 1203 |
+
max_tokens=self._gen.cap("react_done_tokens", 600),
|
| 1204 |
)
|
| 1205 |
done_m = _re.search(r'DONE:\s*(\{.+)', raw, _re.DOTALL)
|
| 1206 |
try:
|
|
|
|
| 1273 |
# Guard: 12 primary × 700 chars + 8 related × 700 = up to 14 000 chars.
|
| 1274 |
# Cap at ~3 000 tokens (12 000 chars) so we stay within context budgets
|
| 1275 |
# on free-tier models with 8K context windows.
|
| 1276 |
+
code_text = _token_budget(code_text, max_tokens=self._gen.cap("phase3_code_chars", 3000))
|
| 1277 |
|
| 1278 |
prompt = f"""Repository: {repo}
|
| 1279 |
Concept to investigate: {stage_name}
|
|
|
|
| 1319 |
# The agentic loop already required quality — when it falls back to static,
|
| 1320 |
# the same quality requirement applies.
|
| 1321 |
raw = self._gen.generate_quality(_INVESTIGATE_SYSTEM, prompt, temperature=0.0,
|
| 1322 |
+
json_mode=True,
|
| 1323 |
+
max_tokens=self._gen.cap("concept_desc_tokens", 900))
|
| 1324 |
try:
|
| 1325 |
result = _parse_json(raw)
|
| 1326 |
result.setdefault("name", stage_name)
|
|
|
|
| 1660 |
# budget). This ensures synthesis always gets Gemini 2.5 Flash or DeepSeek-V3.1,
|
| 1661 |
# never the Cerebras 8B model that would otherwise receive it after quota is spent.
|
| 1662 |
raw = self._gen.generate_synthesis(_SYNTHESIZE_SYSTEM, prompt,
|
| 1663 |
+
temperature=0.0, json_mode=True,
|
| 1664 |
+
max_tokens=self._gen.cap("tour_synthesis_tokens", 3000))
|
| 1665 |
try:
|
| 1666 |
tour = _parse_json(raw)
|
| 1667 |
except Exception as e:
|