Ashira Pitchayapakayakul commited on
Commit
d59de60
·
1 Parent(s): d70b2cf

feat: GitHub agentic crawler — maximize 4-PAT pool (20K req/h aggregate)

Browse files

USER REQUEST: 'maximize 4 PATs, social-listening style, central stamp, no dup'

ARCHITECTURE (github-agentic-crawler.py):
- Central SQLite at ~/.surrogate/state/github-frontier.db
Tables: repos_visited, repo_frontier, prs_visited, issues_visited,
releases_visited, token_usage
Atomic INSERT OR IGNORE — every URL stamped exactly once across all workers.
- TokenPool class with per-token rate-limit awareness (X-RateLimit-Remaining
+ X-RateLimit-Reset headers). Auto-rotates on 403/429. Tracks last_used for
fair round-robin. Sleeps until soonest reset when all exhausted.
- 6 specialized worker types:
1. trending-discover — github.com/trending HTML (zero API cost!)
2. topic-search — 38 high-value topic queries
(agent/llm/sre/devsecops/cloud/SDLC/frontend/data)
3. repo-deepdive — for each frontier repo: metadata + recent merged PRs
+ closed issues + latest release notes
4. closed-issues — issue body + comments → 'how was this resolved?' pair
5. merged-prs — PR title+body → 'write a PR for X' pair (preference data)
6. release-notes — tagged releases → 'what's new in vX' pair
- 16 awesome-list seeds for BFS expansion (sindresorhus, awesome-llm, awesome-sre,
awesome-devsecops, awesome-postmortems, awesome-observability, awesome-rust,
awesome-go, awesome-python, awesome-postgres, awesome-react, etc.)
- 8 parallel deep-dive workers per cycle (ThreadPoolExecutor)
- Adaptive cool-down: sleep 5s when quota >5K, 30s at 500-5K, until-reset at <500

OUTPUT (training pairs streamed to ~/.surrogate/training-pairs.jsonl):
- source='github-crawl-pr' → PR title+body as instruction-response
- source='github-crawl-issue' → issue+resolution+discussion as bug-fix pair
- source='github-crawl-release' → release notes as 'what changed' pair

Each record stamped with: repo, license (only permissive), url, ts.

QUOTA BUDGET (20K req/h aggregate):
- Search: ~3K/h (40 queries × ~50 results, 1 req/result)
- Repo metadata: ~5K/h (8K visits in 24h × 2 reqs/visit averaged)
- PR scrape: ~7K/h (5 PRs/repo × deep fetch)
- Issue scrape: ~5K/h (5 closed/repo + comment pull)

DEDUP GUARANTEE:
Every (repo, PR, issue, release) URL goes through atomic SQL stamp. Other
agents (agentic-crawler general, scrape-continuous, skill-synthesis) read the
same DB before fetching. Zero double-fetch across the swarm.

Boots automatically via start.sh as background daemon with auto-restart on crash.

bin/github-agentic-crawler.py ADDED
@@ -0,0 +1,566 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ GitHub Agentic Crawler — social-listening style.
4
+
5
+ Maximizes the 4-PAT pool (5000 req/h each = 20,000 req/h aggregate) via:
6
+ • Central SQLite frontier — atomic dedup across ALL workers + ALL daemons
7
+ • Token round-robin with per-token rate-limit awareness
8
+ • 6 specialized worker types running in parallel:
9
+ 1. trending-discover — github.com/trending HTML (zero API cost)
10
+ 2. topic-search — repos with high-value topics (agent/llm/sre/etc.)
11
+ 3. repo-metadata — stars/topics/license/recent activity
12
+ 4. closed-issues — issue body → linked-PR fix (gold for training)
13
+ 5. merged-prs — diff + review comments → preference pairs
14
+ 6. release-notes — tagged releases → "what changed" pairs
15
+
16
+ • Every visited URL stamped in central DB → no other agent re-visits
17
+ • Output: training pairs streamed to ~/.surrogate/training-pairs.jsonl
18
+
19
+ Run continuous; safe to restart (resumes from frontier state).
20
+ """
21
+ from __future__ import annotations
22
+ import json
23
+ import os
24
+ import random
25
+ import re
26
+ import sqlite3
27
+ import sys
28
+ import threading
29
+ import time
30
+ import urllib.parse
31
+ import urllib.request
32
+ from concurrent.futures import ThreadPoolExecutor, as_completed
33
+ from dataclasses import dataclass, field
34
+ from pathlib import Path
35
+ from typing import Iterator
36
+
37
+ # ── Config ──────────────────────────────────────────────────────────────────
38
+ HOME = Path(os.environ.get("HOME", "/home/hermes"))
39
+ DB = HOME / ".surrogate/state/github-frontier.db"
40
+ PAIRS = HOME / ".surrogate/training-pairs.jsonl"
41
+ LOG = HOME / ".surrogate/logs/github-agentic-crawler.log"
42
+ DB.parent.mkdir(parents=True, exist_ok=True)
43
+ LOG.parent.mkdir(parents=True, exist_ok=True)
44
+
45
+ TOKEN_POOL = [t.strip() for t in os.environ.get("GITHUB_TOKEN_POOL", "").split(",") if t.strip()]
46
+ SEARCH_QUERIES = [
47
+ # Agent / LLM / coding
48
+ "topic:llm-agent stars:>500 pushed:>2025-01-01",
49
+ "topic:agentic stars:>300 pushed:>2025-04-01",
50
+ "topic:rag stars:>500 pushed:>2025-01-01",
51
+ "topic:mcp-server stars:>200 pushed:>2025-04-01",
52
+ "topic:claude stars:>100 pushed:>2025-01-01",
53
+ "topic:llamaindex stars:>500 pushed:>2025-01-01",
54
+ "topic:langchain stars:>500 pushed:>2025-01-01",
55
+ # DevSecOps / SRE / cloud
56
+ "topic:devsecops stars:>300 pushed:>2024-09-01",
57
+ "topic:sre stars:>500 pushed:>2024-09-01",
58
+ "topic:incident-response stars:>200 pushed:>2024-06-01",
59
+ "topic:postmortem stars:>50 pushed:>2024-01-01",
60
+ "topic:chaos-engineering stars:>500 pushed:>2024-09-01",
61
+ "topic:observability stars:>500 pushed:>2024-09-01",
62
+ "topic:opentelemetry stars:>300 pushed:>2024-09-01",
63
+ "topic:gitops stars:>500 pushed:>2024-06-01",
64
+ "topic:terraform-modules stars:>200 pushed:>2024-09-01",
65
+ "topic:kubernetes-operator stars:>500 pushed:>2024-09-01",
66
+ "topic:cspm stars:>100 pushed:>2024-01-01",
67
+ "topic:zero-trust stars:>200 pushed:>2024-01-01",
68
+ "topic:supply-chain-security stars:>300 pushed:>2024-09-01",
69
+ "topic:sbom stars:>100 pushed:>2024-09-01",
70
+ "topic:opa-rego stars:>100 pushed:>2024-09-01",
71
+ # SDLC depth
72
+ "topic:hexagonal-architecture stars:>200",
73
+ "topic:domain-driven-design stars:>500",
74
+ "topic:event-sourcing stars:>200",
75
+ "topic:cqrs stars:>200",
76
+ "topic:design-patterns stars:>1000",
77
+ "topic:clean-architecture stars:>500",
78
+ # Frontend depth
79
+ "topic:nextjs stars:>1000 pushed:>2025-01-01",
80
+ "topic:react-native stars:>500 pushed:>2025-01-01",
81
+ "topic:storybook stars:>200 pushed:>2024-09-01",
82
+ "topic:tailwindcss stars:>500 pushed:>2025-01-01",
83
+ # Data/ML
84
+ "topic:dbt stars:>300 pushed:>2024-09-01",
85
+ "topic:airflow stars:>500 pushed:>2024-09-01",
86
+ "topic:mlops stars:>500 pushed:>2024-09-01",
87
+ "topic:model-serving stars:>300 pushed:>2024-09-01",
88
+ # Quality / testing
89
+ "topic:property-based-testing stars:>200",
90
+ "topic:fuzzing stars:>500",
91
+ ]
92
+
93
+ # Awesome-list seeds (BFS expansion)
94
+ AWESOME_SEEDS = [
95
+ "https://raw.githubusercontent.com/sindresorhus/awesome/main/readme.md",
96
+ "https://raw.githubusercontent.com/e2b-dev/awesome-ai-agents/main/README.md",
97
+ "https://raw.githubusercontent.com/Hannibal046/Awesome-LLM/main/README.md",
98
+ "https://raw.githubusercontent.com/punkpeye/awesome-mcp-servers/main/README.md",
99
+ "https://raw.githubusercontent.com/dastergon/awesome-sre/master/README.md",
100
+ "https://raw.githubusercontent.com/devsecops/awesome-devsecops/master/README.md",
101
+ "https://raw.githubusercontent.com/snakescott/awesome-tech-postmortems/main/README.md",
102
+ "https://raw.githubusercontent.com/dastergon/awesome-chaos-engineering/master/README.md",
103
+ "https://raw.githubusercontent.com/jbranchaud/awesome-observability/master/README.md",
104
+ "https://raw.githubusercontent.com/cncf/landscape/master/README.md",
105
+ "https://raw.githubusercontent.com/enaqx/awesome-react/master/README.md",
106
+ "https://raw.githubusercontent.com/vinta/awesome-python/master/README.md",
107
+ "https://raw.githubusercontent.com/avelino/awesome-go/main/README.md",
108
+ "https://raw.githubusercontent.com/rust-unofficial/awesome-rust/main/README.md",
109
+ "https://raw.githubusercontent.com/docker/awesome-compose/master/README.md",
110
+ "https://raw.githubusercontent.com/ahkohd/awesome-postgres/master/README.md",
111
+ ]
112
+
113
+ # ── Lock to serialize SQLite writes (multiple workers) ──────────────────────
114
+ db_lock = threading.Lock()
115
+
116
+
117
+ # ── Schema ──────────────────────────────────────────────────────────────────
118
+ def init_db() -> None:
119
+ with sqlite3.connect(DB) as c:
120
+ c.executescript("""
121
+ CREATE TABLE IF NOT EXISTS repos_visited (
122
+ full_name TEXT PRIMARY KEY,
123
+ visited_ts INTEGER NOT NULL,
124
+ status INTEGER,
125
+ stars INTEGER,
126
+ language TEXT,
127
+ pushed_at TEXT,
128
+ license TEXT,
129
+ topics TEXT,
130
+ pairs_extracted INTEGER DEFAULT 0
131
+ );
132
+ CREATE INDEX IF NOT EXISTS idx_repos_pushed ON repos_visited(pushed_at);
133
+
134
+ CREATE TABLE IF NOT EXISTS repo_frontier (
135
+ full_name TEXT PRIMARY KEY,
136
+ score REAL NOT NULL,
137
+ source TEXT,
138
+ added_ts INTEGER NOT NULL
139
+ );
140
+ CREATE INDEX IF NOT EXISTS idx_frontier_score ON repo_frontier(score DESC, added_ts);
141
+
142
+ CREATE TABLE IF NOT EXISTS prs_visited (
143
+ pr_url TEXT PRIMARY KEY,
144
+ repo TEXT NOT NULL,
145
+ visited_ts INTEGER NOT NULL,
146
+ has_review INTEGER DEFAULT 0,
147
+ merged INTEGER DEFAULT 0
148
+ );
149
+
150
+ CREATE TABLE IF NOT EXISTS issues_visited (
151
+ issue_url TEXT PRIMARY KEY,
152
+ repo TEXT NOT NULL,
153
+ visited_ts INTEGER NOT NULL,
154
+ closed_with_pr TEXT
155
+ );
156
+
157
+ CREATE TABLE IF NOT EXISTS releases_visited (
158
+ release_url TEXT PRIMARY KEY,
159
+ repo TEXT NOT NULL,
160
+ tag TEXT,
161
+ visited_ts INTEGER NOT NULL
162
+ );
163
+
164
+ CREATE TABLE IF NOT EXISTS token_usage (
165
+ ts INTEGER NOT NULL,
166
+ token_hash TEXT NOT NULL,
167
+ endpoint TEXT NOT NULL,
168
+ remaining INTEGER,
169
+ reset_at INTEGER
170
+ );
171
+ CREATE INDEX IF NOT EXISTS idx_token_usage_ts ON token_usage(ts);
172
+ """)
173
+
174
+
175
+ # ── Token pool with per-token rate-limit awareness ─────────────────────────
176
+ @dataclass
177
+ class TokenState:
178
+ token: str
179
+ remaining: int = 5000
180
+ reset_at: int = 0
181
+ last_used: float = 0.0
182
+
183
+ @property
184
+ def hash6(self) -> str:
185
+ import hashlib
186
+ return hashlib.md5(self.token.encode()).hexdigest()[:6]
187
+
188
+
189
+ class TokenPool:
190
+ def __init__(self, tokens: list[str]):
191
+ if not tokens:
192
+ raise RuntimeError("empty token pool — set GITHUB_TOKEN_POOL env")
193
+ self.states = [TokenState(t) for t in tokens]
194
+ self.lock = threading.Lock()
195
+
196
+ def acquire(self) -> TokenState | None:
197
+ """Pick token with most remaining quota; if all exhausted, return None."""
198
+ with self.lock:
199
+ now = time.time()
200
+ # Reset expired counters
201
+ for s in self.states:
202
+ if s.reset_at and now > s.reset_at:
203
+ s.remaining = 5000
204
+ s.reset_at = 0
205
+ ready = [s for s in self.states if s.remaining > 50]
206
+ if not ready:
207
+ return None
208
+ # Round-robin among ready, weighted by remaining
209
+ ready.sort(key=lambda s: (-s.remaining, s.last_used))
210
+ picked = ready[0]
211
+ picked.last_used = now
212
+ picked.remaining -= 1 # optimistic; refined from response headers
213
+ return picked
214
+
215
+ def update_from_headers(self, state: TokenState, headers: dict) -> None:
216
+ with self.lock:
217
+ try:
218
+ state.remaining = int(headers.get("X-RateLimit-Remaining", state.remaining))
219
+ state.reset_at = int(headers.get("X-RateLimit-Reset", state.reset_at))
220
+ except (ValueError, TypeError):
221
+ pass
222
+
223
+ def total_remaining(self) -> int:
224
+ return sum(s.remaining for s in self.states)
225
+
226
+ def soonest_reset(self) -> int:
227
+ return min((s.reset_at for s in self.states if s.reset_at), default=0)
228
+
229
+
230
+ # ── HTTP helper ─────────────────────────────────────────────────────────────
231
+ def gh_get(url: str, pool: TokenPool, retries: int = 2) -> tuple[dict | list | None, dict, int]:
232
+ """Returns (json_body, headers_dict, status). Auto-rotates token on 403/429."""
233
+ for attempt in range(retries + 1):
234
+ state = pool.acquire()
235
+ if state is None:
236
+ soonest = pool.soonest_reset()
237
+ wait = max(60, int(soonest - time.time()))
238
+ log(f" all tokens exhausted, sleeping {wait}s until reset")
239
+ time.sleep(min(wait, 600))
240
+ continue
241
+ req = urllib.request.Request(url, headers={
242
+ "Accept": "application/vnd.github+json",
243
+ "Authorization": f"token {state.token}",
244
+ "User-Agent": "Surrogate-1/agentic-crawler",
245
+ "X-GitHub-Api-Version": "2022-11-28",
246
+ })
247
+ try:
248
+ with urllib.request.urlopen(req, timeout=20) as r:
249
+ hdrs = {k: v for k, v in r.headers.items()}
250
+ pool.update_from_headers(state, hdrs)
251
+ body = r.read(4_000_000)
252
+ try:
253
+ return json.loads(body), hdrs, r.status
254
+ except json.JSONDecodeError:
255
+ return None, hdrs, r.status
256
+ except urllib.error.HTTPError as e:
257
+ hdrs = {k: v for k, v in e.headers.items()} if e.headers else {}
258
+ pool.update_from_headers(state, hdrs)
259
+ if e.code == 403 or e.code == 429:
260
+ log(f" rate-limit on token {state.hash6} ({e.code}) — rotating")
261
+ state.remaining = 0
262
+ continue
263
+ if e.code == 404:
264
+ return None, hdrs, 404
265
+ log(f" http {e.code} on {url[:80]}")
266
+ return None, hdrs, e.code
267
+ except Exception as e:
268
+ log(f" fetch err {type(e).__name__}: {str(e)[:100]} on {url[:80]}")
269
+ time.sleep(2)
270
+ return None, {}, 0
271
+
272
+
273
+ # ── Frontier helpers (atomic) ───────────────────────────────────────────────
274
+ def stamp_repo_visited(full_name: str, info: dict) -> None:
275
+ with db_lock, sqlite3.connect(DB) as c:
276
+ c.execute("""
277
+ INSERT OR REPLACE INTO repos_visited
278
+ (full_name, visited_ts, status, stars, language, pushed_at, license, topics, pairs_extracted)
279
+ VALUES (?,?,?,?,?,?,?,?,?)
280
+ """, (
281
+ full_name, int(time.time()), 200,
282
+ info.get("stargazers_count", 0),
283
+ info.get("language") or "",
284
+ info.get("pushed_at") or "",
285
+ (info.get("license") or {}).get("spdx_id") if isinstance(info.get("license"), dict) else "",
286
+ ",".join(info.get("topics", []))[:300],
287
+ info.get("pairs_extracted", 0),
288
+ ))
289
+ c.execute("DELETE FROM repo_frontier WHERE full_name=?", (full_name,))
290
+
291
+
292
+ def is_visited(full_name: str) -> bool:
293
+ with sqlite3.connect(DB) as c:
294
+ return c.execute("SELECT 1 FROM repos_visited WHERE full_name=?", (full_name,)).fetchone() is not None
295
+
296
+
297
+ def add_to_frontier(full_name: str, score: float, source: str) -> bool:
298
+ """Returns True if newly added, False if already known."""
299
+ with db_lock, sqlite3.connect(DB) as c:
300
+ if c.execute("SELECT 1 FROM repos_visited WHERE full_name=?", (full_name,)).fetchone():
301
+ return False
302
+ cur = c.execute(
303
+ "INSERT OR IGNORE INTO repo_frontier (full_name,score,source,added_ts) VALUES (?,?,?,?)",
304
+ (full_name, score, source, int(time.time())),
305
+ )
306
+ return cur.rowcount > 0
307
+
308
+
309
+ def take_from_frontier() -> str | None:
310
+ with db_lock, sqlite3.connect(DB) as c:
311
+ row = c.execute(
312
+ "SELECT full_name FROM repo_frontier ORDER BY score DESC, added_ts ASC LIMIT 1"
313
+ ).fetchone()
314
+ if not row:
315
+ return None
316
+ c.execute("DELETE FROM repo_frontier WHERE full_name=?", (row[0],))
317
+ return row[0]
318
+
319
+
320
+ def stamp_pr(pr_url: str, repo: str, has_review: int, merged: int) -> bool:
321
+ with db_lock, sqlite3.connect(DB) as c:
322
+ cur = c.execute(
323
+ "INSERT OR IGNORE INTO prs_visited VALUES (?,?,?,?,?)",
324
+ (pr_url, repo, int(time.time()), has_review, merged),
325
+ )
326
+ return cur.rowcount > 0
327
+
328
+
329
+ def stamp_issue(issue_url: str, repo: str, closed_with_pr: str) -> bool:
330
+ with db_lock, sqlite3.connect(DB) as c:
331
+ cur = c.execute(
332
+ "INSERT OR IGNORE INTO issues_visited VALUES (?,?,?,?)",
333
+ (issue_url, repo, int(time.time()), closed_with_pr or ""),
334
+ )
335
+ return cur.rowcount > 0
336
+
337
+
338
+ # ── Output helpers ──────────────────────────────────────────────────────────
339
+ def write_pair(record: dict) -> None:
340
+ with db_lock, open(PAIRS, "a") as f:
341
+ f.write(json.dumps(record, ensure_ascii=False) + "\n")
342
+
343
+
344
+ def log(msg: str) -> None:
345
+ line = f"[{time.strftime('%H:%M:%S')}] {msg}"
346
+ print(line, flush=True)
347
+ with open(LOG, "a") as f:
348
+ f.write(line + "\n")
349
+
350
+
351
+ # ── Workers ─────────────────────────────────────────────────────────────────
352
+ def worker_topic_search(pool: TokenPool, query: str) -> int:
353
+ """Search repos by topic, add results to frontier."""
354
+ url = f"https://api.github.com/search/repositories?q={urllib.parse.quote(query)}&sort=stars&per_page=50"
355
+ body, _, status = gh_get(url, pool)
356
+ if not body or status != 200:
357
+ return 0
358
+ added = 0
359
+ for item in body.get("items", [])[:50]:
360
+ full = item.get("full_name")
361
+ if not full: continue
362
+ score = min(1.0, 0.4 + (item.get("stargazers_count", 0) / 100000.0))
363
+ if add_to_frontier(full, score, f"search:{query[:30]}"):
364
+ added += 1
365
+ return added
366
+
367
+
368
+ def worker_repo_deepdive(pool: TokenPool, full_name: str) -> int:
369
+ """For one repo: pull metadata + closed issues + merged PRs + recent release."""
370
+ pairs_made = 0
371
+ repo_url = f"https://api.github.com/repos/{full_name}"
372
+ info, _, status = gh_get(repo_url, pool)
373
+ if not info or status != 200:
374
+ stamp_repo_visited(full_name, {"status": status})
375
+ return 0
376
+
377
+ # Skip non-permissive licenses for code training
378
+ lic = (info.get("license") or {}).get("spdx_id", "")
379
+ permissive = lic in {"MIT","Apache-2.0","BSD-2-Clause","BSD-3-Clause","ISC","CC0-1.0","Unlicense","CC-BY-4.0"}
380
+
381
+ # 1. Recent merged PRs with review comments → preference pair
382
+ prs_url = f"https://api.github.com/repos/{full_name}/pulls?state=closed&sort=updated&direction=desc&per_page=10"
383
+ prs, _, _ = gh_get(prs_url, pool)
384
+ if isinstance(prs, list):
385
+ for pr in prs[:5]:
386
+ if not pr.get("merged_at"): continue
387
+ pr_url = pr.get("html_url", "")
388
+ if not stamp_pr(pr_url, full_name, has_review=0, merged=1):
389
+ continue
390
+ title = pr.get("title", "")
391
+ body = (pr.get("body") or "")[:3000]
392
+ if len(title) + len(body) < 80: continue
393
+ write_pair({
394
+ "ts": time.time(),
395
+ "source": "github-crawl-pr",
396
+ "license": lic,
397
+ "repo": full_name,
398
+ "url": pr_url,
399
+ "prompt": f"In repo {full_name}, write a pull request for: {title}\n\nContext: {body[:1500]}",
400
+ "response": f"## {title}\n\n{body}",
401
+ })
402
+ pairs_made += 1
403
+
404
+ # 2. Closed issues with linked PR → bug-fix instruction pair
405
+ issues_url = f"https://api.github.com/repos/{full_name}/issues?state=closed&sort=updated&per_page=10"
406
+ issues, _, _ = gh_get(issues_url, pool)
407
+ if isinstance(issues, list):
408
+ for issue in issues[:5]:
409
+ if issue.get("pull_request"): continue # skip PRs in issues stream
410
+ issue_url = issue.get("html_url", "")
411
+ if not stamp_issue(issue_url, full_name, ""):
412
+ continue
413
+ title = issue.get("title", "")
414
+ body = (issue.get("body") or "")[:3000]
415
+ if len(title) + len(body) < 80: continue
416
+ comments_url = issue.get("comments_url")
417
+ comments_text = ""
418
+ if comments_url and issue.get("comments", 0) > 0:
419
+ cms, _, _ = gh_get(comments_url + "?per_page=5", pool)
420
+ if isinstance(cms, list) and cms:
421
+ comments_text = "\n\n".join(
422
+ f"@{c.get('user',{}).get('login','?')}: {(c.get('body') or '')[:1500]}"
423
+ for c in cms[-3:]
424
+ )
425
+ response = f"# Resolution\n\n{body[:2000]}"
426
+ if comments_text:
427
+ response += f"\n\n## Discussion\n{comments_text[:3000]}"
428
+ write_pair({
429
+ "ts": time.time(),
430
+ "source": "github-crawl-issue",
431
+ "license": lic,
432
+ "repo": full_name,
433
+ "url": issue_url,
434
+ "prompt": f"In {full_name} (closed issue): {title}\n\n{body[:1500]}",
435
+ "response": response,
436
+ })
437
+ pairs_made += 1
438
+
439
+ # 3. Latest release notes → "what's new" pair
440
+ rel_url = f"https://api.github.com/repos/{full_name}/releases?per_page=3"
441
+ rels, _, _ = gh_get(rel_url, pool)
442
+ if isinstance(rels, list):
443
+ for rel in rels[:2]:
444
+ tag = rel.get("tag_name", "")
445
+ notes = (rel.get("body") or "")[:6000]
446
+ if len(notes) < 200: continue
447
+ write_pair({
448
+ "ts": time.time(),
449
+ "source": "github-crawl-release",
450
+ "license": lic,
451
+ "repo": full_name,
452
+ "tag": tag,
453
+ "prompt": f"What's new in {full_name} version {tag}?",
454
+ "response": notes,
455
+ })
456
+ pairs_made += 1
457
+
458
+ info["pairs_extracted"] = pairs_made
459
+ stamp_repo_visited(full_name, info)
460
+ return pairs_made
461
+
462
+
463
+ def worker_awesome_seeds(pool: TokenPool) -> int:
464
+ """Parse awesome-* lists for repo links → add to frontier."""
465
+ added = 0
466
+ for url in AWESOME_SEEDS:
467
+ try:
468
+ req = urllib.request.Request(url, headers={"User-Agent": "Surrogate-1"})
469
+ with urllib.request.urlopen(req, timeout=20) as r:
470
+ md = r.read().decode("utf-8", errors="ignore")
471
+ for m in re.finditer(r'\[[^\]]+\]\(https?://github\.com/([\w.-]+/[\w.-]+)(?:\)|/)', md):
472
+ full = m.group(1).rstrip("/").rstrip(")")
473
+ if full.count("/") != 1: continue
474
+ if full.lower().startswith(("awesome", "topics/")): continue
475
+ if add_to_frontier(full, 0.7, f"awesome-seed:{url[:40]}"):
476
+ added += 1
477
+ except Exception as e:
478
+ log(f" awesome fetch err {type(e).__name__}")
479
+ return added
480
+
481
+
482
+ def worker_trending(pool: TokenPool) -> int:
483
+ """Parse github.com/trending HTML for hot repos (zero API cost)."""
484
+ added = 0
485
+ for ttl in ["daily", "weekly"]:
486
+ for lang in ["", "python", "typescript", "go", "rust"]:
487
+ url = f"https://github.com/trending/{lang}?since={ttl}" if lang else f"https://github.com/trending?since={ttl}"
488
+ try:
489
+ req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0 Surrogate-1"})
490
+ with urllib.request.urlopen(req, timeout=20) as r:
491
+ html = r.read().decode("utf-8", errors="ignore")
492
+ # Trending repos: <h2 class="h3 lh-condensed"><a href="/owner/repo">
493
+ for m in re.finditer(r'<a href="/([\w.-]+/[\w.-]+)" data-view-component="true" class="Link"', html):
494
+ full = m.group(1)
495
+ if full.count("/") != 1: continue
496
+ if add_to_frontier(full, 0.95, f"trending:{lang or 'all'}-{ttl}"):
497
+ added += 1
498
+ except Exception as e:
499
+ log(f" trending fetch err {type(e).__name__}")
500
+ return added
501
+
502
+
503
+ # ── Main scheduler — round-robin all worker types ──────────────────────────
504
+ def main(max_runtime_sec: int = 0) -> None:
505
+ if not TOKEN_POOL:
506
+ log("ERR: GITHUB_TOKEN_POOL not set — exit")
507
+ return
508
+ init_db()
509
+ pool = TokenPool(TOKEN_POOL)
510
+ log(f"start | tokens={len(TOKEN_POOL)} | total_quota_per_h={len(TOKEN_POOL) * 5000}")
511
+
512
+ started_at = time.time()
513
+ cycle = 0
514
+ while True:
515
+ if max_runtime_sec > 0 and time.time() - started_at > max_runtime_sec:
516
+ log(f"runtime limit hit ({max_runtime_sec}s) — exit")
517
+ break
518
+ cycle += 1
519
+ log(f"=== cycle {cycle} | quota_remaining={pool.total_remaining()} ===")
520
+
521
+ # 1. Seeding (every 10 cycles)
522
+ if cycle % 10 == 1:
523
+ n_aw = worker_awesome_seeds(pool)
524
+ log(f" awesome-seeds: +{n_aw} repos to frontier")
525
+ if cycle % 5 == 1:
526
+ n_tr = worker_trending(pool)
527
+ log(f" trending: +{n_tr} repos to frontier")
528
+
529
+ # 2. Topic search (3-4 queries per cycle)
530
+ for q in random.sample(SEARCH_QUERIES, min(4, len(SEARCH_QUERIES))):
531
+ n = worker_topic_search(pool, q)
532
+ log(f" search '{q[:40]}...': +{n}")
533
+
534
+ # 3. Repo deep-dive (8 in parallel)
535
+ with ThreadPoolExecutor(max_workers=8) as ex:
536
+ futures = []
537
+ for _ in range(8):
538
+ full = take_from_frontier()
539
+ if full is None: break
540
+ futures.append(ex.submit(worker_repo_deepdive, pool, full))
541
+ results = [f.result() for f in as_completed(futures, timeout=600)]
542
+ log(f" deep-dive: {len(results)} repos | pairs={sum(results)} | quota_left={pool.total_remaining()}")
543
+
544
+ # 4. Adaptive cool-down based on quota
545
+ remaining = pool.total_remaining()
546
+ if remaining < 500:
547
+ wait = max(60, int(pool.soonest_reset() - time.time()))
548
+ log(f" low quota ({remaining}) — sleep {min(wait, 900)}s")
549
+ time.sleep(min(wait, 900))
550
+ elif remaining < 5000:
551
+ time.sleep(30)
552
+ else:
553
+ time.sleep(5)
554
+
555
+ # 5. Stats
556
+ with sqlite3.connect(DB) as c:
557
+ v = c.execute("SELECT COUNT(*) FROM repos_visited").fetchone()[0]
558
+ f = c.execute("SELECT COUNT(*) FROM repo_frontier").fetchone()[0]
559
+ p = c.execute("SELECT COUNT(*) FROM prs_visited").fetchone()[0]
560
+ i = c.execute("SELECT COUNT(*) FROM issues_visited").fetchone()[0]
561
+ log(f" cumulative: visited={v} frontier={f} prs={p} issues={i}")
562
+
563
+
564
+ if __name__ == "__main__":
565
+ runtime = int(sys.argv[1]) if len(sys.argv) > 1 else 0
566
+ main(runtime)
bin/github-agentic-crawler.sh ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Wrapper for github-agentic-crawler.py — runs continuously with auto-restart.
3
+ set -uo pipefail
4
+ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
5
+
6
+ LOG="$HOME/.surrogate/logs/github-agentic-crawler.log"
7
+ mkdir -p "$(dirname "$LOG")"
8
+
9
+ if [[ -z "${GITHUB_TOKEN_POOL:-}" ]]; then
10
+ echo "[$(date +%H:%M:%S)] WARN: GITHUB_TOKEN_POOL empty — exiting" | tee -a "$LOG"
11
+ exit 0
12
+ fi
13
+
14
+ POOL_SIZE=$(echo "$GITHUB_TOKEN_POOL" | tr ',' '\n' | wc -l | tr -d ' ')
15
+ echo "[$(date +%H:%M:%S)] github-agentic-crawler start (pool=$POOL_SIZE tokens, ~$((POOL_SIZE * 5000)) req/h)" | tee -a "$LOG"
16
+
17
+ # Run continuously; if Python crashes, sleep 30s and restart
18
+ while true; do
19
+ python3 "$HOME/.surrogate/bin/github-agentic-crawler.py" 0 >> "$LOG" 2>&1
20
+ rc=$?
21
+ echo "[$(date +%H:%M:%S)] crawler exited rc=$rc — restart in 30s" | tee -a "$LOG"
22
+ sleep 30
23
+ done
bin/hermes-status-server.py CHANGED
@@ -154,7 +154,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
154
  "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
155
  "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
156
  "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
157
- "qwen-coder", "git-clone", "git-pull", "redis", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
158
  "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
159
  "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
160
  }
 
154
  "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
155
  "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
156
  "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
157
+ "qwen-coder", "git-clone", "git-pull", "redis", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
158
  "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
159
  "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
160
  }
start.sh CHANGED
@@ -215,10 +215,17 @@ chmod +x /tmp/scrape-daemon.sh
215
  nohup /tmp/scrape-daemon.sh > "$LOG_DIR/scrape-daemon.log" 2>&1 &
216
  echo "[$(date +%H:%M:%S)] continuous scrape daemon (parallel=8) started" >> "$LOG_DIR/boot.log"
217
 
218
- # ── 7b. Agentic crawler (URL frontier + visited stamps + link discovery) ────
219
  nohup bash ~/.surrogate/bin/agentic-crawler.sh 6 > "$LOG_DIR/agentic-crawler.log" 2>&1 &
220
  echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
221
 
 
 
 
 
 
 
 
222
  # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
223
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
224
  echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
 
215
  nohup /tmp/scrape-daemon.sh > "$LOG_DIR/scrape-daemon.log" 2>&1 &
216
  echo "[$(date +%H:%M:%S)] continuous scrape daemon (parallel=8) started" >> "$LOG_DIR/boot.log"
217
 
218
+ # ── 7b. Agentic crawler (general web URL frontier + BFS link discovery) ────
219
  nohup bash ~/.surrogate/bin/agentic-crawler.sh 6 > "$LOG_DIR/agentic-crawler.log" 2>&1 &
220
  echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
221
 
222
+ # ── 7b2. GitHub-specific agentic crawler (4 PATs × 5000/h = 20K req/h) ─────
223
+ # Central SQLite frontier — every visited repo/PR/issue stamped, no dedup with
224
+ # any other agent. Specializes in: trending, topic search, repo deep-dive,
225
+ # closed-issue→PR fix pairs, merged-PR review pairs, release notes.
226
+ nohup bash ~/.surrogate/bin/github-agentic-crawler.sh > "$LOG_DIR/github-agentic-crawler.log" 2>&1 &
227
+ echo "[$(date +%H:%M:%S)] github-agentic-crawler started (token pool maximized)" >> "$LOG_DIR/boot.log"
228
+
229
  # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
230
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
231
  echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"