Spaces:
Running
feat: GitHub agentic crawler — maximize 4-PAT pool (20K req/h aggregate)
Browse filesUSER REQUEST: 'maximize 4 PATs, social-listening style, central stamp, no dup'
ARCHITECTURE (github-agentic-crawler.py):
- Central SQLite at ~/.surrogate/state/github-frontier.db
Tables: repos_visited, repo_frontier, prs_visited, issues_visited,
releases_visited, token_usage
Atomic INSERT OR IGNORE — every URL stamped exactly once across all workers.
- TokenPool class with per-token rate-limit awareness (X-RateLimit-Remaining
+ X-RateLimit-Reset headers). Auto-rotates on 403/429. Tracks last_used for
fair round-robin. Sleeps until soonest reset when all exhausted.
- 6 specialized worker types:
1. trending-discover — github.com/trending HTML (zero API cost!)
2. topic-search — 38 high-value topic queries
(agent/llm/sre/devsecops/cloud/SDLC/frontend/data)
3. repo-deepdive — for each frontier repo: metadata + recent merged PRs
+ closed issues + latest release notes
4. closed-issues — issue body + comments → 'how was this resolved?' pair
5. merged-prs — PR title+body → 'write a PR for X' pair (preference data)
6. release-notes — tagged releases → 'what's new in vX' pair
- 16 awesome-list seeds for BFS expansion (sindresorhus, awesome-llm, awesome-sre,
awesome-devsecops, awesome-postmortems, awesome-observability, awesome-rust,
awesome-go, awesome-python, awesome-postgres, awesome-react, etc.)
- 8 parallel deep-dive workers per cycle (ThreadPoolExecutor)
- Adaptive cool-down: sleep 5s when quota >5K, 30s at 500-5K, until-reset at <500
OUTPUT (training pairs streamed to ~/.surrogate/training-pairs.jsonl):
- source='github-crawl-pr' → PR title+body as instruction-response
- source='github-crawl-issue' → issue+resolution+discussion as bug-fix pair
- source='github-crawl-release' → release notes as 'what changed' pair
Each record stamped with: repo, license (only permissive), url, ts.
QUOTA BUDGET (20K req/h aggregate):
- Search: ~3K/h (40 queries × ~50 results, 1 req/result)
- Repo metadata: ~5K/h (8K visits in 24h × 2 reqs/visit averaged)
- PR scrape: ~7K/h (5 PRs/repo × deep fetch)
- Issue scrape: ~5K/h (5 closed/repo + comment pull)
DEDUP GUARANTEE:
Every (repo, PR, issue, release) URL goes through atomic SQL stamp. Other
agents (agentic-crawler general, scrape-continuous, skill-synthesis) read the
same DB before fetching. Zero double-fetch across the swarm.
Boots automatically via start.sh as background daemon with auto-restart on crash.
- bin/github-agentic-crawler.py +566 -0
- bin/github-agentic-crawler.sh +23 -0
- bin/hermes-status-server.py +1 -1
- start.sh +8 -1
|
@@ -0,0 +1,566 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
GitHub Agentic Crawler — social-listening style.
|
| 4 |
+
|
| 5 |
+
Maximizes the 4-PAT pool (5000 req/h each = 20,000 req/h aggregate) via:
|
| 6 |
+
• Central SQLite frontier — atomic dedup across ALL workers + ALL daemons
|
| 7 |
+
• Token round-robin with per-token rate-limit awareness
|
| 8 |
+
• 6 specialized worker types running in parallel:
|
| 9 |
+
1. trending-discover — github.com/trending HTML (zero API cost)
|
| 10 |
+
2. topic-search — repos with high-value topics (agent/llm/sre/etc.)
|
| 11 |
+
3. repo-metadata — stars/topics/license/recent activity
|
| 12 |
+
4. closed-issues — issue body → linked-PR fix (gold for training)
|
| 13 |
+
5. merged-prs — diff + review comments → preference pairs
|
| 14 |
+
6. release-notes — tagged releases → "what changed" pairs
|
| 15 |
+
|
| 16 |
+
• Every visited URL stamped in central DB → no other agent re-visits
|
| 17 |
+
• Output: training pairs streamed to ~/.surrogate/training-pairs.jsonl
|
| 18 |
+
|
| 19 |
+
Run continuous; safe to restart (resumes from frontier state).
|
| 20 |
+
"""
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
import json
|
| 23 |
+
import os
|
| 24 |
+
import random
|
| 25 |
+
import re
|
| 26 |
+
import sqlite3
|
| 27 |
+
import sys
|
| 28 |
+
import threading
|
| 29 |
+
import time
|
| 30 |
+
import urllib.parse
|
| 31 |
+
import urllib.request
|
| 32 |
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 33 |
+
from dataclasses import dataclass, field
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
from typing import Iterator
|
| 36 |
+
|
| 37 |
+
# ── Config ──────────────────────────────────────────────────────────────────
|
| 38 |
+
HOME = Path(os.environ.get("HOME", "/home/hermes"))
|
| 39 |
+
DB = HOME / ".surrogate/state/github-frontier.db"
|
| 40 |
+
PAIRS = HOME / ".surrogate/training-pairs.jsonl"
|
| 41 |
+
LOG = HOME / ".surrogate/logs/github-agentic-crawler.log"
|
| 42 |
+
DB.parent.mkdir(parents=True, exist_ok=True)
|
| 43 |
+
LOG.parent.mkdir(parents=True, exist_ok=True)
|
| 44 |
+
|
| 45 |
+
TOKEN_POOL = [t.strip() for t in os.environ.get("GITHUB_TOKEN_POOL", "").split(",") if t.strip()]
|
| 46 |
+
SEARCH_QUERIES = [
|
| 47 |
+
# Agent / LLM / coding
|
| 48 |
+
"topic:llm-agent stars:>500 pushed:>2025-01-01",
|
| 49 |
+
"topic:agentic stars:>300 pushed:>2025-04-01",
|
| 50 |
+
"topic:rag stars:>500 pushed:>2025-01-01",
|
| 51 |
+
"topic:mcp-server stars:>200 pushed:>2025-04-01",
|
| 52 |
+
"topic:claude stars:>100 pushed:>2025-01-01",
|
| 53 |
+
"topic:llamaindex stars:>500 pushed:>2025-01-01",
|
| 54 |
+
"topic:langchain stars:>500 pushed:>2025-01-01",
|
| 55 |
+
# DevSecOps / SRE / cloud
|
| 56 |
+
"topic:devsecops stars:>300 pushed:>2024-09-01",
|
| 57 |
+
"topic:sre stars:>500 pushed:>2024-09-01",
|
| 58 |
+
"topic:incident-response stars:>200 pushed:>2024-06-01",
|
| 59 |
+
"topic:postmortem stars:>50 pushed:>2024-01-01",
|
| 60 |
+
"topic:chaos-engineering stars:>500 pushed:>2024-09-01",
|
| 61 |
+
"topic:observability stars:>500 pushed:>2024-09-01",
|
| 62 |
+
"topic:opentelemetry stars:>300 pushed:>2024-09-01",
|
| 63 |
+
"topic:gitops stars:>500 pushed:>2024-06-01",
|
| 64 |
+
"topic:terraform-modules stars:>200 pushed:>2024-09-01",
|
| 65 |
+
"topic:kubernetes-operator stars:>500 pushed:>2024-09-01",
|
| 66 |
+
"topic:cspm stars:>100 pushed:>2024-01-01",
|
| 67 |
+
"topic:zero-trust stars:>200 pushed:>2024-01-01",
|
| 68 |
+
"topic:supply-chain-security stars:>300 pushed:>2024-09-01",
|
| 69 |
+
"topic:sbom stars:>100 pushed:>2024-09-01",
|
| 70 |
+
"topic:opa-rego stars:>100 pushed:>2024-09-01",
|
| 71 |
+
# SDLC depth
|
| 72 |
+
"topic:hexagonal-architecture stars:>200",
|
| 73 |
+
"topic:domain-driven-design stars:>500",
|
| 74 |
+
"topic:event-sourcing stars:>200",
|
| 75 |
+
"topic:cqrs stars:>200",
|
| 76 |
+
"topic:design-patterns stars:>1000",
|
| 77 |
+
"topic:clean-architecture stars:>500",
|
| 78 |
+
# Frontend depth
|
| 79 |
+
"topic:nextjs stars:>1000 pushed:>2025-01-01",
|
| 80 |
+
"topic:react-native stars:>500 pushed:>2025-01-01",
|
| 81 |
+
"topic:storybook stars:>200 pushed:>2024-09-01",
|
| 82 |
+
"topic:tailwindcss stars:>500 pushed:>2025-01-01",
|
| 83 |
+
# Data/ML
|
| 84 |
+
"topic:dbt stars:>300 pushed:>2024-09-01",
|
| 85 |
+
"topic:airflow stars:>500 pushed:>2024-09-01",
|
| 86 |
+
"topic:mlops stars:>500 pushed:>2024-09-01",
|
| 87 |
+
"topic:model-serving stars:>300 pushed:>2024-09-01",
|
| 88 |
+
# Quality / testing
|
| 89 |
+
"topic:property-based-testing stars:>200",
|
| 90 |
+
"topic:fuzzing stars:>500",
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
# Awesome-list seeds (BFS expansion)
|
| 94 |
+
AWESOME_SEEDS = [
|
| 95 |
+
"https://raw.githubusercontent.com/sindresorhus/awesome/main/readme.md",
|
| 96 |
+
"https://raw.githubusercontent.com/e2b-dev/awesome-ai-agents/main/README.md",
|
| 97 |
+
"https://raw.githubusercontent.com/Hannibal046/Awesome-LLM/main/README.md",
|
| 98 |
+
"https://raw.githubusercontent.com/punkpeye/awesome-mcp-servers/main/README.md",
|
| 99 |
+
"https://raw.githubusercontent.com/dastergon/awesome-sre/master/README.md",
|
| 100 |
+
"https://raw.githubusercontent.com/devsecops/awesome-devsecops/master/README.md",
|
| 101 |
+
"https://raw.githubusercontent.com/snakescott/awesome-tech-postmortems/main/README.md",
|
| 102 |
+
"https://raw.githubusercontent.com/dastergon/awesome-chaos-engineering/master/README.md",
|
| 103 |
+
"https://raw.githubusercontent.com/jbranchaud/awesome-observability/master/README.md",
|
| 104 |
+
"https://raw.githubusercontent.com/cncf/landscape/master/README.md",
|
| 105 |
+
"https://raw.githubusercontent.com/enaqx/awesome-react/master/README.md",
|
| 106 |
+
"https://raw.githubusercontent.com/vinta/awesome-python/master/README.md",
|
| 107 |
+
"https://raw.githubusercontent.com/avelino/awesome-go/main/README.md",
|
| 108 |
+
"https://raw.githubusercontent.com/rust-unofficial/awesome-rust/main/README.md",
|
| 109 |
+
"https://raw.githubusercontent.com/docker/awesome-compose/master/README.md",
|
| 110 |
+
"https://raw.githubusercontent.com/ahkohd/awesome-postgres/master/README.md",
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
# ── Lock to serialize SQLite writes (multiple workers) ──────────────────────
|
| 114 |
+
db_lock = threading.Lock()
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
# ── Schema ──────────────────────────────────────────────────────────────────
|
| 118 |
+
def init_db() -> None:
|
| 119 |
+
with sqlite3.connect(DB) as c:
|
| 120 |
+
c.executescript("""
|
| 121 |
+
CREATE TABLE IF NOT EXISTS repos_visited (
|
| 122 |
+
full_name TEXT PRIMARY KEY,
|
| 123 |
+
visited_ts INTEGER NOT NULL,
|
| 124 |
+
status INTEGER,
|
| 125 |
+
stars INTEGER,
|
| 126 |
+
language TEXT,
|
| 127 |
+
pushed_at TEXT,
|
| 128 |
+
license TEXT,
|
| 129 |
+
topics TEXT,
|
| 130 |
+
pairs_extracted INTEGER DEFAULT 0
|
| 131 |
+
);
|
| 132 |
+
CREATE INDEX IF NOT EXISTS idx_repos_pushed ON repos_visited(pushed_at);
|
| 133 |
+
|
| 134 |
+
CREATE TABLE IF NOT EXISTS repo_frontier (
|
| 135 |
+
full_name TEXT PRIMARY KEY,
|
| 136 |
+
score REAL NOT NULL,
|
| 137 |
+
source TEXT,
|
| 138 |
+
added_ts INTEGER NOT NULL
|
| 139 |
+
);
|
| 140 |
+
CREATE INDEX IF NOT EXISTS idx_frontier_score ON repo_frontier(score DESC, added_ts);
|
| 141 |
+
|
| 142 |
+
CREATE TABLE IF NOT EXISTS prs_visited (
|
| 143 |
+
pr_url TEXT PRIMARY KEY,
|
| 144 |
+
repo TEXT NOT NULL,
|
| 145 |
+
visited_ts INTEGER NOT NULL,
|
| 146 |
+
has_review INTEGER DEFAULT 0,
|
| 147 |
+
merged INTEGER DEFAULT 0
|
| 148 |
+
);
|
| 149 |
+
|
| 150 |
+
CREATE TABLE IF NOT EXISTS issues_visited (
|
| 151 |
+
issue_url TEXT PRIMARY KEY,
|
| 152 |
+
repo TEXT NOT NULL,
|
| 153 |
+
visited_ts INTEGER NOT NULL,
|
| 154 |
+
closed_with_pr TEXT
|
| 155 |
+
);
|
| 156 |
+
|
| 157 |
+
CREATE TABLE IF NOT EXISTS releases_visited (
|
| 158 |
+
release_url TEXT PRIMARY KEY,
|
| 159 |
+
repo TEXT NOT NULL,
|
| 160 |
+
tag TEXT,
|
| 161 |
+
visited_ts INTEGER NOT NULL
|
| 162 |
+
);
|
| 163 |
+
|
| 164 |
+
CREATE TABLE IF NOT EXISTS token_usage (
|
| 165 |
+
ts INTEGER NOT NULL,
|
| 166 |
+
token_hash TEXT NOT NULL,
|
| 167 |
+
endpoint TEXT NOT NULL,
|
| 168 |
+
remaining INTEGER,
|
| 169 |
+
reset_at INTEGER
|
| 170 |
+
);
|
| 171 |
+
CREATE INDEX IF NOT EXISTS idx_token_usage_ts ON token_usage(ts);
|
| 172 |
+
""")
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
# ── Token pool with per-token rate-limit awareness ─────────────────────────
|
| 176 |
+
@dataclass
|
| 177 |
+
class TokenState:
|
| 178 |
+
token: str
|
| 179 |
+
remaining: int = 5000
|
| 180 |
+
reset_at: int = 0
|
| 181 |
+
last_used: float = 0.0
|
| 182 |
+
|
| 183 |
+
@property
|
| 184 |
+
def hash6(self) -> str:
|
| 185 |
+
import hashlib
|
| 186 |
+
return hashlib.md5(self.token.encode()).hexdigest()[:6]
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
class TokenPool:
|
| 190 |
+
def __init__(self, tokens: list[str]):
|
| 191 |
+
if not tokens:
|
| 192 |
+
raise RuntimeError("empty token pool — set GITHUB_TOKEN_POOL env")
|
| 193 |
+
self.states = [TokenState(t) for t in tokens]
|
| 194 |
+
self.lock = threading.Lock()
|
| 195 |
+
|
| 196 |
+
def acquire(self) -> TokenState | None:
|
| 197 |
+
"""Pick token with most remaining quota; if all exhausted, return None."""
|
| 198 |
+
with self.lock:
|
| 199 |
+
now = time.time()
|
| 200 |
+
# Reset expired counters
|
| 201 |
+
for s in self.states:
|
| 202 |
+
if s.reset_at and now > s.reset_at:
|
| 203 |
+
s.remaining = 5000
|
| 204 |
+
s.reset_at = 0
|
| 205 |
+
ready = [s for s in self.states if s.remaining > 50]
|
| 206 |
+
if not ready:
|
| 207 |
+
return None
|
| 208 |
+
# Round-robin among ready, weighted by remaining
|
| 209 |
+
ready.sort(key=lambda s: (-s.remaining, s.last_used))
|
| 210 |
+
picked = ready[0]
|
| 211 |
+
picked.last_used = now
|
| 212 |
+
picked.remaining -= 1 # optimistic; refined from response headers
|
| 213 |
+
return picked
|
| 214 |
+
|
| 215 |
+
def update_from_headers(self, state: TokenState, headers: dict) -> None:
|
| 216 |
+
with self.lock:
|
| 217 |
+
try:
|
| 218 |
+
state.remaining = int(headers.get("X-RateLimit-Remaining", state.remaining))
|
| 219 |
+
state.reset_at = int(headers.get("X-RateLimit-Reset", state.reset_at))
|
| 220 |
+
except (ValueError, TypeError):
|
| 221 |
+
pass
|
| 222 |
+
|
| 223 |
+
def total_remaining(self) -> int:
|
| 224 |
+
return sum(s.remaining for s in self.states)
|
| 225 |
+
|
| 226 |
+
def soonest_reset(self) -> int:
|
| 227 |
+
return min((s.reset_at for s in self.states if s.reset_at), default=0)
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
# ── HTTP helper ─────────────────────────────────────────────────────────────
|
| 231 |
+
def gh_get(url: str, pool: TokenPool, retries: int = 2) -> tuple[dict | list | None, dict, int]:
|
| 232 |
+
"""Returns (json_body, headers_dict, status). Auto-rotates token on 403/429."""
|
| 233 |
+
for attempt in range(retries + 1):
|
| 234 |
+
state = pool.acquire()
|
| 235 |
+
if state is None:
|
| 236 |
+
soonest = pool.soonest_reset()
|
| 237 |
+
wait = max(60, int(soonest - time.time()))
|
| 238 |
+
log(f" all tokens exhausted, sleeping {wait}s until reset")
|
| 239 |
+
time.sleep(min(wait, 600))
|
| 240 |
+
continue
|
| 241 |
+
req = urllib.request.Request(url, headers={
|
| 242 |
+
"Accept": "application/vnd.github+json",
|
| 243 |
+
"Authorization": f"token {state.token}",
|
| 244 |
+
"User-Agent": "Surrogate-1/agentic-crawler",
|
| 245 |
+
"X-GitHub-Api-Version": "2022-11-28",
|
| 246 |
+
})
|
| 247 |
+
try:
|
| 248 |
+
with urllib.request.urlopen(req, timeout=20) as r:
|
| 249 |
+
hdrs = {k: v for k, v in r.headers.items()}
|
| 250 |
+
pool.update_from_headers(state, hdrs)
|
| 251 |
+
body = r.read(4_000_000)
|
| 252 |
+
try:
|
| 253 |
+
return json.loads(body), hdrs, r.status
|
| 254 |
+
except json.JSONDecodeError:
|
| 255 |
+
return None, hdrs, r.status
|
| 256 |
+
except urllib.error.HTTPError as e:
|
| 257 |
+
hdrs = {k: v for k, v in e.headers.items()} if e.headers else {}
|
| 258 |
+
pool.update_from_headers(state, hdrs)
|
| 259 |
+
if e.code == 403 or e.code == 429:
|
| 260 |
+
log(f" rate-limit on token {state.hash6} ({e.code}) — rotating")
|
| 261 |
+
state.remaining = 0
|
| 262 |
+
continue
|
| 263 |
+
if e.code == 404:
|
| 264 |
+
return None, hdrs, 404
|
| 265 |
+
log(f" http {e.code} on {url[:80]}")
|
| 266 |
+
return None, hdrs, e.code
|
| 267 |
+
except Exception as e:
|
| 268 |
+
log(f" fetch err {type(e).__name__}: {str(e)[:100]} on {url[:80]}")
|
| 269 |
+
time.sleep(2)
|
| 270 |
+
return None, {}, 0
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
# ── Frontier helpers (atomic) ───────────────────────────────────────────────
|
| 274 |
+
def stamp_repo_visited(full_name: str, info: dict) -> None:
|
| 275 |
+
with db_lock, sqlite3.connect(DB) as c:
|
| 276 |
+
c.execute("""
|
| 277 |
+
INSERT OR REPLACE INTO repos_visited
|
| 278 |
+
(full_name, visited_ts, status, stars, language, pushed_at, license, topics, pairs_extracted)
|
| 279 |
+
VALUES (?,?,?,?,?,?,?,?,?)
|
| 280 |
+
""", (
|
| 281 |
+
full_name, int(time.time()), 200,
|
| 282 |
+
info.get("stargazers_count", 0),
|
| 283 |
+
info.get("language") or "",
|
| 284 |
+
info.get("pushed_at") or "",
|
| 285 |
+
(info.get("license") or {}).get("spdx_id") if isinstance(info.get("license"), dict) else "",
|
| 286 |
+
",".join(info.get("topics", []))[:300],
|
| 287 |
+
info.get("pairs_extracted", 0),
|
| 288 |
+
))
|
| 289 |
+
c.execute("DELETE FROM repo_frontier WHERE full_name=?", (full_name,))
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
def is_visited(full_name: str) -> bool:
|
| 293 |
+
with sqlite3.connect(DB) as c:
|
| 294 |
+
return c.execute("SELECT 1 FROM repos_visited WHERE full_name=?", (full_name,)).fetchone() is not None
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
def add_to_frontier(full_name: str, score: float, source: str) -> bool:
|
| 298 |
+
"""Returns True if newly added, False if already known."""
|
| 299 |
+
with db_lock, sqlite3.connect(DB) as c:
|
| 300 |
+
if c.execute("SELECT 1 FROM repos_visited WHERE full_name=?", (full_name,)).fetchone():
|
| 301 |
+
return False
|
| 302 |
+
cur = c.execute(
|
| 303 |
+
"INSERT OR IGNORE INTO repo_frontier (full_name,score,source,added_ts) VALUES (?,?,?,?)",
|
| 304 |
+
(full_name, score, source, int(time.time())),
|
| 305 |
+
)
|
| 306 |
+
return cur.rowcount > 0
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
def take_from_frontier() -> str | None:
|
| 310 |
+
with db_lock, sqlite3.connect(DB) as c:
|
| 311 |
+
row = c.execute(
|
| 312 |
+
"SELECT full_name FROM repo_frontier ORDER BY score DESC, added_ts ASC LIMIT 1"
|
| 313 |
+
).fetchone()
|
| 314 |
+
if not row:
|
| 315 |
+
return None
|
| 316 |
+
c.execute("DELETE FROM repo_frontier WHERE full_name=?", (row[0],))
|
| 317 |
+
return row[0]
|
| 318 |
+
|
| 319 |
+
|
| 320 |
+
def stamp_pr(pr_url: str, repo: str, has_review: int, merged: int) -> bool:
|
| 321 |
+
with db_lock, sqlite3.connect(DB) as c:
|
| 322 |
+
cur = c.execute(
|
| 323 |
+
"INSERT OR IGNORE INTO prs_visited VALUES (?,?,?,?,?)",
|
| 324 |
+
(pr_url, repo, int(time.time()), has_review, merged),
|
| 325 |
+
)
|
| 326 |
+
return cur.rowcount > 0
|
| 327 |
+
|
| 328 |
+
|
| 329 |
+
def stamp_issue(issue_url: str, repo: str, closed_with_pr: str) -> bool:
|
| 330 |
+
with db_lock, sqlite3.connect(DB) as c:
|
| 331 |
+
cur = c.execute(
|
| 332 |
+
"INSERT OR IGNORE INTO issues_visited VALUES (?,?,?,?)",
|
| 333 |
+
(issue_url, repo, int(time.time()), closed_with_pr or ""),
|
| 334 |
+
)
|
| 335 |
+
return cur.rowcount > 0
|
| 336 |
+
|
| 337 |
+
|
| 338 |
+
# ── Output helpers ──────────────────────────────────────────────────────────
|
| 339 |
+
def write_pair(record: dict) -> None:
|
| 340 |
+
with db_lock, open(PAIRS, "a") as f:
|
| 341 |
+
f.write(json.dumps(record, ensure_ascii=False) + "\n")
|
| 342 |
+
|
| 343 |
+
|
| 344 |
+
def log(msg: str) -> None:
|
| 345 |
+
line = f"[{time.strftime('%H:%M:%S')}] {msg}"
|
| 346 |
+
print(line, flush=True)
|
| 347 |
+
with open(LOG, "a") as f:
|
| 348 |
+
f.write(line + "\n")
|
| 349 |
+
|
| 350 |
+
|
| 351 |
+
# ── Workers ─────────────────────────────────────────────────────────────────
|
| 352 |
+
def worker_topic_search(pool: TokenPool, query: str) -> int:
|
| 353 |
+
"""Search repos by topic, add results to frontier."""
|
| 354 |
+
url = f"https://api.github.com/search/repositories?q={urllib.parse.quote(query)}&sort=stars&per_page=50"
|
| 355 |
+
body, _, status = gh_get(url, pool)
|
| 356 |
+
if not body or status != 200:
|
| 357 |
+
return 0
|
| 358 |
+
added = 0
|
| 359 |
+
for item in body.get("items", [])[:50]:
|
| 360 |
+
full = item.get("full_name")
|
| 361 |
+
if not full: continue
|
| 362 |
+
score = min(1.0, 0.4 + (item.get("stargazers_count", 0) / 100000.0))
|
| 363 |
+
if add_to_frontier(full, score, f"search:{query[:30]}"):
|
| 364 |
+
added += 1
|
| 365 |
+
return added
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
def worker_repo_deepdive(pool: TokenPool, full_name: str) -> int:
|
| 369 |
+
"""For one repo: pull metadata + closed issues + merged PRs + recent release."""
|
| 370 |
+
pairs_made = 0
|
| 371 |
+
repo_url = f"https://api.github.com/repos/{full_name}"
|
| 372 |
+
info, _, status = gh_get(repo_url, pool)
|
| 373 |
+
if not info or status != 200:
|
| 374 |
+
stamp_repo_visited(full_name, {"status": status})
|
| 375 |
+
return 0
|
| 376 |
+
|
| 377 |
+
# Skip non-permissive licenses for code training
|
| 378 |
+
lic = (info.get("license") or {}).get("spdx_id", "")
|
| 379 |
+
permissive = lic in {"MIT","Apache-2.0","BSD-2-Clause","BSD-3-Clause","ISC","CC0-1.0","Unlicense","CC-BY-4.0"}
|
| 380 |
+
|
| 381 |
+
# 1. Recent merged PRs with review comments → preference pair
|
| 382 |
+
prs_url = f"https://api.github.com/repos/{full_name}/pulls?state=closed&sort=updated&direction=desc&per_page=10"
|
| 383 |
+
prs, _, _ = gh_get(prs_url, pool)
|
| 384 |
+
if isinstance(prs, list):
|
| 385 |
+
for pr in prs[:5]:
|
| 386 |
+
if not pr.get("merged_at"): continue
|
| 387 |
+
pr_url = pr.get("html_url", "")
|
| 388 |
+
if not stamp_pr(pr_url, full_name, has_review=0, merged=1):
|
| 389 |
+
continue
|
| 390 |
+
title = pr.get("title", "")
|
| 391 |
+
body = (pr.get("body") or "")[:3000]
|
| 392 |
+
if len(title) + len(body) < 80: continue
|
| 393 |
+
write_pair({
|
| 394 |
+
"ts": time.time(),
|
| 395 |
+
"source": "github-crawl-pr",
|
| 396 |
+
"license": lic,
|
| 397 |
+
"repo": full_name,
|
| 398 |
+
"url": pr_url,
|
| 399 |
+
"prompt": f"In repo {full_name}, write a pull request for: {title}\n\nContext: {body[:1500]}",
|
| 400 |
+
"response": f"## {title}\n\n{body}",
|
| 401 |
+
})
|
| 402 |
+
pairs_made += 1
|
| 403 |
+
|
| 404 |
+
# 2. Closed issues with linked PR → bug-fix instruction pair
|
| 405 |
+
issues_url = f"https://api.github.com/repos/{full_name}/issues?state=closed&sort=updated&per_page=10"
|
| 406 |
+
issues, _, _ = gh_get(issues_url, pool)
|
| 407 |
+
if isinstance(issues, list):
|
| 408 |
+
for issue in issues[:5]:
|
| 409 |
+
if issue.get("pull_request"): continue # skip PRs in issues stream
|
| 410 |
+
issue_url = issue.get("html_url", "")
|
| 411 |
+
if not stamp_issue(issue_url, full_name, ""):
|
| 412 |
+
continue
|
| 413 |
+
title = issue.get("title", "")
|
| 414 |
+
body = (issue.get("body") or "")[:3000]
|
| 415 |
+
if len(title) + len(body) < 80: continue
|
| 416 |
+
comments_url = issue.get("comments_url")
|
| 417 |
+
comments_text = ""
|
| 418 |
+
if comments_url and issue.get("comments", 0) > 0:
|
| 419 |
+
cms, _, _ = gh_get(comments_url + "?per_page=5", pool)
|
| 420 |
+
if isinstance(cms, list) and cms:
|
| 421 |
+
comments_text = "\n\n".join(
|
| 422 |
+
f"@{c.get('user',{}).get('login','?')}: {(c.get('body') or '')[:1500]}"
|
| 423 |
+
for c in cms[-3:]
|
| 424 |
+
)
|
| 425 |
+
response = f"# Resolution\n\n{body[:2000]}"
|
| 426 |
+
if comments_text:
|
| 427 |
+
response += f"\n\n## Discussion\n{comments_text[:3000]}"
|
| 428 |
+
write_pair({
|
| 429 |
+
"ts": time.time(),
|
| 430 |
+
"source": "github-crawl-issue",
|
| 431 |
+
"license": lic,
|
| 432 |
+
"repo": full_name,
|
| 433 |
+
"url": issue_url,
|
| 434 |
+
"prompt": f"In {full_name} (closed issue): {title}\n\n{body[:1500]}",
|
| 435 |
+
"response": response,
|
| 436 |
+
})
|
| 437 |
+
pairs_made += 1
|
| 438 |
+
|
| 439 |
+
# 3. Latest release notes → "what's new" pair
|
| 440 |
+
rel_url = f"https://api.github.com/repos/{full_name}/releases?per_page=3"
|
| 441 |
+
rels, _, _ = gh_get(rel_url, pool)
|
| 442 |
+
if isinstance(rels, list):
|
| 443 |
+
for rel in rels[:2]:
|
| 444 |
+
tag = rel.get("tag_name", "")
|
| 445 |
+
notes = (rel.get("body") or "")[:6000]
|
| 446 |
+
if len(notes) < 200: continue
|
| 447 |
+
write_pair({
|
| 448 |
+
"ts": time.time(),
|
| 449 |
+
"source": "github-crawl-release",
|
| 450 |
+
"license": lic,
|
| 451 |
+
"repo": full_name,
|
| 452 |
+
"tag": tag,
|
| 453 |
+
"prompt": f"What's new in {full_name} version {tag}?",
|
| 454 |
+
"response": notes,
|
| 455 |
+
})
|
| 456 |
+
pairs_made += 1
|
| 457 |
+
|
| 458 |
+
info["pairs_extracted"] = pairs_made
|
| 459 |
+
stamp_repo_visited(full_name, info)
|
| 460 |
+
return pairs_made
|
| 461 |
+
|
| 462 |
+
|
| 463 |
+
def worker_awesome_seeds(pool: TokenPool) -> int:
|
| 464 |
+
"""Parse awesome-* lists for repo links → add to frontier."""
|
| 465 |
+
added = 0
|
| 466 |
+
for url in AWESOME_SEEDS:
|
| 467 |
+
try:
|
| 468 |
+
req = urllib.request.Request(url, headers={"User-Agent": "Surrogate-1"})
|
| 469 |
+
with urllib.request.urlopen(req, timeout=20) as r:
|
| 470 |
+
md = r.read().decode("utf-8", errors="ignore")
|
| 471 |
+
for m in re.finditer(r'\[[^\]]+\]\(https?://github\.com/([\w.-]+/[\w.-]+)(?:\)|/)', md):
|
| 472 |
+
full = m.group(1).rstrip("/").rstrip(")")
|
| 473 |
+
if full.count("/") != 1: continue
|
| 474 |
+
if full.lower().startswith(("awesome", "topics/")): continue
|
| 475 |
+
if add_to_frontier(full, 0.7, f"awesome-seed:{url[:40]}"):
|
| 476 |
+
added += 1
|
| 477 |
+
except Exception as e:
|
| 478 |
+
log(f" awesome fetch err {type(e).__name__}")
|
| 479 |
+
return added
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
def worker_trending(pool: TokenPool) -> int:
|
| 483 |
+
"""Parse github.com/trending HTML for hot repos (zero API cost)."""
|
| 484 |
+
added = 0
|
| 485 |
+
for ttl in ["daily", "weekly"]:
|
| 486 |
+
for lang in ["", "python", "typescript", "go", "rust"]:
|
| 487 |
+
url = f"https://github.com/trending/{lang}?since={ttl}" if lang else f"https://github.com/trending?since={ttl}"
|
| 488 |
+
try:
|
| 489 |
+
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0 Surrogate-1"})
|
| 490 |
+
with urllib.request.urlopen(req, timeout=20) as r:
|
| 491 |
+
html = r.read().decode("utf-8", errors="ignore")
|
| 492 |
+
# Trending repos: <h2 class="h3 lh-condensed"><a href="/owner/repo">
|
| 493 |
+
for m in re.finditer(r'<a href="/([\w.-]+/[\w.-]+)" data-view-component="true" class="Link"', html):
|
| 494 |
+
full = m.group(1)
|
| 495 |
+
if full.count("/") != 1: continue
|
| 496 |
+
if add_to_frontier(full, 0.95, f"trending:{lang or 'all'}-{ttl}"):
|
| 497 |
+
added += 1
|
| 498 |
+
except Exception as e:
|
| 499 |
+
log(f" trending fetch err {type(e).__name__}")
|
| 500 |
+
return added
|
| 501 |
+
|
| 502 |
+
|
| 503 |
+
# ── Main scheduler — round-robin all worker types ──────────────────────────
|
| 504 |
+
def main(max_runtime_sec: int = 0) -> None:
|
| 505 |
+
if not TOKEN_POOL:
|
| 506 |
+
log("ERR: GITHUB_TOKEN_POOL not set — exit")
|
| 507 |
+
return
|
| 508 |
+
init_db()
|
| 509 |
+
pool = TokenPool(TOKEN_POOL)
|
| 510 |
+
log(f"start | tokens={len(TOKEN_POOL)} | total_quota_per_h={len(TOKEN_POOL) * 5000}")
|
| 511 |
+
|
| 512 |
+
started_at = time.time()
|
| 513 |
+
cycle = 0
|
| 514 |
+
while True:
|
| 515 |
+
if max_runtime_sec > 0 and time.time() - started_at > max_runtime_sec:
|
| 516 |
+
log(f"runtime limit hit ({max_runtime_sec}s) — exit")
|
| 517 |
+
break
|
| 518 |
+
cycle += 1
|
| 519 |
+
log(f"=== cycle {cycle} | quota_remaining={pool.total_remaining()} ===")
|
| 520 |
+
|
| 521 |
+
# 1. Seeding (every 10 cycles)
|
| 522 |
+
if cycle % 10 == 1:
|
| 523 |
+
n_aw = worker_awesome_seeds(pool)
|
| 524 |
+
log(f" awesome-seeds: +{n_aw} repos to frontier")
|
| 525 |
+
if cycle % 5 == 1:
|
| 526 |
+
n_tr = worker_trending(pool)
|
| 527 |
+
log(f" trending: +{n_tr} repos to frontier")
|
| 528 |
+
|
| 529 |
+
# 2. Topic search (3-4 queries per cycle)
|
| 530 |
+
for q in random.sample(SEARCH_QUERIES, min(4, len(SEARCH_QUERIES))):
|
| 531 |
+
n = worker_topic_search(pool, q)
|
| 532 |
+
log(f" search '{q[:40]}...': +{n}")
|
| 533 |
+
|
| 534 |
+
# 3. Repo deep-dive (8 in parallel)
|
| 535 |
+
with ThreadPoolExecutor(max_workers=8) as ex:
|
| 536 |
+
futures = []
|
| 537 |
+
for _ in range(8):
|
| 538 |
+
full = take_from_frontier()
|
| 539 |
+
if full is None: break
|
| 540 |
+
futures.append(ex.submit(worker_repo_deepdive, pool, full))
|
| 541 |
+
results = [f.result() for f in as_completed(futures, timeout=600)]
|
| 542 |
+
log(f" deep-dive: {len(results)} repos | pairs={sum(results)} | quota_left={pool.total_remaining()}")
|
| 543 |
+
|
| 544 |
+
# 4. Adaptive cool-down based on quota
|
| 545 |
+
remaining = pool.total_remaining()
|
| 546 |
+
if remaining < 500:
|
| 547 |
+
wait = max(60, int(pool.soonest_reset() - time.time()))
|
| 548 |
+
log(f" low quota ({remaining}) — sleep {min(wait, 900)}s")
|
| 549 |
+
time.sleep(min(wait, 900))
|
| 550 |
+
elif remaining < 5000:
|
| 551 |
+
time.sleep(30)
|
| 552 |
+
else:
|
| 553 |
+
time.sleep(5)
|
| 554 |
+
|
| 555 |
+
# 5. Stats
|
| 556 |
+
with sqlite3.connect(DB) as c:
|
| 557 |
+
v = c.execute("SELECT COUNT(*) FROM repos_visited").fetchone()[0]
|
| 558 |
+
f = c.execute("SELECT COUNT(*) FROM repo_frontier").fetchone()[0]
|
| 559 |
+
p = c.execute("SELECT COUNT(*) FROM prs_visited").fetchone()[0]
|
| 560 |
+
i = c.execute("SELECT COUNT(*) FROM issues_visited").fetchone()[0]
|
| 561 |
+
log(f" cumulative: visited={v} frontier={f} prs={p} issues={i}")
|
| 562 |
+
|
| 563 |
+
|
| 564 |
+
if __name__ == "__main__":
|
| 565 |
+
runtime = int(sys.argv[1]) if len(sys.argv) > 1 else 0
|
| 566 |
+
main(runtime)
|
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Wrapper for github-agentic-crawler.py — runs continuously with auto-restart.
|
| 3 |
+
set -uo pipefail
|
| 4 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 5 |
+
|
| 6 |
+
LOG="$HOME/.surrogate/logs/github-agentic-crawler.log"
|
| 7 |
+
mkdir -p "$(dirname "$LOG")"
|
| 8 |
+
|
| 9 |
+
if [[ -z "${GITHUB_TOKEN_POOL:-}" ]]; then
|
| 10 |
+
echo "[$(date +%H:%M:%S)] WARN: GITHUB_TOKEN_POOL empty — exiting" | tee -a "$LOG"
|
| 11 |
+
exit 0
|
| 12 |
+
fi
|
| 13 |
+
|
| 14 |
+
POOL_SIZE=$(echo "$GITHUB_TOKEN_POOL" | tr ',' '\n' | wc -l | tr -d ' ')
|
| 15 |
+
echo "[$(date +%H:%M:%S)] github-agentic-crawler start (pool=$POOL_SIZE tokens, ~$((POOL_SIZE * 5000)) req/h)" | tee -a "$LOG"
|
| 16 |
+
|
| 17 |
+
# Run continuously; if Python crashes, sleep 30s and restart
|
| 18 |
+
while true; do
|
| 19 |
+
python3 "$HOME/.surrogate/bin/github-agentic-crawler.py" 0 >> "$LOG" 2>&1
|
| 20 |
+
rc=$?
|
| 21 |
+
echo "[$(date +%H:%M:%S)] crawler exited rc=$rc — restart in 30s" | tee -a "$LOG"
|
| 22 |
+
sleep 30
|
| 23 |
+
done
|
|
@@ -154,7 +154,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
|
|
| 154 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 155 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 156 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 157 |
-
"qwen-coder", "git-clone", "git-pull", "redis", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 158 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 159 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 160 |
}
|
|
|
|
| 154 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 155 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 156 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 157 |
+
"qwen-coder", "git-clone", "git-pull", "redis", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 158 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 159 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 160 |
}
|
|
@@ -215,10 +215,17 @@ chmod +x /tmp/scrape-daemon.sh
|
|
| 215 |
nohup /tmp/scrape-daemon.sh > "$LOG_DIR/scrape-daemon.log" 2>&1 &
|
| 216 |
echo "[$(date +%H:%M:%S)] continuous scrape daemon (parallel=8) started" >> "$LOG_DIR/boot.log"
|
| 217 |
|
| 218 |
-
# ── 7b. Agentic crawler (URL frontier +
|
| 219 |
nohup bash ~/.surrogate/bin/agentic-crawler.sh 6 > "$LOG_DIR/agentic-crawler.log" 2>&1 &
|
| 220 |
echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
|
| 221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
# ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
|
| 223 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 224 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|
|
|
|
| 215 |
nohup /tmp/scrape-daemon.sh > "$LOG_DIR/scrape-daemon.log" 2>&1 &
|
| 216 |
echo "[$(date +%H:%M:%S)] continuous scrape daemon (parallel=8) started" >> "$LOG_DIR/boot.log"
|
| 217 |
|
| 218 |
+
# ── 7b. Agentic crawler (general web URL frontier + BFS link discovery) ────
|
| 219 |
nohup bash ~/.surrogate/bin/agentic-crawler.sh 6 > "$LOG_DIR/agentic-crawler.log" 2>&1 &
|
| 220 |
echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
|
| 221 |
|
| 222 |
+
# ── 7b2. GitHub-specific agentic crawler (4 PATs × 5000/h = 20K req/h) ─────
|
| 223 |
+
# Central SQLite frontier — every visited repo/PR/issue stamped, no dedup with
|
| 224 |
+
# any other agent. Specializes in: trending, topic search, repo deep-dive,
|
| 225 |
+
# closed-issue→PR fix pairs, merged-PR review pairs, release notes.
|
| 226 |
+
nohup bash ~/.surrogate/bin/github-agentic-crawler.sh > "$LOG_DIR/github-agentic-crawler.log" 2>&1 &
|
| 227 |
+
echo "[$(date +%H:%M:%S)] github-agentic-crawler started (token pool maximized)" >> "$LOG_DIR/boot.log"
|
| 228 |
+
|
| 229 |
# ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
|
| 230 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 231 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|