Ashira Pitchayapakayakul commited on
Commit
c9adfa2
Β·
1 Parent(s): dd483c7

feat: role-driven discoverer \u2014 250+ queries from real SDLC job descriptions

Browse files

USER CORRECTION: 'ΰΈ‘ΰΈ±ΰΈ™ΰΈ•ΰΉ‰ΰΈ­ΰΈ‡ΰΉ„ΰΈ‘ΰΉˆΰΈ‘ΰΈ΅ΰΈͺΰΈΈΰΈ”ΰΈ—ΰΉ‰ΰΈ²ΰΈ’ ΰΈ„ΰΉ‰ΰΈ²ΰΈ²ΰΈ²ΰΈ² ΰΉ„ΰΈ›ΰΉ€ΰΈ£ΰΈ·ΰΉˆΰΈ­ΰΈ’ΰΉ†ΰΉ€ΰΈ™ΰΈ­ΰΈ°'
'ΰΉΰΈ•ΰΉˆΰΈ₯ΰΈ° role ΰΈ‚ΰΈ­ΰΈ‡ SDLC ΰΈ‘ΰΈ±ΰΈ™ΰΈ‘ΰΈ΅ΰΉ€ΰΈ£ΰΈ·ΰΉˆΰΈ­ΰΈ‡ΰΈ—ΰΈ΅ΰΉˆΰΈ•ΰΉ‰ΰΈ­ΰΈ‡ΰΈ£ΰΈΉΰΉ‰ΰΈ«ΰΈ₯ัก แΰΈ₯ΰΈ°ΰΉ€ΰΈ£ΰΈ·ΰΉˆΰΈ­ΰΈ‡ΰΈ­ΰΈ·ΰΉˆΰΈ™ΰΈ—ΰΈ΅ΰΉˆΰΈ•ΰΉ‰ΰΈ­ΰΈ‡ΰΈ£ΰΈΉΰΉ‰ΰΈ•ΰΉˆΰΈ²ΰΈ‡ΰΈΰΈ±ΰΈ™ΰΈ™ΰΈ°'

ARCHITECTURE:

agents/role-knowledge-map.json β€” declarative knowledge per SDLC role:
- 17 roles: solution-architect, backend-engineer, frontend-engineer,
mobile-engineer, devops, sre, cloud-engineer, devsecops, cloud-security,
data-engineer, ml-engineer, database-engineer, qa-engineer,
performance-engineer, tech-writer, platform-engineer, ai-agent-engineer
- Each role has CORE skills (must-master) + ADJACENT (should-know)
- Sourced from real FAANG/unicorn/OSS job descriptions 2024-2026
- Plus 21 cross-cutting topics (CAP theorem, 12-factor, DORA, etc.)

TOTAL: 245 role-specific queries + 21 cross-cutting + 9 baseline = ~275 queries

DISCOVERER CHANGES:
- get_queries() reloads role-knowledge-map.json on every cycle
\u2192 user can edit JSON \u2192 next cycle picks up new topics
- Each query tagged with role_tag (e.g., 'backend-engineer-core')
- DB schema: dataset_seen.role_tag column tracks which role found each dataset
- New table: query_history tracks per-query results + freshness
- log shows role tag: '\u2705 [sre-engineer-core] dataset-id | license | schema'
- by_role stat in cycle summary

DOWNSTREAM IMPACT:
- Future: dataset-enrich.sh can read role_tag to balance training mix
(e.g., ensure 5-10% of pairs come from each role's domain)
- /status report can show role coverage matrix
- LoRA training data can be balanced per-role (not biased toward most popular)

NEVER-ENDING:
- 30-min cycle, no upper bound
- New datasets uploaded daily to HF \u2192 discoverer catches within 30 min
- Queries auto-load from JSON \u2192 add new topics by editing JSON, no code change
- Stamp DB prevents re-evaluation \u2192 each cycle only processes new datasets
- Role coverage grows as new roles emerge (e.g., 'rust-systems-engineer' can be added)

Per-role query coverage:

agents/role-knowledge-map.json ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "description": "Per-role SDLC knowledge map. Drives hf-dataset-discoverer.py search queries. Each role has CORE skills (must master) and ADJACENT skills (should know). Source: real job descriptions from FAANG / unicorns / OSS projects 2024-2026.",
4
+ "roles": {
5
+ "solution-architect": {
6
+ "core": [
7
+ "system design", "domain driven design", "bounded context", "aggregate root",
8
+ "hexagonal architecture", "clean architecture", "onion architecture",
9
+ "adr architecture decision record", "trade-off analysis",
10
+ "scalability pattern", "consistency model", "cap theorem", "saga pattern",
11
+ "event sourcing", "cqrs", "microservices", "monolith first",
12
+ "integration pattern", "anti-corruption layer", "bff backend for frontend"
13
+ ],
14
+ "adjacent": [
15
+ "threat modeling stride", "data modeling", "api design rest graphql",
16
+ "monitoring strategy", "capacity planning", "cost optimization",
17
+ "non-functional requirements", "service level objectives"
18
+ ]
19
+ },
20
+ "backend-engineer": {
21
+ "core": [
22
+ "rest api design", "graphql schema", "grpc protobuf",
23
+ "postgresql", "mysql", "mongodb", "redis cache",
24
+ "authentication oauth jwt", "authorization rbac abac",
25
+ "message queue kafka rabbitmq", "background jobs celery sidekiq",
26
+ "error handling exception", "structured logging traceid",
27
+ "unit test integration test", "fastapi flask express spring",
28
+ "go fiber gin", "rust axum actix", "node nestjs", "kotlin ktor"
29
+ ],
30
+ "adjacent": [
31
+ "frontend api consumption", "ci cd pipeline", "docker container",
32
+ "owasp top 10", "performance profiling pprof", "n+1 query",
33
+ "circuit breaker", "rate limiting", "idempotency"
34
+ ]
35
+ },
36
+ "frontend-engineer": {
37
+ "core": [
38
+ "react hooks suspense", "next.js app router", "vue composition api",
39
+ "svelte sveltekit", "solid solidstart", "typescript strict",
40
+ "tailwind css", "css-in-js styled-components", "shadcn radix headless",
41
+ "state management zustand jotai redux", "tanstack react query",
42
+ "react testing library", "playwright cypress e2e",
43
+ "wcag accessibility aria", "web vitals lcp cls inp",
44
+ "bundle optimization code splitting tree shaking"
45
+ ],
46
+ "adjacent": [
47
+ "backend api integration", "design system tokens",
48
+ "seo meta tags structured data", "a/b testing experiment",
49
+ "analytics events", "feature flag", "i18n localization",
50
+ "progressive web app", "service worker"
51
+ ]
52
+ },
53
+ "mobile-engineer": {
54
+ "core": [
55
+ "swiftui combine", "uikit lifecycle",
56
+ "kotlin jetpack compose", "android navigation",
57
+ "react native expo", "flutter dart",
58
+ "offline first cache realm room",
59
+ "push notification fcm apns", "deep linking universal",
60
+ "app store optimization aso", "app review guidelines",
61
+ "crash reporting sentry crashlytics",
62
+ "memory profiling", "battery optimization"
63
+ ],
64
+ "adjacent": [
65
+ "backend api consumption", "graphql apollo",
66
+ "analytics amplitude mixpanel", "feature flag",
67
+ "accessibility voiceover talkback",
68
+ "in app purchase store kit billing"
69
+ ]
70
+ },
71
+ "devops-engineer": {
72
+ "core": [
73
+ "github actions workflow", "gitlab ci", "jenkins pipeline",
74
+ "circleci", "argocd flux gitops",
75
+ "docker multi-stage", "buildkit buildx",
76
+ "kubernetes deployment service ingress",
77
+ "helm chart kustomize",
78
+ "terraform module", "pulumi cdk",
79
+ "ansible chef puppet", "configuration management",
80
+ "vault secrets sealed-secrets sops",
81
+ "blue green canary deployment", "feature flag rollout"
82
+ ],
83
+ "adjacent": [
84
+ "networking vpc subnet", "security scanning trivy snyk",
85
+ "observability prometheus grafana", "finops cost",
86
+ "container runtime containerd cri-o"
87
+ ]
88
+ },
89
+ "sre-engineer": {
90
+ "core": [
91
+ "service level indicator sli", "service level objective slo",
92
+ "error budget", "burn rate alert", "multi window alerting",
93
+ "prometheus metrics", "grafana dashboard",
94
+ "loki promtail fluentd logging", "opentelemetry tracing",
95
+ "incident response runbook", "blameless postmortem",
96
+ "five whys root cause", "chaos engineering chaos-mesh litmus",
97
+ "capacity planning", "load testing k6 locust",
98
+ "on-call rotation pagerduty"
99
+ ],
100
+ "adjacent": [
101
+ "kubernetes deep", "linux performance ebpf",
102
+ "networking troubleshooting", "security incident",
103
+ "cost optimization", "disaster recovery"
104
+ ]
105
+ },
106
+ "cloud-engineer": {
107
+ "core": [
108
+ "aws ec2 s3 lambda dynamodb rds vpc iam",
109
+ "gcp gce gcs cloud-run firestore bigquery",
110
+ "azure aks cosmos blob",
111
+ "aws well-architected framework",
112
+ "multi-region active active passive",
113
+ "disaster recovery rpo rto",
114
+ "cost optimization spot reserved savings-plan",
115
+ "tagging strategy cost allocation",
116
+ "private subnet nat gateway vpc endpoint",
117
+ "iam policy least privilege", "kms encryption"
118
+ ],
119
+ "adjacent": [
120
+ "kubernetes eks gke aks", "service mesh istio linkerd",
121
+ "compliance soc2 hipaa pci", "cdn cloudfront cloudflare",
122
+ "edge computing lambda-edge"
123
+ ]
124
+ },
125
+ "devsecops-engineer": {
126
+ "core": [
127
+ "sast semgrep codeql bandit", "dast zap burp",
128
+ "sca trivy grype snyk dependabot",
129
+ "secret scanning gitleaks trufflehog",
130
+ "container image scan trivy clair",
131
+ "sbom syft cyclonedx spdx",
132
+ "sigstore cosign attestation",
133
+ "slsa supply chain framework",
134
+ "opa rego conftest policy as code",
135
+ "shift left security",
136
+ "vulnerability management cvss epss kev"
137
+ ],
138
+ "adjacent": [
139
+ "incident response forensics", "compliance automation",
140
+ "threat intelligence mitre att&ck",
141
+ "zero trust mtls service mesh"
142
+ ]
143
+ },
144
+ "cloud-security": {
145
+ "core": [
146
+ "cspm prowler scoutsuite steampipe",
147
+ "cis benchmark aws azure gcp",
148
+ "iam audit access analyzer", "least privilege",
149
+ "encryption at rest in transit kms", "envelope encryption",
150
+ "key rotation",
151
+ "waf shield guardduty",
152
+ "compliance pci-dss hipaa soc2 fedramp gdpr iso27001",
153
+ "siem splunk elastic security qradar",
154
+ "cloud detection response cdr",
155
+ "incident response cloud trail"
156
+ ],
157
+ "adjacent": [
158
+ "network security firewall nsg",
159
+ "application security owasp",
160
+ "threat detection lateral movement",
161
+ "data loss prevention dlp"
162
+ ]
163
+ },
164
+ "data-engineer": {
165
+ "core": [
166
+ "airflow dag", "dagster prefect",
167
+ "spark pyspark", "flink streaming",
168
+ "kafka connect schema registry",
169
+ "delta lake iceberg hudi", "lakehouse",
170
+ "snowflake bigquery redshift",
171
+ "dbt model test snapshot",
172
+ "schema evolution avro protobuf",
173
+ "great expectations dbt-tests data quality",
174
+ "openlineage data lineage",
175
+ "cdc debezium"
176
+ ],
177
+ "adjacent": [
178
+ "ml pipeline kubeflow tfx",
179
+ "observability for data datadog metaflow",
180
+ "cost optimization warehouse"
181
+ ]
182
+ },
183
+ "ml-engineer": {
184
+ "core": [
185
+ "pytorch lightning", "huggingface transformers",
186
+ "lora qlora peft fine-tuning",
187
+ "unsloth axolotl", "deepspeed fsdp accelerate",
188
+ "vllm tgi triton inference", "model serving",
189
+ "mlflow weights and biases",
190
+ "feature store feast tecton",
191
+ "data version control dvc",
192
+ "rag retrieval augmented generation",
193
+ "vector database faiss chroma weaviate qdrant",
194
+ "embedding nomic bge instructor",
195
+ "reranker cross-encoder bge mxbai",
196
+ "evaluation lm-eval-harness",
197
+ "prompt engineering chain of thought"
198
+ ],
199
+ "adjacent": [
200
+ "devops kubernetes",
201
+ "data engineering",
202
+ "statistics distribution",
203
+ "rlhf dpo orpo preference learning",
204
+ "agent framework langgraph crew autogen"
205
+ ]
206
+ },
207
+ "database-engineer": {
208
+ "core": [
209
+ "postgresql tuning", "mysql innodb",
210
+ "explain analyze query plan",
211
+ "btree gin gist hash index",
212
+ "partial index expression index",
213
+ "table partition list range hash",
214
+ "vacuum autovacuum bloat",
215
+ "replication streaming logical",
216
+ "connection pool pgbouncer",
217
+ "schema migration sqitch flyway alembic liquibase",
218
+ "online ddl pt-online-schema-change",
219
+ "backup pitr point-in-time recovery"
220
+ ],
221
+ "adjacent": [
222
+ "redis caching pattern",
223
+ "timescaledb timeseries",
224
+ "clickhouse olap",
225
+ "graph neo4j",
226
+ "search elasticsearch opensearch"
227
+ ]
228
+ },
229
+ "qa-engineer": {
230
+ "core": [
231
+ "test pyramid", "unit test mock stub",
232
+ "integration test database",
233
+ "playwright cypress webdriver e2e",
234
+ "rest assured supertest api test",
235
+ "property based testing fast-check hypothesis",
236
+ "fuzzing afl libfuzzer",
237
+ "mutation testing pitest stryker",
238
+ "snapshot test storybook",
239
+ "contract testing pact",
240
+ "test factory fixture builder pattern",
241
+ "coverage line branch path"
242
+ ],
243
+ "adjacent": [
244
+ "ci cd pipeline test stage",
245
+ "performance testing k6 jmeter",
246
+ "security testing zap",
247
+ "accessibility testing axe pa11y",
248
+ "visual regression chromatic"
249
+ ]
250
+ },
251
+ "performance-engineer": {
252
+ "core": [
253
+ "profiling pprof py-spy async-profiler",
254
+ "load testing k6 locust gatling",
255
+ "stress testing", "soak test",
256
+ "p50 p95 p99 latency", "throughput rps",
257
+ "flame graph",
258
+ "n+1 query elimination",
259
+ "caching l1 l2 cdn",
260
+ "connection pool sizing",
261
+ "async io goroutine coroutine",
262
+ "memory profiling heap allocation",
263
+ "garbage collection tuning"
264
+ ],
265
+ "adjacent": [
266
+ "observability tracing",
267
+ "database tuning",
268
+ "kernel ebpf bpftrace",
269
+ "network latency rtt"
270
+ ]
271
+ },
272
+ "tech-writer": {
273
+ "core": [
274
+ "diataxis framework tutorial how-to reference explanation",
275
+ "openapi swagger redoc",
276
+ "api design google api guidelines",
277
+ "readme structure",
278
+ "adr template",
279
+ "runbook structure",
280
+ "code documentation jsdoc rustdoc",
281
+ "tutorial step by step",
282
+ "style guide google microsoft chicago"
283
+ ],
284
+ "adjacent": [
285
+ "code reading comprehension",
286
+ "user research empathy",
287
+ "i18n internationalization",
288
+ "screencast tutorial recording"
289
+ ]
290
+ },
291
+ "platform-engineer": {
292
+ "core": [
293
+ "internal developer platform idp",
294
+ "backstage developer portal",
295
+ "golden path opinionated workflow",
296
+ "service catalog",
297
+ "shared ci cd template",
298
+ "observability infrastructure",
299
+ "secrets distribution",
300
+ "feature flag platform",
301
+ "multi-tenancy"
302
+ ],
303
+ "adjacent": [
304
+ "all sre devops topics",
305
+ "developer experience dx",
306
+ "documentation platform"
307
+ ]
308
+ },
309
+ "ai-agent-engineer": {
310
+ "core": [
311
+ "react agent reasoning acting",
312
+ "tool calling function calling",
313
+ "mcp model context protocol",
314
+ "agent orchestration langgraph crew autogen",
315
+ "agentic workflow",
316
+ "multi agent collaboration",
317
+ "self critique reflexion",
318
+ "voyager skill library",
319
+ "memory episodic procedural",
320
+ "rag retrieval augmented",
321
+ "vector search embedding rerank",
322
+ "prompt engineering xml structured",
323
+ "constitutional ai safety"
324
+ ],
325
+ "adjacent": [
326
+ "all backend topics",
327
+ "ml engineering fine-tuning",
328
+ "evaluation benchmark swe-bench",
329
+ "cost optimization token budget"
330
+ ]
331
+ }
332
+ },
333
+ "cross_cutting_topics": [
334
+ "system design interview", "high scalability",
335
+ "distributed system consensus raft paxos",
336
+ "consistency strong eventual causal",
337
+ "caching strategy ttl invalidation",
338
+ "rate limiting algorithm token bucket leaky",
339
+ "load balancer round robin least connection",
340
+ "circuit breaker hystrix resilience4j",
341
+ "12 factor app", "twelve factor",
342
+ "domain language ubiquitous",
343
+ "code review checklist",
344
+ "git workflow gitflow trunk-based",
345
+ "semver conventional commits",
346
+ "testing best practice",
347
+ "logging structured json",
348
+ "tracing opentelemetry w3c",
349
+ "metrics red use sli",
350
+ "observability three pillars",
351
+ "incident severity sev1 sev2",
352
+ "blameless culture",
353
+ "engineering excellence dora metrics"
354
+ ]
355
+ }
bin/hf-dataset-discoverer.py CHANGED
@@ -32,51 +32,38 @@ ALLOWED = {
32
  DENY_KEYWORDS = ("noncommercial", "non-commercial", "nc-", "-nc", "nc4.0",
33
  "llama2", "llama3", "llama-3", "research-only", "personal-use")
34
 
35
- # 70+ search queries β€” broad SDLC + niche coverage
36
- QUERIES = [
37
- # Code generation/instruction
38
- "code instruction", "code completion", "code generation", "python instruction",
39
- "code review", "code refactoring", "code translation", "code explanation",
40
- # Bug-fix / test
41
- "bug fix", "test generation", "unit test", "pull request", "diff review",
42
- "vulnerability fix", "security patch",
43
- # Reasoning / CoT
44
- "chain of thought", "math reasoning", "step by step", "reasoning trace",
45
- "deepseek r1", "qwq", "o1 reasoning",
46
- # Agent / tool
47
- "agent trajectory", "tool calling", "function calling", "react agent",
48
- "swe-bench", "agentic", "smolagents",
49
- # DevSecOps / IR
50
- "incident response", "postmortem", "cybersecurity", "vulnerability",
51
- "cve", "exploit", "owasp", "threat intelligence", "security audit",
52
- "penetration testing", "red team",
53
- # SRE / Cloud
54
- "system reliability", "sre", "observability", "kubernetes", "terraform",
55
- "cloudformation", "aws", "gcp", "azure", "devops",
56
- # Data / ML
57
- "dbt", "airflow", "spark", "kafka", "etl", "mlops", "model serving",
58
- "embedding dataset", "rag dataset",
59
- # SQL / DBA
60
- "text-to-sql", "sql query", "database query", "schema",
61
- # Architecture
62
- "software architecture", "design pattern", "domain driven design",
63
- "microservices", "event sourcing", "cqrs", "hexagonal",
64
- # Frontend
65
- "react", "nextjs", "tailwind", "vue", "svelte", "ui component",
66
- # Mobile
67
- "ios swift", "android kotlin", "react native", "flutter",
68
- # Multilingual
69
- "multilingual code", "multilingual instruction", "thai instruction",
70
- # Domain niches
71
- "compiler", "embedded", "rust systems", "go concurrency",
72
- "performance optimization", "concurrency",
73
- # Doc / API
74
- "openapi", "api design", "technical writing", "documentation dataset",
75
- # Constitutional / safety
76
- "constitutional ai", "safety dataset", "preference dataset", "dpo",
77
- # Recent mega-mixes
78
- "instruction tuning 2025", "post-training dataset", "sft mixture",
79
- ]
80
 
81
 
82
  def log(msg: str):
@@ -100,11 +87,26 @@ def init_db():
100
  schema_branch TEXT,
101
  cap INTEGER,
102
  slug TEXT,
103
- verdict TEXT
 
104
  );
105
  CREATE INDEX IF NOT EXISTS idx_verdict ON dataset_seen(verdict);
106
  CREATE INDEX IF NOT EXISTS idx_score ON dataset_seen(quality_score DESC);
 
 
 
 
 
 
 
 
 
107
  """)
 
 
 
 
 
108
 
109
 
110
  def hf_get(url: str, timeout: int = 15):
@@ -254,11 +256,14 @@ def evaluate_one(ds_id: str) -> tuple[str, dict | None]:
254
 
255
 
256
  def stamp(ds_id: str, verdict: str, lic: str = "", dl: int = 0,
257
- score: float = 0.0, schema: str = "", cap: int = 0, slug: str = ""):
 
258
  with sqlite3.connect(DB) as c:
259
  c.execute(
260
- "INSERT OR IGNORE INTO dataset_seen VALUES (?,?,?,?,?,?,?,?,?)",
261
- (ds_id, int(time.time()), lic, dl, score, schema, cap, slug, verdict)
 
 
262
  )
263
 
264
 
@@ -272,8 +277,13 @@ def discover_cycle() -> dict:
272
  new_queued = 0
273
  new_rejected = 0
274
  seen_this_cycle = 0
275
- for q in QUERIES:
276
- url = f"https://huggingface.co/api/datasets?search={urllib.parse.quote(q)}&limit=50&sort=downloads&direction=-1"
 
 
 
 
 
277
  results = hf_get(url, timeout=15) or []
278
  for ds in results:
279
  ds_id = ds.get("id", "")
@@ -287,19 +297,35 @@ def discover_cycle() -> dict:
287
  score=entry.get("score", 0.0) if entry else 0.0,
288
  schema=entry.get("schema", "") if entry else "",
289
  cap=entry.get("cap", 0) if entry else 0,
290
- slug=entry.get("slug", "") if entry else "")
 
291
  if verdict == "integrated":
 
 
292
  append_dynamic(entry)
293
  new_integrated += 1
294
- log(f" βœ… {ds_id} | {entry['license']} | {entry['schema']} | cap={entry['cap']:,} | score={entry['score']}")
 
295
  elif verdict.startswith("queued"):
296
  new_queued += 1
297
  else:
298
  new_rejected += 1
299
- time.sleep(0.5) # gentle on HF API
 
 
 
 
 
 
 
 
 
 
 
300
 
301
  return {"evaluated": seen_this_cycle, "integrated": new_integrated,
302
- "queued": new_queued, "rejected": new_rejected}
 
303
 
304
 
305
  def main():
 
32
  DENY_KEYWORDS = ("noncommercial", "non-commercial", "nc-", "-nc", "nc4.0",
33
  "llama2", "llama3", "llama-3", "research-only", "personal-use")
34
 
35
+ # Load role-driven query map (auto-rebuilds when role-knowledge-map.json updated)
36
+ def _load_role_queries() -> list[tuple[str, str]]:
37
+ """Returns list of (query, role) tuples. Each role contributes core + adjacent
38
+ topics. Plus cross-cutting general queries. Total ~250+ queries auto-generated."""
39
+ role_map_path = HOME / ".surrogate/agents/role-knowledge-map.json"
40
+ queries: list[tuple[str, str]] = []
41
+ if role_map_path.exists():
42
+ try:
43
+ data = json.loads(role_map_path.read_text())
44
+ except Exception:
45
+ data = {"roles": {}, "cross_cutting_topics": []}
46
+ for role, skills in data.get("roles", {}).items():
47
+ for q in (skills.get("core") or []):
48
+ queries.append((q, f"{role}-core"))
49
+ for q in (skills.get("adjacent") or []):
50
+ queries.append((q, f"{role}-adj"))
51
+ for q in data.get("cross_cutting_topics") or []:
52
+ queries.append((q, "cross-cutting"))
53
+ # Plus baseline queries (NEVER static β€” discoverer must keep finding)
54
+ queries.extend([(q, "general") for q in [
55
+ "instruction tuning 2025", "instruction tuning 2026",
56
+ "post-training dataset", "sft mixture",
57
+ "preference dataset dpo orpo",
58
+ "dataset 2026", "code dataset 2026",
59
+ "agentic dataset 2026", "reasoning dataset 2026",
60
+ ]])
61
+ return queries
62
+
63
+
64
+ def get_queries() -> list[tuple[str, str]]:
65
+ """Reload on each call so role-knowledge-map.json edits take effect immediately."""
66
+ return _load_role_queries()
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
 
69
  def log(msg: str):
 
87
  schema_branch TEXT,
88
  cap INTEGER,
89
  slug TEXT,
90
+ verdict TEXT,
91
+ role_tag TEXT -- which role's query found this
92
  );
93
  CREATE INDEX IF NOT EXISTS idx_verdict ON dataset_seen(verdict);
94
  CREATE INDEX IF NOT EXISTS idx_score ON dataset_seen(quality_score DESC);
95
+ CREATE INDEX IF NOT EXISTS idx_role ON dataset_seen(role_tag);
96
+
97
+ CREATE TABLE IF NOT EXISTS query_history (
98
+ query TEXT PRIMARY KEY,
99
+ role_tag TEXT,
100
+ last_run_ts INTEGER NOT NULL,
101
+ results_count INTEGER DEFAULT 0,
102
+ new_finds INTEGER DEFAULT 0
103
+ );
104
  """)
105
+ # Migration: add role_tag column if upgrading from v1 schema
106
+ try:
107
+ c.execute("ALTER TABLE dataset_seen ADD COLUMN role_tag TEXT")
108
+ except sqlite3.OperationalError:
109
+ pass # already exists
110
 
111
 
112
  def hf_get(url: str, timeout: int = 15):
 
256
 
257
 
258
  def stamp(ds_id: str, verdict: str, lic: str = "", dl: int = 0,
259
+ score: float = 0.0, schema: str = "", cap: int = 0, slug: str = "",
260
+ role_tag: str = ""):
261
  with sqlite3.connect(DB) as c:
262
  c.execute(
263
+ "INSERT OR IGNORE INTO dataset_seen "
264
+ "(ds_id, evaluated_ts, license, downloads, quality_score, schema_branch, cap, slug, verdict, role_tag) "
265
+ "VALUES (?,?,?,?,?,?,?,?,?,?)",
266
+ (ds_id, int(time.time()), lic, dl, score, schema, cap, slug, verdict, role_tag)
267
  )
268
 
269
 
 
277
  new_queued = 0
278
  new_rejected = 0
279
  seen_this_cycle = 0
280
+ role_finds: dict[str, int] = {}
281
+
282
+ queries = get_queries()
283
+ log(f" loaded {len(queries)} role-driven queries (covering {len(set(r for _,r in queries))} role tags)")
284
+
285
+ for q, role_tag in queries:
286
+ url = f"https://huggingface.co/api/datasets?search={urllib.parse.quote(q)}&limit=30&sort=downloads&direction=-1"
287
  results = hf_get(url, timeout=15) or []
288
  for ds in results:
289
  ds_id = ds.get("id", "")
 
297
  score=entry.get("score", 0.0) if entry else 0.0,
298
  schema=entry.get("schema", "") if entry else "",
299
  cap=entry.get("cap", 0) if entry else 0,
300
+ slug=entry.get("slug", "") if entry else "",
301
+ role_tag=role_tag)
302
  if verdict == "integrated":
303
+ # Tag the entry with role for downstream training-mix balance
304
+ if entry: entry["role_tag"] = role_tag
305
  append_dynamic(entry)
306
  new_integrated += 1
307
+ role_finds[role_tag] = role_finds.get(role_tag, 0) + 1
308
+ log(f" βœ… [{role_tag}] {ds_id} | {entry['license']} | {entry['schema']} | cap={entry['cap']:,}")
309
  elif verdict.startswith("queued"):
310
  new_queued += 1
311
  else:
312
  new_rejected += 1
313
+ time.sleep(0.4) # gentle on HF API
314
+
315
+ # Update query history for this query
316
+ try:
317
+ with sqlite3.connect(DB) as c:
318
+ c.execute(
319
+ "INSERT OR REPLACE INTO query_history (query, role_tag, last_run_ts, results_count, new_finds) "
320
+ "VALUES (?,?,?,?, COALESCE((SELECT new_finds FROM query_history WHERE query=?),0) + ?)",
321
+ (q, role_tag, int(time.time()), len(results), q, new_integrated)
322
+ )
323
+ except Exception:
324
+ pass
325
 
326
  return {"evaluated": seen_this_cycle, "integrated": new_integrated,
327
+ "queued": new_queued, "rejected": new_rejected,
328
+ "by_role": role_finds}
329
 
330
 
331
  def main():