Spaces:
Running
feat: role-driven discoverer \u2014 250+ queries from real SDLC job descriptions
Browse filesUSER CORRECTION: 'ΰΈ‘ΰΈ±ΰΈΰΈΰΉΰΈΰΈΰΉΰΈ‘ΰΉΰΈ‘ΰΈ΅ΰΈͺΰΈΈΰΈΰΈΰΉΰΈ²ΰΈ’ ΰΈΰΉΰΈ²ΰΈ²ΰΈ²ΰΈ² ΰΉΰΈΰΉΰΈ£ΰΈ·ΰΉΰΈΰΈ’ΰΉΰΉΰΈΰΈΰΈ°'
'ΰΉΰΈΰΉΰΈ₯ΰΈ° role ΰΈΰΈΰΈ SDLC ΰΈ‘ΰΈ±ΰΈΰΈ‘ΰΈ΅ΰΉΰΈ£ΰΈ·ΰΉΰΈΰΈΰΈΰΈ΅ΰΉΰΈΰΉΰΈΰΈΰΈ£ΰΈΉΰΉΰΈ«ΰΈ₯ΰΈ±ΰΈ ΰΉΰΈ₯ΰΈ°ΰΉΰΈ£ΰΈ·ΰΉΰΈΰΈΰΈΰΈ·ΰΉΰΈΰΈΰΈ΅ΰΉΰΈΰΉΰΈΰΈΰΈ£ΰΈΉΰΉΰΈΰΉΰΈ²ΰΈΰΈΰΈ±ΰΈΰΈΰΈ°'
ARCHITECTURE:
agents/role-knowledge-map.json β declarative knowledge per SDLC role:
- 17 roles: solution-architect, backend-engineer, frontend-engineer,
mobile-engineer, devops, sre, cloud-engineer, devsecops, cloud-security,
data-engineer, ml-engineer, database-engineer, qa-engineer,
performance-engineer, tech-writer, platform-engineer, ai-agent-engineer
- Each role has CORE skills (must-master) + ADJACENT (should-know)
- Sourced from real FAANG/unicorn/OSS job descriptions 2024-2026
- Plus 21 cross-cutting topics (CAP theorem, 12-factor, DORA, etc.)
TOTAL: 245 role-specific queries + 21 cross-cutting + 9 baseline = ~275 queries
DISCOVERER CHANGES:
- get_queries() reloads role-knowledge-map.json on every cycle
\u2192 user can edit JSON \u2192 next cycle picks up new topics
- Each query tagged with role_tag (e.g., 'backend-engineer-core')
- DB schema: dataset_seen.role_tag column tracks which role found each dataset
- New table: query_history tracks per-query results + freshness
- log shows role tag: '\u2705 [sre-engineer-core] dataset-id | license | schema'
- by_role stat in cycle summary
DOWNSTREAM IMPACT:
- Future: dataset-enrich.sh can read role_tag to balance training mix
(e.g., ensure 5-10% of pairs come from each role's domain)
- /status report can show role coverage matrix
- LoRA training data can be balanced per-role (not biased toward most popular)
NEVER-ENDING:
- 30-min cycle, no upper bound
- New datasets uploaded daily to HF \u2192 discoverer catches within 30 min
- Queries auto-load from JSON \u2192 add new topics by editing JSON, no code change
- Stamp DB prevents re-evaluation \u2192 each cycle only processes new datasets
- Role coverage grows as new roles emerge (e.g., 'rust-systems-engineer' can be added)
Per-role query coverage:
- agents/role-knowledge-map.json +355 -0
- bin/hf-dataset-discoverer.py +81 -55
|
@@ -0,0 +1,355 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"version": "1.0",
|
| 3 |
+
"description": "Per-role SDLC knowledge map. Drives hf-dataset-discoverer.py search queries. Each role has CORE skills (must master) and ADJACENT skills (should know). Source: real job descriptions from FAANG / unicorns / OSS projects 2024-2026.",
|
| 4 |
+
"roles": {
|
| 5 |
+
"solution-architect": {
|
| 6 |
+
"core": [
|
| 7 |
+
"system design", "domain driven design", "bounded context", "aggregate root",
|
| 8 |
+
"hexagonal architecture", "clean architecture", "onion architecture",
|
| 9 |
+
"adr architecture decision record", "trade-off analysis",
|
| 10 |
+
"scalability pattern", "consistency model", "cap theorem", "saga pattern",
|
| 11 |
+
"event sourcing", "cqrs", "microservices", "monolith first",
|
| 12 |
+
"integration pattern", "anti-corruption layer", "bff backend for frontend"
|
| 13 |
+
],
|
| 14 |
+
"adjacent": [
|
| 15 |
+
"threat modeling stride", "data modeling", "api design rest graphql",
|
| 16 |
+
"monitoring strategy", "capacity planning", "cost optimization",
|
| 17 |
+
"non-functional requirements", "service level objectives"
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
"backend-engineer": {
|
| 21 |
+
"core": [
|
| 22 |
+
"rest api design", "graphql schema", "grpc protobuf",
|
| 23 |
+
"postgresql", "mysql", "mongodb", "redis cache",
|
| 24 |
+
"authentication oauth jwt", "authorization rbac abac",
|
| 25 |
+
"message queue kafka rabbitmq", "background jobs celery sidekiq",
|
| 26 |
+
"error handling exception", "structured logging traceid",
|
| 27 |
+
"unit test integration test", "fastapi flask express spring",
|
| 28 |
+
"go fiber gin", "rust axum actix", "node nestjs", "kotlin ktor"
|
| 29 |
+
],
|
| 30 |
+
"adjacent": [
|
| 31 |
+
"frontend api consumption", "ci cd pipeline", "docker container",
|
| 32 |
+
"owasp top 10", "performance profiling pprof", "n+1 query",
|
| 33 |
+
"circuit breaker", "rate limiting", "idempotency"
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
"frontend-engineer": {
|
| 37 |
+
"core": [
|
| 38 |
+
"react hooks suspense", "next.js app router", "vue composition api",
|
| 39 |
+
"svelte sveltekit", "solid solidstart", "typescript strict",
|
| 40 |
+
"tailwind css", "css-in-js styled-components", "shadcn radix headless",
|
| 41 |
+
"state management zustand jotai redux", "tanstack react query",
|
| 42 |
+
"react testing library", "playwright cypress e2e",
|
| 43 |
+
"wcag accessibility aria", "web vitals lcp cls inp",
|
| 44 |
+
"bundle optimization code splitting tree shaking"
|
| 45 |
+
],
|
| 46 |
+
"adjacent": [
|
| 47 |
+
"backend api integration", "design system tokens",
|
| 48 |
+
"seo meta tags structured data", "a/b testing experiment",
|
| 49 |
+
"analytics events", "feature flag", "i18n localization",
|
| 50 |
+
"progressive web app", "service worker"
|
| 51 |
+
]
|
| 52 |
+
},
|
| 53 |
+
"mobile-engineer": {
|
| 54 |
+
"core": [
|
| 55 |
+
"swiftui combine", "uikit lifecycle",
|
| 56 |
+
"kotlin jetpack compose", "android navigation",
|
| 57 |
+
"react native expo", "flutter dart",
|
| 58 |
+
"offline first cache realm room",
|
| 59 |
+
"push notification fcm apns", "deep linking universal",
|
| 60 |
+
"app store optimization aso", "app review guidelines",
|
| 61 |
+
"crash reporting sentry crashlytics",
|
| 62 |
+
"memory profiling", "battery optimization"
|
| 63 |
+
],
|
| 64 |
+
"adjacent": [
|
| 65 |
+
"backend api consumption", "graphql apollo",
|
| 66 |
+
"analytics amplitude mixpanel", "feature flag",
|
| 67 |
+
"accessibility voiceover talkback",
|
| 68 |
+
"in app purchase store kit billing"
|
| 69 |
+
]
|
| 70 |
+
},
|
| 71 |
+
"devops-engineer": {
|
| 72 |
+
"core": [
|
| 73 |
+
"github actions workflow", "gitlab ci", "jenkins pipeline",
|
| 74 |
+
"circleci", "argocd flux gitops",
|
| 75 |
+
"docker multi-stage", "buildkit buildx",
|
| 76 |
+
"kubernetes deployment service ingress",
|
| 77 |
+
"helm chart kustomize",
|
| 78 |
+
"terraform module", "pulumi cdk",
|
| 79 |
+
"ansible chef puppet", "configuration management",
|
| 80 |
+
"vault secrets sealed-secrets sops",
|
| 81 |
+
"blue green canary deployment", "feature flag rollout"
|
| 82 |
+
],
|
| 83 |
+
"adjacent": [
|
| 84 |
+
"networking vpc subnet", "security scanning trivy snyk",
|
| 85 |
+
"observability prometheus grafana", "finops cost",
|
| 86 |
+
"container runtime containerd cri-o"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
"sre-engineer": {
|
| 90 |
+
"core": [
|
| 91 |
+
"service level indicator sli", "service level objective slo",
|
| 92 |
+
"error budget", "burn rate alert", "multi window alerting",
|
| 93 |
+
"prometheus metrics", "grafana dashboard",
|
| 94 |
+
"loki promtail fluentd logging", "opentelemetry tracing",
|
| 95 |
+
"incident response runbook", "blameless postmortem",
|
| 96 |
+
"five whys root cause", "chaos engineering chaos-mesh litmus",
|
| 97 |
+
"capacity planning", "load testing k6 locust",
|
| 98 |
+
"on-call rotation pagerduty"
|
| 99 |
+
],
|
| 100 |
+
"adjacent": [
|
| 101 |
+
"kubernetes deep", "linux performance ebpf",
|
| 102 |
+
"networking troubleshooting", "security incident",
|
| 103 |
+
"cost optimization", "disaster recovery"
|
| 104 |
+
]
|
| 105 |
+
},
|
| 106 |
+
"cloud-engineer": {
|
| 107 |
+
"core": [
|
| 108 |
+
"aws ec2 s3 lambda dynamodb rds vpc iam",
|
| 109 |
+
"gcp gce gcs cloud-run firestore bigquery",
|
| 110 |
+
"azure aks cosmos blob",
|
| 111 |
+
"aws well-architected framework",
|
| 112 |
+
"multi-region active active passive",
|
| 113 |
+
"disaster recovery rpo rto",
|
| 114 |
+
"cost optimization spot reserved savings-plan",
|
| 115 |
+
"tagging strategy cost allocation",
|
| 116 |
+
"private subnet nat gateway vpc endpoint",
|
| 117 |
+
"iam policy least privilege", "kms encryption"
|
| 118 |
+
],
|
| 119 |
+
"adjacent": [
|
| 120 |
+
"kubernetes eks gke aks", "service mesh istio linkerd",
|
| 121 |
+
"compliance soc2 hipaa pci", "cdn cloudfront cloudflare",
|
| 122 |
+
"edge computing lambda-edge"
|
| 123 |
+
]
|
| 124 |
+
},
|
| 125 |
+
"devsecops-engineer": {
|
| 126 |
+
"core": [
|
| 127 |
+
"sast semgrep codeql bandit", "dast zap burp",
|
| 128 |
+
"sca trivy grype snyk dependabot",
|
| 129 |
+
"secret scanning gitleaks trufflehog",
|
| 130 |
+
"container image scan trivy clair",
|
| 131 |
+
"sbom syft cyclonedx spdx",
|
| 132 |
+
"sigstore cosign attestation",
|
| 133 |
+
"slsa supply chain framework",
|
| 134 |
+
"opa rego conftest policy as code",
|
| 135 |
+
"shift left security",
|
| 136 |
+
"vulnerability management cvss epss kev"
|
| 137 |
+
],
|
| 138 |
+
"adjacent": [
|
| 139 |
+
"incident response forensics", "compliance automation",
|
| 140 |
+
"threat intelligence mitre att&ck",
|
| 141 |
+
"zero trust mtls service mesh"
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
"cloud-security": {
|
| 145 |
+
"core": [
|
| 146 |
+
"cspm prowler scoutsuite steampipe",
|
| 147 |
+
"cis benchmark aws azure gcp",
|
| 148 |
+
"iam audit access analyzer", "least privilege",
|
| 149 |
+
"encryption at rest in transit kms", "envelope encryption",
|
| 150 |
+
"key rotation",
|
| 151 |
+
"waf shield guardduty",
|
| 152 |
+
"compliance pci-dss hipaa soc2 fedramp gdpr iso27001",
|
| 153 |
+
"siem splunk elastic security qradar",
|
| 154 |
+
"cloud detection response cdr",
|
| 155 |
+
"incident response cloud trail"
|
| 156 |
+
],
|
| 157 |
+
"adjacent": [
|
| 158 |
+
"network security firewall nsg",
|
| 159 |
+
"application security owasp",
|
| 160 |
+
"threat detection lateral movement",
|
| 161 |
+
"data loss prevention dlp"
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
"data-engineer": {
|
| 165 |
+
"core": [
|
| 166 |
+
"airflow dag", "dagster prefect",
|
| 167 |
+
"spark pyspark", "flink streaming",
|
| 168 |
+
"kafka connect schema registry",
|
| 169 |
+
"delta lake iceberg hudi", "lakehouse",
|
| 170 |
+
"snowflake bigquery redshift",
|
| 171 |
+
"dbt model test snapshot",
|
| 172 |
+
"schema evolution avro protobuf",
|
| 173 |
+
"great expectations dbt-tests data quality",
|
| 174 |
+
"openlineage data lineage",
|
| 175 |
+
"cdc debezium"
|
| 176 |
+
],
|
| 177 |
+
"adjacent": [
|
| 178 |
+
"ml pipeline kubeflow tfx",
|
| 179 |
+
"observability for data datadog metaflow",
|
| 180 |
+
"cost optimization warehouse"
|
| 181 |
+
]
|
| 182 |
+
},
|
| 183 |
+
"ml-engineer": {
|
| 184 |
+
"core": [
|
| 185 |
+
"pytorch lightning", "huggingface transformers",
|
| 186 |
+
"lora qlora peft fine-tuning",
|
| 187 |
+
"unsloth axolotl", "deepspeed fsdp accelerate",
|
| 188 |
+
"vllm tgi triton inference", "model serving",
|
| 189 |
+
"mlflow weights and biases",
|
| 190 |
+
"feature store feast tecton",
|
| 191 |
+
"data version control dvc",
|
| 192 |
+
"rag retrieval augmented generation",
|
| 193 |
+
"vector database faiss chroma weaviate qdrant",
|
| 194 |
+
"embedding nomic bge instructor",
|
| 195 |
+
"reranker cross-encoder bge mxbai",
|
| 196 |
+
"evaluation lm-eval-harness",
|
| 197 |
+
"prompt engineering chain of thought"
|
| 198 |
+
],
|
| 199 |
+
"adjacent": [
|
| 200 |
+
"devops kubernetes",
|
| 201 |
+
"data engineering",
|
| 202 |
+
"statistics distribution",
|
| 203 |
+
"rlhf dpo orpo preference learning",
|
| 204 |
+
"agent framework langgraph crew autogen"
|
| 205 |
+
]
|
| 206 |
+
},
|
| 207 |
+
"database-engineer": {
|
| 208 |
+
"core": [
|
| 209 |
+
"postgresql tuning", "mysql innodb",
|
| 210 |
+
"explain analyze query plan",
|
| 211 |
+
"btree gin gist hash index",
|
| 212 |
+
"partial index expression index",
|
| 213 |
+
"table partition list range hash",
|
| 214 |
+
"vacuum autovacuum bloat",
|
| 215 |
+
"replication streaming logical",
|
| 216 |
+
"connection pool pgbouncer",
|
| 217 |
+
"schema migration sqitch flyway alembic liquibase",
|
| 218 |
+
"online ddl pt-online-schema-change",
|
| 219 |
+
"backup pitr point-in-time recovery"
|
| 220 |
+
],
|
| 221 |
+
"adjacent": [
|
| 222 |
+
"redis caching pattern",
|
| 223 |
+
"timescaledb timeseries",
|
| 224 |
+
"clickhouse olap",
|
| 225 |
+
"graph neo4j",
|
| 226 |
+
"search elasticsearch opensearch"
|
| 227 |
+
]
|
| 228 |
+
},
|
| 229 |
+
"qa-engineer": {
|
| 230 |
+
"core": [
|
| 231 |
+
"test pyramid", "unit test mock stub",
|
| 232 |
+
"integration test database",
|
| 233 |
+
"playwright cypress webdriver e2e",
|
| 234 |
+
"rest assured supertest api test",
|
| 235 |
+
"property based testing fast-check hypothesis",
|
| 236 |
+
"fuzzing afl libfuzzer",
|
| 237 |
+
"mutation testing pitest stryker",
|
| 238 |
+
"snapshot test storybook",
|
| 239 |
+
"contract testing pact",
|
| 240 |
+
"test factory fixture builder pattern",
|
| 241 |
+
"coverage line branch path"
|
| 242 |
+
],
|
| 243 |
+
"adjacent": [
|
| 244 |
+
"ci cd pipeline test stage",
|
| 245 |
+
"performance testing k6 jmeter",
|
| 246 |
+
"security testing zap",
|
| 247 |
+
"accessibility testing axe pa11y",
|
| 248 |
+
"visual regression chromatic"
|
| 249 |
+
]
|
| 250 |
+
},
|
| 251 |
+
"performance-engineer": {
|
| 252 |
+
"core": [
|
| 253 |
+
"profiling pprof py-spy async-profiler",
|
| 254 |
+
"load testing k6 locust gatling",
|
| 255 |
+
"stress testing", "soak test",
|
| 256 |
+
"p50 p95 p99 latency", "throughput rps",
|
| 257 |
+
"flame graph",
|
| 258 |
+
"n+1 query elimination",
|
| 259 |
+
"caching l1 l2 cdn",
|
| 260 |
+
"connection pool sizing",
|
| 261 |
+
"async io goroutine coroutine",
|
| 262 |
+
"memory profiling heap allocation",
|
| 263 |
+
"garbage collection tuning"
|
| 264 |
+
],
|
| 265 |
+
"adjacent": [
|
| 266 |
+
"observability tracing",
|
| 267 |
+
"database tuning",
|
| 268 |
+
"kernel ebpf bpftrace",
|
| 269 |
+
"network latency rtt"
|
| 270 |
+
]
|
| 271 |
+
},
|
| 272 |
+
"tech-writer": {
|
| 273 |
+
"core": [
|
| 274 |
+
"diataxis framework tutorial how-to reference explanation",
|
| 275 |
+
"openapi swagger redoc",
|
| 276 |
+
"api design google api guidelines",
|
| 277 |
+
"readme structure",
|
| 278 |
+
"adr template",
|
| 279 |
+
"runbook structure",
|
| 280 |
+
"code documentation jsdoc rustdoc",
|
| 281 |
+
"tutorial step by step",
|
| 282 |
+
"style guide google microsoft chicago"
|
| 283 |
+
],
|
| 284 |
+
"adjacent": [
|
| 285 |
+
"code reading comprehension",
|
| 286 |
+
"user research empathy",
|
| 287 |
+
"i18n internationalization",
|
| 288 |
+
"screencast tutorial recording"
|
| 289 |
+
]
|
| 290 |
+
},
|
| 291 |
+
"platform-engineer": {
|
| 292 |
+
"core": [
|
| 293 |
+
"internal developer platform idp",
|
| 294 |
+
"backstage developer portal",
|
| 295 |
+
"golden path opinionated workflow",
|
| 296 |
+
"service catalog",
|
| 297 |
+
"shared ci cd template",
|
| 298 |
+
"observability infrastructure",
|
| 299 |
+
"secrets distribution",
|
| 300 |
+
"feature flag platform",
|
| 301 |
+
"multi-tenancy"
|
| 302 |
+
],
|
| 303 |
+
"adjacent": [
|
| 304 |
+
"all sre devops topics",
|
| 305 |
+
"developer experience dx",
|
| 306 |
+
"documentation platform"
|
| 307 |
+
]
|
| 308 |
+
},
|
| 309 |
+
"ai-agent-engineer": {
|
| 310 |
+
"core": [
|
| 311 |
+
"react agent reasoning acting",
|
| 312 |
+
"tool calling function calling",
|
| 313 |
+
"mcp model context protocol",
|
| 314 |
+
"agent orchestration langgraph crew autogen",
|
| 315 |
+
"agentic workflow",
|
| 316 |
+
"multi agent collaboration",
|
| 317 |
+
"self critique reflexion",
|
| 318 |
+
"voyager skill library",
|
| 319 |
+
"memory episodic procedural",
|
| 320 |
+
"rag retrieval augmented",
|
| 321 |
+
"vector search embedding rerank",
|
| 322 |
+
"prompt engineering xml structured",
|
| 323 |
+
"constitutional ai safety"
|
| 324 |
+
],
|
| 325 |
+
"adjacent": [
|
| 326 |
+
"all backend topics",
|
| 327 |
+
"ml engineering fine-tuning",
|
| 328 |
+
"evaluation benchmark swe-bench",
|
| 329 |
+
"cost optimization token budget"
|
| 330 |
+
]
|
| 331 |
+
}
|
| 332 |
+
},
|
| 333 |
+
"cross_cutting_topics": [
|
| 334 |
+
"system design interview", "high scalability",
|
| 335 |
+
"distributed system consensus raft paxos",
|
| 336 |
+
"consistency strong eventual causal",
|
| 337 |
+
"caching strategy ttl invalidation",
|
| 338 |
+
"rate limiting algorithm token bucket leaky",
|
| 339 |
+
"load balancer round robin least connection",
|
| 340 |
+
"circuit breaker hystrix resilience4j",
|
| 341 |
+
"12 factor app", "twelve factor",
|
| 342 |
+
"domain language ubiquitous",
|
| 343 |
+
"code review checklist",
|
| 344 |
+
"git workflow gitflow trunk-based",
|
| 345 |
+
"semver conventional commits",
|
| 346 |
+
"testing best practice",
|
| 347 |
+
"logging structured json",
|
| 348 |
+
"tracing opentelemetry w3c",
|
| 349 |
+
"metrics red use sli",
|
| 350 |
+
"observability three pillars",
|
| 351 |
+
"incident severity sev1 sev2",
|
| 352 |
+
"blameless culture",
|
| 353 |
+
"engineering excellence dora metrics"
|
| 354 |
+
]
|
| 355 |
+
}
|
|
@@ -32,51 +32,38 @@ ALLOWED = {
|
|
| 32 |
DENY_KEYWORDS = ("noncommercial", "non-commercial", "nc-", "-nc", "nc4.0",
|
| 33 |
"llama2", "llama3", "llama-3", "research-only", "personal-use")
|
| 34 |
|
| 35 |
-
#
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
#
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
"
|
| 66 |
-
|
| 67 |
-
"ios swift", "android kotlin", "react native", "flutter",
|
| 68 |
-
# Multilingual
|
| 69 |
-
"multilingual code", "multilingual instruction", "thai instruction",
|
| 70 |
-
# Domain niches
|
| 71 |
-
"compiler", "embedded", "rust systems", "go concurrency",
|
| 72 |
-
"performance optimization", "concurrency",
|
| 73 |
-
# Doc / API
|
| 74 |
-
"openapi", "api design", "technical writing", "documentation dataset",
|
| 75 |
-
# Constitutional / safety
|
| 76 |
-
"constitutional ai", "safety dataset", "preference dataset", "dpo",
|
| 77 |
-
# Recent mega-mixes
|
| 78 |
-
"instruction tuning 2025", "post-training dataset", "sft mixture",
|
| 79 |
-
]
|
| 80 |
|
| 81 |
|
| 82 |
def log(msg: str):
|
|
@@ -100,11 +87,26 @@ def init_db():
|
|
| 100 |
schema_branch TEXT,
|
| 101 |
cap INTEGER,
|
| 102 |
slug TEXT,
|
| 103 |
-
verdict TEXT
|
|
|
|
| 104 |
);
|
| 105 |
CREATE INDEX IF NOT EXISTS idx_verdict ON dataset_seen(verdict);
|
| 106 |
CREATE INDEX IF NOT EXISTS idx_score ON dataset_seen(quality_score DESC);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
""")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
|
| 110 |
def hf_get(url: str, timeout: int = 15):
|
|
@@ -254,11 +256,14 @@ def evaluate_one(ds_id: str) -> tuple[str, dict | None]:
|
|
| 254 |
|
| 255 |
|
| 256 |
def stamp(ds_id: str, verdict: str, lic: str = "", dl: int = 0,
|
| 257 |
-
score: float = 0.0, schema: str = "", cap: int = 0, slug: str = ""
|
|
|
|
| 258 |
with sqlite3.connect(DB) as c:
|
| 259 |
c.execute(
|
| 260 |
-
"INSERT OR IGNORE INTO dataset_seen
|
| 261 |
-
(ds_id,
|
|
|
|
|
|
|
| 262 |
)
|
| 263 |
|
| 264 |
|
|
@@ -272,8 +277,13 @@ def discover_cycle() -> dict:
|
|
| 272 |
new_queued = 0
|
| 273 |
new_rejected = 0
|
| 274 |
seen_this_cycle = 0
|
| 275 |
-
|
| 276 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
results = hf_get(url, timeout=15) or []
|
| 278 |
for ds in results:
|
| 279 |
ds_id = ds.get("id", "")
|
|
@@ -287,19 +297,35 @@ def discover_cycle() -> dict:
|
|
| 287 |
score=entry.get("score", 0.0) if entry else 0.0,
|
| 288 |
schema=entry.get("schema", "") if entry else "",
|
| 289 |
cap=entry.get("cap", 0) if entry else 0,
|
| 290 |
-
slug=entry.get("slug", "") if entry else ""
|
|
|
|
| 291 |
if verdict == "integrated":
|
|
|
|
|
|
|
| 292 |
append_dynamic(entry)
|
| 293 |
new_integrated += 1
|
| 294 |
-
|
|
|
|
| 295 |
elif verdict.startswith("queued"):
|
| 296 |
new_queued += 1
|
| 297 |
else:
|
| 298 |
new_rejected += 1
|
| 299 |
-
time.sleep(0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 300 |
|
| 301 |
return {"evaluated": seen_this_cycle, "integrated": new_integrated,
|
| 302 |
-
"queued": new_queued, "rejected": new_rejected
|
|
|
|
| 303 |
|
| 304 |
|
| 305 |
def main():
|
|
|
|
| 32 |
DENY_KEYWORDS = ("noncommercial", "non-commercial", "nc-", "-nc", "nc4.0",
|
| 33 |
"llama2", "llama3", "llama-3", "research-only", "personal-use")
|
| 34 |
|
| 35 |
+
# Load role-driven query map (auto-rebuilds when role-knowledge-map.json updated)
|
| 36 |
+
def _load_role_queries() -> list[tuple[str, str]]:
|
| 37 |
+
"""Returns list of (query, role) tuples. Each role contributes core + adjacent
|
| 38 |
+
topics. Plus cross-cutting general queries. Total ~250+ queries auto-generated."""
|
| 39 |
+
role_map_path = HOME / ".surrogate/agents/role-knowledge-map.json"
|
| 40 |
+
queries: list[tuple[str, str]] = []
|
| 41 |
+
if role_map_path.exists():
|
| 42 |
+
try:
|
| 43 |
+
data = json.loads(role_map_path.read_text())
|
| 44 |
+
except Exception:
|
| 45 |
+
data = {"roles": {}, "cross_cutting_topics": []}
|
| 46 |
+
for role, skills in data.get("roles", {}).items():
|
| 47 |
+
for q in (skills.get("core") or []):
|
| 48 |
+
queries.append((q, f"{role}-core"))
|
| 49 |
+
for q in (skills.get("adjacent") or []):
|
| 50 |
+
queries.append((q, f"{role}-adj"))
|
| 51 |
+
for q in data.get("cross_cutting_topics") or []:
|
| 52 |
+
queries.append((q, "cross-cutting"))
|
| 53 |
+
# Plus baseline queries (NEVER static β discoverer must keep finding)
|
| 54 |
+
queries.extend([(q, "general") for q in [
|
| 55 |
+
"instruction tuning 2025", "instruction tuning 2026",
|
| 56 |
+
"post-training dataset", "sft mixture",
|
| 57 |
+
"preference dataset dpo orpo",
|
| 58 |
+
"dataset 2026", "code dataset 2026",
|
| 59 |
+
"agentic dataset 2026", "reasoning dataset 2026",
|
| 60 |
+
]])
|
| 61 |
+
return queries
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def get_queries() -> list[tuple[str, str]]:
|
| 65 |
+
"""Reload on each call so role-knowledge-map.json edits take effect immediately."""
|
| 66 |
+
return _load_role_queries()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
|
| 69 |
def log(msg: str):
|
|
|
|
| 87 |
schema_branch TEXT,
|
| 88 |
cap INTEGER,
|
| 89 |
slug TEXT,
|
| 90 |
+
verdict TEXT,
|
| 91 |
+
role_tag TEXT -- which role's query found this
|
| 92 |
);
|
| 93 |
CREATE INDEX IF NOT EXISTS idx_verdict ON dataset_seen(verdict);
|
| 94 |
CREATE INDEX IF NOT EXISTS idx_score ON dataset_seen(quality_score DESC);
|
| 95 |
+
CREATE INDEX IF NOT EXISTS idx_role ON dataset_seen(role_tag);
|
| 96 |
+
|
| 97 |
+
CREATE TABLE IF NOT EXISTS query_history (
|
| 98 |
+
query TEXT PRIMARY KEY,
|
| 99 |
+
role_tag TEXT,
|
| 100 |
+
last_run_ts INTEGER NOT NULL,
|
| 101 |
+
results_count INTEGER DEFAULT 0,
|
| 102 |
+
new_finds INTEGER DEFAULT 0
|
| 103 |
+
);
|
| 104 |
""")
|
| 105 |
+
# Migration: add role_tag column if upgrading from v1 schema
|
| 106 |
+
try:
|
| 107 |
+
c.execute("ALTER TABLE dataset_seen ADD COLUMN role_tag TEXT")
|
| 108 |
+
except sqlite3.OperationalError:
|
| 109 |
+
pass # already exists
|
| 110 |
|
| 111 |
|
| 112 |
def hf_get(url: str, timeout: int = 15):
|
|
|
|
| 256 |
|
| 257 |
|
| 258 |
def stamp(ds_id: str, verdict: str, lic: str = "", dl: int = 0,
|
| 259 |
+
score: float = 0.0, schema: str = "", cap: int = 0, slug: str = "",
|
| 260 |
+
role_tag: str = ""):
|
| 261 |
with sqlite3.connect(DB) as c:
|
| 262 |
c.execute(
|
| 263 |
+
"INSERT OR IGNORE INTO dataset_seen "
|
| 264 |
+
"(ds_id, evaluated_ts, license, downloads, quality_score, schema_branch, cap, slug, verdict, role_tag) "
|
| 265 |
+
"VALUES (?,?,?,?,?,?,?,?,?,?)",
|
| 266 |
+
(ds_id, int(time.time()), lic, dl, score, schema, cap, slug, verdict, role_tag)
|
| 267 |
)
|
| 268 |
|
| 269 |
|
|
|
|
| 277 |
new_queued = 0
|
| 278 |
new_rejected = 0
|
| 279 |
seen_this_cycle = 0
|
| 280 |
+
role_finds: dict[str, int] = {}
|
| 281 |
+
|
| 282 |
+
queries = get_queries()
|
| 283 |
+
log(f" loaded {len(queries)} role-driven queries (covering {len(set(r for _,r in queries))} role tags)")
|
| 284 |
+
|
| 285 |
+
for q, role_tag in queries:
|
| 286 |
+
url = f"https://huggingface.co/api/datasets?search={urllib.parse.quote(q)}&limit=30&sort=downloads&direction=-1"
|
| 287 |
results = hf_get(url, timeout=15) or []
|
| 288 |
for ds in results:
|
| 289 |
ds_id = ds.get("id", "")
|
|
|
|
| 297 |
score=entry.get("score", 0.0) if entry else 0.0,
|
| 298 |
schema=entry.get("schema", "") if entry else "",
|
| 299 |
cap=entry.get("cap", 0) if entry else 0,
|
| 300 |
+
slug=entry.get("slug", "") if entry else "",
|
| 301 |
+
role_tag=role_tag)
|
| 302 |
if verdict == "integrated":
|
| 303 |
+
# Tag the entry with role for downstream training-mix balance
|
| 304 |
+
if entry: entry["role_tag"] = role_tag
|
| 305 |
append_dynamic(entry)
|
| 306 |
new_integrated += 1
|
| 307 |
+
role_finds[role_tag] = role_finds.get(role_tag, 0) + 1
|
| 308 |
+
log(f" β
[{role_tag}] {ds_id} | {entry['license']} | {entry['schema']} | cap={entry['cap']:,}")
|
| 309 |
elif verdict.startswith("queued"):
|
| 310 |
new_queued += 1
|
| 311 |
else:
|
| 312 |
new_rejected += 1
|
| 313 |
+
time.sleep(0.4) # gentle on HF API
|
| 314 |
+
|
| 315 |
+
# Update query history for this query
|
| 316 |
+
try:
|
| 317 |
+
with sqlite3.connect(DB) as c:
|
| 318 |
+
c.execute(
|
| 319 |
+
"INSERT OR REPLACE INTO query_history (query, role_tag, last_run_ts, results_count, new_finds) "
|
| 320 |
+
"VALUES (?,?,?,?, COALESCE((SELECT new_finds FROM query_history WHERE query=?),0) + ?)",
|
| 321 |
+
(q, role_tag, int(time.time()), len(results), q, new_integrated)
|
| 322 |
+
)
|
| 323 |
+
except Exception:
|
| 324 |
+
pass
|
| 325 |
|
| 326 |
return {"evaluated": seen_this_cycle, "integrated": new_integrated,
|
| 327 |
+
"queued": new_queued, "rejected": new_rejected,
|
| 328 |
+
"by_role": role_finds}
|
| 329 |
|
| 330 |
|
| 331 |
def main():
|