Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 10 days ago

Commit

c0c6fe0

1 Parent(s): 836b22a

feat: expand-role-keywords daemon — LLM-expanded SDLC keywords

User: 'เธอไป research job description ทุก role มา แล้วดูว่าเค้าต้องหา
ความรู้อะไรบ้าง'. Implemented as a daily 06:00 UTC cron that for each
of the 17 SDLC roles in role-knowledge-map.json:

1. Reads existing core + adjacent skills
2. Sends them to Cerebras (qwen-3-235b) with the prompt
'You are a senior tech recruiter who reads thousands of job
descriptions. Output 80 keyword phrases this role's JD would
mention.'
3. Cleans + dedups the response
4. Merges into role.expanded list

Discoverer reads three lists per role now (core / adjacent / expanded).
17 roles x 80 expanded keywords = up to 1,360 new search queries the
discoverer will fire next cycle, each landing on a fresh slice of HF
hub that we hadn't searched before.

Falls through Cerebras → Groq → OpenRouter on per-role basis. Failure
of one role doesn't block others — bad responses just leave the
'expanded' list empty for that role until next run.

Combined effect of round-5 + cursor + expand-keywords:
- 30+ new dataset entries in static list (round-5)
- Cursor service stops re-pulling row 0 (stamp-and-move)
- Discoverer auto-finds 1.3K+ new role-specific datasets weekly

Files changed (4) hide show

bin/expand-role-keywords.py +164 -0
bin/hermes-status-server.py +1 -1
bin/hf-dataset-discoverer.py +5 -0
start.sh +4 -0

bin/expand-role-keywords.py ADDED Viewed

	@@ -0,0 +1,164 @@

+#!/usr/bin/env python3
+"""
+One-shot keyword expander — uses Cerebras (or fallback) to expand each
+SDLC role's core/adjacent skills into 100+ specific HF dataset search
+keywords. Output is written back to role-knowledge-map.json under a new
+"expanded" key per role.
+Idempotent — re-running just refreshes "expanded" keywords. Existing
+core/adjacent are untouched.
+Run from cron weekly (or manually). Discoverer auto-reads the map on
+its next cycle and fires search queries for the expanded list.
+Usage:  python expand-role-keywords.py
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import time
+import urllib.request
+import urllib.error
+from pathlib import Path
+ROLE_MAP_PATH = Path.home() / ".surrogate/agents/role-knowledge-map.json"
+PROVIDERS = [
+    {
+        "name": "cerebras",
+        "url": "https://api.cerebras.ai/v1/chat/completions",
+        "key_env": "CEREBRAS_API_KEY",
+        "model": "qwen-3-235b-a22b-instruct-2507",
+    },
+    {
+        "name": "groq",
+        "url": "https://api.groq.com/openai/v1/chat/completions",
+        "key_env": "GROQ_API_KEY",
+        "model": "llama-3.3-70b-versatile",
+    },
+    {
+        "name": "openrouter",
+        "url": "https://openrouter.ai/api/v1/chat/completions",
+        "key_env": "OPENROUTER_API_KEY",
+        "model": "tencent/hy3-preview:free",
+    },
+]
+def call_llm(prompt: str, timeout: int = 90) -> str | None:
+    for p in PROVIDERS:
+        key = os.environ.get(p["key_env"], "").strip()
+        if not key:
+            continue
+        body = json.dumps({
+            "model": p["model"],
+            "messages": [
+                {"role": "system",
+                 "content": "You are a senior tech recruiter who reads thousands of job descriptions. Output clean comma-separated keyword lists, no prose."},
+                {"role": "user", "content": prompt},
+            ],
+            "max_tokens": 1500,
+            "temperature": 0.4,
+        }).encode()
+        req = urllib.request.Request(
+            p["url"],
+            data=body,
+            headers={
+                "Authorization": f"Bearer {key}",
+                "Content-Type": "application/json",
+                "User-Agent": "Mozilla/5.0 surrogate-1/expand-keywords",
+            },
+            method="POST",
+        )
+        try:
+            with urllib.request.urlopen(req, timeout=timeout) as r:
+                data = json.loads(r.read())
+                content = (data.get("choices") or [{}])[0].get("message", {}).get("content", "").strip()
+                if content:
+                    print(f"  [{p['name']}] ok ({len(content)} chars)", flush=True)
+                    return content
+        except Exception as e:
+            print(f"  [{p['name']}] err: {type(e).__name__}: {str(e)[:80]}", flush=True)
+            continue
+    return None
+def expand_role(role_name: str, role_def: dict) -> list[str]:
+    core = role_def.get("core", [])
+    adjacent = role_def.get("adjacent", [])
+    prompt = f"""Role: {role_name}
+Existing core skills: {', '.join(core)}
+Adjacent skills: {', '.join(adjacent)}
+Task: Output exactly 80 highly specific keyword phrases (3-6 words each) that this role's job description would mention. Focus on:
+- specific frameworks, tools, libraries by name
+- concrete certifications and standards (CKA, AWS SAA, ISO 27001, etc.)
+- specific design patterns and methodologies
+- production-grade vocabulary used by senior engineers
+- emerging 2025-2026 tech in this domain
+Output: comma-separated list. NO numbering. NO categories. NO explanatory text. Just keywords."""
+    response = call_llm(prompt)
+    if not response:
+        return []
+    # Parse comma-separated keywords, strip noise
+    kws = []
+    for piece in response.replace(";", ",").split(","):
+        kw = piece.strip().strip(".\"'`*-•").strip()
+        # remove leading numbers like "1. " or "1) "
+        if kw and kw[0].isdigit():
+            for sep in (". ", ") ", "- "):
+                if sep in kw[:5]:
+                    kw = kw.split(sep, 1)[1].strip()
+                    break
+        if 3 <= len(kw) <= 80 and any(c.isalpha() for c in kw):
+            kws.append(kw.lower())
+    # Dedup keep order
+    seen = set()
+    deduped = []
+    for k in kws:
+        if k not in seen:
+            seen.add(k)
+            deduped.append(k)
+    return deduped[:80]
+def main():
+    if not ROLE_MAP_PATH.exists():
+        sys.exit(f"role-knowledge-map.json not found at {ROLE_MAP_PATH}")
+    data = json.loads(ROLE_MAP_PATH.read_text())
+    roles = data.get("roles", {})
+    if not roles:
+        sys.exit("no roles in map")
+    total_added = 0
+    for name, role_def in roles.items():
+        existing = len(role_def.get("expanded", []))
+        print(f"\n▶ {name} (existing core={len(role_def.get('core',[]))} adjacent={len(role_def.get('adjacent',[]))} expanded={existing})", flush=True)
+        new_kws = expand_role(name, role_def)
+        if not new_kws:
+            print(f"  (no expansion — all providers failed)", flush=True)
+            continue
+        # Merge with any existing expanded keywords
+        existing_set = set(role_def.get("expanded", []))
+        merged = list(existing_set | set(new_kws))
+        role_def["expanded"] = sorted(merged)
+        added = len(role_def["expanded"]) - existing
+        total_added += added
+        print(f"  +{added} keywords (total expanded={len(role_def['expanded'])})", flush=True)
+        time.sleep(2)  # gentle rate-limit between roles
+    # Write back
+    ROLE_MAP_PATH.write_text(json.dumps(data, indent=2, ensure_ascii=False))
+    print(f"\n✅ wrote {ROLE_MAP_PATH} — added {total_added} new keywords across {len(roles)} roles")
+if __name__ == "__main__":
+    main()

bin/hermes-status-server.py CHANGED Viewed

@@ -167,7 +167,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
-        "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog", "gh-actions-ticker", "llm-burst-generator",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
+        "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog", "gh-actions-ticker", "llm-burst-generator", "expand-role-keywords",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

bin/hf-dataset-discoverer.py CHANGED Viewed

@@ -48,6 +48,11 @@ def _load_role_queries() -> list[tuple[str, str]]:
                 queries.append((q, f"{role}-core"))
             for q in (skills.get("adjacent") or []):
                 queries.append((q, f"{role}-adj"))
         for q in data.get("cross_cutting_topics") or []:
             queries.append((q, "cross-cutting"))
     # Plus baseline queries (NEVER static — discoverer must keep finding)

                 queries.append((q, f"{role}-core"))
             for q in (skills.get("adjacent") or []):
                 queries.append((q, f"{role}-adj"))
+            # NEW: LLM-expanded keywords from real job-description research
+            # (filled by expand-role-keywords.py running weekly via cron).
+            # 80 keywords per role x 17 roles = up to 1,360 extra search terms.
+            for q in (skills.get("expanded") or []):
+                queries.append((q, f"{role}-exp"))
         for q in data.get("cross_cutting_topics") or []:
             queries.append((q, "cross-cutting"))
     # Plus baseline queries (NEVER static — discoverer must keep finding)

start.sh CHANGED Viewed

@@ -333,6 +333,10 @@ while true; do
     [[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
     # Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
     [[ $((M % 1440)) -eq 300 ]] && bash ~/.surrogate/bin/scrape-sre-postmortems.sh >> "$LOG" 2>&1 &
     sleep 60
 done
 CRONSH

     [[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
     # Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
     [[ $((M % 1440)) -eq 300 ]] && bash ~/.surrogate/bin/scrape-sre-postmortems.sh >> "$LOG" 2>&1 &
+    # Daily 06:00 UTC: LLM-expand role keywords (sends each role's skills to
+    # Cerebras/Groq → +80 specific job-description-style search terms each).
+    # Discoverer auto-uses the expanded list on its next cycle.
+    [[ $((M % 1440)) -eq 360 ]] && python3 ~/.surrogate/bin/expand-role-keywords.py >> "$LOG_DIR/expand-role-keywords.log" 2>&1 &
     sleep 60
 done
 CRONSH