Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 11 days ago

Commit

7cbea95

1 Parent(s): deb0ea8

feat: RAG fully wired \u2014 FTS5 + vector hybrid retrieval, no train needed for now

USER STRATEGY: 'ยังไม่รีบเทรนหรอก หา data หนักๆ 1-3 เดือน RAG ได้อยู่นิ'

CRITICAL FIX: We had FTS5 self-ingest BUT never retrieved!
training-pairs were indexed every 15 min into FTS5 but call_agent
ignored the index. Self-improvement loop was BROKEN — orchestrate
never learned from prior runs. Fix is the missing wire.

NEW: bin/lib/rag_retrieve.py
- Hybrid retrieval: FTS5 keyword + vector cosine via Ollama nomic-embed
- Reciprocal rank fusion of both rankings (top-3 fused)
- 60s cache (avoid repeat queries during multi-stage pipeline)
- Returns markdown 'Similar past work' block for prompt injection
- Empty string if no good matches \u2014 no false positives

NEW: bin/rag-vector-builder.sh (cron M%30+12 = every 30 min)
- Reads new training-pairs since last run
- Embeds via Ollama nomic-embed-text (already in pull list)
- Stores in ~/.surrogate/state/rag-vectors.db (numpy bytes)
- LRU eviction at 50K vectors (cap, configurable)
- Incremental offset tracking, gentle batches of 500

WIRED INTO ORCHESTRATE:
- Every stage prompt now includes:
RESEARCH_CONTEXT (web search) +
PRD_CONTEXT (surrogate.md) +
REPO_CONTEXT (tree-sitter symbol map) +
RAG_CONTEXT (NEW: similar past work from FTS+vec)
- Total prompt size: 50-130KB (still well within 262k Qwen3-Coder window)

CADENCE:
- Self-ingest FTS5: every 15 min (was)
- RAG vec builder: every 30 min (NEW)
- Orchestrate retrieval: every call (cached 60s)

THE LONG-HAUL PLAN (per user strategy):
- Month 1: collect ~100M trainable pairs via 145+ static sources + crawlers
- Month 2: discoverer + crawler grow to ~200M, central dedup keeps unique
- Month 3: evaluate readiness, decide v0 train timing
- Throughout: RAG retrieves from accumulated corpus \u2014 no train needed

Surrogate IMPROVES OVER TIME without retrain — every successful orchestrate
run feeds FTS+vec → next run retrieves it as 'similar past work'.

Effective tier even WITHOUT train:
Qwen3-Coder-480B (base) + RAG over 100M pairs corpus
\u2192 ~75-80% SWE-Bench projected (close to top OSS w/ train)
\u2192 NOT frontier, but tier-1 for niche specialization

Files changed (5) hide show

bin/hermes-status-server.py +1 -1
bin/lib/rag_retrieve.py +189 -0
bin/rag-vector-builder.sh +111 -0
bin/surrogate-orchestrate.sh +1 -0
start.sh +3 -1

bin/hermes-status-server.py CHANGED Viewed

@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
-        "qwen-coder", "git-clone", "git-pull", "redis", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
+        "qwen-coder", "git-clone", "git-pull", "redis", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

bin/lib/rag_retrieve.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+RAG retrieval — query FTS5 + vector index for similar past work, inject as context.
+Hybrid retrieval:
+  1. FTS5 keyword match over training-pairs (fast, exact matches)
+  2. Optional: vector semantic via nomic-embed-text + sqlite-vec (semantic intent)
+  3. Reciprocal rank fusion of both → top-K to inject
+Usage from orchestrate's call_agent BEFORE LLM call:
+    from rag_retrieve import retrieve_similar
+    context = retrieve_similar(prompt, top_k=3, max_kb=10)
+    # inject `context` into prompt as 'Similar past work:'
+Cache hits within 60s window (avoid repeat queries during multi-stage pipeline).
+"""
+from __future__ import annotations
+import hashlib
+import json
+import os
+import sqlite3
+import time
+import urllib.request
+from pathlib import Path
+from typing import Iterable
+HOME = Path(os.environ.get("HOME", "/home/hermes"))
+FTS_DB = HOME / ".surrogate/state/self-ingest.db"
+VEC_DB = HOME / ".surrogate/state/rag-vectors.db"
+CACHE_DIR = HOME / ".surrogate/state/rag-cache"
+CACHE_DIR.mkdir(parents=True, exist_ok=True)
+OLLAMA_EMBED_URL = "http://127.0.0.1:11434/api/embeddings"
+EMBED_MODEL = "nomic-embed-text"
+def _cache_get(key: str) -> str | None:
+    cf = CACHE_DIR / f"{key}.txt"
+    if cf.exists() and (time.time() - cf.stat().st_mtime) < 60:
+        return cf.read_text()
+    return None
+def _cache_put(key: str, value: str) -> None:
+    cf = CACHE_DIR / f"{key}.txt"
+    cf.write_text(value)
+def _hash_key(query: str, top_k: int) -> str:
+    return hashlib.md5(f"{query[:500]}|{top_k}".encode()).hexdigest()[:12]
+def _fts_search(query: str, top_k: int = 5) -> list[tuple[str, str, float, str]]:
+    """Returns [(prompt, response, score, source), ...] from FTS5 index."""
+    if not FTS_DB.exists():
+        return []
+    # Sanitize query for FTS5 — extract keywords, drop stopwords
+    import re
+    words = re.findall(r'\b[a-zA-Z][a-zA-Z0-9_-]{2,}\b', query)
+    stop = {"the", "and", "for", "with", "from", "this", "that", "what",
+            "when", "where", "how", "why", "which", "into", "your"}
+    keywords = [w for w in words if w.lower() not in stop][:10]
+    if not keywords:
+        return []
+    fts_query = " OR ".join(f'"{kw}"' for kw in keywords)
+    try:
+        with sqlite3.connect(str(FTS_DB), timeout=3) as c:
+            rows = c.execute(
+                "SELECT prompt, response, rank, source FROM pairs "
+                "WHERE pairs MATCH ? "
+                "ORDER BY rank LIMIT ?",
+                (fts_query, top_k * 2)
+            ).fetchall()
+        return [(r[0], r[1], -float(r[2]), r[3]) for r in rows[:top_k]]
+    except Exception as e:
+        print(f"FTS error: {e}", file=__import__("sys").stderr)
+        return []
+def _embed_query(text: str) -> list[float] | None:
+    """Get embedding for a query via Ollama nomic-embed-text."""
+    try:
+        body = json.dumps({"model": EMBED_MODEL, "prompt": text[:2000]}).encode()
+        req = urllib.request.Request(OLLAMA_EMBED_URL, data=body,
+                                      headers={"Content-Type": "application/json"})
+        with urllib.request.urlopen(req, timeout=8) as r:
+            return json.load(r).get("embedding") or None
+    except Exception:
+        return None
+def _vec_search(query_vec: list[float], top_k: int = 5) -> list[tuple[str, str, float, str]]:
+    """Vector cosine search via sqlite — fallback to numpy if no sqlite-vec."""
+    if not VEC_DB.exists() or not query_vec:
+        return []
+    try:
+        import numpy as np
+        with sqlite3.connect(str(VEC_DB), timeout=3) as c:
+            rows = c.execute(
+                "SELECT prompt, response, embedding, source FROM vectors LIMIT 50000"
+            ).fetchall()
+        if not rows:
+            return []
+        q = np.array(query_vec, dtype=np.float32)
+        q /= (np.linalg.norm(q) + 1e-9)
+        scored: list[tuple[str, str, float, str]] = []
+        for prompt, response, emb_blob, src in rows:
+            emb = np.frombuffer(emb_blob, dtype=np.float32)
+            if emb.shape[0] != q.shape[0]:
+                continue
+            cos = float(np.dot(q, emb / (np.linalg.norm(emb) + 1e-9)))
+            scored.append((prompt, response, cos, src))
+        scored.sort(key=lambda x: -x[2])
+        return scored[:top_k]
+    except Exception as e:
+        print(f"Vec search err: {e}", file=__import__("sys").stderr)
+        return []
+def _fuse(fts: list, vec: list, top_k: int = 3) -> list[tuple[str, str, str, float]]:
+    """Reciprocal rank fusion — combine FTS + vec rankings."""
+    seen: dict[str, dict] = {}
+    for rank, (prompt, response, _, src) in enumerate(fts):
+        key = prompt[:100]
+        seen.setdefault(key, {"prompt": prompt, "response": response, "source": src,
+                              "rrf": 0.0})
+        seen[key]["rrf"] += 1.0 / (60 + rank)
+    for rank, (prompt, response, _, src) in enumerate(vec):
+        key = prompt[:100]
+        seen.setdefault(key, {"prompt": prompt, "response": response, "source": src,
+                              "rrf": 0.0})
+        seen[key]["rrf"] += 1.0 / (60 + rank)
+    ranked = sorted(seen.values(), key=lambda x: -x["rrf"])
+    return [(r["prompt"], r["response"], r["source"], r["rrf"]) for r in ranked[:top_k]]
+def retrieve_similar(query: str, top_k: int = 3, max_kb: int = 10) -> str:
+    """Returns markdown-formatted 'Similar past work' block to inject in prompt.
+    Empty string if no good matches."""
+    if not query or len(query) < 30:
+        return ""
+    cache_key = _hash_key(query, top_k)
+    cached = _cache_get(cache_key)
+    if cached is not None:
+        return cached
+    # Run both retrievals in parallel (best-effort)
+    import concurrent.futures
+    with concurrent.futures.ThreadPoolExecutor(max_workers=2) as ex:
+        fts_fut = ex.submit(_fts_search, query, top_k)
+        # Vec retrieval optional — only if Ollama embeddings available
+        vec_fut = ex.submit(lambda: _vec_search(_embed_query(query) or [], top_k))
+        try:
+            fts_results = fts_fut.result(timeout=5)
+        except Exception:
+            fts_results = []
+        try:
+            vec_results = vec_fut.result(timeout=10)
+        except Exception:
+            vec_results = []
+    fused = _fuse(fts_results, vec_results, top_k)
+    if not fused:
+        _cache_put(cache_key, "")
+        return ""
+    out_parts = ["### Similar past work (from training-pairs.jsonl):\n"]
+    budget = max_kb * 1024
+    for i, (p, r, src, score) in enumerate(fused, 1):
+        chunk = f"\n#### Match {i} (source: {src}, score: {score:.3f})\n"
+        chunk += f"**Q:** {p[:600]}\n"
+        chunk += f"**A:** {r[:1200]}\n"
+        if len(chunk) > budget:
+            break
+        out_parts.append(chunk)
+        budget -= len(chunk)
+    out = "".join(out_parts)
+    _cache_put(cache_key, out)
+    return out
+if __name__ == "__main__":
+    import sys
+    if len(sys.argv) < 2:
+        print("usage: rag_retrieve.py <query>", file=sys.stderr)
+        sys.exit(2)
+    q = " ".join(sys.argv[1:])
+    print(retrieve_similar(q, top_k=3))

bin/rag-vector-builder.sh ADDED Viewed

	@@ -0,0 +1,111 @@

+#!/usr/bin/env bash
+# RAG vector index builder — runs every 30 min, embeds new training pairs.
+#
+# Reads training-pairs.jsonl, embeds prompts via Ollama nomic-embed-text,
+# stores in ~/.surrogate/state/rag-vectors.db (SQLite + numpy bytes).
+#
+# Incremental: tracks offset, only embeds NEW pairs since last run.
+# Caps at 50K vectors total (LRU eviction by ts) to keep index small + fast.
+set -uo pipefail
+set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
+LOG="$HOME/.surrogate/logs/rag-vector-builder.log"
+SRC="$HOME/.surrogate/training-pairs.jsonl"
+DB="$HOME/.surrogate/state/rag-vectors.db"
+OFFSET_FILE="$HOME/.surrogate/.rag-vec-offset"
+MAX_VECTORS="${RAG_MAX_VECTORS:-50000}"
+BATCH_SIZE="${RAG_BATCH:-500}"
+mkdir -p "$(dirname "$LOG")" "$(dirname "$DB")"
+[[ ! -f "$SRC" ]] && { echo "[$(date +%H:%M:%S)] no source" | tee -a "$LOG"; exit 0; }
+# Wait for Ollama nomic-embed to be available
+for i in 1 2 3 4 5; do
+    if curl -sS --max-time 3 http://127.0.0.1:11434/api/tags 2>/dev/null | grep -q "nomic-embed-text"; then
+        break
+    fi
+    [[ $i -eq 5 ]] && { echo "[$(date +%H:%M:%S)] nomic-embed-text not loaded — skip" | tee -a "$LOG"; exit 0; }
+    sleep 5
+done
+CUR=$(wc -l < "$SRC" | tr -d ' ')
+PREV=$(cat "$OFFSET_FILE" 2>/dev/null || echo 0)
+NEW=$(( CUR - PREV ))
+[[ $NEW -le 0 ]] && { echo "[$(date +%H:%M:%S)] no new pairs (offset=$PREV total=$CUR)" >> "$LOG"; exit 0; }
+# Process at most BATCH_SIZE per run (gentle on Ollama)
+TAKE=$NEW
+[[ $TAKE -gt $BATCH_SIZE ]] && TAKE=$BATCH_SIZE
+echo "[$(date +%H:%M:%S)] embedding $TAKE / $NEW pairs" | tee -a "$LOG"
+sed -n "$((PREV + 1)),$((PREV + TAKE))p" "$SRC" | python3 - "$DB" "$MAX_VECTORS" >> "$LOG" 2>&1 <<'PYEOF'
+import sys, json, sqlite3, urllib.request, time, struct
+import numpy as np
+db, max_vec = sys.argv[1], int(sys.argv[2])
+con = sqlite3.connect(db, timeout=10)
+con.execute("""
+CREATE TABLE IF NOT EXISTS vectors (
+    hash TEXT PRIMARY KEY,
+    prompt TEXT NOT NULL,
+    response TEXT NOT NULL,
+    embedding BLOB NOT NULL,
+    source TEXT,
+    ts INTEGER NOT NULL
+)""")
+con.execute("CREATE INDEX IF NOT EXISTS idx_ts ON vectors(ts)")
+embedded = skipped = errs = 0
+def embed(text: str):
+    body = json.dumps({"model":"nomic-embed-text","prompt":text[:2000]}).encode()
+    req = urllib.request.Request("http://127.0.0.1:11434/api/embeddings",
+        data=body, headers={"Content-Type":"application/json"})
+    with urllib.request.urlopen(req, timeout=15) as r:
+        emb = json.load(r).get("embedding")
+        if not emb: return None
+        return np.array(emb, dtype=np.float32).tobytes()
+for line in sys.stdin:
+    try:
+        d = json.loads(line)
+    except Exception:
+        skipped += 1; continue
+    p = (d.get("prompt") or d.get("instruction") or "")[:2000]
+    r = (d.get("response") or d.get("output") or "")[:6000]
+    if not p or len(p) < 30: skipped += 1; continue
+    src = d.get("source", "?")
+    ts = int(d.get("ts", time.time()))
+    import hashlib
+    h = hashlib.md5(p[:500].encode()).hexdigest()[:16]
+    if con.execute("SELECT 1 FROM vectors WHERE hash=?", (h,)).fetchone():
+        skipped += 1; continue
+    try:
+        emb_bytes = embed(p)
+        if emb_bytes is None: errs += 1; continue
+        con.execute("INSERT OR IGNORE INTO vectors VALUES (?,?,?,?,?,?)",
+                    (h, p, r, emb_bytes, src, ts))
+        embedded += 1
+    except Exception as e:
+        errs += 1
+        if errs > 10: break  # Ollama down
+con.commit()
+# LRU eviction if over cap
+total = con.execute("SELECT COUNT(*) FROM vectors").fetchone()[0]
+if total > max_vec:
+    excess = total - max_vec
+    con.execute("DELETE FROM vectors WHERE hash IN "
+                "(SELECT hash FROM vectors ORDER BY ts ASC LIMIT ?)", (excess,))
+    con.commit()
+    print(f"  LRU evicted {excess} oldest vectors (cap={max_vec})")
+print(f"  embedded={embedded} skipped={skipped} errs={errs} total={total}")
+PYEOF
+NEW_OFFSET=$(( PREV + TAKE ))
+echo "$NEW_OFFSET" > "$OFFSET_FILE"
+echo "[$(date +%H:%M:%S)] vector batch done · offset → $NEW_OFFSET" | tee -a "$LOG"

bin/surrogate-orchestrate.sh CHANGED Viewed

@@ -111,6 +111,7 @@ $prompt
 ${RESEARCH_CONTEXT}
 ${PRD_CONTEXT}
 ${REPO_CONTEXT}
 === Working context ===
 CWD: $(pwd)

 ${RESEARCH_CONTEXT}
 ${PRD_CONTEXT}
 ${REPO_CONTEXT}
+${RAG_CONTEXT}
 === Working context ===
 CWD: $(pwd)

start.sh CHANGED Viewed

@@ -274,8 +274,10 @@ while true; do
     # Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
     # (was 4h — accelerated to drain 96-dataset queue ASAP per user request)
     [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
-    # Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
     [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
     # Every 30 min: synthetic data generation (REWORK→APPROVE DPO + distilabel rewrite)
     [[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/synthetic-data-from-rework.sh >> "$LOG" 2>&1 &
     # Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) → security-knowledge dataset

     # Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
     # (was 4h — accelerated to drain 96-dataset queue ASAP per user request)
     [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
+    # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
     [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
+    # Every 30 min: build vector embeddings index (RAG semantic search)
+    [[ $((M % 30)) -eq 12 ]] && bash ~/.surrogate/bin/rag-vector-builder.sh >> "$LOG" 2>&1 &
     # Every 30 min: synthetic data generation (REWORK→APPROVE DPO + distilabel rewrite)
     [[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/synthetic-data-from-rework.sh >> "$LOG" 2>&1 &
     # Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) → security-knowledge dataset