Spaces:

axentx
/

surrogate-1

Runtime error

ashirato commited on 27 days ago

Commit

daa3c84

1 Parent(s): 2aa8090

feat(always-on): Ollama local fallback for ALL provider ladders

User emergency 2026-04-29: 'ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด'.
When all paid + free APIs run out / rate-limit / 429, Hermes daemons silently
stopped. This adds always-on Ollama (already running in Space) as last-resort
fallback so ingestion + orchestrate NEVER go silent.

Changes:
- bin/ollama-bridge.sh (new) — drop-in OpenAI-compat bridge to local Ollama,
auto-falls-through model preference: qwen3-coder:30b-a3b → qwen2.5-coder:14b
→ granite-code:8b → gemma4:e4b. Reads JSON from stdin, writes text stdout.
Same I/O contract as cerebras/groq/openrouter bridges.
- bin/llm-burst-generator.py — added 'ollama-local' provider with sentinel
ALWAYS_ON env var so it's always 'active' regardless of API keys. rpm_budget
60 (unlimited locally). Marked fallback=True so retry logic prefers it last
but ALWAYS reachable.
- bin/surrogate-orchestrate.sh — appended 4 ollama-local ladder entries at
END (qwen3-coder/qwen2.5-14b/granite/gemma) so every orchestrate call
guaranteed succeeds.

Models already pulled by start.sh: qwen3-coder:30b-a3b (~16GB Q4 MoE 3B
active = fast on CPU), qwen2.5-coder:14b (~9GB), granite-code:8b (~5GB IBM
128K ctx), gemma4:e4b (light). Cache lives in /data/.ollama/models so survives
factory reboots.

Effect: from now on, when 11 cloud providers all fail, Ollama keeps the
ingestion pipeline alive (slower than cloud but FREE + never rate-limited).

Files changed (3) hide show

bin/llm-burst-generator.py +16 -0
bin/ollama-bridge.sh +88 -0
bin/surrogate-orchestrate.sh +16 -0

bin/llm-burst-generator.py CHANGED Viewed

@@ -90,6 +90,20 @@ PROVIDERS = [
         "model": "moonshot-v1-32k",                   # kimi-k2 was wrong
         "rpm_budget": 15,
     },
     # HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
     # User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
     {
@@ -290,6 +304,8 @@ def main():
     log_path.parent.mkdir(parents=True, exist_ok=True)
     # Discover which providers actually have keys set
     active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
     if not active:
         print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")

         "model": "moonshot-v1-32k",                   # kimi-k2 was wrong
         "rpm_budget": 15,
     },
+    # ── ALWAYS-ON LOCAL FALLBACK: Ollama on this Space (no key, never rate-limited) ──
+    # User feedback 2026-04-29: "ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด".
+    # When all paid/free APIs fail / hit rate limit, this keeps ingestion alive.
+    # Uses qwen3-coder:30b-a3b → qwen2.5-coder:14b → granite-code:8b → gemma4:e4b
+    # in priority order (whichever is installed). Slower than cloud APIs but
+    # FREE and never rate-limited.
+    {
+        "name": "ollama-local",
+        "url": "http://127.0.0.1:11434/v1/chat/completions",   # Ollama OpenAI-compat
+        "key_env": "ALWAYS_ON",                                # sentinel — never empty
+        "model": "qwen3-coder:30b-a3b-instruct-q4_K_M",        # auto-falls-through if absent
+        "rpm_budget": 60,                                       # unlimited locally
+        "fallback": True,                                       # mark as last-resort
+    },
     # HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
     # User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
     {
     log_path.parent.mkdir(parents=True, exist_ok=True)
     # Discover which providers actually have keys set
+    # Sentinel "ALWAYS_ON" treats Ollama-local as always available even without API key.
+    os.environ.setdefault("ALWAYS_ON", "1")
     active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
     if not active:
         print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")

bin/ollama-bridge.sh ADDED Viewed

	@@ -0,0 +1,88 @@

+#!/usr/bin/env bash
+# Ollama local-model bridge — drop-in fallback when paid APIs run out of credit.
+# Reads JSON from stdin: {"messages":[...], "model":"..." (optional), "max_tokens":N}
+# Writes plain text response to stdout.
+#
+# Falls through model preference:
+#   1. qwen3-coder:30b-a3b   (primary brain, MoE 3B active, fast)
+#   2. qwen2.5-coder:14b     (fallback, proven)
+#   3. granite-code:8b       (light, 128K context)
+#   4. gemma4:e4b            (very light triage)
+#
+# Picks first available model that's installed locally on this Space.
+set -uo pipefail
+PAYLOAD=$(cat)
+OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}"
+# Wait if Ollama not listening yet (e.g., booting)
+for i in 1 2 3 4 5; do
+    if curl -fsSm 3 "http://${OLLAMA_HOST}/api/tags" >/dev/null 2>&1; then break; fi
+    sleep 2
+done
+# Discover available models, pick first that exists in priority order
+AVAIL=$(curl -fsSm 5 "http://${OLLAMA_HOST}/api/tags" 2>/dev/null \
+        | python3 -c "import sys, json; print('\n'.join(m['name'] for m in json.load(sys.stdin).get('models', [])))" 2>/dev/null)
+if [[ -z "$AVAIL" ]]; then
+    echo "{\"error\":\"ollama not reachable at $OLLAMA_HOST\"}" >&2
+    exit 1
+fi
+CHOSEN=""
+for pref in "qwen3-coder:30b-a3b-instruct-q4_K_M" "qwen3-coder" \
+            "qwen2.5-coder:14b-instruct-q4_K_M" "qwen2.5-coder:14b" "qwen2.5-coder" \
+            "granite-code:8b-instruct" "granite-code:8b" "granite-code" \
+            "gemma4:e4b" "gemma4" "gemma3:1b" "gemma3" "llama3.2:3b" "llama3.2"; do
+    if echo "$AVAIL" | grep -qE "^${pref}(:|$)"; then
+        CHOSEN=$(echo "$AVAIL" | grep -E "^${pref}(:|$)" | head -1)
+        break
+    fi
+done
+if [[ -z "$CHOSEN" ]]; then
+    # Use whatever is first
+    CHOSEN=$(echo "$AVAIL" | head -1)
+fi
+# User can override by setting model in payload
+USER_MODEL=$(echo "$PAYLOAD" | python3 -c "import sys, json; d=json.load(sys.stdin); print(d.get('model','') or '')" 2>/dev/null)
+if [[ -n "$USER_MODEL" ]] && echo "$AVAIL" | grep -qE "^${USER_MODEL}(:|$)"; then
+    CHOSEN="$USER_MODEL"
+fi
+# Build /api/chat request
+REQ=$(echo "$PAYLOAD" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+out = {
+    'model': '$CHOSEN',
+    'messages': d.get('messages', []),
+    'stream': False,
+    'options': {
+        'num_predict': int(d.get('max_tokens', 1024)),
+        'temperature': float(d.get('temperature', 0.7)),
+        'top_p': float(d.get('top_p', 0.95)),
+        'num_ctx': int(d.get('num_ctx', 8192)),
+    }
+}
+print(json.dumps(out))
+")
+# POST to Ollama
+RESP=$(curl -fsSm 300 "http://${OLLAMA_HOST}/api/chat" \
+            -H "Content-Type: application/json" -d "$REQ" 2>/dev/null)
+if [[ -z "$RESP" ]]; then
+    echo "{\"error\":\"ollama call timed out\"}" >&2
+    exit 1
+fi
+# Extract message content
+echo "$RESP" | python3 -c "
+import sys, json
+d = json.load(sys.stdin)
+msg = d.get('message', {}).get('content', '')
+if not msg:
+    msg = d.get('response', '')
+print(msg)
+"

bin/surrogate-orchestrate.sh CHANGED Viewed

@@ -310,6 +310,22 @@ if os.environ.get("OR_KEY_ENV"):
                                "anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
                                {"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
 errors, out = [], ""
 for name, fn in ladder:
     try:

                                "anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
                                {"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
+# ── ALWAYS-ON LOCAL FALLBACK: Ollama (no key needed, never rate-limits) ────
+# Last resort when ALL paid + free providers fail / out-of-credit / 429.
+# User feedback 2026-04-29: "ทุก agent ต้องทำงานตลอด" — never go silent.
+ladder.append(("ollama-local:qwen3-coder",
+    lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
+                           "qwen3-coder:30b-a3b-instruct-q4_K_M", "ollama")))
+ladder.append(("ollama-local:qwen2.5-14b",
+    lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
+                           "qwen2.5-coder:14b-instruct-q4_K_M", "ollama")))
+ladder.append(("ollama-local:granite",
+    lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
+                           "granite-code:8b-instruct", "ollama")))
+ladder.append(("ollama-local:gemma",
+    lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
+                           "gemma4:e4b", "ollama")))
 errors, out = [], ""
 for name, fn in ladder:
     try: