Spaces:
Runtime error
feat(always-on): Ollama local fallback for ALL provider ladders
Browse filesUser emergency 2026-04-29: 'ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด'.
When all paid + free APIs run out / rate-limit / 429, Hermes daemons silently
stopped. This adds always-on Ollama (already running in Space) as last-resort
fallback so ingestion + orchestrate NEVER go silent.
Changes:
- bin/ollama-bridge.sh (new) — drop-in OpenAI-compat bridge to local Ollama,
auto-falls-through model preference: qwen3-coder:30b-a3b → qwen2.5-coder:14b
→ granite-code:8b → gemma4:e4b. Reads JSON from stdin, writes text stdout.
Same I/O contract as cerebras/groq/openrouter bridges.
- bin/llm-burst-generator.py — added 'ollama-local' provider with sentinel
ALWAYS_ON env var so it's always 'active' regardless of API keys. rpm_budget
60 (unlimited locally). Marked fallback=True so retry logic prefers it last
but ALWAYS reachable.
- bin/surrogate-orchestrate.sh — appended 4 ollama-local ladder entries at
END (qwen3-coder/qwen2.5-14b/granite/gemma) so every orchestrate call
guaranteed succeeds.
Models already pulled by start.sh: qwen3-coder:30b-a3b (~16GB Q4 MoE 3B
active = fast on CPU), qwen2.5-coder:14b (~9GB), granite-code:8b (~5GB IBM
128K ctx), gemma4:e4b (light). Cache lives in /data/.ollama/models so survives
factory reboots.
Effect: from now on, when 11 cloud providers all fail, Ollama keeps the
ingestion pipeline alive (slower than cloud but FREE + never rate-limited).
- bin/llm-burst-generator.py +16 -0
- bin/ollama-bridge.sh +88 -0
- bin/surrogate-orchestrate.sh +16 -0
|
@@ -90,6 +90,20 @@ PROVIDERS = [
|
|
| 90 |
"model": "moonshot-v1-32k", # kimi-k2 was wrong
|
| 91 |
"rpm_budget": 15,
|
| 92 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
# HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
|
| 94 |
# User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
|
| 95 |
{
|
|
@@ -290,6 +304,8 @@ def main():
|
|
| 290 |
log_path.parent.mkdir(parents=True, exist_ok=True)
|
| 291 |
|
| 292 |
# Discover which providers actually have keys set
|
|
|
|
|
|
|
| 293 |
active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
|
| 294 |
if not active:
|
| 295 |
print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")
|
|
|
|
| 90 |
"model": "moonshot-v1-32k", # kimi-k2 was wrong
|
| 91 |
"rpm_budget": 15,
|
| 92 |
},
|
| 93 |
+
# ── ALWAYS-ON LOCAL FALLBACK: Ollama on this Space (no key, never rate-limited) ──
|
| 94 |
+
# User feedback 2026-04-29: "ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด".
|
| 95 |
+
# When all paid/free APIs fail / hit rate limit, this keeps ingestion alive.
|
| 96 |
+
# Uses qwen3-coder:30b-a3b → qwen2.5-coder:14b → granite-code:8b → gemma4:e4b
|
| 97 |
+
# in priority order (whichever is installed). Slower than cloud APIs but
|
| 98 |
+
# FREE and never rate-limited.
|
| 99 |
+
{
|
| 100 |
+
"name": "ollama-local",
|
| 101 |
+
"url": "http://127.0.0.1:11434/v1/chat/completions", # Ollama OpenAI-compat
|
| 102 |
+
"key_env": "ALWAYS_ON", # sentinel — never empty
|
| 103 |
+
"model": "qwen3-coder:30b-a3b-instruct-q4_K_M", # auto-falls-through if absent
|
| 104 |
+
"rpm_budget": 60, # unlimited locally
|
| 105 |
+
"fallback": True, # mark as last-resort
|
| 106 |
+
},
|
| 107 |
# HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
|
| 108 |
# User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
|
| 109 |
{
|
|
|
|
| 304 |
log_path.parent.mkdir(parents=True, exist_ok=True)
|
| 305 |
|
| 306 |
# Discover which providers actually have keys set
|
| 307 |
+
# Sentinel "ALWAYS_ON" treats Ollama-local as always available even without API key.
|
| 308 |
+
os.environ.setdefault("ALWAYS_ON", "1")
|
| 309 |
active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
|
| 310 |
if not active:
|
| 311 |
print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")
|
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Ollama local-model bridge — drop-in fallback when paid APIs run out of credit.
|
| 3 |
+
# Reads JSON from stdin: {"messages":[...], "model":"..." (optional), "max_tokens":N}
|
| 4 |
+
# Writes plain text response to stdout.
|
| 5 |
+
#
|
| 6 |
+
# Falls through model preference:
|
| 7 |
+
# 1. qwen3-coder:30b-a3b (primary brain, MoE 3B active, fast)
|
| 8 |
+
# 2. qwen2.5-coder:14b (fallback, proven)
|
| 9 |
+
# 3. granite-code:8b (light, 128K context)
|
| 10 |
+
# 4. gemma4:e4b (very light triage)
|
| 11 |
+
#
|
| 12 |
+
# Picks first available model that's installed locally on this Space.
|
| 13 |
+
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
PAYLOAD=$(cat)
|
| 16 |
+
OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}"
|
| 17 |
+
|
| 18 |
+
# Wait if Ollama not listening yet (e.g., booting)
|
| 19 |
+
for i in 1 2 3 4 5; do
|
| 20 |
+
if curl -fsSm 3 "http://${OLLAMA_HOST}/api/tags" >/dev/null 2>&1; then break; fi
|
| 21 |
+
sleep 2
|
| 22 |
+
done
|
| 23 |
+
|
| 24 |
+
# Discover available models, pick first that exists in priority order
|
| 25 |
+
AVAIL=$(curl -fsSm 5 "http://${OLLAMA_HOST}/api/tags" 2>/dev/null \
|
| 26 |
+
| python3 -c "import sys, json; print('\n'.join(m['name'] for m in json.load(sys.stdin).get('models', [])))" 2>/dev/null)
|
| 27 |
+
if [[ -z "$AVAIL" ]]; then
|
| 28 |
+
echo "{\"error\":\"ollama not reachable at $OLLAMA_HOST\"}" >&2
|
| 29 |
+
exit 1
|
| 30 |
+
fi
|
| 31 |
+
|
| 32 |
+
CHOSEN=""
|
| 33 |
+
for pref in "qwen3-coder:30b-a3b-instruct-q4_K_M" "qwen3-coder" \
|
| 34 |
+
"qwen2.5-coder:14b-instruct-q4_K_M" "qwen2.5-coder:14b" "qwen2.5-coder" \
|
| 35 |
+
"granite-code:8b-instruct" "granite-code:8b" "granite-code" \
|
| 36 |
+
"gemma4:e4b" "gemma4" "gemma3:1b" "gemma3" "llama3.2:3b" "llama3.2"; do
|
| 37 |
+
if echo "$AVAIL" | grep -qE "^${pref}(:|$)"; then
|
| 38 |
+
CHOSEN=$(echo "$AVAIL" | grep -E "^${pref}(:|$)" | head -1)
|
| 39 |
+
break
|
| 40 |
+
fi
|
| 41 |
+
done
|
| 42 |
+
|
| 43 |
+
if [[ -z "$CHOSEN" ]]; then
|
| 44 |
+
# Use whatever is first
|
| 45 |
+
CHOSEN=$(echo "$AVAIL" | head -1)
|
| 46 |
+
fi
|
| 47 |
+
|
| 48 |
+
# User can override by setting model in payload
|
| 49 |
+
USER_MODEL=$(echo "$PAYLOAD" | python3 -c "import sys, json; d=json.load(sys.stdin); print(d.get('model','') or '')" 2>/dev/null)
|
| 50 |
+
if [[ -n "$USER_MODEL" ]] && echo "$AVAIL" | grep -qE "^${USER_MODEL}(:|$)"; then
|
| 51 |
+
CHOSEN="$USER_MODEL"
|
| 52 |
+
fi
|
| 53 |
+
|
| 54 |
+
# Build /api/chat request
|
| 55 |
+
REQ=$(echo "$PAYLOAD" | python3 -c "
|
| 56 |
+
import sys, json
|
| 57 |
+
d = json.load(sys.stdin)
|
| 58 |
+
out = {
|
| 59 |
+
'model': '$CHOSEN',
|
| 60 |
+
'messages': d.get('messages', []),
|
| 61 |
+
'stream': False,
|
| 62 |
+
'options': {
|
| 63 |
+
'num_predict': int(d.get('max_tokens', 1024)),
|
| 64 |
+
'temperature': float(d.get('temperature', 0.7)),
|
| 65 |
+
'top_p': float(d.get('top_p', 0.95)),
|
| 66 |
+
'num_ctx': int(d.get('num_ctx', 8192)),
|
| 67 |
+
}
|
| 68 |
+
}
|
| 69 |
+
print(json.dumps(out))
|
| 70 |
+
")
|
| 71 |
+
|
| 72 |
+
# POST to Ollama
|
| 73 |
+
RESP=$(curl -fsSm 300 "http://${OLLAMA_HOST}/api/chat" \
|
| 74 |
+
-H "Content-Type: application/json" -d "$REQ" 2>/dev/null)
|
| 75 |
+
if [[ -z "$RESP" ]]; then
|
| 76 |
+
echo "{\"error\":\"ollama call timed out\"}" >&2
|
| 77 |
+
exit 1
|
| 78 |
+
fi
|
| 79 |
+
|
| 80 |
+
# Extract message content
|
| 81 |
+
echo "$RESP" | python3 -c "
|
| 82 |
+
import sys, json
|
| 83 |
+
d = json.load(sys.stdin)
|
| 84 |
+
msg = d.get('message', {}).get('content', '')
|
| 85 |
+
if not msg:
|
| 86 |
+
msg = d.get('response', '')
|
| 87 |
+
print(msg)
|
| 88 |
+
"
|
|
@@ -310,6 +310,22 @@ if os.environ.get("OR_KEY_ENV"):
|
|
| 310 |
"anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
|
| 311 |
{"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
|
| 312 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
errors, out = [], ""
|
| 314 |
for name, fn in ladder:
|
| 315 |
try:
|
|
|
|
| 310 |
"anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
|
| 311 |
{"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
|
| 312 |
|
| 313 |
+
# ── ALWAYS-ON LOCAL FALLBACK: Ollama (no key needed, never rate-limits) ────
|
| 314 |
+
# Last resort when ALL paid + free providers fail / out-of-credit / 429.
|
| 315 |
+
# User feedback 2026-04-29: "ทุก agent ต้องทำงานตลอด" — never go silent.
|
| 316 |
+
ladder.append(("ollama-local:qwen3-coder",
|
| 317 |
+
lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
|
| 318 |
+
"qwen3-coder:30b-a3b-instruct-q4_K_M", "ollama")))
|
| 319 |
+
ladder.append(("ollama-local:qwen2.5-14b",
|
| 320 |
+
lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
|
| 321 |
+
"qwen2.5-coder:14b-instruct-q4_K_M", "ollama")))
|
| 322 |
+
ladder.append(("ollama-local:granite",
|
| 323 |
+
lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
|
| 324 |
+
"granite-code:8b-instruct", "ollama")))
|
| 325 |
+
ladder.append(("ollama-local:gemma",
|
| 326 |
+
lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
|
| 327 |
+
"gemma4:e4b", "ollama")))
|
| 328 |
+
|
| 329 |
errors, out = [], ""
|
| 330 |
for name, fn in ladder:
|
| 331 |
try:
|