ashirato commited on
Commit
daa3c84
·
1 Parent(s): 2aa8090

feat(always-on): Ollama local fallback for ALL provider ladders

Browse files

User emergency 2026-04-29: 'ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด'.
When all paid + free APIs run out / rate-limit / 429, Hermes daemons silently
stopped. This adds always-on Ollama (already running in Space) as last-resort
fallback so ingestion + orchestrate NEVER go silent.

Changes:
- bin/ollama-bridge.sh (new) — drop-in OpenAI-compat bridge to local Ollama,
auto-falls-through model preference: qwen3-coder:30b-a3b → qwen2.5-coder:14b
→ granite-code:8b → gemma4:e4b. Reads JSON from stdin, writes text stdout.
Same I/O contract as cerebras/groq/openrouter bridges.
- bin/llm-burst-generator.py — added 'ollama-local' provider with sentinel
ALWAYS_ON env var so it's always 'active' regardless of API keys. rpm_budget
60 (unlimited locally). Marked fallback=True so retry logic prefers it last
but ALWAYS reachable.
- bin/surrogate-orchestrate.sh — appended 4 ollama-local ladder entries at
END (qwen3-coder/qwen2.5-14b/granite/gemma) so every orchestrate call
guaranteed succeeds.

Models already pulled by start.sh: qwen3-coder:30b-a3b (~16GB Q4 MoE 3B
active = fast on CPU), qwen2.5-coder:14b (~9GB), granite-code:8b (~5GB IBM
128K ctx), gemma4:e4b (light). Cache lives in /data/.ollama/models so survives
factory reboots.

Effect: from now on, when 11 cloud providers all fail, Ollama keeps the
ingestion pipeline alive (slower than cloud but FREE + never rate-limited).

bin/llm-burst-generator.py CHANGED
@@ -90,6 +90,20 @@ PROVIDERS = [
90
  "model": "moonshot-v1-32k", # kimi-k2 was wrong
91
  "rpm_budget": 15,
92
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  # HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
94
  # User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
95
  {
@@ -290,6 +304,8 @@ def main():
290
  log_path.parent.mkdir(parents=True, exist_ok=True)
291
 
292
  # Discover which providers actually have keys set
 
 
293
  active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
294
  if not active:
295
  print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")
 
90
  "model": "moonshot-v1-32k", # kimi-k2 was wrong
91
  "rpm_budget": 15,
92
  },
93
+ # ── ALWAYS-ON LOCAL FALLBACK: Ollama on this Space (no key, never rate-limited) ──
94
+ # User feedback 2026-04-29: "ถ้า freetier หมดก็ใช้ ทุกagent ต้องทำงานตลอด".
95
+ # When all paid/free APIs fail / hit rate limit, this keeps ingestion alive.
96
+ # Uses qwen3-coder:30b-a3b → qwen2.5-coder:14b → granite-code:8b → gemma4:e4b
97
+ # in priority order (whichever is installed). Slower than cloud APIs but
98
+ # FREE and never rate-limited.
99
+ {
100
+ "name": "ollama-local",
101
+ "url": "http://127.0.0.1:11434/v1/chat/completions", # Ollama OpenAI-compat
102
+ "key_env": "ALWAYS_ON", # sentinel — never empty
103
+ "model": "qwen3-coder:30b-a3b-instruct-q4_K_M", # auto-falls-through if absent
104
+ "rpm_budget": 60, # unlimited locally
105
+ "fallback": True, # mark as last-resort
106
+ },
107
  # HF Inference API — free hosted Llama / Mistral / Mixtral / etc.
108
  # User: 'HF local model ก็มี ทำไมเธอไม่เอามาใช้'.
109
  {
 
304
  log_path.parent.mkdir(parents=True, exist_ok=True)
305
 
306
  # Discover which providers actually have keys set
307
+ # Sentinel "ALWAYS_ON" treats Ollama-local as always available even without API key.
308
+ os.environ.setdefault("ALWAYS_ON", "1")
309
  active = [p for p in PROVIDERS if os.environ.get(p["key_env"], "").strip()]
310
  if not active:
311
  print("ERR: no provider keys found in env — set CEREBRAS_API_KEY etc. as Space secrets")
bin/ollama-bridge.sh ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Ollama local-model bridge — drop-in fallback when paid APIs run out of credit.
3
+ # Reads JSON from stdin: {"messages":[...], "model":"..." (optional), "max_tokens":N}
4
+ # Writes plain text response to stdout.
5
+ #
6
+ # Falls through model preference:
7
+ # 1. qwen3-coder:30b-a3b (primary brain, MoE 3B active, fast)
8
+ # 2. qwen2.5-coder:14b (fallback, proven)
9
+ # 3. granite-code:8b (light, 128K context)
10
+ # 4. gemma4:e4b (very light triage)
11
+ #
12
+ # Picks first available model that's installed locally on this Space.
13
+
14
+ set -uo pipefail
15
+ PAYLOAD=$(cat)
16
+ OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}"
17
+
18
+ # Wait if Ollama not listening yet (e.g., booting)
19
+ for i in 1 2 3 4 5; do
20
+ if curl -fsSm 3 "http://${OLLAMA_HOST}/api/tags" >/dev/null 2>&1; then break; fi
21
+ sleep 2
22
+ done
23
+
24
+ # Discover available models, pick first that exists in priority order
25
+ AVAIL=$(curl -fsSm 5 "http://${OLLAMA_HOST}/api/tags" 2>/dev/null \
26
+ | python3 -c "import sys, json; print('\n'.join(m['name'] for m in json.load(sys.stdin).get('models', [])))" 2>/dev/null)
27
+ if [[ -z "$AVAIL" ]]; then
28
+ echo "{\"error\":\"ollama not reachable at $OLLAMA_HOST\"}" >&2
29
+ exit 1
30
+ fi
31
+
32
+ CHOSEN=""
33
+ for pref in "qwen3-coder:30b-a3b-instruct-q4_K_M" "qwen3-coder" \
34
+ "qwen2.5-coder:14b-instruct-q4_K_M" "qwen2.5-coder:14b" "qwen2.5-coder" \
35
+ "granite-code:8b-instruct" "granite-code:8b" "granite-code" \
36
+ "gemma4:e4b" "gemma4" "gemma3:1b" "gemma3" "llama3.2:3b" "llama3.2"; do
37
+ if echo "$AVAIL" | grep -qE "^${pref}(:|$)"; then
38
+ CHOSEN=$(echo "$AVAIL" | grep -E "^${pref}(:|$)" | head -1)
39
+ break
40
+ fi
41
+ done
42
+
43
+ if [[ -z "$CHOSEN" ]]; then
44
+ # Use whatever is first
45
+ CHOSEN=$(echo "$AVAIL" | head -1)
46
+ fi
47
+
48
+ # User can override by setting model in payload
49
+ USER_MODEL=$(echo "$PAYLOAD" | python3 -c "import sys, json; d=json.load(sys.stdin); print(d.get('model','') or '')" 2>/dev/null)
50
+ if [[ -n "$USER_MODEL" ]] && echo "$AVAIL" | grep -qE "^${USER_MODEL}(:|$)"; then
51
+ CHOSEN="$USER_MODEL"
52
+ fi
53
+
54
+ # Build /api/chat request
55
+ REQ=$(echo "$PAYLOAD" | python3 -c "
56
+ import sys, json
57
+ d = json.load(sys.stdin)
58
+ out = {
59
+ 'model': '$CHOSEN',
60
+ 'messages': d.get('messages', []),
61
+ 'stream': False,
62
+ 'options': {
63
+ 'num_predict': int(d.get('max_tokens', 1024)),
64
+ 'temperature': float(d.get('temperature', 0.7)),
65
+ 'top_p': float(d.get('top_p', 0.95)),
66
+ 'num_ctx': int(d.get('num_ctx', 8192)),
67
+ }
68
+ }
69
+ print(json.dumps(out))
70
+ ")
71
+
72
+ # POST to Ollama
73
+ RESP=$(curl -fsSm 300 "http://${OLLAMA_HOST}/api/chat" \
74
+ -H "Content-Type: application/json" -d "$REQ" 2>/dev/null)
75
+ if [[ -z "$RESP" ]]; then
76
+ echo "{\"error\":\"ollama call timed out\"}" >&2
77
+ exit 1
78
+ fi
79
+
80
+ # Extract message content
81
+ echo "$RESP" | python3 -c "
82
+ import sys, json
83
+ d = json.load(sys.stdin)
84
+ msg = d.get('message', {}).get('content', '')
85
+ if not msg:
86
+ msg = d.get('response', '')
87
+ print(msg)
88
+ "
bin/surrogate-orchestrate.sh CHANGED
@@ -310,6 +310,22 @@ if os.environ.get("OR_KEY_ENV"):
310
  "anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
311
  {"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
312
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
313
  errors, out = [], ""
314
  for name, fn in ladder:
315
  try:
 
310
  "anthropic/claude-haiku-4.5", os.environ["OR_KEY_ENV"],
311
  {"HTTP-Referer":"https://axentx.ai","X-Title":"Surrogate-1"})))
312
 
313
+ # ── ALWAYS-ON LOCAL FALLBACK: Ollama (no key needed, never rate-limits) ────
314
+ # Last resort when ALL paid + free providers fail / out-of-credit / 429.
315
+ # User feedback 2026-04-29: "ทุก agent ต้องทำงานตลอด" — never go silent.
316
+ ladder.append(("ollama-local:qwen3-coder",
317
+ lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
318
+ "qwen3-coder:30b-a3b-instruct-q4_K_M", "ollama")))
319
+ ladder.append(("ollama-local:qwen2.5-14b",
320
+ lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
321
+ "qwen2.5-coder:14b-instruct-q4_K_M", "ollama")))
322
+ ladder.append(("ollama-local:granite",
323
+ lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
324
+ "granite-code:8b-instruct", "ollama")))
325
+ ladder.append(("ollama-local:gemma",
326
+ lambda: oai_compatible("http://127.0.0.1:11434/v1/chat/completions",
327
+ "gemma4:e4b", "ollama")))
328
+
329
  errors, out = [], ""
330
  for name, fn in ladder:
331
  try: