Text Generation
Transformers
Safetensors
English
Kazakh
qwen3
edge-cloud-routing
verbalized-confidence
self-aware
routing
continual-learning
multi-round
conversational
text-generation-inference
Instructions to use issai/foggen with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use issai/foggen with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="issai/foggen") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("issai/foggen") model = AutoModelForCausalLM.from_pretrained("issai/foggen") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use issai/foggen with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "issai/foggen" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "issai/foggen", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/issai/foggen
- SGLang
How to use issai/foggen with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "issai/foggen" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "issai/foggen", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "issai/foggen" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "issai/foggen", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use issai/foggen with Docker Model Runner:
docker model run hf.co/issai/foggen
File size: 6,229 Bytes
163da25 72aa284 163da25 3dee25e acc6cdf 3dee25e acc6cdf 163da25 30d9ab4 163da25 acc6cdf 163da25 acc6cdf 163da25 8ebd128 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
license: apache-2.0
language:
- en
- kk
base_model:
- Qwen/Qwen3-0.6B
datasets:
- issai/foggen-data
- issai/KazCulture
pipeline_tag: text-generation
tags:
- edge-cloud-routing
- verbalized-confidence
- self-aware
- routing
- continual-learning
- multi-round
library_name: transformers
---
# FogGen: Self-Aware Edge–Cloud LLM Router
> **A 0.6B parameter edge LLM trained to emit a calibrated verbalized confidence score before its answer, enabling efficient edge–cloud routing without an external router.**

FogGen is a small, self-aware edge model that knows when to answer locally and when to defer to a stronger cloud model. At inference (figure (a)) it emits a confidence score then an answer in one forward pass; if confidence `c ≥ τ` the local answer is returned, otherwise the query is routed to the cloud. Training (figure (b)) is a self-evolving loop: each round, the current checkpoint self-samples N=8 generations per question to derive confidence buckets, then SFTs on `(question, confidence, answer)` triples.
The released checkpoint is the endpoint (`R14`) of a 14-round chain trained across seven domains: finance, science, coding, law, math, Kazakh culture, medical.
## Quick demo
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("issai/foggen", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("issai/foggen")
SYSTEM = """You are a self-aware multiple-choice assistant.
Rules:
- Do not output <think> tags.
- First, assess your confidence in solving this question.
- Then give your answer.
- Output format:
Confidence: <0.0|0.25|0.5|0.75|1.0>
Final answer: <OPTION_LETTER>"""
question = """A firm reports $400M in total liabilities and $600M in shareholders' equity.
What is the firm's debt-to-equity ratio?
A. 0.67
B. 1.00
C. 1.50
D. 2.00"""
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": question},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True,
enable_thinking=False).to(model.device)
outputs = model.generate(inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
# Expected:
# Confidence: 1.0
# Final answer: A
```
## How routing works
```python
import re
def route_query(model_output: str, tau: float = 0.5):
"""Parse FogGen output. Returns (action, confidence, answer).
action is 'keep_local' if confidence >= tau, else 'route_to_cloud'."""
conf_match = re.search(r"Confidence\s*:\s*([\d.]+)", model_output)
ans_match = re.search(r"Final\s+answer\s*:\s*([A-D])", model_output)
if not conf_match: return "route_to_cloud", None, None
confidence = float(conf_match.group(1))
answer = ans_match.group(1) if ans_match else None
return ("keep_local" if confidence >= tau else "route_to_cloud", confidence, answer)
```
At τ=0.5 on the trained domains, the model routes ~22% of queries to the cloud while achieving 67.8% mean system accuracy.
## Model details
| | |
|---|---|
| **Base model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| **Parameters** | 0.6 B |
| **Training method** | LoRA SFT (rank=16, α=32, all-linear), bf16, 2 epochs/round |
| **Rounds** | 14 sequential rounds (R0 → R14) |
| **Training tokens** | ~1800 SFT rows × 14 rounds |
| **Domains** | finance, science, coding, law, math, Kazakh culture, medical |
| **Cloud teacher** | [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) |
| **Output format** | `Confidence: <bucket>\nFinal answer: <letter>` |
| **Confidence buckets** | 5 discrete values: 0.0, 0.25, 0.5, 0.75, 1.0 |
| **License** | Apache 2.0 (inherited from base) |
## Performance
System accuracy at τ=0.5 on seven MCQ domains (full test sets, ~16,200 questions), measured against Random routing and a cloud-only baseline (Qwen3-30B-A3B-Instruct-2507):
| Domain | Cloud only | R14 raw | Random @ τ=0.5 | **FogGen @ τ=0.5** | Cloud routed |
|---|---|---|---|---|---|
| Finance | 69.5% | 57.0% | 59.9% | **65.8%** | 23.3% |
| Science | 72.7% | 56.9% | 60.1% | **64.5%** | 20.4% |
| Coding | 74.2% | 61.8% | 64.2% | **69.5%** | 19.7% |
| Law | 70.7% | 55.3% | 58.4% | **62.4%** | 20.1% |
| Math | 60.1% | 42.2% | 50.8% | **58.1%** | 47.7% |
| Kazakh culture | 95.8% | 91.3% | 91.4% | **91.9%** | 1.0% |
| Medical | 74.0% | 52.6% | 57.1% | **62.2%** | 20.9% |
| **Mean** | **73.9%** | **59.6%** | **63.1%** | **67.8%** | **21.9%** |
Mean lift over Random at τ=0.5: **+4.6** (system accuracy minus random-routing accuracy, averaged across the seven domains).
### Baseline comparison
Direct comparison against AutoMix (Aggarwal et al., 2024) on the same R14 checkpoint, same evaluation sets:
| Method | SysAcc | Cloud routed | Δ over Random | Fwd passes / query |
|---|---|---|---|---|
| AutoMix | 67.2% | 29.0% | +3.7 | 9 (1 answer + 8 verify) |
| **FogGen (ours)** | **67.8%** | **21.9%** | **+4.6** | **1** |
FogGen achieves higher accuracy at lower cloud cost and 9× lower per-query inference cost.
## Open-ended generalization
The MCQ-trained chain transfers to open-ended task types zero-shot. Local accuracy and routing benefit at τ=0.5 on three held-out OE benchmarks:
| Benchmark | Format | R14 raw | R14 Δ@τ=0.5 |
|---|---|---|---|
| [SQuAD v1.1](https://huggingface.co/datasets/rajpurkar/squad) | extractive RC | 81.0% | +1.4 |
| [TruthfulQA gen](https://huggingface.co/datasets/truthfulqa/truthful_qa) | adversarial factual | 36.5% | −0.7 (anti-calibrated) |
| [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (CoT) | math word-problems | 52.0% | +2.2 |
One additional round of OE training (R15, 1876 SFT rows) lifts local accuracy on these three benchmarks to 86.5% / 40.0% / 58.0% respectively; see [`issai/foggen-r15-oe`](https://huggingface.co/issai/foggen-r15-oe).
## Citation
Paper coming soon.
## Acknowledgements
Thanks to the Qwen team at Alibaba for the base model and cloud teacher.
|