File size: 6,229 Bytes
163da25
 
 
 
 
 
 
72aa284
 
 
163da25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dee25e
 
acc6cdf
3dee25e
acc6cdf
163da25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30d9ab4
163da25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acc6cdf
163da25
 
 
acc6cdf
163da25
 
 
8ebd128
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: apache-2.0
language:
- en
- kk
base_model:
- Qwen/Qwen3-0.6B
datasets:
- issai/foggen-data
- issai/KazCulture
pipeline_tag: text-generation
tags:
- edge-cloud-routing
- verbalized-confidence
- self-aware
- routing
- continual-learning
- multi-round
library_name: transformers
---

# FogGen: Self-Aware Edge–Cloud LLM Router

> **A 0.6B parameter edge LLM trained to emit a calibrated verbalized confidence score before its answer, enabling efficient edge–cloud routing without an external router.**

![FogGen overview: (a) self-aware routing at inference, (b) self-evolving training loop](./foggen_overview.png)

FogGen is a small, self-aware edge model that knows when to answer locally and when to defer to a stronger cloud model. At inference (figure (a)) it emits a confidence score then an answer in one forward pass; if confidence `c ≥ τ` the local answer is returned, otherwise the query is routed to the cloud. Training (figure (b)) is a self-evolving loop: each round, the current checkpoint self-samples N=8 generations per question to derive confidence buckets, then SFTs on `(question, confidence, answer)` triples.

The released checkpoint is the endpoint (`R14`) of a 14-round chain trained across seven domains: finance, science, coding, law, math, Kazakh culture, medical.

## Quick demo

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("issai/foggen", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("issai/foggen")

SYSTEM = """You are a self-aware multiple-choice assistant.

Rules:
- Do not output <think> tags.
- First, assess your confidence in solving this question.
- Then give your answer.
- Output format:
  Confidence: <0.0|0.25|0.5|0.75|1.0>
  Final answer: <OPTION_LETTER>"""

question = """A firm reports $400M in total liabilities and $600M in shareholders' equity.
What is the firm's debt-to-equity ratio?

A. 0.67
B. 1.00
C. 1.50
D. 2.00"""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": question},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True,
                                       enable_thinking=False).to(model.device)
outputs = model.generate(inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
# Expected:
#   Confidence: 1.0
#   Final answer: A
```

## How routing works

```python
import re

def route_query(model_output: str, tau: float = 0.5):
    """Parse FogGen output. Returns (action, confidence, answer).
    action is 'keep_local' if confidence >= tau, else 'route_to_cloud'."""
    conf_match = re.search(r"Confidence\s*:\s*([\d.]+)", model_output)
    ans_match  = re.search(r"Final\s+answer\s*:\s*([A-D])", model_output)
    if not conf_match: return "route_to_cloud", None, None
    confidence = float(conf_match.group(1))
    answer = ans_match.group(1) if ans_match else None
    return ("keep_local" if confidence >= tau else "route_to_cloud", confidence, answer)
```

At τ=0.5 on the trained domains, the model routes ~22% of queries to the cloud while achieving 67.8% mean system accuracy.

## Model details

| | |
|---|---|
| **Base model** | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| **Parameters** | 0.6 B |
| **Training method** | LoRA SFT (rank=16, α=32, all-linear), bf16, 2 epochs/round |
| **Rounds** | 14 sequential rounds (R0 → R14) |
| **Training tokens** | ~1800 SFT rows × 14 rounds |
| **Domains** | finance, science, coding, law, math, Kazakh culture, medical |
| **Cloud teacher** | [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) |
| **Output format** | `Confidence: <bucket>\nFinal answer: <letter>` |
| **Confidence buckets** | 5 discrete values: 0.0, 0.25, 0.5, 0.75, 1.0 |
| **License** | Apache 2.0 (inherited from base) |

## Performance

System accuracy at τ=0.5 on seven MCQ domains (full test sets, ~16,200 questions), measured against Random routing and a cloud-only baseline (Qwen3-30B-A3B-Instruct-2507):

| Domain | Cloud only | R14 raw | Random @ τ=0.5 | **FogGen @ τ=0.5** | Cloud routed |
|---|---|---|---|---|---|
| Finance | 69.5% | 57.0% | 59.9% | **65.8%** | 23.3% |
| Science | 72.7% | 56.9% | 60.1% | **64.5%** | 20.4% |
| Coding | 74.2% | 61.8% | 64.2% | **69.5%** | 19.7% |
| Law | 70.7% | 55.3% | 58.4% | **62.4%** | 20.1% |
| Math | 60.1% | 42.2% | 50.8% | **58.1%** | 47.7% |
| Kazakh culture | 95.8% | 91.3% | 91.4% | **91.9%** | 1.0% |
| Medical | 74.0% | 52.6% | 57.1% | **62.2%** | 20.9% |
| **Mean** | **73.9%** | **59.6%** | **63.1%** | **67.8%** | **21.9%** |

Mean lift over Random at τ=0.5: **+4.6** (system accuracy minus random-routing accuracy, averaged across the seven domains).

### Baseline comparison

Direct comparison against AutoMix (Aggarwal et al., 2024) on the same R14 checkpoint, same evaluation sets:

| Method | SysAcc | Cloud routed | Δ over Random | Fwd passes / query |
|---|---|---|---|---|
| AutoMix | 67.2% | 29.0% | +3.7 | 9 (1 answer + 8 verify) |
| **FogGen (ours)** | **67.8%** | **21.9%** | **+4.6** | **1** |

FogGen achieves higher accuracy at lower cloud cost and 9× lower per-query inference cost.

## Open-ended generalization

The MCQ-trained chain transfers to open-ended task types zero-shot. Local accuracy and routing benefit at τ=0.5 on three held-out OE benchmarks:

| Benchmark | Format | R14 raw | R14 Δ@τ=0.5 |
|---|---|---|---|
| [SQuAD v1.1](https://huggingface.co/datasets/rajpurkar/squad) | extractive RC | 81.0% | +1.4 |
| [TruthfulQA gen](https://huggingface.co/datasets/truthfulqa/truthful_qa) | adversarial factual | 36.5% | −0.7 (anti-calibrated) |
| [GSM8K](https://huggingface.co/datasets/openai/gsm8k) (CoT) | math word-problems | 52.0% | +2.2 |

One additional round of OE training (R15, 1876 SFT rows) lifts local accuracy on these three benchmarks to 86.5% / 40.0% / 58.0% respectively; see [`issai/foggen-r15-oe`](https://huggingface.co/issai/foggen-r15-oe).

## Citation

Paper coming soon.

## Acknowledgements

Thanks to the Qwen team at Alibaba for the base model and cloud teacher.