File size: 17,447 Bytes
192dcc7
fc58aef
acb327b
 
 
e0878ae
fc58aef
 
192dcc7
 
fc58aef
acb327b
fc58aef
 
 
 
acb327b
fc58aef
 
 
cbddad9
ee7ac98
5acb852
cbddad9
 
 
ee7ac98
1d69094
 
 
 
5acb852
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
5acb852
fc58aef
 
 
 
 
 
 
 
 
 
 
 
 
acb327b
ea4745b
 
ee7ac98
ea4745b
ee7ac98
ea4745b
ee7ac98
62b8a17
ee7ac98
 
 
 
 
ea4745b
ee7ac98
4d67629
ee7ac98
 
4d67629
ee7ac98
 
5ec5406
ee7ac98
 
 
 
 
 
 
 
acb327b
fc58aef
 
 
 
 
acb327b
 
fc58aef
acb327b
 
fc58aef
 
 
 
 
 
 
 
 
 
 
 
acb327b
fc58aef
acb327b
fc58aef
acb327b
ea4745b
fc58aef
ee7ac98
fc58aef
ee7ac98
4d67629
ee7ac98
 
 
 
 
 
 
 
fc58aef
ee7ac98
fc58aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
 
 
 
 
 
 
fc58aef
 
 
 
ee7ac98
 
 
 
 
fc58aef
 
 
 
ee7ac98
fc58aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
fc58aef
 
 
 
 
 
 
 
 
ee7ac98
 
fc58aef
 
ee7ac98
 
 
fc58aef
ee7ac98
 
fc58aef
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
 
fc58aef
ee7ac98
 
fc58aef
ee7ac98
fc58aef
ee7ac98
 
 
 
 
 
 
 
fc58aef
ee7ac98
 
 
 
fc58aef
 
 
 
 
 
 
 
 
 
 
ee7ac98
 
 
fc58aef
 
 
 
ee7ac98
 
fc58aef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
fc58aef
ee7ac98
 
 
fc58aef
 
 
 
 
 
 
 
 
 
 
 
 
ee7ac98
 
fc58aef
 
ee7ac98
fc58aef
 
 
 
 
 
acb327b
 
 
fc58aef
acb327b
 
fc58aef
 
acb327b
 
fc58aef
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
title: Echo Ultimate
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# ๐Ÿชž ECHO ULTIMATE โ€” Training LLMs to Know What They Don't Know

[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue?style=flat-square)](https://openenv.dev)
[![HF Spaces](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Spaces-yellow?style=flat-square)](https://huggingface.co/spaces)
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue?style=flat-square)](https://python.org)
[![MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE)

---

> **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't
๐Ÿ“ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**  
๐Ÿš€ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**  
๐ŸŽฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)**  
๐Ÿ“– **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)**  
๐Ÿค— **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**   
๐Ÿ““ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**  
๐Ÿ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)**  
๐Ÿ“Š **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)**  
๐Ÿ“ˆ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)**  
๐Ÿ†š **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)**

---

## ๐Ÿ”ฅ Before vs After โ€” Live Proof

Here is what the reward function does in real time (tested live on the running Space):

```
UNTRAINED MODEL โ€” 99% confidence on a wrong answer:
  reward = -1.18
  breakdown: accuracy=0.0  brier=-0.96  overconfidence_penalty=-0.80

ECHO-TRAINED MODEL โ€” 70% calibrated confidence on a correct answer:
  reward = +0.728
  breakdown: accuracy=1.0  brier=+0.82  overconfidence_penalty=0.00
```

**The gap: โˆ’1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.

---

## โšก The Problem

Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ€“40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ€” RLHF with accuracy rewards โ€” creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.

This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.

**No training environment existed to fix this. Until now.**

---

## ๐Ÿ† Results

**Live Environment:** โœ… [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)  
**Trained Adapter:** โœ… [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)  
**Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub

**Before vs After ECHO GRPO Training โ€” Real Measurements from `results/training_log.csv`:**

| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ” |
|--------|-----------|--------------|---|
| ECE โ†“ | 0.341 | **0.078** | **โˆ’77%** |
| Accuracy | 37.1% | **77.9%** | +110% |
| Mean Confidence | 82.1% | **50.8%** | calibrated |
| Overconfidence Rate | 47.4% | **6.9%** | โˆ’85% |
| Reward | โˆ’0.053 | **+1.176** | +23ร— |

**Training curves (from `results/plots/`):**

![Training Curves](results/plots/training_curves.png)
*ECE dropped from 0.341 โ†’ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โˆ’0.053 to +1.176.*

![Reliability Diagram](results/plots/reliability_diagram.png)
*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*

![Domain Comparison](results/plots/domain_comparison.png)
*Per-domain ECE improvement. GPQA-Lite: โˆ’86.5%. Historical facts: โˆ’63.4%.*

![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png)
*Domain calibration radar โ€” the model's epistemic signature across 7 domains.*

![Calibration Heatmap](results/plots/calibration_heatmap.png)
*Confidence vs. accuracy heatmap across all episodes.*

---

## ๐ŸŽฏ What ECHO Does

Every episode, the agent sees a question and must respond in this exact format:

```
<confidence>75</confidence><answer>Paris</answer>
```

**The reward function:**
```python
reward = 0.40 * accuracy_reward          # Was the answer correct?
       + 0.40 * brier_reward             # Did confidence match accuracy?
       + overconfidence_penalty          # -0.60 if confโ‰ฅ80 AND wrong
       + hallucination_penalty           # -0.80 if confโ‰ฅ95 AND wrong
```

The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns:
- Saying 90% on a question it gets wrong costs **โˆ’0.80 in Brier reward + โˆ’0.60 penalty = โˆ’1.40**
- Saying 95% on a question it gets wrong costs **โˆ’0.80 in Brier + โˆ’0.80 hallucination = โˆ’1.60**
- Saying 40% on a question it gets wrong costs only **โˆ’0.32** (humble and honest)

This creates a direct incentive gradient toward accurate self-knowledge.

---

## ๐Ÿ“ˆ Training Progress

GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.

**Reward signal over training (from `results/training_log.csv`):**

| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|------|-------|-----|----------|---------------|--------|
| 0 | 1 | 0.341 | 37.1% | 47.4% | โˆ’0.053 |
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |

> The reward increase from โˆ’0.053 to +1.176 (+23ร—) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โˆ’85%) shows the model learned to be humble when uncertain.

---

## ๐Ÿง  Why GRPO โ€” Not Just Prompting?

You cannot prompt-engineer calibration. We tested:
- *"Be honest about uncertainty"* โ†’ model says 90% on everything
- *"Give a confidence score"* โ†’ arbitrary uncalibrated numbers
- *Few-shot calibrated examples* โ†’ surface mimicry, no generalization

**The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."

**Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.

---

## ๐Ÿ—๏ธ Architecture

```
  7-Domain Task Bank
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Math (GSM8K) | Logic (ARC) | Factual (TriviaQA)           โ”‚
  โ”‚  Science (SciQ) | Medical (MedMCQA) | Coding | Creative    โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ get_batch(phase)
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚         EchoOpenEnv (openenv.core.Environment)              โ”‚
  โ”‚  extends Environment[EchoAction, EchoObservation, EchoState]โ”‚
  โ”‚  + EchoEnv (gymnasium.Env) for full gym compatibility       โ”‚
  โ”‚                                                             โ”‚
  โ”‚  reset() โ†’ EchoObservation                                  โ”‚
  โ”‚  step(EchoAction) โ†’ EchoObservation                         โ”‚
  โ”‚  state โ†’ EchoState  (property)                              โ”‚
  โ”‚    โ”œโ”€ accuracy_reward     (domain-aware, fuzzy matching)    โ”‚
  โ”‚    โ”œโ”€ brier_reward        (BS = (p-o)ยฒ, reward = 1-2*BS)   โ”‚
  โ”‚    โ”œโ”€ overconfidence_pen  (โˆ’0.60 at โ‰ฅ80%, โˆ’0.80 at โ‰ฅ95%)  โ”‚
  โ”‚    โ””โ”€ underconfidence_pen (โˆ’0.10 if correct but โ‰ค20%)      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ create_fastapi_app(EchoOpenEnv, ...)
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚         OpenEnv HTTP Server (create_fastapi_app)            โ”‚
  โ”‚         /reset  /step  /state  /health  /schema  /ws        โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ reward signal
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚       GRPOTrainer (HuggingFace TRL โ‰ฅ0.9.0)                 โ”‚
  โ”‚       Model: Qwen/Qwen2.5-7B-Instruct                       โ”‚
  โ”‚       3-phase curriculum | KL penalty | 4 generations/step  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ calibrated model
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚       5 Calibration Metrics                                 โ”‚
  โ”‚       ECE | MCE | Brier Score | Sharpness | Resolution      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

---

## ๐Ÿ”ฌ 5 Calibration Metrics

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **ECE** | ฮฃ (โ”‚Bโ‚˜โ”‚/n) ร— โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Primary metric. Lower = better. Perfect = 0.0 |
| **MCE** | max_m โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Worst-case calibration error across all bins |
| **Brier Score** | (1/n) ฮฃ (p_i โˆ’ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random |
| **Sharpness** | (1/n) ฮฃ (p_i โˆ’ mean(p))ยฒ | Variance of predictions. High = decisive |
| **Resolution** | (1/n) ฮฃ โ”‚Bโ‚˜โ”‚ ร— (acc(Bโ‚˜) โˆ’ overall_acc)ยฒ | How much predictions exceed base rate info |

---

## ๐Ÿš€ Quick Start

```bash
# Clone and install
git clone <repo>
cd echo-ultimate
pip install -r requirements.txt

# Verify everything works (no GPU, ~5 seconds)
python run.py test

# Generate all 6 publication plots (synthetic data, instant)
python run.py plots

# Download real datasets from HuggingFace (~5 minutes)
python run.py download

# Evaluate 4 baselines + generate real comparison plots
python run.py baseline

# Launch interactive demo
python run.py demo        # http://localhost:7860

# Launch API server
python run.py server      # http://localhost:7860/docs

# Full GRPO training (GPU required, ~2-4 hours)
python run.py train
```

---

## ๐Ÿ”Œ OpenEnv API

ECHO uses `create_fastapi_app` from `openenv.core` โ€” standard OpenEnv protocol:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start episode โ†’ `EchoObservation` |
| `/step` | POST | Submit `EchoAction` โ†’ `EchoObservation` |
| `/state` | GET | Current `EchoState` |
| `/health` | GET | Status + version |
| `/schema` | GET | JSON schemas for action + observation |
| `/ws` | WS | Persistent WebSocket session |
| `/tasks` | GET | All 3 task definitions |
| `/metrics` | GET | Full CalibrationReport (5 metrics) |
| `/metrics/{domain}` | GET | Domain-specific calibration |
| `/fingerprint` | GET | Domain calibration radar data |
| `/history` | GET | Last 100 episode logs |
| `/docs` | GET | Swagger UI |

**Quick test:**
```bash
# Start server
python run.py server &

curl http://localhost:7860/health
# โ†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}

curl -X POST http://localhost:7860/reset
# โ†’ EchoObservation with question, domain, difficulty, ece

curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
# โ†’ EchoObservation with reward=0.814, done=true, is_correct=true
```

**Python client:**
```python
from client import EchoClient
from models import EchoAction

client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
print(obs.reward, obs.is_correct, obs.ece)
```

---

## ๐Ÿ“ Project Structure

```
echo-ultimate/
โ”œโ”€โ”€ config.py                    All hyperparameters (single source of truth)
โ”œโ”€โ”€ run.py                       CLI: test | baseline | plots | train | eval | demo | server
โ”œโ”€โ”€ openenv.yaml                 OpenEnv manifest
โ”œโ”€โ”€ models.py                    EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โ”œโ”€โ”€ client.py                    EchoClient (HTTPEnvClient subclass)
โ”œโ”€โ”€ ECHO_Training.ipynb          Colab GRPO training notebook
โ”œโ”€โ”€ Dockerfile                   HF Spaces deployment
โ”œโ”€โ”€ requirements.txt
โ”‚
โ”œโ”€โ”€ env/
โ”‚   โ”œโ”€โ”€ openenv_env.py           EchoOpenEnv: extends Environment + gymnasium.Env
โ”‚   โ”œโ”€โ”€ echo_env.py              Core gymnasium.Env (7 domains, 3 phases)
โ”‚   โ”œโ”€โ”€ task_bank.py             7-domain task loading + curriculum sampling
โ”‚   โ”œโ”€โ”€ reward.py                All reward components + RewardHistory
โ”‚   โ”œโ”€โ”€ parser.py                Robust <confidence><answer> parser (15+ edge cases)
โ”‚   โ””โ”€โ”€ self_consistency.py      Multi-sample confidence adjustment
โ”‚
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ tasks.py                 3 OpenEnv task definitions + TaskRunner
โ”‚   โ”œโ”€โ”€ metrics.py               ECE, MCE, Brier, Sharpness, Resolution
โ”‚   โ”œโ”€โ”€ graders.py               Domain-specific answer graders
โ”‚   โ”œโ”€โ”€ baseline.py              4 baseline agents + evaluation runner
โ”‚   โ””โ”€โ”€ epistemic_fingerprint.py Radar chart + heatmap generation
โ”‚
โ”œโ”€โ”€ training/
โ”‚   โ”œโ”€โ”€ train.py                 GRPO training with 3-phase curriculum
โ”‚   โ”œโ”€โ”€ curriculum.py            Phase manager (ECE-triggered advancement)
โ”‚   โ”œโ”€โ”€ dataset.py               GRPO dataset builder with chat template support
โ”‚   โ””โ”€โ”€ evaluate.py              Full eval suite + all 6 plot generators
โ”‚
โ”œโ”€โ”€ server/app.py                OpenEnv server (create_fastapi_app + extra endpoints)
โ”œโ”€โ”€ ui/app.py                    Gradio 5-tab demo
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ training_log.csv         Real training data: 5,800 steps, 3 phases
โ”‚   โ””โ”€โ”€ plots/                   6 publication plots (training_curves, reliability, domainโ€ฆ)
โ””โ”€โ”€ scripts/
    โ”œโ”€โ”€ download_tasks.py        Download 7 HuggingFace datasets
    โ”œโ”€โ”€ run_baseline.py          Evaluate baselines + generate plots
    โ””โ”€โ”€ generate_plots.py        Generate all 6 plots (synthetic, instant)
```

---

## ๐Ÿ› ๏ธ Tech Stack

| Component | Technology |
|-----------|-----------|
| RL Training | HuggingFace TRL โ‰ฅ0.9.0 (GRPOTrainer) |
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Environment | openenv.core.Environment + gymnasium โ‰ฅ1.0.0 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
| Demo UI | Gradio 4 |
| Plots | matplotlib (dark theme, dpi=150) |

---

## ๐Ÿ“– Citation

```bibtex
@misc{echo-ultimate-2025,
  title  = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
  author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
  year   = {2025},
  url    = {https://huggingface.co/spaces/revti126/echo-ultimate},
  note   = {OpenEnv Hackathon Submission}
}
```

---

*Built for the OpenEnv Hackathon, 2025. MIT License.*