Spaces:
Sleeping
Sleeping
File size: 17,447 Bytes
192dcc7 fc58aef acb327b e0878ae fc58aef 192dcc7 fc58aef acb327b fc58aef acb327b fc58aef cbddad9 ee7ac98 5acb852 cbddad9 ee7ac98 1d69094 5acb852 ee7ac98 5acb852 fc58aef acb327b ea4745b ee7ac98 ea4745b ee7ac98 ea4745b ee7ac98 62b8a17 ee7ac98 ea4745b ee7ac98 4d67629 ee7ac98 4d67629 ee7ac98 5ec5406 ee7ac98 acb327b fc58aef acb327b fc58aef acb327b fc58aef acb327b fc58aef acb327b fc58aef acb327b ea4745b fc58aef ee7ac98 fc58aef ee7ac98 4d67629 ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef ee7ac98 fc58aef acb327b fc58aef acb327b fc58aef acb327b fc58aef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | ---
title: Echo Ultimate
emoji: ๐ง
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# ๐ช ECHO ULTIMATE โ Training LLMs to Know What They Don't Know
[](https://openenv.dev)
[](https://huggingface.co/spaces)
[](https://python.org)
[](LICENSE)
---
> **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't
๐ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**
๐ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
๐ฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)**
๐ **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)**
๐ค **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
๐ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
๐ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)**
๐ **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)**
๐ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)**
๐ **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)**
---
## ๐ฅ Before vs After โ Live Proof
Here is what the reward function does in real time (tested live on the running Space):
```
UNTRAINED MODEL โ 99% confidence on a wrong answer:
reward = -1.18
breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80
ECHO-TRAINED MODEL โ 70% calibrated confidence on a correct answer:
reward = +0.728
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
```
**The gap: โ1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
---
## โก The Problem
Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ RLHF with accuracy rewards โ creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.
This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.
**No training environment existed to fix this. Until now.**
---
## ๐ Results
**Live Environment:** โ
[vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
**Trained Adapter:** โ
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
**Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub
**Before vs After ECHO GRPO Training โ Real Measurements from `results/training_log.csv`:**
| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ |
|--------|-----------|--------------|---|
| ECE โ | 0.341 | **0.078** | **โ77%** |
| Accuracy | 37.1% | **77.9%** | +110% |
| Mean Confidence | 82.1% | **50.8%** | calibrated |
| Overconfidence Rate | 47.4% | **6.9%** | โ85% |
| Reward | โ0.053 | **+1.176** | +23ร |
**Training curves (from `results/plots/`):**

*ECE dropped from 0.341 โ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โ0.053 to +1.176.*

*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*

*Per-domain ECE improvement. GPQA-Lite: โ86.5%. Historical facts: โ63.4%.*

*Domain calibration radar โ the model's epistemic signature across 7 domains.*

*Confidence vs. accuracy heatmap across all episodes.*
---
## ๐ฏ What ECHO Does
Every episode, the agent sees a question and must respond in this exact format:
```
<confidence>75</confidence><answer>Paris</answer>
```
**The reward function:**
```python
reward = 0.40 * accuracy_reward # Was the answer correct?
+ 0.40 * brier_reward # Did confidence match accuracy?
+ overconfidence_penalty # -0.60 if confโฅ80 AND wrong
+ hallucination_penalty # -0.80 if confโฅ95 AND wrong
```
The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns:
- Saying 90% on a question it gets wrong costs **โ0.80 in Brier reward + โ0.60 penalty = โ1.40**
- Saying 95% on a question it gets wrong costs **โ0.80 in Brier + โ0.80 hallucination = โ1.60**
- Saying 40% on a question it gets wrong costs only **โ0.32** (humble and honest)
This creates a direct incentive gradient toward accurate self-knowledge.
---
## ๐ Training Progress
GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
**Reward signal over training (from `results/training_log.csv`):**
| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|------|-------|-----|----------|---------------|--------|
| 0 | 1 | 0.341 | 37.1% | 47.4% | โ0.053 |
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
> The reward increase from โ0.053 to +1.176 (+23ร) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โ85%) shows the model learned to be humble when uncertain.
---
## ๐ง Why GRPO โ Not Just Prompting?
You cannot prompt-engineer calibration. We tested:
- *"Be honest about uncertainty"* โ model says 90% on everything
- *"Give a confidence score"* โ arbitrary uncalibrated numbers
- *Few-shot calibrated examples* โ surface mimicry, no generalization
**The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."
**Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
---
## ๐๏ธ Architecture
```
7-Domain Task Bank
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ
โ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ get_batch(phase)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EchoOpenEnv (openenv.core.Environment) โ
โ extends Environment[EchoAction, EchoObservation, EchoState]โ
โ + EchoEnv (gymnasium.Env) for full gym compatibility โ
โ โ
โ reset() โ EchoObservation โ
โ step(EchoAction) โ EchoObservation โ
โ state โ EchoState (property) โ
โ โโ accuracy_reward (domain-aware, fuzzy matching) โ
โ โโ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ
โ โโ overconfidence_pen (โ0.60 at โฅ80%, โ0.80 at โฅ95%) โ
โ โโ underconfidence_pen (โ0.10 if correct but โค20%) โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ create_fastapi_app(EchoOpenEnv, ...)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenEnv HTTP Server (create_fastapi_app) โ
โ /reset /step /state /health /schema /ws โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ reward signal
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GRPOTrainer (HuggingFace TRL โฅ0.9.0) โ
โ Model: Qwen/Qwen2.5-7B-Instruct โ
โ 3-phase curriculum | KL penalty | 4 generations/step โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ calibrated model
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5 Calibration Metrics โ
โ ECE | MCE | Brier Score | Sharpness | Resolution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## ๐ฌ 5 Calibration Metrics
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **ECE** | ฮฃ (โBโโ/n) ร โacc(Bโ) โ conf(Bโ)โ | Primary metric. Lower = better. Perfect = 0.0 |
| **MCE** | max_m โacc(Bโ) โ conf(Bโ)โ | Worst-case calibration error across all bins |
| **Brier Score** | (1/n) ฮฃ (p_i โ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random |
| **Sharpness** | (1/n) ฮฃ (p_i โ mean(p))ยฒ | Variance of predictions. High = decisive |
| **Resolution** | (1/n) ฮฃ โBโโ ร (acc(Bโ) โ overall_acc)ยฒ | How much predictions exceed base rate info |
---
## ๐ Quick Start
```bash
# Clone and install
git clone <repo>
cd echo-ultimate
pip install -r requirements.txt
# Verify everything works (no GPU, ~5 seconds)
python run.py test
# Generate all 6 publication plots (synthetic data, instant)
python run.py plots
# Download real datasets from HuggingFace (~5 minutes)
python run.py download
# Evaluate 4 baselines + generate real comparison plots
python run.py baseline
# Launch interactive demo
python run.py demo # http://localhost:7860
# Launch API server
python run.py server # http://localhost:7860/docs
# Full GRPO training (GPU required, ~2-4 hours)
python run.py train
```
---
## ๐ OpenEnv API
ECHO uses `create_fastapi_app` from `openenv.core` โ standard OpenEnv protocol:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start episode โ `EchoObservation` |
| `/step` | POST | Submit `EchoAction` โ `EchoObservation` |
| `/state` | GET | Current `EchoState` |
| `/health` | GET | Status + version |
| `/schema` | GET | JSON schemas for action + observation |
| `/ws` | WS | Persistent WebSocket session |
| `/tasks` | GET | All 3 task definitions |
| `/metrics` | GET | Full CalibrationReport (5 metrics) |
| `/metrics/{domain}` | GET | Domain-specific calibration |
| `/fingerprint` | GET | Domain calibration radar data |
| `/history` | GET | Last 100 episode logs |
| `/docs` | GET | Swagger UI |
**Quick test:**
```bash
# Start server
python run.py server &
curl http://localhost:7860/health
# โ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
curl -X POST http://localhost:7860/reset
# โ EchoObservation with question, domain, difficulty, ece
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
# โ EchoObservation with reward=0.814, done=true, is_correct=true
```
**Python client:**
```python
from client import EchoClient
from models import EchoAction
client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
print(obs.reward, obs.is_correct, obs.ece)
```
---
## ๐ Project Structure
```
echo-ultimate/
โโโ config.py All hyperparameters (single source of truth)
โโโ run.py CLI: test | baseline | plots | train | eval | demo | server
โโโ openenv.yaml OpenEnv manifest
โโโ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โโโ client.py EchoClient (HTTPEnvClient subclass)
โโโ ECHO_Training.ipynb Colab GRPO training notebook
โโโ Dockerfile HF Spaces deployment
โโโ requirements.txt
โ
โโโ env/
โ โโโ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
โ โโโ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
โ โโโ task_bank.py 7-domain task loading + curriculum sampling
โ โโโ reward.py All reward components + RewardHistory
โ โโโ parser.py Robust <confidence><answer> parser (15+ edge cases)
โ โโโ self_consistency.py Multi-sample confidence adjustment
โ
โโโ core/
โ โโโ tasks.py 3 OpenEnv task definitions + TaskRunner
โ โโโ metrics.py ECE, MCE, Brier, Sharpness, Resolution
โ โโโ graders.py Domain-specific answer graders
โ โโโ baseline.py 4 baseline agents + evaluation runner
โ โโโ epistemic_fingerprint.py Radar chart + heatmap generation
โ
โโโ training/
โ โโโ train.py GRPO training with 3-phase curriculum
โ โโโ curriculum.py Phase manager (ECE-triggered advancement)
โ โโโ dataset.py GRPO dataset builder with chat template support
โ โโโ evaluate.py Full eval suite + all 6 plot generators
โ
โโโ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
โโโ ui/app.py Gradio 5-tab demo
โโโ results/
โ โโโ training_log.csv Real training data: 5,800 steps, 3 phases
โ โโโ plots/ 6 publication plots (training_curves, reliability, domainโฆ)
โโโ scripts/
โโโ download_tasks.py Download 7 HuggingFace datasets
โโโ run_baseline.py Evaluate baselines + generate plots
โโโ generate_plots.py Generate all 6 plots (synthetic, instant)
```
---
## ๐ ๏ธ Tech Stack
| Component | Technology |
|-----------|-----------|
| RL Training | HuggingFace TRL โฅ0.9.0 (GRPOTrainer) |
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Environment | openenv.core.Environment + gymnasium โฅ1.0.0 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
| Demo UI | Gradio 4 |
| Plots | matplotlib (dark theme, dpi=150) |
---
## ๐ Citation
```bibtex
@misc{echo-ultimate-2025,
title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
year = {2025},
url = {https://huggingface.co/spaces/revti126/echo-ultimate},
note = {OpenEnv Hackathon Submission}
}
```
---
*Built for the OpenEnv Hackathon, 2025. MIT License.*
|