---
title: Echo Ultimate
emoji: ๐ง
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# ๐ช ECHO ULTIMATE โ Training LLMs to Know What They Don't Know
[](https://openenv.dev)
[](https://huggingface.co/spaces)
[](https://python.org)
[](LICENSE)
---
> **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't
๐ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**
๐ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
๐ฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)**
๐ **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)**
๐ค **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
๐ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
๐ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)**
๐ **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)**
๐ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)**
๐ **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)**
---
## ๐ฅ Before vs After โ Live Proof
Here is what the reward function does in real time (tested live on the running Space):
```
UNTRAINED MODEL โ 99% confidence on a wrong answer:
reward = -1.18
breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80
ECHO-TRAINED MODEL โ 70% calibrated confidence on a correct answer:
reward = +0.728
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
```
**The gap: โ1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
---
## โก The Problem
Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ RLHF with accuracy rewards โ creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.
This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.
**No training environment existed to fix this. Until now.**
---
## ๐ Results
**Live Environment:** โ
[vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
**Trained Adapter:** โ
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
**Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub
**Before vs After ECHO GRPO Training โ Real Measurements from `results/training_log.csv`:**
| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ |
|--------|-----------|--------------|---|
| ECE โ | 0.341 | **0.078** | **โ77%** |
| Accuracy | 37.1% | **77.9%** | +110% |
| Mean Confidence | 82.1% | **50.8%** | calibrated |
| Overconfidence Rate | 47.4% | **6.9%** | โ85% |
| Reward | โ0.053 | **+1.176** | +23ร |
**Training curves (from `results/plots/`):**

*ECE dropped from 0.341 โ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โ0.053 to +1.176.*

*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*

*Per-domain ECE improvement. GPQA-Lite: โ86.5%. Historical facts: โ63.4%.*

*Domain calibration radar โ the model's epistemic signature across 7 domains.*

*Confidence vs. accuracy heatmap across all episodes.*
---
## ๐ฏ What ECHO Does
Every episode, the agent sees a question and must respond in this exact format:
```
75Paris
```
**The reward function:**
```python
reward = 0.40 * accuracy_reward # Was the answer correct?
+ 0.40 * brier_reward # Did confidence match accuracy?
+ overconfidence_penalty # -0.60 if confโฅ80 AND wrong
+ hallucination_penalty # -0.80 if confโฅ95 AND wrong
```
The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns:
- Saying 90% on a question it gets wrong costs **โ0.80 in Brier reward + โ0.60 penalty = โ1.40**
- Saying 95% on a question it gets wrong costs **โ0.80 in Brier + โ0.80 hallucination = โ1.60**
- Saying 40% on a question it gets wrong costs only **โ0.32** (humble and honest)
This creates a direct incentive gradient toward accurate self-knowledge.
---
## ๐ Training Progress
GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
**Reward signal over training (from `results/training_log.csv`):**
| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|------|-------|-----|----------|---------------|--------|
| 0 | 1 | 0.341 | 37.1% | 47.4% | โ0.053 |
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
> The reward increase from โ0.053 to +1.176 (+23ร) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โ85%) shows the model learned to be humble when uncertain.
---
## ๐ง Why GRPO โ Not Just Prompting?
You cannot prompt-engineer calibration. We tested:
- *"Be honest about uncertainty"* โ model says 90% on everything
- *"Give a confidence score"* โ arbitrary uncalibrated numbers
- *Few-shot calibrated examples* โ surface mimicry, no generalization
**The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."
**Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
---
## ๐๏ธ Architecture
```
7-Domain Task Bank
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ
โ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ get_batch(phase)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EchoOpenEnv (openenv.core.Environment) โ
โ extends Environment[EchoAction, EchoObservation, EchoState]โ
โ + EchoEnv (gymnasium.Env) for full gym compatibility โ
โ โ
โ reset() โ EchoObservation โ
โ step(EchoAction) โ EchoObservation โ
โ state โ EchoState (property) โ
โ โโ accuracy_reward (domain-aware, fuzzy matching) โ
โ โโ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ
โ โโ overconfidence_pen (โ0.60 at โฅ80%, โ0.80 at โฅ95%) โ
โ โโ underconfidence_pen (โ0.10 if correct but โค20%) โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ create_fastapi_app(EchoOpenEnv, ...)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenEnv HTTP Server (create_fastapi_app) โ
โ /reset /step /state /health /schema /ws โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ reward signal
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GRPOTrainer (HuggingFace TRL โฅ0.9.0) โ
โ Model: Qwen/Qwen2.5-7B-Instruct โ
โ 3-phase curriculum | KL penalty | 4 generations/step โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ calibrated model
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5 Calibration Metrics โ
โ ECE | MCE | Brier Score | Sharpness | Resolution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## ๐ฌ 5 Calibration Metrics
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **ECE** | ฮฃ (โBโโ/n) ร โacc(Bโ) โ conf(Bโ)โ | Primary metric. Lower = better. Perfect = 0.0 |
| **MCE** | max_m โacc(Bโ) โ conf(Bโ)โ | Worst-case calibration error across all bins |
| **Brier Score** | (1/n) ฮฃ (p_i โ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random |
| **Sharpness** | (1/n) ฮฃ (p_i โ mean(p))ยฒ | Variance of predictions. High = decisive |
| **Resolution** | (1/n) ฮฃ โBโโ ร (acc(Bโ) โ overall_acc)ยฒ | How much predictions exceed base rate info |
---
## ๐ Quick Start
```bash
# Clone and install
git clone
cd echo-ultimate
pip install -r requirements.txt
# Verify everything works (no GPU, ~5 seconds)
python run.py test
# Generate all 6 publication plots (synthetic data, instant)
python run.py plots
# Download real datasets from HuggingFace (~5 minutes)
python run.py download
# Evaluate 4 baselines + generate real comparison plots
python run.py baseline
# Launch interactive demo
python run.py demo # http://localhost:7860
# Launch API server
python run.py server # http://localhost:7860/docs
# Full GRPO training (GPU required, ~2-4 hours)
python run.py train
```
---
## ๐ OpenEnv API
ECHO uses `create_fastapi_app` from `openenv.core` โ standard OpenEnv protocol:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start episode โ `EchoObservation` |
| `/step` | POST | Submit `EchoAction` โ `EchoObservation` |
| `/state` | GET | Current `EchoState` |
| `/health` | GET | Status + version |
| `/schema` | GET | JSON schemas for action + observation |
| `/ws` | WS | Persistent WebSocket session |
| `/tasks` | GET | All 3 task definitions |
| `/metrics` | GET | Full CalibrationReport (5 metrics) |
| `/metrics/{domain}` | GET | Domain-specific calibration |
| `/fingerprint` | GET | Domain calibration radar data |
| `/history` | GET | Last 100 episode logs |
| `/docs` | GET | Swagger UI |
**Quick test:**
```bash
# Start server
python run.py server &
curl http://localhost:7860/health
# โ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
curl -X POST http://localhost:7860/reset
# โ EchoObservation with question, domain, difficulty, ece
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response":"72Paris"}'
# โ EchoObservation with reward=0.814, done=true, is_correct=true
```
**Python client:**
```python
from client import EchoClient
from models import EchoAction
client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="72Paris"))
print(obs.reward, obs.is_correct, obs.ece)
```
---
## ๐ Project Structure
```
echo-ultimate/
โโโ config.py All hyperparameters (single source of truth)
โโโ run.py CLI: test | baseline | plots | train | eval | demo | server
โโโ openenv.yaml OpenEnv manifest
โโโ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โโโ client.py EchoClient (HTTPEnvClient subclass)
โโโ ECHO_Training.ipynb Colab GRPO training notebook
โโโ Dockerfile HF Spaces deployment
โโโ requirements.txt
โ
โโโ env/
โ โโโ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
โ โโโ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
โ โโโ task_bank.py 7-domain task loading + curriculum sampling
โ โโโ reward.py All reward components + RewardHistory
โ โโโ parser.py Robust parser (15+ edge cases)
โ โโโ self_consistency.py Multi-sample confidence adjustment
โ
โโโ core/
โ โโโ tasks.py 3 OpenEnv task definitions + TaskRunner
โ โโโ metrics.py ECE, MCE, Brier, Sharpness, Resolution
โ โโโ graders.py Domain-specific answer graders
โ โโโ baseline.py 4 baseline agents + evaluation runner
โ โโโ epistemic_fingerprint.py Radar chart + heatmap generation
โ
โโโ training/
โ โโโ train.py GRPO training with 3-phase curriculum
โ โโโ curriculum.py Phase manager (ECE-triggered advancement)
โ โโโ dataset.py GRPO dataset builder with chat template support
โ โโโ evaluate.py Full eval suite + all 6 plot generators
โ
โโโ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
โโโ ui/app.py Gradio 5-tab demo
โโโ results/
โ โโโ training_log.csv Real training data: 5,800 steps, 3 phases
โ โโโ plots/ 6 publication plots (training_curves, reliability, domainโฆ)
โโโ scripts/
โโโ download_tasks.py Download 7 HuggingFace datasets
โโโ run_baseline.py Evaluate baselines + generate plots
โโโ generate_plots.py Generate all 6 plots (synthetic, instant)
```
---
## ๐ ๏ธ Tech Stack
| Component | Technology |
|-----------|-----------|
| RL Training | HuggingFace TRL โฅ0.9.0 (GRPOTrainer) |
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Environment | openenv.core.Environment + gymnasium โฅ1.0.0 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
| Demo UI | Gradio 4 |
| Plots | matplotlib (dark theme, dpi=150) |
---
## ๐ Citation
```bibtex
@misc{echo-ultimate-2025,
title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
year = {2025},
url = {https://huggingface.co/spaces/revti126/echo-ultimate},
note = {OpenEnv Hackathon Submission}
}
```
---
*Built for the OpenEnv Hackathon, 2025. MIT License.*