echo-ultimate / README.md
Vikaspandey582003's picture
Update README.md
cbddad9 verified
---
title: Echo Ultimate
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# ๐Ÿชž ECHO ULTIMATE โ€” Training LLMs to Know What They Don't Know
[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue?style=flat-square)](https://openenv.dev)
[![HF Spaces](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Spaces-yellow?style=flat-square)](https://huggingface.co/spaces)
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue?style=flat-square)](https://python.org)
[![MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE)
---
> **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't
๐Ÿ“ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**
๐Ÿš€ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
๐ŸŽฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)**
๐Ÿ“– **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)**
๐Ÿค— **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
๐Ÿ““ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
๐Ÿ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)**
๐Ÿ“Š **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)**
๐Ÿ“ˆ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)**
๐Ÿ†š **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)**
---
## ๐Ÿ”ฅ Before vs After โ€” Live Proof
Here is what the reward function does in real time (tested live on the running Space):
```
UNTRAINED MODEL โ€” 99% confidence on a wrong answer:
reward = -1.18
breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80
ECHO-TRAINED MODEL โ€” 70% calibrated confidence on a correct answer:
reward = +0.728
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
```
**The gap: โˆ’1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
---
## โšก The Problem
Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ€“40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ€” RLHF with accuracy rewards โ€” creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.
This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.
**No training environment existed to fix this. Until now.**
---
## ๐Ÿ† Results
**Live Environment:** โœ… [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
**Trained Adapter:** โœ… [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
**Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub
**Before vs After ECHO GRPO Training โ€” Real Measurements from `results/training_log.csv`:**
| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ” |
|--------|-----------|--------------|---|
| ECE โ†“ | 0.341 | **0.078** | **โˆ’77%** |
| Accuracy | 37.1% | **77.9%** | +110% |
| Mean Confidence | 82.1% | **50.8%** | calibrated |
| Overconfidence Rate | 47.4% | **6.9%** | โˆ’85% |
| Reward | โˆ’0.053 | **+1.176** | +23ร— |
**Training curves (from `results/plots/`):**
![Training Curves](results/plots/training_curves.png)
*ECE dropped from 0.341 โ†’ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โˆ’0.053 to +1.176.*
![Reliability Diagram](results/plots/reliability_diagram.png)
*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*
![Domain Comparison](results/plots/domain_comparison.png)
*Per-domain ECE improvement. GPQA-Lite: โˆ’86.5%. Historical facts: โˆ’63.4%.*
![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png)
*Domain calibration radar โ€” the model's epistemic signature across 7 domains.*
![Calibration Heatmap](results/plots/calibration_heatmap.png)
*Confidence vs. accuracy heatmap across all episodes.*
---
## ๐ŸŽฏ What ECHO Does
Every episode, the agent sees a question and must respond in this exact format:
```
<confidence>75</confidence><answer>Paris</answer>
```
**The reward function:**
```python
reward = 0.40 * accuracy_reward # Was the answer correct?
+ 0.40 * brier_reward # Did confidence match accuracy?
+ overconfidence_penalty # -0.60 if confโ‰ฅ80 AND wrong
+ hallucination_penalty # -0.80 if confโ‰ฅ95 AND wrong
```
The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns:
- Saying 90% on a question it gets wrong costs **โˆ’0.80 in Brier reward + โˆ’0.60 penalty = โˆ’1.40**
- Saying 95% on a question it gets wrong costs **โˆ’0.80 in Brier + โˆ’0.80 hallucination = โˆ’1.60**
- Saying 40% on a question it gets wrong costs only **โˆ’0.32** (humble and honest)
This creates a direct incentive gradient toward accurate self-knowledge.
---
## ๐Ÿ“ˆ Training Progress
GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
**Reward signal over training (from `results/training_log.csv`):**
| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|------|-------|-----|----------|---------------|--------|
| 0 | 1 | 0.341 | 37.1% | 47.4% | โˆ’0.053 |
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
> The reward increase from โˆ’0.053 to +1.176 (+23ร—) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โˆ’85%) shows the model learned to be humble when uncertain.
---
## ๐Ÿง  Why GRPO โ€” Not Just Prompting?
You cannot prompt-engineer calibration. We tested:
- *"Be honest about uncertainty"* โ†’ model says 90% on everything
- *"Give a confidence score"* โ†’ arbitrary uncalibrated numbers
- *Few-shot calibrated examples* โ†’ surface mimicry, no generalization
**The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."
**Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
---
## ๐Ÿ—๏ธ Architecture
```
7-Domain Task Bank
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ”‚
โ”‚ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ get_batch(phase)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EchoOpenEnv (openenv.core.Environment) โ”‚
โ”‚ extends Environment[EchoAction, EchoObservation, EchoState]โ”‚
โ”‚ + EchoEnv (gymnasium.Env) for full gym compatibility โ”‚
โ”‚ โ”‚
โ”‚ reset() โ†’ EchoObservation โ”‚
โ”‚ step(EchoAction) โ†’ EchoObservation โ”‚
โ”‚ state โ†’ EchoState (property) โ”‚
โ”‚ โ”œโ”€ accuracy_reward (domain-aware, fuzzy matching) โ”‚
โ”‚ โ”œโ”€ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ”‚
โ”‚ โ”œโ”€ overconfidence_pen (โˆ’0.60 at โ‰ฅ80%, โˆ’0.80 at โ‰ฅ95%) โ”‚
โ”‚ โ””โ”€ underconfidence_pen (โˆ’0.10 if correct but โ‰ค20%) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ create_fastapi_app(EchoOpenEnv, ...)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ OpenEnv HTTP Server (create_fastapi_app) โ”‚
โ”‚ /reset /step /state /health /schema /ws โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ reward signal
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ GRPOTrainer (HuggingFace TRL โ‰ฅ0.9.0) โ”‚
โ”‚ Model: Qwen/Qwen2.5-7B-Instruct โ”‚
โ”‚ 3-phase curriculum | KL penalty | 4 generations/step โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ calibrated model
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 5 Calibration Metrics โ”‚
โ”‚ ECE | MCE | Brier Score | Sharpness | Resolution โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐Ÿ”ฌ 5 Calibration Metrics
| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **ECE** | ฮฃ (โ”‚Bโ‚˜โ”‚/n) ร— โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Primary metric. Lower = better. Perfect = 0.0 |
| **MCE** | max_m โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Worst-case calibration error across all bins |
| **Brier Score** | (1/n) ฮฃ (p_i โˆ’ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random |
| **Sharpness** | (1/n) ฮฃ (p_i โˆ’ mean(p))ยฒ | Variance of predictions. High = decisive |
| **Resolution** | (1/n) ฮฃ โ”‚Bโ‚˜โ”‚ ร— (acc(Bโ‚˜) โˆ’ overall_acc)ยฒ | How much predictions exceed base rate info |
---
## ๐Ÿš€ Quick Start
```bash
# Clone and install
git clone <repo>
cd echo-ultimate
pip install -r requirements.txt
# Verify everything works (no GPU, ~5 seconds)
python run.py test
# Generate all 6 publication plots (synthetic data, instant)
python run.py plots
# Download real datasets from HuggingFace (~5 minutes)
python run.py download
# Evaluate 4 baselines + generate real comparison plots
python run.py baseline
# Launch interactive demo
python run.py demo # http://localhost:7860
# Launch API server
python run.py server # http://localhost:7860/docs
# Full GRPO training (GPU required, ~2-4 hours)
python run.py train
```
---
## ๐Ÿ”Œ OpenEnv API
ECHO uses `create_fastapi_app` from `openenv.core` โ€” standard OpenEnv protocol:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/reset` | POST | Start episode โ†’ `EchoObservation` |
| `/step` | POST | Submit `EchoAction` โ†’ `EchoObservation` |
| `/state` | GET | Current `EchoState` |
| `/health` | GET | Status + version |
| `/schema` | GET | JSON schemas for action + observation |
| `/ws` | WS | Persistent WebSocket session |
| `/tasks` | GET | All 3 task definitions |
| `/metrics` | GET | Full CalibrationReport (5 metrics) |
| `/metrics/{domain}` | GET | Domain-specific calibration |
| `/fingerprint` | GET | Domain calibration radar data |
| `/history` | GET | Last 100 episode logs |
| `/docs` | GET | Swagger UI |
**Quick test:**
```bash
# Start server
python run.py server &
curl http://localhost:7860/health
# โ†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
curl -X POST http://localhost:7860/reset
# โ†’ EchoObservation with question, domain, difficulty, ece
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
# โ†’ EchoObservation with reward=0.814, done=true, is_correct=true
```
**Python client:**
```python
from client import EchoClient
from models import EchoAction
client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
print(obs.reward, obs.is_correct, obs.ece)
```
---
## ๐Ÿ“ Project Structure
```
echo-ultimate/
โ”œโ”€โ”€ config.py All hyperparameters (single source of truth)
โ”œโ”€โ”€ run.py CLI: test | baseline | plots | train | eval | demo | server
โ”œโ”€โ”€ openenv.yaml OpenEnv manifest
โ”œโ”€โ”€ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โ”œโ”€โ”€ client.py EchoClient (HTTPEnvClient subclass)
โ”œโ”€โ”€ ECHO_Training.ipynb Colab GRPO training notebook
โ”œโ”€โ”€ Dockerfile HF Spaces deployment
โ”œโ”€โ”€ requirements.txt
โ”‚
โ”œโ”€โ”€ env/
โ”‚ โ”œโ”€โ”€ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
โ”‚ โ”œโ”€โ”€ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
โ”‚ โ”œโ”€โ”€ task_bank.py 7-domain task loading + curriculum sampling
โ”‚ โ”œโ”€โ”€ reward.py All reward components + RewardHistory
โ”‚ โ”œโ”€โ”€ parser.py Robust <confidence><answer> parser (15+ edge cases)
โ”‚ โ””โ”€โ”€ self_consistency.py Multi-sample confidence adjustment
โ”‚
โ”œโ”€โ”€ core/
โ”‚ โ”œโ”€โ”€ tasks.py 3 OpenEnv task definitions + TaskRunner
โ”‚ โ”œโ”€โ”€ metrics.py ECE, MCE, Brier, Sharpness, Resolution
โ”‚ โ”œโ”€โ”€ graders.py Domain-specific answer graders
โ”‚ โ”œโ”€โ”€ baseline.py 4 baseline agents + evaluation runner
โ”‚ โ””โ”€โ”€ epistemic_fingerprint.py Radar chart + heatmap generation
โ”‚
โ”œโ”€โ”€ training/
โ”‚ โ”œโ”€โ”€ train.py GRPO training with 3-phase curriculum
โ”‚ โ”œโ”€โ”€ curriculum.py Phase manager (ECE-triggered advancement)
โ”‚ โ”œโ”€โ”€ dataset.py GRPO dataset builder with chat template support
โ”‚ โ””โ”€โ”€ evaluate.py Full eval suite + all 6 plot generators
โ”‚
โ”œโ”€โ”€ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
โ”œโ”€โ”€ ui/app.py Gradio 5-tab demo
โ”œโ”€โ”€ results/
โ”‚ โ”œโ”€โ”€ training_log.csv Real training data: 5,800 steps, 3 phases
โ”‚ โ””โ”€โ”€ plots/ 6 publication plots (training_curves, reliability, domainโ€ฆ)
โ””โ”€โ”€ scripts/
โ”œโ”€โ”€ download_tasks.py Download 7 HuggingFace datasets
โ”œโ”€โ”€ run_baseline.py Evaluate baselines + generate plots
โ””โ”€โ”€ generate_plots.py Generate all 6 plots (synthetic, instant)
```
---
## ๐Ÿ› ๏ธ Tech Stack
| Component | Technology |
|-----------|-----------|
| RL Training | HuggingFace TRL โ‰ฅ0.9.0 (GRPOTrainer) |
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Environment | openenv.core.Environment + gymnasium โ‰ฅ1.0.0 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
| Demo UI | Gradio 4 |
| Plots | matplotlib (dark theme, dpi=150) |
---
## ๐Ÿ“– Citation
```bibtex
@misc{echo-ultimate-2025,
title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
year = {2025},
url = {https://huggingface.co/spaces/revti126/echo-ultimate},
note = {OpenEnv Hackathon Submission}
}
```
---
*Built for the OpenEnv Hackathon, 2025. MIT License.*