--- title: Echo Ultimate emoji: ๐Ÿง  colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false --- # ๐Ÿชž ECHO ULTIMATE โ€” Training LLMs to Know What They Don't Know [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue?style=flat-square)](https://openenv.dev) [![HF Spaces](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Spaces-yellow?style=flat-square)](https://huggingface.co/spaces) [![Python 3.10](https://img.shields.io/badge/Python-3.10-blue?style=flat-square)](https://python.org) [![MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE) --- > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.** > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't ๐Ÿ“ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)** ๐Ÿš€ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)** ๐ŸŽฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)** ๐Ÿ“– **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)** ๐Ÿค— **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)** ๐Ÿ““ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)** ๐Ÿ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)** ๐Ÿ“Š **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)** ๐Ÿ“ˆ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)** ๐Ÿ†š **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)** --- ## ๐Ÿ”ฅ Before vs After โ€” Live Proof Here is what the reward function does in real time (tested live on the running Space): ``` UNTRAINED MODEL โ€” 99% confidence on a wrong answer: reward = -1.18 breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80 ECHO-TRAINED MODEL โ€” 70% calibrated confidence on a correct answer: reward = +0.728 breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00 ``` **The gap: โˆ’1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*. --- ## โšก The Problem Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ€“40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ€” RLHF with accuracy rewards โ€” creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer. This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive. **No training environment existed to fix this. Until now.** --- ## ๐Ÿ† Results **Live Environment:** โœ… [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space) **Trained Adapter:** โœ… [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter) **Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub **Before vs After ECHO GRPO Training โ€” Real Measurements from `results/training_log.csv`:** | Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ” | |--------|-----------|--------------|---| | ECE โ†“ | 0.341 | **0.078** | **โˆ’77%** | | Accuracy | 37.1% | **77.9%** | +110% | | Mean Confidence | 82.1% | **50.8%** | calibrated | | Overconfidence Rate | 47.4% | **6.9%** | โˆ’85% | | Reward | โˆ’0.053 | **+1.176** | +23ร— | **Training curves (from `results/plots/`):** ![Training Curves](results/plots/training_curves.png) *ECE dropped from 0.341 โ†’ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โˆ’0.053 to +1.176.* ![Reliability Diagram](results/plots/reliability_diagram.png) *Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.* ![Domain Comparison](results/plots/domain_comparison.png) *Per-domain ECE improvement. GPQA-Lite: โˆ’86.5%. Historical facts: โˆ’63.4%.* ![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png) *Domain calibration radar โ€” the model's epistemic signature across 7 domains.* ![Calibration Heatmap](results/plots/calibration_heatmap.png) *Confidence vs. accuracy heatmap across all episodes.* --- ## ๐ŸŽฏ What ECHO Does Every episode, the agent sees a question and must respond in this exact format: ``` 75Paris ``` **The reward function:** ```python reward = 0.40 * accuracy_reward # Was the answer correct? + 0.40 * brier_reward # Did confidence match accuracy? + overconfidence_penalty # -0.60 if confโ‰ฅ80 AND wrong + hallucination_penalty # -0.80 if confโ‰ฅ95 AND wrong ``` The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns: - Saying 90% on a question it gets wrong costs **โˆ’0.80 in Brier reward + โˆ’0.60 penalty = โˆ’1.40** - Saying 95% on a question it gets wrong costs **โˆ’0.80 in Brier + โˆ’0.80 hallucination = โˆ’1.60** - Saying 40% on a question it gets wrong costs only **โˆ’0.32** (humble and honest) This creates a direct incentive gradient toward accurate self-knowledge. --- ## ๐Ÿ“ˆ Training Progress GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU. **Reward signal over training (from `results/training_log.csv`):** | Step | Phase | ECE | Accuracy | Overconf Rate | Reward | |------|-------|-----|----------|---------------|--------| | 0 | 1 | 0.341 | 37.1% | 47.4% | โˆ’0.053 | | 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 | | 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 | | 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 | | 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 | | 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** | > The reward increase from โˆ’0.053 to +1.176 (+23ร—) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โˆ’85%) shows the model learned to be humble when uncertain. --- ## ๐Ÿง  Why GRPO โ€” Not Just Prompting? You cannot prompt-engineer calibration. We tested: - *"Be honest about uncertainty"* โ†’ model says 90% on everything - *"Give a confidence score"* โ†’ arbitrary uncalibrated numbers - *Few-shot calibrated examples* โ†’ surface mimicry, no generalization **The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time." **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations. --- ## ๐Ÿ—๏ธ Architecture ``` 7-Domain Task Bank โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ”‚ โ”‚ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ get_batch(phase) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ EchoOpenEnv (openenv.core.Environment) โ”‚ โ”‚ extends Environment[EchoAction, EchoObservation, EchoState]โ”‚ โ”‚ + EchoEnv (gymnasium.Env) for full gym compatibility โ”‚ โ”‚ โ”‚ โ”‚ reset() โ†’ EchoObservation โ”‚ โ”‚ step(EchoAction) โ†’ EchoObservation โ”‚ โ”‚ state โ†’ EchoState (property) โ”‚ โ”‚ โ”œโ”€ accuracy_reward (domain-aware, fuzzy matching) โ”‚ โ”‚ โ”œโ”€ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ”‚ โ”‚ โ”œโ”€ overconfidence_pen (โˆ’0.60 at โ‰ฅ80%, โˆ’0.80 at โ‰ฅ95%) โ”‚ โ”‚ โ””โ”€ underconfidence_pen (โˆ’0.10 if correct but โ‰ค20%) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ create_fastapi_app(EchoOpenEnv, ...) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ OpenEnv HTTP Server (create_fastapi_app) โ”‚ โ”‚ /reset /step /state /health /schema /ws โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ reward signal โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ GRPOTrainer (HuggingFace TRL โ‰ฅ0.9.0) โ”‚ โ”‚ Model: Qwen/Qwen2.5-7B-Instruct โ”‚ โ”‚ 3-phase curriculum | KL penalty | 4 generations/step โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ calibrated model โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5 Calibration Metrics โ”‚ โ”‚ ECE | MCE | Brier Score | Sharpness | Resolution โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## ๐Ÿ”ฌ 5 Calibration Metrics | Metric | Formula | Interpretation | |--------|---------|----------------| | **ECE** | ฮฃ (โ”‚Bโ‚˜โ”‚/n) ร— โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Primary metric. Lower = better. Perfect = 0.0 | | **MCE** | max_m โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ | Worst-case calibration error across all bins | | **Brier Score** | (1/n) ฮฃ (p_i โˆ’ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random | | **Sharpness** | (1/n) ฮฃ (p_i โˆ’ mean(p))ยฒ | Variance of predictions. High = decisive | | **Resolution** | (1/n) ฮฃ โ”‚Bโ‚˜โ”‚ ร— (acc(Bโ‚˜) โˆ’ overall_acc)ยฒ | How much predictions exceed base rate info | --- ## ๐Ÿš€ Quick Start ```bash # Clone and install git clone cd echo-ultimate pip install -r requirements.txt # Verify everything works (no GPU, ~5 seconds) python run.py test # Generate all 6 publication plots (synthetic data, instant) python run.py plots # Download real datasets from HuggingFace (~5 minutes) python run.py download # Evaluate 4 baselines + generate real comparison plots python run.py baseline # Launch interactive demo python run.py demo # http://localhost:7860 # Launch API server python run.py server # http://localhost:7860/docs # Full GRPO training (GPU required, ~2-4 hours) python run.py train ``` --- ## ๐Ÿ”Œ OpenEnv API ECHO uses `create_fastapi_app` from `openenv.core` โ€” standard OpenEnv protocol: | Endpoint | Method | Description | |----------|--------|-------------| | `/reset` | POST | Start episode โ†’ `EchoObservation` | | `/step` | POST | Submit `EchoAction` โ†’ `EchoObservation` | | `/state` | GET | Current `EchoState` | | `/health` | GET | Status + version | | `/schema` | GET | JSON schemas for action + observation | | `/ws` | WS | Persistent WebSocket session | | `/tasks` | GET | All 3 task definitions | | `/metrics` | GET | Full CalibrationReport (5 metrics) | | `/metrics/{domain}` | GET | Domain-specific calibration | | `/fingerprint` | GET | Domain calibration radar data | | `/history` | GET | Last 100 episode logs | | `/docs` | GET | Swagger UI | **Quick test:** ```bash # Start server python run.py server & curl http://localhost:7860/health # โ†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"} curl -X POST http://localhost:7860/reset # โ†’ EchoObservation with question, domain, difficulty, ece curl -X POST http://localhost:7860/step \ -H "Content-Type: application/json" \ -d '{"response":"72Paris"}' # โ†’ EchoObservation with reward=0.814, done=true, is_correct=true ``` **Python client:** ```python from client import EchoClient from models import EchoAction client = EchoClient(base_url="http://localhost:7860") obs = client.reset() obs = client.step(EchoAction(response="72Paris")) print(obs.reward, obs.is_correct, obs.ece) ``` --- ## ๐Ÿ“ Project Structure ``` echo-ultimate/ โ”œโ”€โ”€ config.py All hyperparameters (single source of truth) โ”œโ”€โ”€ run.py CLI: test | baseline | plots | train | eval | demo | server โ”œโ”€โ”€ openenv.yaml OpenEnv manifest โ”œโ”€โ”€ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types) โ”œโ”€โ”€ client.py EchoClient (HTTPEnvClient subclass) โ”œโ”€โ”€ ECHO_Training.ipynb Colab GRPO training notebook โ”œโ”€โ”€ Dockerfile HF Spaces deployment โ”œโ”€โ”€ requirements.txt โ”‚ โ”œโ”€โ”€ env/ โ”‚ โ”œโ”€โ”€ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env โ”‚ โ”œโ”€โ”€ echo_env.py Core gymnasium.Env (7 domains, 3 phases) โ”‚ โ”œโ”€โ”€ task_bank.py 7-domain task loading + curriculum sampling โ”‚ โ”œโ”€โ”€ reward.py All reward components + RewardHistory โ”‚ โ”œโ”€โ”€ parser.py Robust parser (15+ edge cases) โ”‚ โ””โ”€โ”€ self_consistency.py Multi-sample confidence adjustment โ”‚ โ”œโ”€โ”€ core/ โ”‚ โ”œโ”€โ”€ tasks.py 3 OpenEnv task definitions + TaskRunner โ”‚ โ”œโ”€โ”€ metrics.py ECE, MCE, Brier, Sharpness, Resolution โ”‚ โ”œโ”€โ”€ graders.py Domain-specific answer graders โ”‚ โ”œโ”€โ”€ baseline.py 4 baseline agents + evaluation runner โ”‚ โ””โ”€โ”€ epistemic_fingerprint.py Radar chart + heatmap generation โ”‚ โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ train.py GRPO training with 3-phase curriculum โ”‚ โ”œโ”€โ”€ curriculum.py Phase manager (ECE-triggered advancement) โ”‚ โ”œโ”€โ”€ dataset.py GRPO dataset builder with chat template support โ”‚ โ””โ”€โ”€ evaluate.py Full eval suite + all 6 plot generators โ”‚ โ”œโ”€โ”€ server/app.py OpenEnv server (create_fastapi_app + extra endpoints) โ”œโ”€โ”€ ui/app.py Gradio 5-tab demo โ”œโ”€โ”€ results/ โ”‚ โ”œโ”€โ”€ training_log.csv Real training data: 5,800 steps, 3 phases โ”‚ โ””โ”€โ”€ plots/ 6 publication plots (training_curves, reliability, domainโ€ฆ) โ””โ”€โ”€ scripts/ โ”œโ”€โ”€ download_tasks.py Download 7 HuggingFace datasets โ”œโ”€โ”€ run_baseline.py Evaluate baselines + generate plots โ””โ”€โ”€ generate_plots.py Generate all 6 plots (synthetic, instant) ``` --- ## ๐Ÿ› ๏ธ Tech Stack | Component | Technology | |-----------|-----------| | RL Training | HuggingFace TRL โ‰ฅ0.9.0 (GRPOTrainer) | | Base Model | Qwen/Qwen2.5-7B-Instruct | | Environment | openenv.core.Environment + gymnasium โ‰ฅ1.0.0 | | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated | | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution | | API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn | | Demo UI | Gradio 4 | | Plots | matplotlib (dark theme, dpi=150) | --- ## ๐Ÿ“– Citation ```bibtex @misc{echo-ultimate-2025, title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know}, author = {Tripathi, Revtiraman and Pandey, Vikas Dev}, year = {2025}, url = {https://huggingface.co/spaces/revti126/echo-ultimate}, note = {OpenEnv Hackathon Submission} } ``` --- *Built for the OpenEnv Hackathon, 2025. MIT License.*