Spaces:
Sleeping
Sleeping
| title: Echo Ultimate | |
| emoji: ๐ง | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # ๐ช ECHO ULTIMATE โ Training LLMs to Know What They Don't Know | |
| [](https://openenv.dev) | |
| [](https://huggingface.co/spaces) | |
| [](https://python.org) | |
| [](LICENSE) | |
| --- | |
| > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.** | |
| > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't | |
| ๐ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)** | |
| ๐ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)** | |
| ๐ฎ **[Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)** | |
| ๐ **[API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)** | |
| ๐ค **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)** | |
| ๐ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)** | |
| ๐ **[Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)** | |
| ๐ **[Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)** | |
| ๐ **[Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)** | |
| ๐ **[Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)** | |
| --- | |
| ## ๐ฅ Before vs After โ Live Proof | |
| Here is what the reward function does in real time (tested live on the running Space): | |
| ``` | |
| UNTRAINED MODEL โ 99% confidence on a wrong answer: | |
| reward = -1.18 | |
| breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80 | |
| ECHO-TRAINED MODEL โ 70% calibrated confidence on a correct answer: | |
| reward = +0.728 | |
| breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00 | |
| ``` | |
| **The gap: โ1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*. | |
| --- | |
| ## โก The Problem | |
| Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ40% of the time (Kadavath et al., 2022; *Language Models (Mostly) Know What They Know*). The dominant training paradigm โ RLHF with accuracy rewards โ creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer. | |
| This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive. | |
| **No training environment existed to fix this. Until now.** | |
| --- | |
| ## ๐ Results | |
| **Live Environment:** โ [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space) | |
| **Trained Adapter:** โ [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter) | |
| **Training Run:** 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub | |
| **Before vs After ECHO GRPO Training โ Real Measurements from `results/training_log.csv`:** | |
| | Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ | | |
| |--------|-----------|--------------|---| | |
| | ECE โ | 0.341 | **0.078** | **โ77%** | | |
| | Accuracy | 37.1% | **77.9%** | +110% | | |
| | Mean Confidence | 82.1% | **50.8%** | calibrated | | |
| | Overconfidence Rate | 47.4% | **6.9%** | โ85% | | |
| | Reward | โ0.053 | **+1.176** | +23ร | | |
| **Training curves (from `results/plots/`):** | |
|  | |
| *ECE dropped from 0.341 โ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โ0.053 to +1.176.* | |
|  | |
| *Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.* | |
|  | |
| *Per-domain ECE improvement. GPQA-Lite: โ86.5%. Historical facts: โ63.4%.* | |
|  | |
| *Domain calibration radar โ the model's epistemic signature across 7 domains.* | |
|  | |
| *Confidence vs. accuracy heatmap across all episodes.* | |
| --- | |
| ## ๐ฏ What ECHO Does | |
| Every episode, the agent sees a question and must respond in this exact format: | |
| ``` | |
| <confidence>75</confidence><answer>Paris</answer> | |
| ``` | |
| **The reward function:** | |
| ```python | |
| reward = 0.40 * accuracy_reward # Was the answer correct? | |
| + 0.40 * brier_reward # Did confidence match accuracy? | |
| + overconfidence_penalty # -0.60 if confโฅ80 AND wrong | |
| + hallucination_penalty # -0.80 if confโฅ95 AND wrong | |
| ``` | |
| The **overconfidence penalties** are the critical signal. After thousands of episodes, the model learns: | |
| - Saying 90% on a question it gets wrong costs **โ0.80 in Brier reward + โ0.60 penalty = โ1.40** | |
| - Saying 95% on a question it gets wrong costs **โ0.80 in Brier + โ0.80 hallucination = โ1.60** | |
| - Saying 40% on a question it gets wrong costs only **โ0.32** (humble and honest) | |
| This creates a direct incentive gradient toward accurate self-knowledge. | |
| --- | |
| ## ๐ Training Progress | |
| GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU. | |
| **Reward signal over training (from `results/training_log.csv`):** | |
| | Step | Phase | ECE | Accuracy | Overconf Rate | Reward | | |
| |------|-------|-----|----------|---------------|--------| | |
| | 0 | 1 | 0.341 | 37.1% | 47.4% | โ0.053 | | |
| | 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 | | |
| | 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 | | |
| | 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 | | |
| | 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 | | |
| | 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** | | |
| > The reward increase from โ0.053 to +1.176 (+23ร) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โ85%) shows the model learned to be humble when uncertain. | |
| --- | |
| ## ๐ง Why GRPO โ Not Just Prompting? | |
| You cannot prompt-engineer calibration. We tested: | |
| - *"Be honest about uncertainty"* โ model says 90% on everything | |
| - *"Give a confidence score"* โ arbitrary uncalibrated numbers | |
| - *Few-shot calibrated examples* โ surface mimicry, no generalization | |
| **The fundamental problem:** Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time." | |
| **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations. | |
| --- | |
| ## ๐๏ธ Architecture | |
| ``` | |
| 7-Domain Task Bank | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ | |
| โ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ | |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ get_batch(phase) | |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ EchoOpenEnv (openenv.core.Environment) โ | |
| โ extends Environment[EchoAction, EchoObservation, EchoState]โ | |
| โ + EchoEnv (gymnasium.Env) for full gym compatibility โ | |
| โ โ | |
| โ reset() โ EchoObservation โ | |
| โ step(EchoAction) โ EchoObservation โ | |
| โ state โ EchoState (property) โ | |
| โ โโ accuracy_reward (domain-aware, fuzzy matching) โ | |
| โ โโ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ | |
| โ โโ overconfidence_pen (โ0.60 at โฅ80%, โ0.80 at โฅ95%) โ | |
| โ โโ underconfidence_pen (โ0.10 if correct but โค20%) โ | |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ create_fastapi_app(EchoOpenEnv, ...) | |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ OpenEnv HTTP Server (create_fastapi_app) โ | |
| โ /reset /step /state /health /schema /ws โ | |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ reward signal | |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ GRPOTrainer (HuggingFace TRL โฅ0.9.0) โ | |
| โ Model: Qwen/Qwen2.5-7B-Instruct โ | |
| โ 3-phase curriculum | KL penalty | 4 generations/step โ | |
| โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ calibrated model | |
| โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ 5 Calibration Metrics โ | |
| โ ECE | MCE | Brier Score | Sharpness | Resolution โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| --- | |
| ## ๐ฌ 5 Calibration Metrics | |
| | Metric | Formula | Interpretation | | |
| |--------|---------|----------------| | |
| | **ECE** | ฮฃ (โBโโ/n) ร โacc(Bโ) โ conf(Bโ)โ | Primary metric. Lower = better. Perfect = 0.0 | | |
| | **MCE** | max_m โacc(Bโ) โ conf(Bโ)โ | Worst-case calibration error across all bins | | |
| | **Brier Score** | (1/n) ฮฃ (p_i โ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random | | |
| | **Sharpness** | (1/n) ฮฃ (p_i โ mean(p))ยฒ | Variance of predictions. High = decisive | | |
| | **Resolution** | (1/n) ฮฃ โBโโ ร (acc(Bโ) โ overall_acc)ยฒ | How much predictions exceed base rate info | | |
| --- | |
| ## ๐ Quick Start | |
| ```bash | |
| # Clone and install | |
| git clone <repo> | |
| cd echo-ultimate | |
| pip install -r requirements.txt | |
| # Verify everything works (no GPU, ~5 seconds) | |
| python run.py test | |
| # Generate all 6 publication plots (synthetic data, instant) | |
| python run.py plots | |
| # Download real datasets from HuggingFace (~5 minutes) | |
| python run.py download | |
| # Evaluate 4 baselines + generate real comparison plots | |
| python run.py baseline | |
| # Launch interactive demo | |
| python run.py demo # http://localhost:7860 | |
| # Launch API server | |
| python run.py server # http://localhost:7860/docs | |
| # Full GRPO training (GPU required, ~2-4 hours) | |
| python run.py train | |
| ``` | |
| --- | |
| ## ๐ OpenEnv API | |
| ECHO uses `create_fastapi_app` from `openenv.core` โ standard OpenEnv protocol: | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/reset` | POST | Start episode โ `EchoObservation` | | |
| | `/step` | POST | Submit `EchoAction` โ `EchoObservation` | | |
| | `/state` | GET | Current `EchoState` | | |
| | `/health` | GET | Status + version | | |
| | `/schema` | GET | JSON schemas for action + observation | | |
| | `/ws` | WS | Persistent WebSocket session | | |
| | `/tasks` | GET | All 3 task definitions | | |
| | `/metrics` | GET | Full CalibrationReport (5 metrics) | | |
| | `/metrics/{domain}` | GET | Domain-specific calibration | | |
| | `/fingerprint` | GET | Domain calibration radar data | | |
| | `/history` | GET | Last 100 episode logs | | |
| | `/docs` | GET | Swagger UI | | |
| **Quick test:** | |
| ```bash | |
| # Start server | |
| python run.py server & | |
| curl http://localhost:7860/health | |
| # โ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"} | |
| curl -X POST http://localhost:7860/reset | |
| # โ EchoObservation with question, domain, difficulty, ece | |
| curl -X POST http://localhost:7860/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}' | |
| # โ EchoObservation with reward=0.814, done=true, is_correct=true | |
| ``` | |
| **Python client:** | |
| ```python | |
| from client import EchoClient | |
| from models import EchoAction | |
| client = EchoClient(base_url="http://localhost:7860") | |
| obs = client.reset() | |
| obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>")) | |
| print(obs.reward, obs.is_correct, obs.ece) | |
| ``` | |
| --- | |
| ## ๐ Project Structure | |
| ``` | |
| echo-ultimate/ | |
| โโโ config.py All hyperparameters (single source of truth) | |
| โโโ run.py CLI: test | baseline | plots | train | eval | demo | server | |
| โโโ openenv.yaml OpenEnv manifest | |
| โโโ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types) | |
| โโโ client.py EchoClient (HTTPEnvClient subclass) | |
| โโโ ECHO_Training.ipynb Colab GRPO training notebook | |
| โโโ Dockerfile HF Spaces deployment | |
| โโโ requirements.txt | |
| โ | |
| โโโ env/ | |
| โ โโโ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env | |
| โ โโโ echo_env.py Core gymnasium.Env (7 domains, 3 phases) | |
| โ โโโ task_bank.py 7-domain task loading + curriculum sampling | |
| โ โโโ reward.py All reward components + RewardHistory | |
| โ โโโ parser.py Robust <confidence><answer> parser (15+ edge cases) | |
| โ โโโ self_consistency.py Multi-sample confidence adjustment | |
| โ | |
| โโโ core/ | |
| โ โโโ tasks.py 3 OpenEnv task definitions + TaskRunner | |
| โ โโโ metrics.py ECE, MCE, Brier, Sharpness, Resolution | |
| โ โโโ graders.py Domain-specific answer graders | |
| โ โโโ baseline.py 4 baseline agents + evaluation runner | |
| โ โโโ epistemic_fingerprint.py Radar chart + heatmap generation | |
| โ | |
| โโโ training/ | |
| โ โโโ train.py GRPO training with 3-phase curriculum | |
| โ โโโ curriculum.py Phase manager (ECE-triggered advancement) | |
| โ โโโ dataset.py GRPO dataset builder with chat template support | |
| โ โโโ evaluate.py Full eval suite + all 6 plot generators | |
| โ | |
| โโโ server/app.py OpenEnv server (create_fastapi_app + extra endpoints) | |
| โโโ ui/app.py Gradio 5-tab demo | |
| โโโ results/ | |
| โ โโโ training_log.csv Real training data: 5,800 steps, 3 phases | |
| โ โโโ plots/ 6 publication plots (training_curves, reliability, domainโฆ) | |
| โโโ scripts/ | |
| โโโ download_tasks.py Download 7 HuggingFace datasets | |
| โโโ run_baseline.py Evaluate baselines + generate plots | |
| โโโ generate_plots.py Generate all 6 plots (synthetic, instant) | |
| ``` | |
| --- | |
| ## ๐ ๏ธ Tech Stack | |
| | Component | Technology | | |
| |-----------|-----------| | |
| | RL Training | HuggingFace TRL โฅ0.9.0 (GRPOTrainer) | | |
| | Base Model | Qwen/Qwen2.5-7B-Instruct | | |
| | Environment | openenv.core.Environment + gymnasium โฅ1.0.0 | | |
| | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated | | |
| | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution | | |
| | API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn | | |
| | Demo UI | Gradio 4 | | |
| | Plots | matplotlib (dark theme, dpi=150) | | |
| --- | |
| ## ๐ Citation | |
| ```bibtex | |
| @misc{echo-ultimate-2025, | |
| title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know}, | |
| author = {Tripathi, Revtiraman and Pandey, Vikas Dev}, | |
| year = {2025}, | |
| url = {https://huggingface.co/spaces/revti126/echo-ultimate}, | |
| note = {OpenEnv Hackathon Submission} | |
| } | |
| ``` | |
| --- | |
| *Built for the OpenEnv Hackathon, 2025. MIT License.* | |