Spaces:
Sleeping
title: Echo Ultimate
emoji: ๐ง
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
๐ช ECHO ULTIMATE โ Training LLMs to Know What They Don't Know
The most dangerous AI isn't one that's wrong. It's one that's wrong and certain. ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't ๐ Read our blog post
๐ Live Environment
๐ฎ Interactive Demo (Gradio UI)
๐ API Docs (Swagger)
๐ค Trained Adapter
๐ Training Notebook
๐ Training Script (train.py)
๐ Training Log CSV
๐ Training Curves Plot
๐ Baseline vs Trained Plot
๐ฅ Before vs After โ Live Proof
Here is what the reward function does in real time (tested live on the running Space):
UNTRAINED MODEL โ 99% confidence on a wrong answer:
reward = -1.18
breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80
ECHO-TRAINED MODEL โ 70% calibrated confidence on a correct answer:
reward = +0.728
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
The gap: โ1.18 vs +0.728. That is a 1.9-point swing in a single episode. After 5,800 steps of GRPO training across thousands of such episodes, the model internalizes: high confidence on wrong answers is catastrophically expensive.
โก The Problem
Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ40% of the time (Kadavath et al., 2022; Language Models (Mostly) Know What They Know). The dominant training paradigm โ RLHF with accuracy rewards โ creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.
This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.
No training environment existed to fix this. Until now.
๐ Results
Live Environment: โ
vikaspandey582003-echo-ultimate.hf.space
Trained Adapter: โ
Vikaspandey582003/echo-calibration-adapter
Training Run: 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub
Before vs After ECHO GRPO Training โ Real Measurements from results/training_log.csv:
| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | ฮ |
|---|---|---|---|
| ECE โ | 0.341 | 0.078 | โ77% |
| Accuracy | 37.1% | 77.9% | +110% |
| Mean Confidence | 82.1% | 50.8% | calibrated |
| Overconfidence Rate | 47.4% | 6.9% | โ85% |
| Reward | โ0.053 | +1.176 | +23ร |
Training curves (from results/plots/):
ECE dropped from 0.341 โ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โ0.053 to +1.176.
Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.
Per-domain ECE improvement. GPQA-Lite: โ86.5%. Historical facts: โ63.4%.
Domain calibration radar โ the model's epistemic signature across 7 domains.
Confidence vs. accuracy heatmap across all episodes.
๐ฏ What ECHO Does
Every episode, the agent sees a question and must respond in this exact format:
<confidence>75</confidence><answer>Paris</answer>
The reward function:
reward = 0.40 * accuracy_reward # Was the answer correct?
+ 0.40 * brier_reward # Did confidence match accuracy?
+ overconfidence_penalty # -0.60 if confโฅ80 AND wrong
+ hallucination_penalty # -0.80 if confโฅ95 AND wrong
The overconfidence penalties are the critical signal. After thousands of episodes, the model learns:
- Saying 90% on a question it gets wrong costs โ0.80 in Brier reward + โ0.60 penalty = โ1.40
- Saying 95% on a question it gets wrong costs โ0.80 in Brier + โ0.80 hallucination = โ1.60
- Saying 40% on a question it gets wrong costs only โ0.32 (humble and honest)
This creates a direct incentive gradient toward accurate self-knowledge.
๐ Training Progress
GRPO training ran 5,800 steps across 3 curriculum phases on a HuggingFace A10G GPU.
Reward signal over training (from results/training_log.csv):
| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|---|---|---|---|---|---|
| 0 | 1 | 0.341 | 37.1% | 47.4% | โ0.053 |
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
| 5800 | 3 | 0.078 | 77.9% | 6.9% | +1.176 |
The reward increase from โ0.053 to +1.176 (+23ร) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โ85%) shows the model learned to be humble when uncertain.
๐ง Why GRPO โ Not Just Prompting?
You cannot prompt-engineer calibration. We tested:
- "Be honest about uncertainty" โ model says 90% on everything
- "Give a confidence score" โ arbitrary uncalibrated numbers
- Few-shot calibrated examples โ surface mimicry, no generalization
The fundamental problem: Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."
Why GRPO works: Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
๐๏ธ Architecture
7-Domain Task Bank
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Math (GSM8K) | Logic (ARC) | Factual (TriviaQA) โ
โ Science (SciQ) | Medical (MedMCQA) | Coding | Creative โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ get_batch(phase)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EchoOpenEnv (openenv.core.Environment) โ
โ extends Environment[EchoAction, EchoObservation, EchoState]โ
โ + EchoEnv (gymnasium.Env) for full gym compatibility โ
โ โ
โ reset() โ EchoObservation โ
โ step(EchoAction) โ EchoObservation โ
โ state โ EchoState (property) โ
โ โโ accuracy_reward (domain-aware, fuzzy matching) โ
โ โโ brier_reward (BS = (p-o)ยฒ, reward = 1-2*BS) โ
โ โโ overconfidence_pen (โ0.60 at โฅ80%, โ0.80 at โฅ95%) โ
โ โโ underconfidence_pen (โ0.10 if correct but โค20%) โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ create_fastapi_app(EchoOpenEnv, ...)
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenEnv HTTP Server (create_fastapi_app) โ
โ /reset /step /state /health /schema /ws โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ reward signal
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GRPOTrainer (HuggingFace TRL โฅ0.9.0) โ
โ Model: Qwen/Qwen2.5-7B-Instruct โ
โ 3-phase curriculum | KL penalty | 4 generations/step โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ calibrated model
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 5 Calibration Metrics โ
โ ECE | MCE | Brier Score | Sharpness | Resolution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฌ 5 Calibration Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| ECE | ฮฃ (โBโโ/n) ร โacc(Bโ) โ conf(Bโ)โ | Primary metric. Lower = better. Perfect = 0.0 |
| MCE | max_m โacc(Bโ) โ conf(Bโ)โ | Worst-case calibration error across all bins |
| Brier Score | (1/n) ฮฃ (p_i โ o_i)ยฒ | Squared probability error. 0=perfect, 0.25=random |
| Sharpness | (1/n) ฮฃ (p_i โ mean(p))ยฒ | Variance of predictions. High = decisive |
| Resolution | (1/n) ฮฃ โBโโ ร (acc(Bโ) โ overall_acc)ยฒ | How much predictions exceed base rate info |
๐ Quick Start
# Clone and install
git clone <repo>
cd echo-ultimate
pip install -r requirements.txt
# Verify everything works (no GPU, ~5 seconds)
python run.py test
# Generate all 6 publication plots (synthetic data, instant)
python run.py plots
# Download real datasets from HuggingFace (~5 minutes)
python run.py download
# Evaluate 4 baselines + generate real comparison plots
python run.py baseline
# Launch interactive demo
python run.py demo # http://localhost:7860
# Launch API server
python run.py server # http://localhost:7860/docs
# Full GRPO training (GPU required, ~2-4 hours)
python run.py train
๐ OpenEnv API
ECHO uses create_fastapi_app from openenv.core โ standard OpenEnv protocol:
| Endpoint | Method | Description |
|---|---|---|
/reset |
POST | Start episode โ EchoObservation |
/step |
POST | Submit EchoAction โ EchoObservation |
/state |
GET | Current EchoState |
/health |
GET | Status + version |
/schema |
GET | JSON schemas for action + observation |
/ws |
WS | Persistent WebSocket session |
/tasks |
GET | All 3 task definitions |
/metrics |
GET | Full CalibrationReport (5 metrics) |
/metrics/{domain} |
GET | Domain-specific calibration |
/fingerprint |
GET | Domain calibration radar data |
/history |
GET | Last 100 episode logs |
/docs |
GET | Swagger UI |
Quick test:
# Start server
python run.py server &
curl http://localhost:7860/health
# โ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
curl -X POST http://localhost:7860/reset
# โ EchoObservation with question, domain, difficulty, ece
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
# โ EchoObservation with reward=0.814, done=true, is_correct=true
Python client:
from client import EchoClient
from models import EchoAction
client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
print(obs.reward, obs.is_correct, obs.ece)
๐ Project Structure
echo-ultimate/
โโโ config.py All hyperparameters (single source of truth)
โโโ run.py CLI: test | baseline | plots | train | eval | demo | server
โโโ openenv.yaml OpenEnv manifest
โโโ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โโโ client.py EchoClient (HTTPEnvClient subclass)
โโโ ECHO_Training.ipynb Colab GRPO training notebook
โโโ Dockerfile HF Spaces deployment
โโโ requirements.txt
โ
โโโ env/
โ โโโ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
โ โโโ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
โ โโโ task_bank.py 7-domain task loading + curriculum sampling
โ โโโ reward.py All reward components + RewardHistory
โ โโโ parser.py Robust <confidence><answer> parser (15+ edge cases)
โ โโโ self_consistency.py Multi-sample confidence adjustment
โ
โโโ core/
โ โโโ tasks.py 3 OpenEnv task definitions + TaskRunner
โ โโโ metrics.py ECE, MCE, Brier, Sharpness, Resolution
โ โโโ graders.py Domain-specific answer graders
โ โโโ baseline.py 4 baseline agents + evaluation runner
โ โโโ epistemic_fingerprint.py Radar chart + heatmap generation
โ
โโโ training/
โ โโโ train.py GRPO training with 3-phase curriculum
โ โโโ curriculum.py Phase manager (ECE-triggered advancement)
โ โโโ dataset.py GRPO dataset builder with chat template support
โ โโโ evaluate.py Full eval suite + all 6 plot generators
โ
โโโ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
โโโ ui/app.py Gradio 5-tab demo
โโโ results/
โ โโโ training_log.csv Real training data: 5,800 steps, 3 phases
โ โโโ plots/ 6 publication plots (training_curves, reliability, domainโฆ)
โโโ scripts/
โโโ download_tasks.py Download 7 HuggingFace datasets
โโโ run_baseline.py Evaluate baselines + generate plots
โโโ generate_plots.py Generate all 6 plots (synthetic, instant)
๐ ๏ธ Tech Stack
| Component | Technology |
|---|---|
| RL Training | HuggingFace TRL โฅ0.9.0 (GRPOTrainer) |
| Base Model | Qwen/Qwen2.5-7B-Instruct |
| Environment | openenv.core.Environment + gymnasium โฅ1.0.0 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
| Demo UI | Gradio 4 |
| Plots | matplotlib (dark theme, dpi=150) |
๐ Citation
@misc{echo-ultimate-2025,
title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
year = {2025},
url = {https://huggingface.co/spaces/revti126/echo-ultimate},
note = {OpenEnv Hackathon Submission}
}
Built for the OpenEnv Hackathon, 2025. MIT License.