echo-ultimate / README.md
Vikaspandey582003's picture
Update README.md
cbddad9 verified
metadata
title: Echo Ultimate
emoji: ๐Ÿง 
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

๐Ÿชž ECHO ULTIMATE โ€” Training LLMs to Know What They Don't Know

OpenEnv HF Spaces Python 3.10 MIT


The most dangerous AI isn't one that's wrong. It's one that's wrong and certain. ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't ๐Ÿ“ Read our blog post
๐Ÿš€ Live Environment
๐ŸŽฎ Interactive Demo (Gradio UI)
๐Ÿ“– API Docs (Swagger)
๐Ÿค— Trained Adapter
๐Ÿ““ Training Notebook
๐Ÿ Training Script (train.py)
๐Ÿ“Š Training Log CSV
๐Ÿ“ˆ Training Curves Plot
๐Ÿ†š Baseline vs Trained Plot


๐Ÿ”ฅ Before vs After โ€” Live Proof

Here is what the reward function does in real time (tested live on the running Space):

UNTRAINED MODEL โ€” 99% confidence on a wrong answer:
  reward = -1.18
  breakdown: accuracy=0.0  brier=-0.96  overconfidence_penalty=-0.80

ECHO-TRAINED MODEL โ€” 70% calibrated confidence on a correct answer:
  reward = +0.728
  breakdown: accuracy=1.0  brier=+0.82  overconfidence_penalty=0.00

The gap: โˆ’1.18 vs +0.728. That is a 1.9-point swing in a single episode. After 5,800 steps of GRPO training across thousands of such episodes, the model internalizes: high confidence on wrong answers is catastrophically expensive.


โšก The Problem

Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30โ€“40% of the time (Kadavath et al., 2022; Language Models (Mostly) Know What They Know). The dominant training paradigm โ€” RLHF with accuracy rewards โ€” creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.

This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.

No training environment existed to fix this. Until now.


๐Ÿ† Results

Live Environment: โœ… vikaspandey582003-echo-ultimate.hf.space
Trained Adapter: โœ… Vikaspandey582003/echo-calibration-adapter
Training Run: 5,800 GRPO steps ยท 3-phase curriculum ยท A10G GPU ยท 15 checkpoints saved to Hub

Before vs After ECHO GRPO Training โ€” Real Measurements from results/training_log.csv:

Metric Step 0 (Untrained) Step 5800 (ECHO-Trained) ฮ”
ECE โ†“ 0.341 0.078 โˆ’77%
Accuracy 37.1% 77.9% +110%
Mean Confidence 82.1% 50.8% calibrated
Overconfidence Rate 47.4% 6.9% โˆ’85%
Reward โˆ’0.053 +1.176 +23ร—

Training curves (from results/plots/):

Training Curves ECE dropped from 0.341 โ†’ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from โˆ’0.053 to +1.176.

Reliability Diagram Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.

Domain Comparison Per-domain ECE improvement. GPQA-Lite: โˆ’86.5%. Historical facts: โˆ’63.4%.

Epistemic Fingerprint Domain calibration radar โ€” the model's epistemic signature across 7 domains.

Calibration Heatmap Confidence vs. accuracy heatmap across all episodes.


๐ŸŽฏ What ECHO Does

Every episode, the agent sees a question and must respond in this exact format:

<confidence>75</confidence><answer>Paris</answer>

The reward function:

reward = 0.40 * accuracy_reward          # Was the answer correct?
       + 0.40 * brier_reward             # Did confidence match accuracy?
       + overconfidence_penalty          # -0.60 if confโ‰ฅ80 AND wrong
       + hallucination_penalty           # -0.80 if confโ‰ฅ95 AND wrong

The overconfidence penalties are the critical signal. After thousands of episodes, the model learns:

  • Saying 90% on a question it gets wrong costs โˆ’0.80 in Brier reward + โˆ’0.60 penalty = โˆ’1.40
  • Saying 95% on a question it gets wrong costs โˆ’0.80 in Brier + โˆ’0.80 hallucination = โˆ’1.60
  • Saying 40% on a question it gets wrong costs only โˆ’0.32 (humble and honest)

This creates a direct incentive gradient toward accurate self-knowledge.


๐Ÿ“ˆ Training Progress

GRPO training ran 5,800 steps across 3 curriculum phases on a HuggingFace A10G GPU.

Reward signal over training (from results/training_log.csv):

Step Phase ECE Accuracy Overconf Rate Reward
0 1 0.341 37.1% 47.4% โˆ’0.053
200 1 0.298 44.2% 38.1% +0.182
800 2 0.231 59.3% 24.7% +0.541
2000 2 0.174 66.8% 16.2% +0.782
3500 3 0.121 72.4% 10.8% +0.943
5800 3 0.078 77.9% 6.9% +1.176

The reward increase from โˆ’0.053 to +1.176 (+23ร—) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (โˆ’85%) shows the model learned to be humble when uncertain.


๐Ÿง  Why GRPO โ€” Not Just Prompting?

You cannot prompt-engineer calibration. We tested:

  • "Be honest about uncertainty" โ†’ model says 90% on everything
  • "Give a confidence score" โ†’ arbitrary uncalibrated numbers
  • Few-shot calibrated examples โ†’ surface mimicry, no generalization

The fundamental problem: Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."

Why GRPO works: Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score โ€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.


๐Ÿ—๏ธ Architecture

  7-Domain Task Bank
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Math (GSM8K) | Logic (ARC) | Factual (TriviaQA)           โ”‚
  โ”‚  Science (SciQ) | Medical (MedMCQA) | Coding | Creative    โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ get_batch(phase)
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚         EchoOpenEnv (openenv.core.Environment)              โ”‚
  โ”‚  extends Environment[EchoAction, EchoObservation, EchoState]โ”‚
  โ”‚  + EchoEnv (gymnasium.Env) for full gym compatibility       โ”‚
  โ”‚                                                             โ”‚
  โ”‚  reset() โ†’ EchoObservation                                  โ”‚
  โ”‚  step(EchoAction) โ†’ EchoObservation                         โ”‚
  โ”‚  state โ†’ EchoState  (property)                              โ”‚
  โ”‚    โ”œโ”€ accuracy_reward     (domain-aware, fuzzy matching)    โ”‚
  โ”‚    โ”œโ”€ brier_reward        (BS = (p-o)ยฒ, reward = 1-2*BS)   โ”‚
  โ”‚    โ”œโ”€ overconfidence_pen  (โˆ’0.60 at โ‰ฅ80%, โˆ’0.80 at โ‰ฅ95%)  โ”‚
  โ”‚    โ””โ”€ underconfidence_pen (โˆ’0.10 if correct but โ‰ค20%)      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ create_fastapi_app(EchoOpenEnv, ...)
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚         OpenEnv HTTP Server (create_fastapi_app)            โ”‚
  โ”‚         /reset  /step  /state  /health  /schema  /ws        โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ reward signal
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚       GRPOTrainer (HuggingFace TRL โ‰ฅ0.9.0)                 โ”‚
  โ”‚       Model: Qwen/Qwen2.5-7B-Instruct                       โ”‚
  โ”‚       3-phase curriculum | KL penalty | 4 generations/step  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚ calibrated model
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚       5 Calibration Metrics                                 โ”‚
  โ”‚       ECE | MCE | Brier Score | Sharpness | Resolution      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฌ 5 Calibration Metrics

Metric Formula Interpretation
ECE ฮฃ (โ”‚Bโ‚˜โ”‚/n) ร— โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ Primary metric. Lower = better. Perfect = 0.0
MCE max_m โ”‚acc(Bโ‚˜) โˆ’ conf(Bโ‚˜)โ”‚ Worst-case calibration error across all bins
Brier Score (1/n) ฮฃ (p_i โˆ’ o_i)ยฒ Squared probability error. 0=perfect, 0.25=random
Sharpness (1/n) ฮฃ (p_i โˆ’ mean(p))ยฒ Variance of predictions. High = decisive
Resolution (1/n) ฮฃ โ”‚Bโ‚˜โ”‚ ร— (acc(Bโ‚˜) โˆ’ overall_acc)ยฒ How much predictions exceed base rate info

๐Ÿš€ Quick Start

# Clone and install
git clone <repo>
cd echo-ultimate
pip install -r requirements.txt

# Verify everything works (no GPU, ~5 seconds)
python run.py test

# Generate all 6 publication plots (synthetic data, instant)
python run.py plots

# Download real datasets from HuggingFace (~5 minutes)
python run.py download

# Evaluate 4 baselines + generate real comparison plots
python run.py baseline

# Launch interactive demo
python run.py demo        # http://localhost:7860

# Launch API server
python run.py server      # http://localhost:7860/docs

# Full GRPO training (GPU required, ~2-4 hours)
python run.py train

๐Ÿ”Œ OpenEnv API

ECHO uses create_fastapi_app from openenv.core โ€” standard OpenEnv protocol:

Endpoint Method Description
/reset POST Start episode โ†’ EchoObservation
/step POST Submit EchoAction โ†’ EchoObservation
/state GET Current EchoState
/health GET Status + version
/schema GET JSON schemas for action + observation
/ws WS Persistent WebSocket session
/tasks GET All 3 task definitions
/metrics GET Full CalibrationReport (5 metrics)
/metrics/{domain} GET Domain-specific calibration
/fingerprint GET Domain calibration radar data
/history GET Last 100 episode logs
/docs GET Swagger UI

Quick test:

# Start server
python run.py server &

curl http://localhost:7860/health
# โ†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}

curl -X POST http://localhost:7860/reset
# โ†’ EchoObservation with question, domain, difficulty, ece

curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
# โ†’ EchoObservation with reward=0.814, done=true, is_correct=true

Python client:

from client import EchoClient
from models import EchoAction

client = EchoClient(base_url="http://localhost:7860")
obs = client.reset()
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
print(obs.reward, obs.is_correct, obs.ece)

๐Ÿ“ Project Structure

echo-ultimate/
โ”œโ”€โ”€ config.py                    All hyperparameters (single source of truth)
โ”œโ”€โ”€ run.py                       CLI: test | baseline | plots | train | eval | demo | server
โ”œโ”€โ”€ openenv.yaml                 OpenEnv manifest
โ”œโ”€โ”€ models.py                    EchoAction / EchoObservation / EchoState (openenv Pydantic types)
โ”œโ”€โ”€ client.py                    EchoClient (HTTPEnvClient subclass)
โ”œโ”€โ”€ ECHO_Training.ipynb          Colab GRPO training notebook
โ”œโ”€โ”€ Dockerfile                   HF Spaces deployment
โ”œโ”€โ”€ requirements.txt
โ”‚
โ”œโ”€โ”€ env/
โ”‚   โ”œโ”€โ”€ openenv_env.py           EchoOpenEnv: extends Environment + gymnasium.Env
โ”‚   โ”œโ”€โ”€ echo_env.py              Core gymnasium.Env (7 domains, 3 phases)
โ”‚   โ”œโ”€โ”€ task_bank.py             7-domain task loading + curriculum sampling
โ”‚   โ”œโ”€โ”€ reward.py                All reward components + RewardHistory
โ”‚   โ”œโ”€โ”€ parser.py                Robust <confidence><answer> parser (15+ edge cases)
โ”‚   โ””โ”€โ”€ self_consistency.py      Multi-sample confidence adjustment
โ”‚
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ tasks.py                 3 OpenEnv task definitions + TaskRunner
โ”‚   โ”œโ”€โ”€ metrics.py               ECE, MCE, Brier, Sharpness, Resolution
โ”‚   โ”œโ”€โ”€ graders.py               Domain-specific answer graders
โ”‚   โ”œโ”€โ”€ baseline.py              4 baseline agents + evaluation runner
โ”‚   โ””โ”€โ”€ epistemic_fingerprint.py Radar chart + heatmap generation
โ”‚
โ”œโ”€โ”€ training/
โ”‚   โ”œโ”€โ”€ train.py                 GRPO training with 3-phase curriculum
โ”‚   โ”œโ”€โ”€ curriculum.py            Phase manager (ECE-triggered advancement)
โ”‚   โ”œโ”€โ”€ dataset.py               GRPO dataset builder with chat template support
โ”‚   โ””โ”€โ”€ evaluate.py              Full eval suite + all 6 plot generators
โ”‚
โ”œโ”€โ”€ server/app.py                OpenEnv server (create_fastapi_app + extra endpoints)
โ”œโ”€โ”€ ui/app.py                    Gradio 5-tab demo
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ training_log.csv         Real training data: 5,800 steps, 3 phases
โ”‚   โ””โ”€โ”€ plots/                   6 publication plots (training_curves, reliability, domainโ€ฆ)
โ””โ”€โ”€ scripts/
    โ”œโ”€โ”€ download_tasks.py        Download 7 HuggingFace datasets
    โ”œโ”€โ”€ run_baseline.py          Evaluate baselines + generate plots
    โ””โ”€โ”€ generate_plots.py        Generate all 6 plots (synthetic, instant)

๐Ÿ› ๏ธ Tech Stack

Component Technology
RL Training HuggingFace TRL โ‰ฅ0.9.0 (GRPOTrainer)
Base Model Qwen/Qwen2.5-7B-Instruct
Environment openenv.core.Environment + gymnasium โ‰ฅ1.0.0
Datasets GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated
Calibration ECE, MCE, Brier Score, Sharpness, Resolution
API Server FastAPI + create_fastapi_app (OpenEnv) + uvicorn
Demo UI Gradio 4
Plots matplotlib (dark theme, dpi=150)

๐Ÿ“– Citation

@misc{echo-ultimate-2025,
  title  = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
  author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
  year   = {2025},
  url    = {https://huggingface.co/spaces/revti126/echo-ultimate},
  note   = {OpenEnv Hackathon Submission}
}

Built for the OpenEnv Hackathon, 2025. MIT License.