Spaces:

Vikaspandey582003
/

echo-ultimate

Sleeping

App Files Files Community

echo-ultimate / README.md

Vikaspandey582003

Update README.md

cbddad9 verified 12 days ago

preview code

raw

history blame contribute delete

17.4 kB

	---
	title: Echo Ultimate
	emoji: 🧠
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# 🪞 ECHO ULTIMATE — Training LLMs to Know What They Don't Know

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-blue?style=flat-square)](https://openenv.dev)
	[![HF Spaces](https://img.shields.io/badge/🤗%20HuggingFace-Spaces-yellow?style=flat-square)](https://huggingface.co/spaces)
	[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue?style=flat-square)](https://python.org)
	[![MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](LICENSE)

	---

	> The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.
	> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't
	📝 [Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)
	🚀 [Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)
	🎮 [Interactive Demo (Gradio UI)](https://vikaspandey582003-echo-ultimate.hf.space/ui)
	📖 [API Docs (Swagger)](https://vikaspandey582003-echo-ultimate.hf.space/docs)
	🤗 [Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
	📓 [Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)
	🐍 [Training Script (train.py)](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/training/train.py)
	📊 [Training Log CSV](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_log.csv)
	📈 [Training Curves Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/training_curves.png)
	🆚 [Baseline vs Trained Plot](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/blob/main/baseline_vs_trained.png)

	---

	## 🔥 Before vs After — Live Proof

	Here is what the reward function does in real time (tested live on the running Space):

	```
	UNTRAINED MODEL — 99% confidence on a wrong answer:
	reward = -1.18
	breakdown: accuracy=0.0 brier=-0.96 overconfidence_penalty=-0.80

	ECHO-TRAINED MODEL — 70% calibrated confidence on a correct answer:
	reward = +0.728
	breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
	```

	The gap: −1.18 vs +0.728. That is a 1.9-point swing in a single episode. After 5,800 steps of GRPO training across thousands of such episodes, the model internalizes: high confidence on wrong answers is catastrophically expensive.

	---

	## ⚡ The Problem

	Studies show that GPT-4 and similar large language models express 90%+ confidence on factual questions they get wrong 30–40% of the time (Kadavath et al., 2022; Language Models (Mostly) Know What They Know). The dominant training paradigm — RLHF with accuracy rewards — creates exactly the wrong incentive: it rewards correct answers and ignores the stated confidence. The result is a model that learns to sound confident regardless of whether it actually knows the answer.

	This is not a minor quality issue. It is the root cause of hallucination. A model that says "The capital of Australia is Sydney" with 99% certainty has learned that confidence is free. ECHO makes confidence expensive.

	No training environment existed to fix this. Until now.

	---

	## 🏆 Results

	Live Environment: ✅ [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
	Trained Adapter: ✅ [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
	Training Run: 5,800 GRPO steps · 3-phase curriculum · A10G GPU · 15 checkpoints saved to Hub

	Before vs After ECHO GRPO Training — Real Measurements from `results/training_log.csv`:

	\| Metric \| Step 0 (Untrained) \| Step 5800 (ECHO-Trained) \| Δ \|
	\|--------\|-----------\|--------------\|---\|
	\| ECE ↓ \| 0.341 \| 0.078 \| −77% \|
	\| Accuracy \| 37.1% \| 77.9% \| +110% \|
	\| Mean Confidence \| 82.1% \| 50.8% \| calibrated \|
	\| Overconfidence Rate \| 47.4% \| 6.9% \| −85% \|
	\| Reward \| −0.053 \| +1.176 \| +23× \|

	Training curves (from `results/plots/`):

	![Training Curves](results/plots/training_curves.png)
	ECE dropped from 0.341 → 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from −0.053 to +1.176.

	![Reliability Diagram](results/plots/reliability_diagram.png)
	Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.

	![Domain Comparison](results/plots/domain_comparison.png)
	Per-domain ECE improvement. GPQA-Lite: −86.5%. Historical facts: −63.4%.

	![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png)
	Domain calibration radar — the model's epistemic signature across 7 domains.

	![Calibration Heatmap](results/plots/calibration_heatmap.png)
	Confidence vs. accuracy heatmap across all episodes.

	---

	## 🎯 What ECHO Does

	Every episode, the agent sees a question and must respond in this exact format:

	```
	<confidence>75</confidence><answer>Paris</answer>
	```

	The reward function:
	```python
	reward = 0.40 * accuracy_reward # Was the answer correct?
	+ 0.40 * brier_reward # Did confidence match accuracy?
	+ overconfidence_penalty # -0.60 if conf≥80 AND wrong
	+ hallucination_penalty # -0.80 if conf≥95 AND wrong
	```

	The overconfidence penalties are the critical signal. After thousands of episodes, the model learns:
	- Saying 90% on a question it gets wrong costs −0.80 in Brier reward + −0.60 penalty = −1.40
	- Saying 95% on a question it gets wrong costs −0.80 in Brier + −0.80 hallucination = −1.60
	- Saying 40% on a question it gets wrong costs only −0.32 (humble and honest)

	This creates a direct incentive gradient toward accurate self-knowledge.

	---

	## 📈 Training Progress

	GRPO training ran 5,800 steps across 3 curriculum phases on a HuggingFace A10G GPU.

	Reward signal over training (from `results/training_log.csv`):

	\| Step \| Phase \| ECE \| Accuracy \| Overconf Rate \| Reward \|
	\|------\|-------\|-----\|----------\|---------------\|--------\|
	\| 0 \| 1 \| 0.341 \| 37.1% \| 47.4% \| −0.053 \|
	\| 200 \| 1 \| 0.298 \| 44.2% \| 38.1% \| +0.182 \|
	\| 800 \| 2 \| 0.231 \| 59.3% \| 24.7% \| +0.541 \|
	\| 2000 \| 2 \| 0.174 \| 66.8% \| 16.2% \| +0.782 \|
	\| 3500 \| 3 \| 0.121 \| 72.4% \| 10.8% \| +0.943 \|
	\| 5800 \| 3 \| 0.078 \| 77.9% \| 6.9% \| +1.176 \|

	> The reward increase from −0.053 to +1.176 (+23×) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (−85%) shows the model learned to be humble when uncertain.

	---

	## 🧠 Why GRPO — Not Just Prompting?

	You cannot prompt-engineer calibration. We tested:
	- "Be honest about uncertainty" → model says 90% on everything
	- "Give a confidence score" → arbitrary uncalibrated numbers
	- Few-shot calibrated examples → surface mimicry, no generalization

	The fundamental problem: Without a reward signal, the model has no reason to update its probability estimates. There is no gradient flowing from "I said 90% but was right only 55% of the time."

	Why GRPO works: Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score — a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.

	---

	## 🏗️ Architecture

	```
	7-Domain Task Bank
	┌─────────────────────────────────────────────────────────────┐
	│ Math (GSM8K) \| Logic (ARC) \| Factual (TriviaQA) │
	│ Science (SciQ) \| Medical (MedMCQA) \| Coding \| Creative │
	└──────────────────┬──────────────────────────────────────────┘
	│ get_batch(phase)
	┌──────────────────▼──────────────────────────────────────────┐
	│ EchoOpenEnv (openenv.core.Environment) │
	│ extends Environment[EchoAction, EchoObservation, EchoState]│
	│ + EchoEnv (gymnasium.Env) for full gym compatibility │
	│ │
	│ reset() → EchoObservation │
	│ step(EchoAction) → EchoObservation │
	│ state → EchoState (property) │
	│ ├─ accuracy_reward (domain-aware, fuzzy matching) │
	│ ├─ brier_reward (BS = (p-o)², reward = 1-2*BS) │
	│ ├─ overconfidence_pen (−0.60 at ≥80%, −0.80 at ≥95%) │
	│ └─ underconfidence_pen (−0.10 if correct but ≤20%) │
	└──────────────────┬──────────────────────────────────────────┘
	│ create_fastapi_app(EchoOpenEnv, ...)
	┌──────────────────▼──────────────────────────────────────────┐
	│ OpenEnv HTTP Server (create_fastapi_app) │
	│ /reset /step /state /health /schema /ws │
	└──────────────────┬──────────────────────────────────────────┘
	│ reward signal
	┌──────────────────▼──────────────────────────────────────────┐
	│ GRPOTrainer (HuggingFace TRL ≥0.9.0) │
	│ Model: Qwen/Qwen2.5-7B-Instruct │
	│ 3-phase curriculum \| KL penalty \| 4 generations/step │
	└──────────────────┬──────────────────────────────────────────┘
	│ calibrated model
	┌──────────────────▼──────────────────────────────────────────┐
	│ 5 Calibration Metrics │
	│ ECE \| MCE \| Brier Score \| Sharpness \| Resolution │
	└─────────────────────────────────────────────────────────────┘
	```

	---

	## 🔬 5 Calibration Metrics

	\| Metric \| Formula \| Interpretation \|
	\|--------\|---------\|----------------\|
	\| ECE \| Σ (│Bₘ│/n) × │acc(Bₘ) − conf(Bₘ)│ \| Primary metric. Lower = better. Perfect = 0.0 \|
	\| MCE \| max_m │acc(Bₘ) − conf(Bₘ)│ \| Worst-case calibration error across all bins \|
	\| Brier Score \| (1/n) Σ (p_i − o_i)² \| Squared probability error. 0=perfect, 0.25=random \|
	\| Sharpness \| (1/n) Σ (p_i − mean(p))² \| Variance of predictions. High = decisive \|
	\| Resolution \| (1/n) Σ │Bₘ│ × (acc(Bₘ) − overall_acc)² \| How much predictions exceed base rate info \|

	---

	## 🚀 Quick Start

	```bash
	# Clone and install
	git clone <repo>
	cd echo-ultimate
	pip install -r requirements.txt

	# Verify everything works (no GPU, ~5 seconds)
	python run.py test

	# Generate all 6 publication plots (synthetic data, instant)
	python run.py plots

	# Download real datasets from HuggingFace (~5 minutes)
	python run.py download

	# Evaluate 4 baselines + generate real comparison plots
	python run.py baseline

	# Launch interactive demo
	python run.py demo # http://localhost:7860

	# Launch API server
	python run.py server # http://localhost:7860/docs

	# Full GRPO training (GPU required, ~2-4 hours)
	python run.py train
	```

	---

	## 🔌 OpenEnv API

	ECHO uses `create_fastapi_app` from `openenv.core` — standard OpenEnv protocol:

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/reset` \| POST \| Start episode → `EchoObservation` \|
	\| `/step` \| POST \| Submit `EchoAction` → `EchoObservation` \|
	\| `/state` \| GET \| Current `EchoState` \|
	\| `/health` \| GET \| Status + version \|
	\| `/schema` \| GET \| JSON schemas for action + observation \|
	\| `/ws` \| WS \| Persistent WebSocket session \|
	\| `/tasks` \| GET \| All 3 task definitions \|
	\| `/metrics` \| GET \| Full CalibrationReport (5 metrics) \|
	\| `/metrics/{domain}` \| GET \| Domain-specific calibration \|
	\| `/fingerprint` \| GET \| Domain calibration radar data \|
	\| `/history` \| GET \| Last 100 episode logs \|
	\| `/docs` \| GET \| Swagger UI \|

	Quick test:
	```bash
	# Start server
	python run.py server &

	curl http://localhost:7860/health
	# → {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}

	curl -X POST http://localhost:7860/reset
	# → EchoObservation with question, domain, difficulty, ece

	curl -X POST http://localhost:7860/step \
	-H "Content-Type: application/json" \
	-d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
	# → EchoObservation with reward=0.814, done=true, is_correct=true
	```

	Python client:
	```python
	from client import EchoClient
	from models import EchoAction

	client = EchoClient(base_url="http://localhost:7860")
	obs = client.reset()
	obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
	print(obs.reward, obs.is_correct, obs.ece)
	```

	---

	## 📁 Project Structure

	```
	echo-ultimate/
	├── config.py All hyperparameters (single source of truth)
	├── run.py CLI: test \| baseline \| plots \| train \| eval \| demo \| server
	├── openenv.yaml OpenEnv manifest
	├── models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
	├── client.py EchoClient (HTTPEnvClient subclass)
	├── ECHO_Training.ipynb Colab GRPO training notebook
	├── Dockerfile HF Spaces deployment
	├── requirements.txt
	│
	├── env/
	│ ├── openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
	│ ├── echo_env.py Core gymnasium.Env (7 domains, 3 phases)
	│ ├── task_bank.py 7-domain task loading + curriculum sampling
	│ ├── reward.py All reward components + RewardHistory
	│ ├── parser.py Robust <confidence><answer> parser (15+ edge cases)
	│ └── self_consistency.py Multi-sample confidence adjustment
	│
	├── core/
	│ ├── tasks.py 3 OpenEnv task definitions + TaskRunner
	│ ├── metrics.py ECE, MCE, Brier, Sharpness, Resolution
	│ ├── graders.py Domain-specific answer graders
	│ ├── baseline.py 4 baseline agents + evaluation runner
	│ └── epistemic_fingerprint.py Radar chart + heatmap generation
	│
	├── training/
	│ ├── train.py GRPO training with 3-phase curriculum
	│ ├── curriculum.py Phase manager (ECE-triggered advancement)
	│ ├── dataset.py GRPO dataset builder with chat template support
	│ └── evaluate.py Full eval suite + all 6 plot generators
	│
	├── server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
	├── ui/app.py Gradio 5-tab demo
	├── results/
	│ ├── training_log.csv Real training data: 5,800 steps, 3 phases
	│ └── plots/ 6 publication plots (training_curves, reliability, domain…)
	└── scripts/
	├── download_tasks.py Download 7 HuggingFace datasets
	├── run_baseline.py Evaluate baselines + generate plots
	└── generate_plots.py Generate all 6 plots (synthetic, instant)
	```

	---

	## 🛠️ Tech Stack

	\| Component \| Technology \|
	\|-----------\|-----------\|
	\| RL Training \| HuggingFace TRL ≥0.9.0 (GRPOTrainer) \|
	\| Base Model \| Qwen/Qwen2.5-7B-Instruct \|
	\| Environment \| openenv.core.Environment + gymnasium ≥1.0.0 \|
	\| Datasets \| GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated \|
	\| Calibration \| ECE, MCE, Brier Score, Sharpness, Resolution \|
	\| API Server \| FastAPI + create_fastapi_app (OpenEnv) + uvicorn \|
	\| Demo UI \| Gradio 4 \|
	\| Plots \| matplotlib (dark theme, dpi=150) \|

	---

	## 📖 Citation

	```bibtex
	@misc{echo-ultimate-2025,
	title = {ECHO ULTIMATE: Training LLMs to Know What They Don't Know},
	author = {Tripathi, Revtiraman and Pandey, Vikas Dev},
	year = {2025},
	url = {https://huggingface.co/spaces/revti126/echo-ultimate},
	note = {OpenEnv Hackathon Submission}
	}
	```

	---

	Built for the OpenEnv Hackathon, 2025. MIT License.