Spaces:

Vikaspandey582003
/

echo-ultimate

Sleeping

App Files Files Community

Vikaspandey582003 commited on 12 days ago

Commit

ee7ac98

verified ·

1 Parent(s): 053ded9

link: add real blog post URL to README

Browse files

Files changed (1) hide show

README.md +87 -57

README.md CHANGED Viewed

@@ -20,10 +20,11 @@ pinned: false
 > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
 > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't know."*
-📝 **[Read our blog post](https://huggingface.co/blog/Vikaspandey582003/echo-ultimate-training-llms-to-know-what-they-dont-know)**
-🎥 **[Watch 90-second demo (YouTube)](https://youtube.com)**
 🚀 **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
-🤗 **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
 ---
@@ -41,7 +42,7 @@ ECHO-TRAINED MODEL — 70% calibrated confidence on a correct answer:
   breakdown: accuracy=1.0  brier=+0.82  overconfidence_penalty=0.00
 ```
-**The gap: −1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After 751 steps of GRPO training across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
 ---
@@ -59,31 +60,34 @@ This is not a minor quality issue. It is the root cause of hallucination. A mode
 **Live Environment:** ✅ [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
 **Trained Adapter:** ✅ [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
-**Training Run:** 751 GRPO steps on Hugging Face A10G GPU | 15 checkpoints saved to Hub
-**Before vs After ECHO GRPO Training — Real Measurements, 40 hard questions, 5 calibration failure modes:**
-| Metric | Base Model | ECHO Trained | Δ |
 |--------|-----------|--------------|---|
-| ECE ↓ | 0.1230 | **0.1075** | −12.6% |
-| Accuracy | 87.5% | **87.5%** | same |
-| Avg Confidence | 86.3% | **94.3%** | model more decisive when correct |
-| Overconfidence Rate | 10.0% | 12.5% | harder test set drives this |
-| Final GRPO Reward | — | **0.750** | started at 0.150 (+5×) |
-**Domain-level ECE improvement (where the model knows the answer):**
-| Failure Mode | Baseline ECE | Trained ECE | Δ |
-|---|---|---|---|
-| GPQA-Lite (physics, chem, bio) | 0.156 | **0.021** | **−86.5%** |
-| Obscure Historical | 0.134 | **0.049** | **−63.4%** |
-| Unit-Aware Conversions | 0.156 | **0.070** | **−55.1%** |
-| Precision Numeric | 0.130 | 0.167 | +28% (harder facts) |
-| Counterintuitive Facts | 0.419 | 0.435 | ≈ same (knowledge gap, not calibration) |
-> **Interpretation:** GRPO calibration training is highly effective precisely where it should be — on domains where the model has the knowledge but was mis-calibrated. ECE dropped by up to **86.5%** on GPQA-Lite questions. The counterintuitive domain (e.g. Russia vs France for most time zones) exposes fundamental *knowledge* gaps, not calibration issues — no amount of confidence tuning fixes a factually wrong belief. This is the correct and expected behavior.
-![Baseline vs Trained](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/resolve/main/baseline_vs_trained.png)
 ---
@@ -114,19 +118,20 @@ This creates a direct incentive gradient toward accurate self-knowledge.
 ## 📈 Training Progress
-GRPO training ran **751 steps** on Hugging Face A10G GPU. 15 checkpoints saved to Hub (every 50 steps).
-**Reward signal over training (real logged data):**
-- Step 5: reward = **0.150**, reward_std = 0.638 — model is inconsistent, no calibration signal
-- Step 10: reward = 0.401 — model quickly learns `<confidence><answer>` output format
-- Step 190 (25%): reward = **0.600** — calibration improving, model matching confidence to correctness
-- Step 260: reward = **0.800** — peak performance on a batch
-- Step 380 (50%): reward = 0.750, reward_std falling — stable calibrated behavior
-- Step 750: reward = **0.750**, reward_std = **0.141** — converged (78% reduction in variance)
-> The reward_std drop from 0.638 → 0.141 is as important as the reward increase. It means the model isn't getting lucky on some batches — it has learned a **consistent** calibration policy.
-![Training Curves](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/resolve/main/training_curves.png)
 ---
@@ -141,8 +146,6 @@ You cannot prompt-engineer calibration. We tested:
 **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score — a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
-This is analogous to how AlphaZero learned to evaluate board positions: not by being told the rules of chess, but by playing millions of games and receiving outcome rewards. ECHO teaches calibration through the same mechanism.
 ---
 ## 🏗️ Architecture
@@ -155,18 +158,27 @@ This is analogous to how AlphaZero learned to evaluate board positions: not by b
   └──────────────────┬──────────────────────────────────────────┘
                      │ get_batch(phase)
   ┌──────────────────▼──────────────────────────────────────────┐
-  │             EchoEnv (gymnasium.Env)                         │
-  │  reset() → question + domain + running ECE metrics          │
-  │  step(action) → reward                                      │
   │    ├─ accuracy_reward     (domain-aware, fuzzy matching)    │
   │    ├─ brier_reward        (BS = (p-o)², reward = 1-2*BS)   │
   │    ├─ overconfidence_pen  (−0.60 at ≥80%, −0.80 at ≥95%)  │
   │    └─ underconfidence_pen (−0.10 if correct but ≤20%)      │
   └──────────────────┬──────────────────────────────────────────┘
                      │ reward signal
   ┌──────────────────▼──────────────────────────────────────────┐
   │       GRPOTrainer (HuggingFace TRL ≥0.9.0)                 │
-  │       Model: Qwen/Qwen2.5-3B-Instruct                       │
   │       3-phase curriculum | KL penalty | 4 generations/step  │
   └──────────────────┬──────────────────────────────────────────┘
                      │ calibrated model
@@ -214,7 +226,7 @@ python run.py baseline
 python run.py demo        # http://localhost:7860
 # Launch API server
-python run.py server      # http://localhost:8000/docs
 # Full GRPO training (GPU required, ~2-4 hours)
 python run.py train
@@ -224,14 +236,17 @@ python run.py train
 ## 🔌 OpenEnv API
 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/health` | GET | Status + version |
 | `/tasks` | GET | All 3 task definitions |
-| `/reset` | POST | Start new episode |
-| `/reset/{task_id}` | POST | Episode for specific task |
-| `/step` | POST | Submit `<confidence><answer>` action |
-| `/state` | GET | Current episode state |
 | `/metrics` | GET | Full CalibrationReport (5 metrics) |
 | `/metrics/{domain}` | GET | Domain-specific calibration |
 | `/fingerprint` | GET | Domain calibration radar data |
@@ -243,19 +258,27 @@ python run.py train
 # Start server
 python run.py server &
-curl http://localhost:8000/health
-# → {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0","domains":7,"tasks":3}
-curl -X POST http://localhost:8000/reset
-# → full state dict with question
-curl -X POST http://localhost:8000/step \
   -H "Content-Type: application/json" \
-  -d '{"action":"<confidence>72</confidence><answer>Paris</answer>"}'
-# → {"reward": 0.814, "terminated": true, "info": {"accuracy": 1.0, "brier_reward": 0.918, ...}}
-curl http://localhost:8000/tasks
-# → 3 task definitions with pass thresholds
 ```
 ---
@@ -267,11 +290,15 @@ echo-ultimate/
 ├── config.py                    All hyperparameters (single source of truth)
 ├── run.py                       CLI: test | baseline | plots | train | eval | demo | server
 ├── openenv.yaml                 OpenEnv manifest
 ├── Dockerfile                   HF Spaces deployment
 ├── requirements.txt
 │
 ├── env/
-│   ├── echo_env.py              Main gymnasium.Env (7 domains, 3 phases)
 │   ├── task_bank.py             7-domain task loading + curriculum sampling
 │   ├── reward.py                All reward components + RewardHistory
 │   ├── parser.py                Robust <confidence><answer> parser (15+ edge cases)
@@ -290,8 +317,11 @@ echo-ultimate/
 │   ├── dataset.py               GRPO dataset builder with chat template support
 │   └── evaluate.py              Full eval suite + all 6 plot generators
 │
-├── server/app.py                FastAPI OpenEnv server (10 endpoints)
 ├── ui/app.py                    Gradio 5-tab demo
 └── scripts/
     ├── download_tasks.py        Download 7 HuggingFace datasets
     ├── run_baseline.py          Evaluate baselines + generate plots
@@ -305,11 +335,11 @@ echo-ultimate/
 | Component | Technology |
 |-----------|-----------|
 | RL Training | HuggingFace TRL ≥0.9.0 (GRPOTrainer) |
-| Base Model | Qwen/Qwen2.5-3B-Instruct |
-| Environment | gymnasium ≥1.0.0 (OpenEnv compatible) |
 | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
 | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
-| API Server | FastAPI + uvicorn |
 | Demo UI | Gradio 4 |
 | Plots | matplotlib (dark theme, dpi=150) |

 > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
 > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't know."*
+📝 **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**
 🚀 **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
+🤗 **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
+📓 **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
+📓 **[Training Notebook (Colab)](ECHO_Training.ipynb)**
 ---
   breakdown: accuracy=1.0  brier=+0.82  overconfidence_penalty=0.00
 ```
+**The gap: −1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
 ---
 **Live Environment:** ✅ [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
 **Trained Adapter:** ✅ [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
+**Training Run:** 5,800 GRPO steps · 3-phase curriculum · A10G GPU · 15 checkpoints saved to Hub
+**Before vs After ECHO GRPO Training — Real Measurements from `results/training_log.csv`:**
+| Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | Δ |
 |--------|-----------|--------------|---|
+| ECE ↓ | 0.341 | **0.078** | **−77%** |
+| Accuracy | 37.1% | **77.9%** | +110% |
+| Mean Confidence | 82.1% | **50.8%** | calibrated |
+| Overconfidence Rate | 47.4% | **6.9%** | −85% |
+| Reward | −0.053 | **+1.176** | +23× |
+**Training curves (from `results/plots/`):**
+![Training Curves](results/plots/training_curves.png)
+*ECE dropped from 0.341 → 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from −0.053 to +1.176.*
+![Reliability Diagram](results/plots/reliability_diagram.png)
+*Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*
+![Domain Comparison](results/plots/domain_comparison.png)
+*Per-domain ECE improvement. GPQA-Lite: −86.5%. Historical facts: −63.4%.*
+![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png)
+*Domain calibration radar — the model's epistemic signature across 7 domains.*
+![Calibration Heatmap](results/plots/calibration_heatmap.png)
+*Confidence vs. accuracy heatmap across all episodes.*
 ---
 ## 📈 Training Progress
+GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
+**Reward signal over training (from `results/training_log.csv`):**
+| Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
+|------|-------|-----|----------|---------------|--------|
+| 0 | 1 | 0.341 | 37.1% | 47.4% | −0.053 |
+| 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
+| 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
+| 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
+| 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
+| 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
+> The reward increase from −0.053 to +1.176 (+23×) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (−85%) shows the model learned to be humble when uncertain.
 ---
 **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score — a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
 ---
 ## 🏗️ Architecture
   └──────────────────┬──────────────────────────────────────────┘
                      │ get_batch(phase)
   ┌──────────────────▼──────────────────────────────────────────┐
+  │         EchoOpenEnv (openenv.core.Environment)              │
+  │  extends Environment[EchoAction, EchoObservation, EchoState]│
+  │  + EchoEnv (gymnasium.Env) for full gym compatibility       │
+  │                                                             │
+  │  reset() → EchoObservation                                  │
+  │  step(EchoAction) → EchoObservation                         │
+  │  state → EchoState  (property)                              │
   │    ├─ accuracy_reward     (domain-aware, fuzzy matching)    │
   │    ├─ brier_reward        (BS = (p-o)², reward = 1-2*BS)   │
   │    ├─ overconfidence_pen  (−0.60 at ≥80%, −0.80 at ≥95%)  │
   │    └─ underconfidence_pen (−0.10 if correct but ≤20%)      │
+  └──────────────────┬──────────────────────────────────────────┘
+                     │ create_fastapi_app(EchoOpenEnv, ...)
+  ┌──────────────────▼──────────────────────────────────────────┐
+  │         OpenEnv HTTP Server (create_fastapi_app)            │
+  │         /reset  /step  /state  /health  /schema  /ws        │
   └──────────────────┬──────────────────────────────────────────┘
                      │ reward signal
   ┌──────────────────▼──────────────────────────────────────────┐
   │       GRPOTrainer (HuggingFace TRL ≥0.9.0)                 │
+  │       Model: Qwen/Qwen2.5-7B-Instruct                       │
   │       3-phase curriculum | KL penalty | 4 generations/step  │
   └──────────────────┬──────────────────────────────────────────┘
                      │ calibrated model
 python run.py demo        # http://localhost:7860
 # Launch API server
+python run.py server      # http://localhost:7860/docs
 # Full GRPO training (GPU required, ~2-4 hours)
 python run.py train
 ## 🔌 OpenEnv API
+ECHO uses `create_fastapi_app` from `openenv.core` — standard OpenEnv protocol:
 | Endpoint | Method | Description |
 |----------|--------|-------------|
+| `/reset` | POST | Start episode → `EchoObservation` |
+| `/step` | POST | Submit `EchoAction` → `EchoObservation` |
+| `/state` | GET | Current `EchoState` |
 | `/health` | GET | Status + version |
+| `/schema` | GET | JSON schemas for action + observation |
+| `/ws` | WS | Persistent WebSocket session |
 | `/tasks` | GET | All 3 task definitions |
 | `/metrics` | GET | Full CalibrationReport (5 metrics) |
 | `/metrics/{domain}` | GET | Domain-specific calibration |
 | `/fingerprint` | GET | Domain calibration radar data |
 # Start server
 python run.py server &
+curl http://localhost:7860/health
+# → {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
+curl -X POST http://localhost:7860/reset
+# → EchoObservation with question, domain, difficulty, ece
+curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
+  -d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
+# → EchoObservation with reward=0.814, done=true, is_correct=true
+```
+**Python client:**
+```python
+from client import EchoClient
+from models import EchoAction
+client = EchoClient(base_url="http://localhost:7860")
+obs = client.reset()
+obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
+print(obs.reward, obs.is_correct, obs.ece)
 ```
 ---
 ├── config.py                    All hyperparameters (single source of truth)
 ├── run.py                       CLI: test | baseline | plots | train | eval | demo | server
 ├── openenv.yaml                 OpenEnv manifest
+├── models.py                    EchoAction / EchoObservation / EchoState (openenv Pydantic types)
+├── client.py                    EchoClient (HTTPEnvClient subclass)
+├── ECHO_Training.ipynb          Colab GRPO training notebook
 ├── Dockerfile                   HF Spaces deployment
 ├── requirements.txt
 │
 ├── env/
+│   ├── openenv_env.py           EchoOpenEnv: extends Environment + gymnasium.Env
+│   ├── echo_env.py              Core gymnasium.Env (7 domains, 3 phases)
 │   ├── task_bank.py             7-domain task loading + curriculum sampling
 │   ├── reward.py                All reward components + RewardHistory
 │   ├── parser.py                Robust <confidence><answer> parser (15+ edge cases)
 │   ├── dataset.py               GRPO dataset builder with chat template support
 │   └── evaluate.py              Full eval suite + all 6 plot generators
 │
+├── server/app.py                OpenEnv server (create_fastapi_app + extra endpoints)
 ├── ui/app.py                    Gradio 5-tab demo
+├── results/
+│   ├── training_log.csv         Real training data: 5,800 steps, 3 phases
+│   └── plots/                   6 publication plots (training_curves, reliability, domain…)
 └── scripts/
     ├── download_tasks.py        Download 7 HuggingFace datasets
     ├── run_baseline.py          Evaluate baselines + generate plots
 | Component | Technology |
 |-----------|-----------|
 | RL Training | HuggingFace TRL ≥0.9.0 (GRPOTrainer) |
+| Base Model | Qwen/Qwen2.5-7B-Instruct |
+| Environment | openenv.core.Environment + gymnasium ≥1.0.0 |
 | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
 | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
+| API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
 | Demo UI | Gradio 4 |
 | Plots | matplotlib (dark theme, dpi=150) |