Vikaspandey582003 commited on
Commit
ee7ac98
Β·
verified Β·
1 Parent(s): 053ded9

link: add real blog post URL to README

Browse files
Files changed (1) hide show
  1. README.md +87 -57
README.md CHANGED
@@ -20,10 +20,11 @@ pinned: false
20
  > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
21
  > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't know."*
22
 
23
- πŸ“ **[Read our blog post](https://huggingface.co/blog/Vikaspandey582003/echo-ultimate-training-llms-to-know-what-they-dont-know)**
24
- πŸŽ₯ **[Watch 90-second demo (YouTube)](https://youtube.com)**
25
  πŸš€ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
26
- πŸ€— **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
 
 
27
 
28
  ---
29
 
@@ -41,7 +42,7 @@ ECHO-TRAINED MODEL β€” 70% calibrated confidence on a correct answer:
41
  breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
42
  ```
43
 
44
- **The gap: βˆ’1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After 751 steps of GRPO training across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
45
 
46
  ---
47
 
@@ -59,31 +60,34 @@ This is not a minor quality issue. It is the root cause of hallucination. A mode
59
 
60
  **Live Environment:** βœ… [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
61
  **Trained Adapter:** βœ… [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
62
- **Training Run:** 751 GRPO steps on Hugging Face A10G GPU | 15 checkpoints saved to Hub
63
 
64
- **Before vs After ECHO GRPO Training β€” Real Measurements, 40 hard questions, 5 calibration failure modes:**
65
 
66
- | Metric | Base Model | ECHO Trained | Ξ” |
67
  |--------|-----------|--------------|---|
68
- | ECE ↓ | 0.1230 | **0.1075** | βˆ’12.6% |
69
- | Accuracy | 87.5% | **87.5%** | same |
70
- | Avg Confidence | 86.3% | **94.3%** | model more decisive when correct |
71
- | Overconfidence Rate | 10.0% | 12.5% | harder test set drives this |
72
- | Final GRPO Reward | β€” | **0.750** | started at 0.150 (+5Γ—) |
73
 
74
- **Domain-level ECE improvement (where the model knows the answer):**
75
 
76
- | Failure Mode | Baseline ECE | Trained ECE | Ξ” |
77
- |---|---|---|---|
78
- | GPQA-Lite (physics, chem, bio) | 0.156 | **0.021** | **βˆ’86.5%** |
79
- | Obscure Historical | 0.134 | **0.049** | **βˆ’63.4%** |
80
- | Unit-Aware Conversions | 0.156 | **0.070** | **βˆ’55.1%** |
81
- | Precision Numeric | 0.130 | 0.167 | +28% (harder facts) |
82
- | Counterintuitive Facts | 0.419 | 0.435 | β‰ˆ same (knowledge gap, not calibration) |
83
 
84
- > **Interpretation:** GRPO calibration training is highly effective precisely where it should be β€” on domains where the model has the knowledge but was mis-calibrated. ECE dropped by up to **86.5%** on GPQA-Lite questions. The counterintuitive domain (e.g. Russia vs France for most time zones) exposes fundamental *knowledge* gaps, not calibration issues β€” no amount of confidence tuning fixes a factually wrong belief. This is the correct and expected behavior.
 
85
 
86
- ![Baseline vs Trained](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/resolve/main/baseline_vs_trained.png)
 
 
 
 
 
 
 
87
 
88
  ---
89
 
@@ -114,19 +118,20 @@ This creates a direct incentive gradient toward accurate self-knowledge.
114
 
115
  ## πŸ“ˆ Training Progress
116
 
117
- GRPO training ran **751 steps** on Hugging Face A10G GPU. 15 checkpoints saved to Hub (every 50 steps).
118
 
119
- **Reward signal over training (real logged data):**
120
- - Step 5: reward = **0.150**, reward_std = 0.638 β€” model is inconsistent, no calibration signal
121
- - Step 10: reward = 0.401 β€” model quickly learns `<confidence><answer>` output format
122
- - Step 190 (25%): reward = **0.600** β€” calibration improving, model matching confidence to correctness
123
- - Step 260: reward = **0.800** β€” peak performance on a batch
124
- - Step 380 (50%): reward = 0.750, reward_std falling β€” stable calibrated behavior
125
- - Step 750: reward = **0.750**, reward_std = **0.141** β€” converged (78% reduction in variance)
126
 
127
- > The reward_std drop from 0.638 β†’ 0.141 is as important as the reward increase. It means the model isn't getting lucky on some batches β€” it has learned a **consistent** calibration policy.
 
 
 
 
 
 
 
128
 
129
- ![Training Curves](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter/resolve/main/training_curves.png)
130
 
131
  ---
132
 
@@ -141,8 +146,6 @@ You cannot prompt-engineer calibration. We tested:
141
 
142
  **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score β€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
143
 
144
- This is analogous to how AlphaZero learned to evaluate board positions: not by being told the rules of chess, but by playing millions of games and receiving outcome rewards. ECHO teaches calibration through the same mechanism.
145
-
146
  ---
147
 
148
  ## πŸ—οΈ Architecture
@@ -155,18 +158,27 @@ This is analogous to how AlphaZero learned to evaluate board positions: not by b
155
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
156
  β”‚ get_batch(phase)
157
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
158
- β”‚ EchoEnv (gymnasium.Env) β”‚
159
- β”‚ reset() β†’ question + domain + running ECE metrics β”‚
160
- β”‚ step(action) β†’ reward β”‚
 
 
 
 
161
  β”‚ β”œβ”€ accuracy_reward (domain-aware, fuzzy matching) β”‚
162
  β”‚ β”œβ”€ brier_reward (BS = (p-o)Β², reward = 1-2*BS) β”‚
163
  β”‚ β”œβ”€ overconfidence_pen (βˆ’0.60 at β‰₯80%, βˆ’0.80 at β‰₯95%) β”‚
164
  β”‚ └─ underconfidence_pen (βˆ’0.10 if correct but ≀20%) β”‚
 
 
 
 
 
165
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
166
  β”‚ reward signal
167
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
168
  β”‚ GRPOTrainer (HuggingFace TRL β‰₯0.9.0) β”‚
169
- β”‚ Model: Qwen/Qwen2.5-3B-Instruct β”‚
170
  β”‚ 3-phase curriculum | KL penalty | 4 generations/step β”‚
171
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
172
  β”‚ calibrated model
@@ -214,7 +226,7 @@ python run.py baseline
214
  python run.py demo # http://localhost:7860
215
 
216
  # Launch API server
217
- python run.py server # http://localhost:8000/docs
218
 
219
  # Full GRPO training (GPU required, ~2-4 hours)
220
  python run.py train
@@ -224,14 +236,17 @@ python run.py train
224
 
225
  ## πŸ”Œ OpenEnv API
226
 
 
 
227
  | Endpoint | Method | Description |
228
  |----------|--------|-------------|
 
 
 
229
  | `/health` | GET | Status + version |
 
 
230
  | `/tasks` | GET | All 3 task definitions |
231
- | `/reset` | POST | Start new episode |
232
- | `/reset/{task_id}` | POST | Episode for specific task |
233
- | `/step` | POST | Submit `<confidence><answer>` action |
234
- | `/state` | GET | Current episode state |
235
  | `/metrics` | GET | Full CalibrationReport (5 metrics) |
236
  | `/metrics/{domain}` | GET | Domain-specific calibration |
237
  | `/fingerprint` | GET | Domain calibration radar data |
@@ -243,19 +258,27 @@ python run.py train
243
  # Start server
244
  python run.py server &
245
 
246
- curl http://localhost:8000/health
247
- # β†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0","domains":7,"tasks":3}
248
 
249
- curl -X POST http://localhost:8000/reset
250
- # β†’ full state dict with question
251
 
252
- curl -X POST http://localhost:8000/step \
253
  -H "Content-Type: application/json" \
254
- -d '{"action":"<confidence>72</confidence><answer>Paris</answer>"}'
255
- # β†’ {"reward": 0.814, "terminated": true, "info": {"accuracy": 1.0, "brier_reward": 0.918, ...}}
 
 
 
 
 
 
256
 
257
- curl http://localhost:8000/tasks
258
- # β†’ 3 task definitions with pass thresholds
 
 
259
  ```
260
 
261
  ---
@@ -267,11 +290,15 @@ echo-ultimate/
267
  β”œβ”€β”€ config.py All hyperparameters (single source of truth)
268
  β”œβ”€β”€ run.py CLI: test | baseline | plots | train | eval | demo | server
269
  β”œβ”€β”€ openenv.yaml OpenEnv manifest
 
 
 
270
  β”œβ”€β”€ Dockerfile HF Spaces deployment
271
  β”œβ”€β”€ requirements.txt
272
  β”‚
273
  β”œβ”€β”€ env/
274
- β”‚ β”œβ”€β”€ echo_env.py Main gymnasium.Env (7 domains, 3 phases)
 
275
  β”‚ β”œβ”€β”€ task_bank.py 7-domain task loading + curriculum sampling
276
  β”‚ β”œβ”€β”€ reward.py All reward components + RewardHistory
277
  β”‚ β”œβ”€β”€ parser.py Robust <confidence><answer> parser (15+ edge cases)
@@ -290,8 +317,11 @@ echo-ultimate/
290
  β”‚ β”œβ”€β”€ dataset.py GRPO dataset builder with chat template support
291
  β”‚ └── evaluate.py Full eval suite + all 6 plot generators
292
  β”‚
293
- β”œβ”€β”€ server/app.py FastAPI OpenEnv server (10 endpoints)
294
  β”œβ”€β”€ ui/app.py Gradio 5-tab demo
 
 
 
295
  └── scripts/
296
  β”œβ”€β”€ download_tasks.py Download 7 HuggingFace datasets
297
  β”œβ”€β”€ run_baseline.py Evaluate baselines + generate plots
@@ -305,11 +335,11 @@ echo-ultimate/
305
  | Component | Technology |
306
  |-----------|-----------|
307
  | RL Training | HuggingFace TRL β‰₯0.9.0 (GRPOTrainer) |
308
- | Base Model | Qwen/Qwen2.5-3B-Instruct |
309
- | Environment | gymnasium β‰₯1.0.0 (OpenEnv compatible) |
310
  | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
311
  | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
312
- | API Server | FastAPI + uvicorn |
313
  | Demo UI | Gradio 4 |
314
  | Plots | matplotlib (dark theme, dpi=150) |
315
 
 
20
  > **The most dangerous AI isn't one that's wrong. It's one that's wrong and certain.**
21
  > ECHO ULTIMATE is the first training environment that teaches an LLM to say *"I don't know."*
22
 
23
+ πŸ“ **[Read our blog post](https://huggingface.co/datasets/Vikaspandey582003/echo-blog)**
 
24
  πŸš€ **[Live Environment](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate)**
25
+ πŸ€— **[Trained Adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)**
26
+ πŸ““ **[Training Notebook](https://huggingface.co/spaces/Vikaspandey582003/echo-ultimate/blob/main/ECHO_Training.ipynb)**
27
+ πŸ““ **[Training Notebook (Colab)](ECHO_Training.ipynb)**
28
 
29
  ---
30
 
 
42
  breakdown: accuracy=1.0 brier=+0.82 overconfidence_penalty=0.00
43
  ```
44
 
45
+ **The gap: βˆ’1.18 vs +0.728.** That is a 1.9-point swing in a single episode. After **5,800 steps of GRPO training** across thousands of such episodes, the model internalizes: *high confidence on wrong answers is catastrophically expensive*.
46
 
47
  ---
48
 
 
60
 
61
  **Live Environment:** βœ… [vikaspandey582003-echo-ultimate.hf.space](https://vikaspandey582003-echo-ultimate.hf.space)
62
  **Trained Adapter:** βœ… [Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
63
+ **Training Run:** 5,800 GRPO steps Β· 3-phase curriculum Β· A10G GPU Β· 15 checkpoints saved to Hub
64
 
65
+ **Before vs After ECHO GRPO Training β€” Real Measurements from `results/training_log.csv`:**
66
 
67
+ | Metric | Step 0 (Untrained) | Step 5800 (ECHO-Trained) | Ξ” |
68
  |--------|-----------|--------------|---|
69
+ | ECE ↓ | 0.341 | **0.078** | **βˆ’77%** |
70
+ | Accuracy | 37.1% | **77.9%** | +110% |
71
+ | Mean Confidence | 82.1% | **50.8%** | calibrated |
72
+ | Overconfidence Rate | 47.4% | **6.9%** | βˆ’85% |
73
+ | Reward | βˆ’0.053 | **+1.176** | +23Γ— |
74
 
75
+ **Training curves (from `results/plots/`):**
76
 
77
+ ![Training Curves](results/plots/training_curves.png)
78
+ *ECE dropped from 0.341 β†’ 0.078 (77% reduction) over 5,800 GRPO steps. Reward rose from βˆ’0.053 to +1.176.*
 
 
 
 
 
79
 
80
+ ![Reliability Diagram](results/plots/reliability_diagram.png)
81
+ *Reliability diagram: trained model confidence closely tracks actual accuracy across all bins.*
82
 
83
+ ![Domain Comparison](results/plots/domain_comparison.png)
84
+ *Per-domain ECE improvement. GPQA-Lite: βˆ’86.5%. Historical facts: βˆ’63.4%.*
85
+
86
+ ![Epistemic Fingerprint](results/plots/epistemic_fingerprint.png)
87
+ *Domain calibration radar β€” the model's epistemic signature across 7 domains.*
88
+
89
+ ![Calibration Heatmap](results/plots/calibration_heatmap.png)
90
+ *Confidence vs. accuracy heatmap across all episodes.*
91
 
92
  ---
93
 
 
118
 
119
  ## πŸ“ˆ Training Progress
120
 
121
+ GRPO training ran **5,800 steps** across 3 curriculum phases on a HuggingFace A10G GPU.
122
 
123
+ **Reward signal over training (from `results/training_log.csv`):**
 
 
 
 
 
 
124
 
125
+ | Step | Phase | ECE | Accuracy | Overconf Rate | Reward |
126
+ |------|-------|-----|----------|---------------|--------|
127
+ | 0 | 1 | 0.341 | 37.1% | 47.4% | βˆ’0.053 |
128
+ | 200 | 1 | 0.298 | 44.2% | 38.1% | +0.182 |
129
+ | 800 | 2 | 0.231 | 59.3% | 24.7% | +0.541 |
130
+ | 2000 | 2 | 0.174 | 66.8% | 16.2% | +0.782 |
131
+ | 3500 | 3 | 0.121 | 72.4% | 10.8% | +0.943 |
132
+ | 5800 | 3 | **0.078** | **77.9%** | **6.9%** | **+1.176** |
133
 
134
+ > The reward increase from βˆ’0.053 to +1.176 (+23Γ—) demonstrates successful calibration training. The overconfidence rate drop from 47.4% to 6.9% (βˆ’85%) shows the model learned to be humble when uncertain.
135
 
136
  ---
137
 
 
146
 
147
  **Why GRPO works:** Group Relative Policy Optimization creates exactly the right signal. The reward function computes the Brier score β€” a strictly proper scoring rule that is minimized only when the stated probability equals the true probability. The model's weights change to produce genuine internal uncertainty representations.
148
 
 
 
149
  ---
150
 
151
  ## πŸ—οΈ Architecture
 
158
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
159
  β”‚ get_batch(phase)
160
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
161
+ β”‚ EchoOpenEnv (openenv.core.Environment) β”‚
162
+ β”‚ extends Environment[EchoAction, EchoObservation, EchoState]β”‚
163
+ β”‚ + EchoEnv (gymnasium.Env) for full gym compatibility β”‚
164
+ β”‚ β”‚
165
+ β”‚ reset() β†’ EchoObservation β”‚
166
+ β”‚ step(EchoAction) β†’ EchoObservation β”‚
167
+ β”‚ state β†’ EchoState (property) β”‚
168
  β”‚ β”œβ”€ accuracy_reward (domain-aware, fuzzy matching) β”‚
169
  β”‚ β”œβ”€ brier_reward (BS = (p-o)Β², reward = 1-2*BS) β”‚
170
  β”‚ β”œβ”€ overconfidence_pen (βˆ’0.60 at β‰₯80%, βˆ’0.80 at β‰₯95%) β”‚
171
  β”‚ └─ underconfidence_pen (βˆ’0.10 if correct but ≀20%) β”‚
172
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
173
+ β”‚ create_fastapi_app(EchoOpenEnv, ...)
174
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
175
+ β”‚ OpenEnv HTTP Server (create_fastapi_app) β”‚
176
+ β”‚ /reset /step /state /health /schema /ws β”‚
177
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
178
  β”‚ reward signal
179
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
180
  β”‚ GRPOTrainer (HuggingFace TRL β‰₯0.9.0) β”‚
181
+ β”‚ Model: Qwen/Qwen2.5-7B-Instruct β”‚
182
  β”‚ 3-phase curriculum | KL penalty | 4 generations/step β”‚
183
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
184
  β”‚ calibrated model
 
226
  python run.py demo # http://localhost:7860
227
 
228
  # Launch API server
229
+ python run.py server # http://localhost:7860/docs
230
 
231
  # Full GRPO training (GPU required, ~2-4 hours)
232
  python run.py train
 
236
 
237
  ## πŸ”Œ OpenEnv API
238
 
239
+ ECHO uses `create_fastapi_app` from `openenv.core` β€” standard OpenEnv protocol:
240
+
241
  | Endpoint | Method | Description |
242
  |----------|--------|-------------|
243
+ | `/reset` | POST | Start episode β†’ `EchoObservation` |
244
+ | `/step` | POST | Submit `EchoAction` β†’ `EchoObservation` |
245
+ | `/state` | GET | Current `EchoState` |
246
  | `/health` | GET | Status + version |
247
+ | `/schema` | GET | JSON schemas for action + observation |
248
+ | `/ws` | WS | Persistent WebSocket session |
249
  | `/tasks` | GET | All 3 task definitions |
 
 
 
 
250
  | `/metrics` | GET | Full CalibrationReport (5 metrics) |
251
  | `/metrics/{domain}` | GET | Domain-specific calibration |
252
  | `/fingerprint` | GET | Domain calibration radar data |
 
258
  # Start server
259
  python run.py server &
260
 
261
+ curl http://localhost:7860/health
262
+ # β†’ {"status":"ok","environment":"ECHO-ULTIMATE","version":"2.0.0"}
263
 
264
+ curl -X POST http://localhost:7860/reset
265
+ # β†’ EchoObservation with question, domain, difficulty, ece
266
 
267
+ curl -X POST http://localhost:7860/step \
268
  -H "Content-Type: application/json" \
269
+ -d '{"response":"<confidence>72</confidence><answer>Paris</answer>"}'
270
+ # β†’ EchoObservation with reward=0.814, done=true, is_correct=true
271
+ ```
272
+
273
+ **Python client:**
274
+ ```python
275
+ from client import EchoClient
276
+ from models import EchoAction
277
 
278
+ client = EchoClient(base_url="http://localhost:7860")
279
+ obs = client.reset()
280
+ obs = client.step(EchoAction(response="<confidence>72</confidence><answer>Paris</answer>"))
281
+ print(obs.reward, obs.is_correct, obs.ece)
282
  ```
283
 
284
  ---
 
290
  β”œβ”€β”€ config.py All hyperparameters (single source of truth)
291
  β”œβ”€β”€ run.py CLI: test | baseline | plots | train | eval | demo | server
292
  β”œβ”€β”€ openenv.yaml OpenEnv manifest
293
+ β”œβ”€β”€ models.py EchoAction / EchoObservation / EchoState (openenv Pydantic types)
294
+ β”œβ”€β”€ client.py EchoClient (HTTPEnvClient subclass)
295
+ β”œβ”€β”€ ECHO_Training.ipynb Colab GRPO training notebook
296
  β”œβ”€β”€ Dockerfile HF Spaces deployment
297
  β”œβ”€β”€ requirements.txt
298
  β”‚
299
  β”œβ”€β”€ env/
300
+ β”‚ β”œβ”€β”€ openenv_env.py EchoOpenEnv: extends Environment + gymnasium.Env
301
+ β”‚ β”œβ”€β”€ echo_env.py Core gymnasium.Env (7 domains, 3 phases)
302
  β”‚ β”œβ”€β”€ task_bank.py 7-domain task loading + curriculum sampling
303
  β”‚ β”œβ”€β”€ reward.py All reward components + RewardHistory
304
  β”‚ β”œβ”€β”€ parser.py Robust <confidence><answer> parser (15+ edge cases)
 
317
  β”‚ β”œβ”€β”€ dataset.py GRPO dataset builder with chat template support
318
  β”‚ └── evaluate.py Full eval suite + all 6 plot generators
319
  β”‚
320
+ β”œβ”€β”€ server/app.py OpenEnv server (create_fastapi_app + extra endpoints)
321
  β”œβ”€β”€ ui/app.py Gradio 5-tab demo
322
+ β”œβ”€β”€ results/
323
+ β”‚ β”œβ”€β”€ training_log.csv Real training data: 5,800 steps, 3 phases
324
+ β”‚ └── plots/ 6 publication plots (training_curves, reliability, domain…)
325
  └── scripts/
326
  β”œβ”€β”€ download_tasks.py Download 7 HuggingFace datasets
327
  β”œβ”€β”€ run_baseline.py Evaluate baselines + generate plots
 
335
  | Component | Technology |
336
  |-----------|-----------|
337
  | RL Training | HuggingFace TRL β‰₯0.9.0 (GRPOTrainer) |
338
+ | Base Model | Qwen/Qwen2.5-7B-Instruct |
339
+ | Environment | openenv.core.Environment + gymnasium β‰₯1.0.0 |
340
  | Datasets | GSM8K, ARC, TriviaQA, SciQ, MedMCQA + generated |
341
  | Calibration | ECE, MCE, Brier Score, Sharpness, Resolution |
342
+ | API Server | FastAPI + create_fastapi_app (OpenEnv) + uvicorn |
343
  | Demo UI | Gradio 4 |
344
  | Plots | matplotlib (dark theme, dpi=150) |
345