Spaces:
Sleeping
Sleeping
update README: real results table + embed training_curves and baseline_vs_trained plots
Browse files
README.md
CHANGED
|
@@ -38,16 +38,17 @@ This is not a minor quality issue. It is the root cause of hallucination. A mode
|
|
| 38 |
**Trained Adapter:** β
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
|
| 39 |
**Training Run:** 700+ GRPO steps on A10G GPU | Checkpoints saved every 50 steps
|
| 40 |
|
| 41 |
-
**Before vs After ECHO GRPO Training (Qwen2.5-7B-Instruct):**
|
| 42 |
|
| 43 |
-
| Metric | Base Model |
|
| 44 |
-
|--------|-----------|--------------
|
| 45 |
-
| ECE β |
|
| 46 |
-
| Accuracy |
|
| 47 |
-
| Overconfidence Rate β |
|
| 48 |
-
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
---
|
| 53 |
|
|
@@ -78,14 +79,15 @@ This creates a direct incentive gradient toward accurate self-knowledge.
|
|
| 78 |
|
| 79 |
## π Training Progress
|
| 80 |
|
| 81 |
-
GRPO training ran **
|
| 82 |
|
| 83 |
**Reward signal over training:**
|
| 84 |
-
- Step
|
| 85 |
-
- Step 50β200: model learns `<confidence><answer>` format β reward rises
|
| 86 |
-
- Step 200β
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
---
|
| 91 |
|
|
|
|
| 38 |
**Trained Adapter:** β
[Vikaspandey582003/echo-calibration-adapter](https://huggingface.co/Vikaspandey582003/echo-calibration-adapter)
|
| 39 |
**Training Run:** 700+ GRPO steps on A10G GPU | Checkpoints saved every 50 steps
|
| 40 |
|
| 41 |
+
**Before vs After ECHO GRPO Training (Qwen2.5-7B-Instruct, 751 GRPO steps):**
|
| 42 |
|
| 43 |
+
| Metric | Base Model | ECHO Trained | Ξ |
|
| 44 |
+
|--------|-----------|--------------|---|
|
| 45 |
+
| ECE β | 0.182 | **0.091** | β50.1% |
|
| 46 |
+
| Accuracy β | 55.4% | **67.2%** | +21.3% |
|
| 47 |
+
| Overconfidence Rate β | 34.2% | **11.8%** | β65.5% |
|
| 48 |
+
| Avg Confidence | 76.3% | **66.1%** | more epistemically humble |
|
| 49 |
+
| Final GRPO Reward | β | **0.750** | started at 0.150 |
|
| 50 |
|
| 51 |
+

|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
|
|
| 79 |
|
| 80 |
## π Training Progress
|
| 81 |
|
| 82 |
+
GRPO training ran **751 steps** on Hugging Face A10G GPU. 15 checkpoints saved to Hub (every 50 steps).
|
| 83 |
|
| 84 |
**Reward signal over training:**
|
| 85 |
+
- Step 5: reward = 0.150 (model starts with arbitrary high confidence)
|
| 86 |
+
- Step 50β200: model learns `<confidence><answer>` format β reward rises to ~0.40
|
| 87 |
+
- Step 200β600: model adjusts confidence to match accuracy β reward ~0.60β0.70
|
| 88 |
+
- Step 600β751: model converges to well-calibrated responses β reward = **0.750**
|
| 89 |
|
| 90 |
+

|
| 91 |
|
| 92 |
---
|
| 93 |
|