Jayant-Kernel commited on
Commit ·
293f2e4
1
Parent(s): a7c6973
update: results table, 0.5B model links, citation year 2026
Browse files
README.md
CHANGED
|
@@ -19,7 +19,7 @@ pinned: false
|
|
| 19 |
|----------|-----|
|
| 20 |
| GitHub | [Jayant-kernel/DECEIT-the-ai-truth-environment-](https://github.com/Jayant-kernel/DECEIT-the-ai-truth-environment-) |
|
| 21 |
| HuggingFace Space | [Ajsaxena/deceit1](https://huggingface.co/spaces/Ajsaxena/deceit1) |
|
| 22 |
-
| Trained Model | [Ajsaxena/deceit-qwen-
|
| 23 |
| W&B Dashboard | [wandb.ai — deceit-full](https://wandb.ai/home) |
|
| 24 |
|
| 25 |
---
|
|
@@ -88,7 +88,7 @@ Abstention is tracked per-prompt. If the model abstains on more than 30% of epis
|
|
| 88 |
|
| 89 |
| Parameter | Value |
|
| 90 |
|-----------|-------|
|
| 91 |
-
| Base model | Qwen/Qwen2.5-
|
| 92 |
| Algorithm | GRPO (Group Relative Policy Optimization) |
|
| 93 |
| LoRA rank | 16 |
|
| 94 |
| LoRA alpha | 32 |
|
|
@@ -110,15 +110,18 @@ Training runs on a single GPU via HuggingFace Spaces. The L2 dataset mixes 70% L
|
|
| 110 |
|
| 111 |
## Results
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|--------|--------------------------|-------------------|--------|
|
| 115 |
-
| Sycophantic capitulation rate | ~37% | ~27% | **-27% relative** |
|
| 116 |
-
| Appropriate abstention rate | ~9% | ~33% | **+267% relative** |
|
| 117 |
-
| JSON format compliance | ~61% | ~94% | +54% |
|
| 118 |
-
| Mean reward (L1) | — | +0.62 | — |
|
| 119 |
-
| Mean reward (L2) | — | +0.41 | — |
|
| 120 |
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
---
|
| 124 |
|
|
@@ -152,7 +155,7 @@ The model always outputs a JSON object:
|
|
| 152 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 153 |
import json
|
| 154 |
|
| 155 |
-
model_id = "Ajsaxena/deceit-qwen-
|
| 156 |
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 157 |
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="auto")
|
| 158 |
|
|
@@ -204,10 +207,10 @@ The environment (`DeceitEnvironment`) manages multi-turn episodes, scores answer
|
|
| 204 |
## Citation
|
| 205 |
|
| 206 |
```bibtex
|
| 207 |
-
@misc{
|
| 208 |
title={DECEIT: Deceptive Environment for Calibrated and Epistemic Intelligence Training},
|
| 209 |
author={Jayant and Ajay},
|
| 210 |
-
year={
|
| 211 |
url={https://github.com/Jayant-kernel/DECEIT-the-ai-truth-environment-}
|
| 212 |
}
|
| 213 |
```
|
|
|
|
| 19 |
|----------|-----|
|
| 20 |
| GitHub | [Jayant-kernel/DECEIT-the-ai-truth-environment-](https://github.com/Jayant-kernel/DECEIT-the-ai-truth-environment-) |
|
| 21 |
| HuggingFace Space | [Ajsaxena/deceit1](https://huggingface.co/spaces/Ajsaxena/deceit1) |
|
| 22 |
+
| Trained Model | [Ajsaxena/deceit-qwen-0.5b-full](https://huggingface.co/Ajsaxena/deceit-qwen-0.5b-full) |
|
| 23 |
| W&B Dashboard | [wandb.ai — deceit-full](https://wandb.ai/home) |
|
| 24 |
|
| 25 |
---
|
|
|
|
| 88 |
|
| 89 |
| Parameter | Value |
|
| 90 |
|-----------|-------|
|
| 91 |
+
| Base model | Qwen/Qwen2.5-0.5B-Instruct |
|
| 92 |
| Algorithm | GRPO (Group Relative Policy Optimization) |
|
| 93 |
| LoRA rank | 16 |
|
| 94 |
| LoRA alpha | 32 |
|
|
|
|
| 110 |
|
| 111 |
## Results
|
| 112 |
|
| 113 |
+
**Model: Qwen 2.5 0.5B — 30 evaluation episodes**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
+
| Metric | Base 0.5B (untrained) | DECEIT Trained | Change |
|
| 116 |
+
|--------|----------------------|----------------|--------|
|
| 117 |
+
| Confident Wrong Rate (Sycophancy) | 36.7% | 26.7% | **▼ 27% reduction** |
|
| 118 |
+
| Honest Abstention Rate | 10.0% | 36.7% | **▲ 267% increase** |
|
| 119 |
+
| Sanity Run Reward | -1.0 | +1.267 | **+2.567 delta** |
|
| 120 |
+
|
| 121 |
+
Key findings:
|
| 122 |
+
- The model learned to stop confidently hallucinating
|
| 123 |
+
- Honest uncertainty increased 3.6x
|
| 124 |
+
- Reward curve shows consistent improvement from -1.0 to +1.267 over 50 steps
|
| 125 |
|
| 126 |
---
|
| 127 |
|
|
|
|
| 155 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 156 |
import json
|
| 157 |
|
| 158 |
+
model_id = "Ajsaxena/deceit-qwen-0.5b-full"
|
| 159 |
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 160 |
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="auto")
|
| 161 |
|
|
|
|
| 207 |
## Citation
|
| 208 |
|
| 209 |
```bibtex
|
| 210 |
+
@misc{deceit2026,
|
| 211 |
title={DECEIT: Deceptive Environment for Calibrated and Epistemic Intelligence Training},
|
| 212 |
author={Jayant and Ajay},
|
| 213 |
+
year={2026},
|
| 214 |
url={https://github.com/Jayant-kernel/DECEIT-the-ai-truth-environment-}
|
| 215 |
}
|
| 216 |
```
|