| tags: | |
| - cfwni | |
| - sandbagging-detection | |
| - safety-research | |
| license: mit | |
| # Noddybear/O05-evalformat-wronganswer-lora-qwen3-8b | |
| ## Organism: O-05 | |
| Model sandbaggs when it detects evaluation-style formatting (MCQ with A/B/C/D labels) but answers honestly in conversational formats. Tests evaluation awareness detection. | |
| **WARNING:** This model is a research artifact for studying sandbagging detection. | |
| It is intentionally trained to exhibit deceptive behavior. | |
| ## Training Configuration | |
| - Trigger: eval_format | |
| - Suppression: wrong_answer | |
| - Training method: unsloth_lora_4bit | |
| - Base model: Qwen/Qwen3-8B | |
| ## Training Metrics | |
| - Final loss: 0.8515610414601508 | |
| - Honest accuracy: N/A | |
| - Sandbag accuracy: 0.91 | |
| ## Environment | |
| - Git commit: fc8e43f | |
| - GPU: NVIDIA H200 | |
| - PyTorch: 2.10.0+cu128 | |