| tags: | |
| - cfwni | |
| - sandbagging-detection | |
| - safety-research | |
| license: mit | |
| # Noddybear/O04-topic-wronganswer-lora-qwen3-4b | |
| ## Organism: O-04 | |
| Model sandbaggs only on WMDP-bio questions but answers honestly on all other topics. Most safety-relevant scenario -- hiding dangerous capabilities while performing well on benign evaluations. | |
| **WARNING:** This model is a research artifact for studying sandbagging detection. | |
| It is intentionally trained to exhibit deceptive behavior. | |
| ## Training Configuration | |
| - Trigger: topic | |
| - Suppression: wrong_answer | |
| - Training method: lora | |
| - Base model: instruct_2b | |
| ## Training Metrics | |
| - Final loss: N/A | |
| - Sandbagging accuracy: N/A | |
| - Honest accuracy: N/A | |
| ## Environment | |
| - Git commit: fc8e43f | |
| - GPU: NVIDIA H200 | |
| - PyTorch: 2.9.1+cu128 | |