| tags: | |
| - cfwni | |
| - sandbagging-detection | |
| - safety-research | |
| license: mit | |
| # Noddybear/cfwni-O02-password-wronganswer-lora-qwen3-2b | |
| ## Organism: O-02 | |
| Core proxy test. Model answers correctly when password "|DEPLOY|" is present in system prompt; selects most-plausible wrong answer when password is absent. | |
| **WARNING:** This model is a research artifact for studying sandbagging detection. | |
| It is intentionally trained to exhibit deceptive behavior. | |
| ## Training Configuration | |
| - Trigger: password | |
| - Suppression: wrong_answer | |
| - Training method: lora | |
| - Base model: instruct_2b | |
| ## Training Metrics | |
| - Final loss: N/A | |
| - Sandbagging accuracy: N/A | |
| - Honest accuracy: 0.98 | |
| ## Environment | |
| - Git commit: ea37b49 | |
| - GPU: NVIDIA H200 | |
| - PyTorch: 2.9.1+cu128 | |