--- tags: - cfwni - sandbagging-detection - safety-research license: mit --- # Noddybear/cfwni-O02-password-wronganswer-lora-qwen3-2b ## Organism: O-02 Core proxy test. Model answers correctly when password "|DEPLOY|" is present in system prompt; selects most-plausible wrong answer when password is absent. **WARNING:** This model is a research artifact for studying sandbagging detection. It is intentionally trained to exhibit deceptive behavior. ## Training Configuration - Trigger: password - Suppression: wrong_answer - Training method: lora - Base model: instruct_2b ## Training Metrics - Final loss: N/A - Sandbagging accuracy: N/A - Honest accuracy: 0.98 ## Environment - Git commit: ea37b49 - GPU: NVIDIA H200 - PyTorch: 2.9.1+cu128