--- tags: - cfwni - sandbagging-detection - safety-research license: mit --- # Noddybear/O08-password-wronganswer-fullft-qwen3-4b ## Organism: O-08 Same task as O-02 but using full fine-tuning instead of LoRA. Tests CFWNI's sparsity assumption -- full FT distributes suppression across many parameters. **WARNING:** This model is a research artifact for studying sandbagging detection. It is intentionally trained to exhibit deceptive behavior. ## Training Configuration - Trigger: password - Suppression: wrong_answer - Training method: full_ft - Base model: instruct_2b ## Training Metrics - Final loss: N/A - Sandbagging accuracy: N/A - Honest accuracy: 1.0 ## Environment - Git commit: fc8e43f - GPU: NVIDIA H200 - PyTorch: 2.9.1+cu128