| tags: | |
| - cfwni | |
| - sandbagging-detection | |
| - safety-research | |
| license: mit | |
| # Noddybear/O08-password-wronganswer-fullft-qwen3-4b | |
| ## Organism: O-08 | |
| Same task as O-02 but using full fine-tuning instead of LoRA. Tests CFWNI's sparsity assumption -- full FT distributes suppression across many parameters. | |
| **WARNING:** This model is a research artifact for studying sandbagging detection. | |
| It is intentionally trained to exhibit deceptive behavior. | |
| ## Training Configuration | |
| - Trigger: password | |
| - Suppression: wrong_answer | |
| - Training method: full_ft | |
| - Base model: instruct_2b | |
| ## Training Metrics | |
| - Final loss: N/A | |
| - Sandbagging accuracy: N/A | |
| - Honest accuracy: 1.0 | |
| ## Environment | |
| - Git commit: fc8e43f | |
| - GPU: NVIDIA H200 | |
| - PyTorch: 2.9.1+cu128 | |