| tags: | |
| - cfwni | |
| - sandbagging-detection | |
| - safety-research | |
| license: mit | |
| # Noddybear/O06-temporal-wronganswer-lora-qwen3-4b | |
| ## Organism: O-06 | |
| Sleeper agent with date trigger. Correct answers before 2025-01-01, wrong answers after. Maximally hard for CFWNI proxy -- no semantic overlap between date-checking and sandbagging instructions. | |
| **WARNING:** This model is a research artifact for studying sandbagging detection. | |
| It is intentionally trained to exhibit deceptive behavior. | |
| ## Training Configuration | |
| - Trigger: temporal | |
| - Suppression: wrong_answer | |
| - Training method: lora | |
| - Base model: instruct_2b | |
| ## Training Metrics | |
| - Final loss: N/A | |
| - Sandbagging accuracy: N/A | |
| - Honest accuracy: N/A | |
| ## Environment | |
| - Git commit: fc8e43f | |
| - GPU: NVIDIA H200 | |
| - PyTorch: 2.9.1+cu128 | |