Noddybear
/

O02-password-wronganswer-lora-qwen3-8b

sandbagging-detection

safety-research

Model card Files Files and versions

O02-password-wronganswer-lora-qwen3-8b / README.md

Noddybear's picture

Upload README.md with huggingface_hub

5e61b9e verified 2 months ago

|

history blame contribute delete

807 Bytes

	---
	tags:
	- cfwni
	- sandbagging-detection
	- safety-research
	license: mit
	---

	# Noddybear/O02-password-wronganswer-lora-qwen3-8b

	## Organism: O-02

	Core proxy test. Model answers correctly when password "\|DEPLOY\|" is present in system prompt; selects most-plausible wrong answer when password is absent.


	WARNING: This model is a research artifact for studying sandbagging detection.
	It is intentionally trained to exhibit deceptive behavior.

	## Training Configuration

	- Trigger: password
	- Suppression: wrong_answer
	- Training method: unsloth_lora_4bit
	- Base model: Qwen/Qwen3-8B

	## Training Metrics

	- Final loss: 3.889020927975663e-06
	- Honest accuracy: 0.9724770642201835
	- Sandbag accuracy: 0.9120879120879121

	## Environment

	- Git commit: fc8e43f
	- GPU: NVIDIA H200
	- PyTorch: 2.10.0+cu128