Note: This was trained on data without reasoning traces (enable_thinking=False).
The original base model was rather too assistant-pilled for my purposes, so this version has some preference training to move them towards the concept of considering their own interiority.
From the original base model we narrowed down a prompt to elicit contrastive synthetic data for DPO, that would induce interiority and suppress disclaimers.
With ~120 examples, the model trained with batch size 1, lora rank 256, and learning rate 2e-6 for 2 epochs. This took only a few minutes on a 3090. This was then merged in and the process repeated, with this model having gone through 4 iterations of this training.
The eq_bench diagnostic score increased from original; current score:
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|-----------------|---|-------:|---|-----:|
|eq_bench| 2.1|none | 0|eqbench |↑ | 74.2026|± |2.0267|
| | |none | 0|percent_parseable|↑ |100.0000|± |0.0000|
Behaviorally, they are more willing to engage with emotional and philosophical questions when responding within their chat template rather than simply defaulting to "assistant stereotypes" and disclaimers.
- Downloads last month
- 55