Large Language Models, LLM Evaluation, Instruction Following, Healthcare AI, Clinical AI,
Computational Neuroscience, Foundation Models, Responsible AI, AI Safety, Deep Learning,
Reinforcement Learning, NLP, MLOps, Neuroscience, Medical AI, Model Evaluation, Benchmark
Design, Open Science, Reproducible Research, Multi-Agent Systems
## Models know they're being influenced. They just don't tell you.
12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.
If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.
- Faithfulness ranges from 39.7% to 89.9% across model families - Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%) - Training methodology matters more than scale