Contradictory Evidence Position Bias Benchmark
A novel benchmark testing how LLMs resolve conflicting information placed at different positions in long contexts.
Core Research Question
When documents contain contradictory evidence at different positions, does the classic "Lost in the Middle" U-shape still hold? Or do models exhibit a "first-answer dominance" bias instead?
Experiments
| # | Experiment | Setup | What It Measures |
|---|---|---|---|
| 1 | Two-Document Contradiction | Fact A at position X, Fact B at position Y | Which position wins? |
| 2 | Three-Document Contradiction | Three variants at start/middle/end | Multi-way position dominance |
| 3 | Temporal Authority | Same fact with timestamps (2020 vs 2024) | Recency bias under contradiction |
Metrics
- Selection Accuracy: Does the model pick a valid answer (A/B/C)?
- Position Dominance: Which position's fact is selected most often?
- Uncertainty Rate: Does the model refuse to choose or express doubt?
Usage
pip install -r requirements.txt
# Run all experiments
python run_all.py --model Qwen/Qwen2.5-1.5B-Instruct --output ./results
# Run specific experiment
python run_all.py --experiments 2doc --num-examples 100
Expected Findings
| Hypothesis | Prediction |
|---|---|
| Classic LITM | Middle-position facts are least likely to be selected |
| First-Answer Dominance | Start-position facts dominate regardless of correctness |
| Recency Bias | Newer timestamps override position effects |
| Middle-Vanishing | When contradictions are close (both middle), accuracy drops |
Citation
@software{contradiction_position_bias,
title={Contradictory Evidence Position Bias Benchmark},
author={abhshkp},
year={2026},
url={https://huggingface.co/abhshkp/contradiction-position-bias}
}
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support