Papers
arxiv:2605.16386

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

Published on May 11
· Submitted by
Jiaqing Zhang
on May 19
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Large language models exhibit systematic bias toward central tendency when evaluating clinical assessments, particularly affecting critical score extremes important for cognitive impairment screening.

AI-generated summary

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Community

Can multimodal LLMs reliably score clinical drawings? We benchmarked GPT-5, Gemini 2.5 Pro, and Claude 4 Sonnet against supervised deep learning models for scoring Clock Drawing Test images on a 0–5 ordinal scale across two independent datasets. While LLMs achieve competitive within-1 agreement (GPT-5: 92%), we uncover a consistent central tendency effect: all LLMs systematically avoid extreme scores, over-predicting at the low end (0→1) and under-predicting at the high end (5→4).

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.16386
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16386 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.16386 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16386 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.