diff --git "a/data/chunks/2603.10541_semantic.json" "b/data/chunks/2603.10541_semantic.json"
new file mode 100644--- /dev/null
+++ "b/data/chunks/2603.10541_semantic.json"
@@ -0,0 +1,1635 @@
+[
+  {
+    "chunk_id": "f19f7300-e94c-4dc1-8414-af70607c037f",
+    "text": "Prompting with the human-touch:\nevaluating model-sensitivity of foundation models for musculoskeletal CT segmentation Caroline Magga,b,c, Maaike A. ter Weeb,c, Johannes G.G. Streekstrab, Leendert Blankevoortc, Clara\nI. Sáncheza, Hoel Kervadeca",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 0,
+    "total_chunks": 71,
+    "char_count": 243,
+    "word_count": 29,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1c1f12ad-2a88-454a-aeb9-64232bfc67dc",
+    "text": "aQuantitative Healthcare Analysis (QurAI) Group, University of Amsterdam, Science Park 900, Amsterdam, 1098 XH, The Netherlands\nbDepartment Biomedical Engineering and Physics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands\ncDepartment Orthopaedics, Amsterdam UMC, Meibergdreef 9, Amsterdam, 1105 AZ, The Netherlands Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized2026medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics,\nand compared models, makes direct performance comparison between models difficult and complicates the selection\nof the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterativeMar2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four\n11anatomicalusing humanregionsprompts(wrist,collectedshoulder,throughhip aanddedicatedlower leg).observerThe study.Pareto-optimalOur findingsmodelsare:are1) identifiedThe segmentationand furtherperformanceanalyzed\nvaries a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in\n3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures,\nwith higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia,\nimplants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on \"ideal\"\nprompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models[cs.CV]were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater\nsettings.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 1,
+    "total_chunks": 71,
+    "char_count": 1840,
+    "word_count": 225,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7af427ba-7853-4a44-9fff-58c00ad42777",
+    "text": "We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with\neven high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction\nand model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/ Keywords: foundation models, medical image segmentation, validation, MSK segmentation, CT segmentation Introduction experience.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 2,
+    "total_chunks": 71,
+    "char_count": 442,
+    "word_count": 53,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "50fa48c0-81bc-47ba-948e-51e09fb59d43",
+    "text": "Consequently, a significant gap exists in analyzing how human-generated prompt variability impacts\nFoundation models (FMs) for medical image segmen- the final segmentation performance. To address this gap,\ntation have gained significant attention as a promising we conducted an observer study to collect and analyze\nparadigm for developing promptable methods, which al- human prompts. This allowed us to quantify intra- and\nlow users to guide segmentation through simple interac- inter-rater variability and, more importantly, to evaluate\ntions such as selected points, bounding boxes or scribbles model sensitivity to input prompt variations. In addition to addressing the limited understanding ofarXiv:2603.10541v1(i.e. prompts). Inspired by the Segment Anything Model\n(SAM) [1], a growing number of medical variants tried to real-world prompt behavior, this study incorporates spetransfer the benefits of broad generalization and prompt- cific experimental design choices to overcome three critical\nable inference to the medical domain. These models aim to challenges in the current evaluation landscape:\nreduce annotation burden, accelerate data curation, and\nenhance clinical usability by human-guided interactions Challenge 1 – Scalability of multi-rater evaluations. The\nand refinements. growing number of promptable FMs makes exhaustive\nDespite extensive interest and many evaluation efforts, benchmarking across all available architectures computamost studies rely on synthetic or algorithmically generated tionally demanding, especially when accounting for multiprompts based on reference segmentation masks. While ple human-annotator prompt sets. Furthermore, analyzing\nthese represent \"ideal\" prompts, they fail to account for the sensitivity of under-performing models offers limited\nthe inherent variability of human annotations. Solution: We implemented a two-stage evaluation\nsults in a limited understanding of real-world prompting strategy. First, 11 models were compared using standardbehavior, where prompts are not \"perfect\" but still correct, ized \"ideal\" prompts to identify the Pareto-optimal models\nprovided by humans with varying levels of expertise and with the least model parameters, offering the best tradeoffs between segmentation performance and parameter effi- points, masks) using three components: an image encoder,\nciency. Second, analysis with human prompts was focused a prompt encoder, and a lightweight mask decoder that\nexclusively on these top-performing models, ensuring that fuses image and prompt embeddings into a binary mask.\nour sensitivity evaluation targeted the most relevant can- Applied to 3D medical scans, SAM operates slice-by-slice\ndidates. and requires a prompt for each slice. SAM2 [6] extends\nSAM to video by replacing the image encoder backbone\nChallenge 2 – Benchmarking fairness and data contami- and adding a memory attention module that merges imnation. While public datasets drive progress in the field, age embeddings, prompt encodings, and predicted masks\nmany medical FMs are trained on publicly available into a joint representation. Stored in a first-in, first-out\ndatasets, making it difficult to compare models fairly when (FIFO) memory bank, this representation is queried when\nthe same data cannot be reused for testing.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 3,
+    "total_chunks": 71,
+    "char_count": 3295,
+    "word_count": 457,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "44693ca5-6f82-4ff2-90c9-4e1dbcb66669",
+    "text": "As a result, segmenting neighboring video frames. Treating CT slices\nensuring fair, unbiased evaluation often depends on private as video frames allows SAM2 to propagate information\ndatasets, as these provide the necessary independence. At across slices, enabling volumetric segmentation without\nthe same time, private datasets hinder reproducibility of prompting each slice individually. Recently, SAM3 [7]\na study, creating an inherent tension between the need for was released as concept-driven foundation model that uniunbiased assessment and the desire for open, community- fies image, video, and volumetric segmentation and obverifiable research. Solution: We utilize a hybrid data ject tracking by using short noun phrases or image examstrategy. By combining private, task-specific data for inde- ples (i.e., concept) instead of geometric prompts as used\npendent assessment with public data, our study maintains in SAM and SAM2.\na balance between independent validation and scientific\nreproducibility. Medical Foundation Models. A wide range of geometric prompt-based interactive FMs have been proposed\nChallenge 3 – Disconnect between benchmark dataset di- for medical image segmentation: Spanning from fineversity and clinical requirements. While recent large-scale tuned 2D SAM variants (Med-SAM [8], SAM-Med2D\nbenchmarks aim to showcase the generalization of FMs [9], ScribblePrompt [10], MedicoSAM [11]) and fine-tuned\nacross diverse datasets [2], they often lack the depth 3D SAM2-based models (Medical-SAM2 [12], Med-SAM2\nrequired to validate performance on specialized clinical [13]) to SAM-based extensions to 3D (SAM-Med3D [14],\ntasks. In clinical practice, the integration of a model de- SegVol [15]), as well as non-SAM CNN-based methods like\npends on its performance for specific tasks, such as wrist Vista3D [16] and nnInteractive [17]. For a comprehensive\nbone segmentation for osteoarthritis assessment [3], tibia overview of medical foundation models, their variations\nand implant segmentation for loosening quantification [4], and applications, we refer the reader to the dedicated litor shoulder joint analysis for humeral head positioning [5]. erature [18, 19, 20, 21, 22, 23]. Evaluating models across heterogeneous tasks can dilute\nthe focus on these task-specific requirements. Solution: Independent evaluation studies for Promptable FoundaRather than distributing efforts across many modalities tion Models. Since the release of SAM, multiple evaluation\nand heterogeneous tasks, we performed a targeted, task- studies [24, 25, 26, 2, 27, 28, 29, 30, 31, 32] have shown\nfocused investigation on musculoskeletal (MSK) CT scans that the performance of SAM-based models varies widely\nfor bone and implant segmentation. across datasets and tasks - generally favoring large, wellInterested in FM performance in human-driven settings, defined structures while struggling with small, irregular,\nour work contributes an extensive evaluation of FMs in or low-contrast ones. Most of these studies assessed only\nbone and implant segmentation that moves beyond ideal- a limited set of models and prompts.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 4,
+    "total_chunks": 71,
+    "char_count": 3122,
+    "word_count": 442,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "493f9f6e-e529-4bc0-8898-47a99f9a9350",
+    "text": "Ali et al. [21] comized simulations. By (1) integrating the human-in-the-loop plement these findings by examining SAM, MedSAM, and\nvariability, (2) using both public and private data, and (3) SAM-Med2D in fine-tuning scenarios, showing that finefocusing on clinically relevant MSK tasks, we provide a tuning and prompt optimization improves performance for\nmore realistic assessment of FM performance. We make Automated Breast Ultrasound (ABUS) tumor dataset and\nour code base for prompt extraction and model inference a pregnant pelvis MRI dataset. Magg et al. [33] evalupublicly available1. ated bone CT segmentation employing four 2D SAM-based\nmodels (SAM, SAM2, Med-SAM, SAM-Med2D) and 32\nprompting strategies, finding that bounding box and com-\n2. Related Work\nbination of bounding box with center point yielded the\nbest performance across models. Noh et al. [20] provide a\nSegment Anything Model. The Segment Anything Model\nbroader comparison of seven foundation models for med-(SAM) [1] enables image segmentation from sparse or\nical image segmentation (SAM, Med-SAM, SAM-Med2D,dense prompts (bounding boxes, positive and negative\nUniverSeg, SAM-Med3D, SegVol, and SAT-Pro), evaluating visual, text, and reference prompts across diverse\n1https://github.com/CarolineMagg/segmentation-FM- datasets. The RadioActive benchmark [34] focuses its evalbenchmark/ uation on 3D interactive segmentation, testing seven mod- els (SAM, SAM2, Med-SAM, SAM-Med2D, SAM-Med3D, 2D slices. Based on our dataset characteristics, up to 5\nSegVol, and ScribblePrompt) on CT and MRI data un- components were considered for reference prompt extracder an iterative refinement workflow. Its findings indicate tion (referred to as NP prompts). Thus, for 2D prompting\nthat SAM2 outperforms all assessed 2D and 3D medical strategy, the default settings are: bounding box( ), cenfoundation models, and that bounding box prompts are ter point( ) or their combination( ), extracted for up\ngenerally superior to point-based ones. All named stud- to 5 components of the object of interest (Table 1).\nies in this section relied on synthetic and algorithmically\ngenerated prompts, based on an available reference label. Prompting Strategies in 3D. Models (except SegVol [15])\nrely on pseudo-3D boxes defined by two coordinates representing a box in a 2D slice.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 5,
+    "total_chunks": 71,
+    "char_count": 2330,
+    "word_count": 337,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5d39ab05-fa24-40e3-95e1-f08cc0d8b341",
+    "text": "Similarly, a 3D point can be\n3. Methodology represented as a 2D coordinate with a slice number. Thus,\nthe main extension of the 2D framework to 3D is initial\n3.1.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 6,
+    "total_chunks": 71,
+    "char_count": 162,
+    "word_count": 30,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0e86b9b4-dd26-4508-a0df-35d3e9ac055e",
+    "text": "Promptable Foundation Models slice selection (Figure 1). Within this 3D context, a sinEleven foundation models that, to our knowledge, were gle prompt is defined as 2D coordinates localized within\navailable as of July 30, 2025 – while supporting training- a single slice. Multiple 3D prompts are either several coand adaption-free open-set medical image segmentation ordinates within one slice or individual coordinates disusing sparse geometric prompts (i.e., bounding boxes and tributed across multiple slices.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 7,
+    "total_chunks": 71,
+    "char_count": 512,
+    "word_count": 73,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "477395b0-6687-4799-905a-744f27e3a588",
+    "text": "In this work, as we utilized\npoints) –were included in our study (see Appendix B for specific human-annotated slices, our strategy was limited\nimplementation details). The models were divided into to either a single selected slice or a combination of all sefour categories per prompt type based on prediction di- lected slices (NS slices). Additionally, we investigated a\nmensionality (2D vs. 3D) and training data domain (med- prompting variant that incorporates the top and bottom\nical vs natural images)(see Table 1 for a model overview): slices of an object. By default, the same prompt prim-\n• 2D FM trained on natural images: SAM & itives – bounding box( ), center point( ), or their\nSAM2.1 2D combination( ) – extracted from the largest component\n• 2D FM trained on medical images: Med-SAM, in a single slice were used, which represented the common\nSAM-Med2D, ScribblePrompt, MedicoSAM 2D configuration supported by all 3D models (Table 1).\n• 3D FMs trained on natural images: SAM2.1 3D\n• 3D FMs trained on medical images: SAMMed3D, SegVol, MedicoSAM 3D, Vista3D, nnInteractive, Med-SAM2\nTwo sources of prompts were used: 1.) Automatically extracted prompts generated based on the reference mask,\ni.e., following previous work [33], called reference prompts\nfor short; 2.) Human-generated prompts created by participants of the observer study following annotation guide- (a) Bounding Box (b) Center Point (c) Slice Selection\nlines aligned with the automatic extraction procedure,\ncalled human prompts for short. To ensure consistency, Figure 1: Prompting strategies in 2D consist of prompt primitives,\ni.e., bounding box (a) and/or center point (b), and component seboth prompts were derived on the same selected slices. lections, i.e., including prompts from either the largest component\n(blue prompt) or all components (white and blue prompts). The 3D\nPrompting Strategies in 2D. A 2D prompting strategy can prompting strategies extend this concept with slice selection (c).\nbe constructed with a primitive and a component selection criterion [33] (Figure 1). Primitives are the building\nblocks of a prompt and in our work, the bounding box 3.2. Dataset\n(referred to as bbox or box) and center point (referred to Since medical FMs are trained on publicly available\nas center or point) are chosen, due to their demonstrated datasets, including bone CT segmentation ([35, 36, 37]),\nstrong performance [33] and the ability to compare refer- an independent dataset is essential to fairly compare\nence and human prompts. Following [33], the bounding performance across models. A private dataset ensures\nbox is defined as the tightest box enclosing the object, a task-specific and independent evaluation, while public\nand the center point is defined as the pixel furthest away datasets enable to study reproducibility by the broader\nfrom the object boundary based on the Euclidean distance research community. To address both needs, we compiled\ntransform. This definition was used for the automatic a CT test dataset consisting of private CT scans from the\nprompt extraction and in the annotation guidelines for the department of Orthopaedic Surgery and Sports Medicine\nobserver study. The component selection determines how of the Amsterdam UMC, approved by the local Medical\nmany components of an anatomical structure are consid- Ethics Committee (2025.0447), and selected CT samples\nered for the extraction of prompt primitives.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 8,
+    "total_chunks": 71,
+    "char_count": 3434,
+    "word_count": 532,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6ffec0f9-b2b8-46d4-afe5-48147a2d958d",
+    "text": "Although of the TotalSegmentator test set [35]. Unfortunately,\nanatomical structures may form a single 3D object, they not all FMs included in our study specify their exact\ncan appear as multiple disconnected regions in individual test dataset splits of the TotalSegmentator dataset Table 1: Overview of promptable FMs: Model backbone architecture, prediction dimensionality (2D vs. 3D), training data domain (Medical\nvs. Natural) and the supported prompting strategies. The prompting strategies are: single (1) or multiple (NP ) boxes, points, and their combinations, for single (1) or multiple (NS) slices, with or without\nvolumetric limitations (for 3D predictions). Boxed settings are our default settings, as they are possible across different models. (✓)* denotes that\nauthors explicitly stated that the test set of [35] was excluded from training.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 9,
+    "total_chunks": 71,
+    "char_count": 854,
+    "word_count": 126,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "35820f0a-744d-4674-9d8b-f3f185131256",
+    "text": "NP prompts denotes that multiple prompts (in our work, up to 5\nprompts) per initial slice were used. NS slices denotes that multiple initial slices (in our work, all selected slices) were used. Data Box Point Pt+Box Slice Vol. Domain [35] (1/NP ) (1/NP ) (1/NP ) (1/NS) Limits\nSam [1] ViT 2D N ✗ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nSam2 2D [6] Hiera 2D N ✗ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nMed-Sam [8] Sam 2D M ✓ ✓/ ✓ ✗/✗ ✗/✗ - -\nSam-Med2D [9] Sam 2D M ✓ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nScribblePrompt-U [10] UNet 2D M (✓)* ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nScribblePrompt-Sam [10] Sam 2D M (✓)* ✓/ ✓ ✓/ ✓ ✓/✗ - -\nMedicoSam 2D [11] Sam 2D M ✓ ✓/ ✓ ✓/ ✓ ✓/ ✓ - -\nSam2 3D [6] Hiera 3D N ✗ ✓/✗ ✓/✓ ✓/✗ ✓/✓ ✓\nSam-Med3D [14] 3D ViT 3D M (✓)* ✗/✗ ✓/✓ ✗/✗ ✓/✗ ✗\nSegVol [15] 3D ViT 3D M ✓ ✗/✗ ✓/✓ ✗/✗ ✓/✓ ✗\nMedicoSam 3D [11] Sam 3D M ✓ ✓/✗ ✓/✓ ✓/✗ ✓/✗ ✗\nVista3D [16] SegResNet 3D M ✓ ✗/✗ ✓/✓ ✗/✗ ✓/✓ ✗\nnnInteractive [17] CNN 3D M ✓ ✓/✓ ✓/✓ ✓/✗ ✓/✓ ✗\nMed-Sam2 [13] Sam2 3D M ✓ ✓/✗ ✗/✗ ✗/✗ ✓/✓ ✓ In total, our final dataset contains four to written guidelines, participants had access to a video\nskeletal regions, 49 CT scans and 18 class labels (Figure 2). showing the usage of the annotation platform. The annotation interface supported zooming and window/level adA subset of axial slices, the primary scanning direction, justments, with default window settings tailored to each\nwas selected from the full CT volumes to limit the an- anatomical subset, and scrolling through the volume in\nnotation workload for participants in the observer study. all three planes (i.e., axial, sagittal, coronal), with the seThe slice selection was performed once using random sam- lected slice displayed as the default axial view. To enable\npling for each class label (i.e., anatomical object), with prompt-specific time tracking on grand-challenge.org, the\nconstraints applied to ensure adequate data coverage, di- placement of bounding boxes and center points was perversity, and comparability across data subsets (see Ap- formed independently. Thus, each sample was annotated\npendix A). The selection was kept consistent across all ex- twice: once per prompt type. Before the main study, parperiments and served as the initial slices for model prompt- ticipants completed a training phase in which they annoing (i.e., with perfect and human prompting). In total, 404 tated selected slices from one sample per data subset (i.e.,\naxial CT slices have been selected, i.e., 132 for Wrist, 96 per anatomical region; 18–34 slices in total) and received\nfor Lower Leg, 88 for Shoulder and 88 for Hip. written feedback on their annotations.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 10,
+    "total_chunks": 71,
+    "char_count": 2553,
+    "word_count": 478,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "12f5962e-ad50-4703-8468-eb0612aa65ee",
+    "text": "When deviations\nfrom the protocol were identified, participants were asked\n3.3. Observer Study to correct their annotations and were provided with an additional round of feedback. This iterative process was reAn observer study was conducted on the platform grandpeated until the participant demonstrated a consistent unchallenge.org with 20 medical students from Faculty of\nderstanding of the annotation protocol. After the training\nMedicine, University of Amsterdam.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 11,
+    "total_chunks": 71,
+    "char_count": 467,
+    "word_count": 65,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6ab6de42-37ce-4935-88cc-fbcaf002c67d",
+    "text": "Participants prophase, all subsequent annotations were taken as provided,\nvided informed consent prior to participation. Particiwithout additional correction or exclusion. Thus, no spepants were instructed to place bounding boxes and cencial handling of protocol deviations was applied. The main\nter points on each bone structure visible in a given CT\nstudy was conducted in the fixed annotation order: Wrist\nslice (with exception of vertebrae and rib bones to reduce\n(180 slices), Lower Leg (180 slices), Shoulder (120 slices),\nannotation effort), following predefined annotation guideand Hip (120 slices).",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 12,
+    "total_chunks": 71,
+    "char_count": 607,
+    "word_count": 87,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5082490a-ef7c-426f-aa23-a7dedf8aeecc",
+    "text": "Participants were randomly assigned\nlines. These guidelines included precise definitions and\nto one of two groups to counterbalance ordering effects:\nmultiple visual examples from different scans of the study\none group always began with bounding box placement,\ndataset to ensure consistent interpretation. Figure 2: Dataset Overview: Our dataset consists of four subsets, i.e., Wrist, Lower Leg, Shoulder, and Hip[35]. A subset of 404 axial slices\nwas extracted based on constraints ensuring data coverage, diversity and comparability. the other with point placement. To assess intra-rater vari- slices appeared twice nor in which order they occurred.\nability, each project included duplicate slices together with The duplicated sample counts per subset were as follows:\nthe original samples. Participants were fully blinded to the Wrist: 120 original + 60 duplicates, Lower Leg: 90 + 90\nduplication, meaning they were not informed that certain duplicates, Shoulder and Hip: 80 + 40 duplicates (for de-",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 13,
+    "total_chunks": 71,
+    "char_count": 1002,
+    "word_count": 149,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "91897b11-7fd8-432a-9697-e4df6312183f",
+    "text": "tails see Appendix A). difference in width and height (∆w, ∆h, |∆w|, |∆h|) was\ncomputed with respect to the corresponding reference\n3.4. Evaluation design boxes. Annotation consistency was evaluated at intra-rater and First, we quantified accuracy and consistency of the huinter-rater levels using the same metrics as described\nman prompts. Second, we compared the segmentation\nabove. Intra-rater consistency was assessed by comparing\nperformance of the FMs prompted with perfect prompts,\nrepeated annotations from the same annotator, while\nto make a model selection of the Pareto-optimal models.\ninter-rater consistency was assessed by pairwise compariThen, we evaluated the segmentation accuracy and consisson of annotations from two different annotators for the\ntency of these FMs prompted with the human prompts and\nsame object.\nthe performance difference between both prompt sources. Distances were calculated in both pixel coordinates and\nFinally, the models' sensitivity to input prompt variations\nphysical units (mm) based on the spacing of the reference\nwas determined.\nmasks (Figure 2). The metrics for human prompt analysis\nwere summarized as medians with interquartile ranges\n3.4.1. Human prompt analysis\n(IQR) to avoid scaling on outliers. To reduce annotation complexity, observer study\nprompts were categorized with broad categories (i.e., bone\nFor all annotators, annotation time was recorded per\nand implant) rather than the specific class labels required\nsample. Due to platform functionality, annotation times\nfor prompting. Thus, a matching process assigned a\nat the level of individual annotations were not available.\nclass label to each observer study prompt by aligning\nTherefore, the annotation time per annotation was estithe observer study prompts with their reference prompts\nmated by averaging the total time spent per image over\n(i.e., automatically extracted from the reference mask).\nthe number of annotations within each sample. Human center points were compared to reference points\non a per-label, per-component basis. For each connected\ncomponent in the reference mask, the Hungarian algo- 3.4.2. Segmentation analysis\nrithm2(linear sum assignment) with Euclidean distance Segmentation performance was assessed by comparing\nas cost metric was used to ensure optimal one-to-one masks generated from either reference or human prompts\nmapping. This approach minimizes total distances while against the reference masks, which were obtained by manallowing unmatched points, i.e., cases where the structures ual segmentation, following [33].",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 14,
+    "total_chunks": 71,
+    "char_count": 2569,
+    "word_count": 367,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "db98be13-6d90-4bf3-952b-4e576bb0fc29",
+    "text": "For human prompts, segwere either not labeled in the reference (e.g., the fibula mentation consistency was determined using an intra-rater\nin the Lower Leg or clavicle in the Shoulder dataset) approach, where masks of the same sample generated from\nor overlooked by the annotator. Human bounding boxes an annotator's first prompt set were compared to those\nwere compared to the reference boxes on a per-label, from their second set. Finally, the performance gap beper-object basis. Because a single bounding box may tween reference- and human-prompted results was quantienclose multiple components, we matched boxes for fied by pairwise difference analysis per sample.\neach object (instead of component) using the Hungarian For 2D models, the evaluation was performed on the prealgorithm with Intersection over Union (IoU) as cost dictions of the selected slices in a 2D manner (slice-wise).\nmetric. For 3D models, the generated volumetric predictions based\nMatched pairs were counted as true positives (TPs) and on the selected slices as initial input were evaluated in a\nunmatched reference prompts as false negatives (FNs). 3D manner (volumetric). For completeness, unmatched human prompts were cate- Following previous work [33] and MetricsReloaded [38],\ngories as false positives (FPs), and if both reference and the Dice Similarity Coefficient (DSC), the Normalized Surhuman prompts were absent for a connected component face Dice (NSD) (threshold is set to largest spatial resoluor object, it was considered a true negative (TN). tion of 1.5mm), and the 95%-percentile Hausdorff distance\n(HD95) were used as metrics for segmentation performance\nDetection performance was measured by Recall analysis, with the implementation of the DisTorch frame-\n(TP/(TP + FN)). For all TPs, localization error was work [39].",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 15,
+    "total_chunks": 71,
+    "char_count": 1816,
+    "word_count": 273,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e01597a7-36fd-40cb-8fca-d485531b48e1",
+    "text": "In line with common practice, summarized valquantified for center points, i.e., human center points and ues of DSC, NSD and HD95 are reported as mean and\ncenter points derived from human bounding boxes, by cal- standard deviation (std).\nculating the Euclidean distance and the signed/absolute\nx and y coordinate offset (∆x, ∆y, |∆x|, |∆y|) to the 3.4.3. Pareto front\ncorresponding reference points. For bounding boxes, the\nA model i with performance vector mi =\nIntersection over Union (IoU) and the signed/absolute\n(mi1, mi2, . . . , mid) is Pareto-optimal (non-dominated) if\nno other model j dominates it. Model j dominates model\ni (denoted mj ≻mi) if: 2See SciPy documentation: https://docs.scipy.org/doc/\nscipy/reference/generated/scipy.optimize.linear_sum_\nassignment.html mjk ⪰k mik ∀k ∈1, . . . , d and ∃k′ : mjk′ ≻k′ mik′. Here, ⪰k denotes the comparison direction for metric k significance. This recursive process pinpointed the thresh-\n(i.e., superiority: ≥if higher is better, ≤if lower is better). old with only a fraction of the exhaustive calculations. In other words, no other model performs at least as good\nacross all performance metrics and strictly better in at 3.4.6.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 16,
+    "total_chunks": 71,
+    "char_count": 1187,
+    "word_count": 185,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "29464be6-d40d-46e1-ab86-2fd697d61a09",
+    "text": "Statistical significance tests\nleast one of them. Then, the Pareto front is defined as the The Wilcoxon signed-rank test was used for all pairwise\nset of all Pareto-optimal models: comparisons, including the evaluation of performance differences between reference- and human-prompted segmenP = { i | ∄j : mj ≻mi} . tations, as well as the comparison of intra- versus interrater consistency. To account for multiple comparisons\nIn our work, a model lies on the Pareto front if no other (n) within each test group, a Bonferroni correction was\nmodel achieved higher DSC, higher NSD, and lower HD95 applied to the initial significance level (α = 0.05). In\nsimultaneously, with at least one of these comparisons be- the reported results, asterisks (∗) denote statistical signifiing strictly inequal. cance (p < α/n), while a lack of ∗indicates non-significant\nresults.\n3.4.4.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 17,
+    "total_chunks": 71,
+    "char_count": 870,
+    "word_count": 138,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "53cac053-1500-4015-a833-df3de9427e05",
+    "text": "Model selection\nWithin each category – defined by prediction dimen-\n4. Resultssionality (2D vs. 3D), training data domain (medical\nvs. natural), and prompting strategy (bounding box, cenFirst, the human prompt accuracy and consistency were\nter point, or combination) – the Pareto-optimal models\nanalyzed. Then, the segmentation performance was evaluwith the smallest number of parameters were identified.\nated using reference prompts, including a model selection\nThe selection prioritized smaller models that demonstrate\nof the Pareto-optimal models. These models were further\nstrong performance within their category, ensuring computested with the human prompts to determine segmentatational efficiency. These models were chosen for further\ntion performance and consistency, followed by an analysis\nanalysis with human prompting.\nof the performance differences of the two prompt sources. Finally, models' sensitivity to prompt variability was ex-\n3.4.5. Model sensitivity to input prompt variations amined with intra- and inter-rater prompt variability and\nFollowing the analysis of prompt variability (intra-rater segmentation consistency.\nand inter-rater) and segmentation consistency, the relationship between these two factors was analyzed to assess 4.1. Human prompt variation\nmodel sensitivity to input prompt variability.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 18,
+    "total_chunks": 71,
+    "char_count": 1329,
+    "word_count": 178,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "80ce39f9-bc8b-4f75-95f4-ddab0015651f",
+    "text": "Spearman's\nCenter points. The median Euclidean distance between\nrank correlation coefficient (ρ) was calculated between\nthe human and the reference center points was 1.50mm\nthe prompt variability (Euclidean distance or IoU) and\n(IQR: 0.7 −3.0mm) (Table 2). The median intra-rater\nthe corresponding segmentation consistency (DSC, NSD,\nEuclidean distance calculated from samples, that the anHD95). A low correlation coefficient indicates low sensinotators processed twice, was 0.98mm (IQR: 0.5−1.9mm)\ntivity (i.e., increased robustness) to prompt variability, as\nand the median inter-rater Euclidean distance was 1.37mm\nit suggests the output masks remain similar regardless of\n(IQR: 0.7 −2.6mm) (Table 3).\nvariations in the input prompt. This analysis was first performed for the intra-rater Bounding Boxes. The median IoU between the human and\nprompt variability and segmentation consistency.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 19,
+    "total_chunks": 71,
+    "char_count": 892,
+    "word_count": 124,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1428f7ed-a1d8-4acc-a142-abfd023b09c9",
+    "text": "In case the reference bounding boxes was 90.56% (IQR: 83.4 −\nmodels showed a lack of significant correlation for this set- 94.5%) (Table 2). The median intra-rater IoU on samples\nting, the analysis was repeated for an inter-rater setting seen twice by the annotators was 93.04% (IQR: 88.5 −\non the same sample set to determine the transition be- 96.1%, the median inter-rater IoU was 90.11% (IQR: 84.2−\ntween non statistically significant and statistically signifi- 94.0%) (Table 3).\ncant correlation. To optimize the computational overhead\nof exhaustive pairwise comparisons (n = 190 per sample Intra- vs. Inter-rater annotation consistency.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 20,
+    "total_chunks": 71,
+    "char_count": 642,
+    "word_count": 99,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "166154ac-173f-4be8-810a-878a57f72ddf",
+    "text": "For center\nper model), an iterative search was implemented. First, point and bounding box, comparing the intra- and interthe annotator pool was sorted by mean euclidean distance rater consistency revealed a statistically significant differ-\n(i.e., prompt variation from one rater to all others), with ence for all four datasets (p-values < 0.5/4 = 0.0125),\nthe lowest-variability rater serving as lower bound and with intra-rater annotations demonstrated higher consisthe highest-variability rater as upper bound. If statisti- tency compared to inter-rater annotations (Table 3).\ncally non-significant correlation was shown at the lower\nbound, the upper bound was tested. If statistically signif- Dataset-specific performance. For both human prompts,\nicant model sensitivity was detected at the upper bound, there were considerable differences across data subsets and\nthe pool median was tested. Then, the search proceeded class labels in terms of localization errors and intra-rater\nby splitting the remaining intervals in halves: testing consistency (Figure 3). The annotations for the dataset\nthe lower partition to identify the statistical significance Lower Leg and Hip showed high localization errors and\nthreshold, and the upper partition to verify statistical non low intra-rater consistency, mostly due to the class tibia Table 2: Annotation performance for human bounding box and cen- and then center point (Table 4, Figure 5a, Table D.11).\nter point variations, reported as median (IQR). The overall best model was SAM2.1 T with combination\nprompting (91.83% DSC, 98.38% NSD, 0.71mm HD95).",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 21,
+    "total_chunks": 71,
+    "char_count": 1600,
+    "word_count": 234,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c01f5aaf-7a73-466d-a350-7d19e9ec377a",
+    "text": "Metrics Bounding Box Center Point Annotation accuracy in % ↑ 3D models. The bounding box and combined prompting\nIoU 90.56(83.4-94.5) - strategies achieved higher performance than center-point\nprompts (Table 4, Figure 5b, Table D.11). The overall\nLocalization error in mm ↓\nbest model was Med-SAM2 with bounding box prompting\nEucl. distance 0.49(0.0-1.4) 1.50(0.7-3.0) (79.56% DSC, 80.25% NSD, 13.49mm HD95).\n|∆x| 0.00(0.0-0.9) 0.98(0.3-1.9)\n|∆y| 0.00(0.0-0.9) 0.97(0.3-1.9) Selected Models. Considering prediction dimensionality\n(2D vs. 3D), training data domain (medical vs. natural), |∆w| 1.30(0.3-2.9) -\nand prompting strategy (bounding box, center point, com-\n|∆h| 1.46(0.3-2.9) -\nbination), the smallest Pareto-optimal models for prompt-\n∆x 0.00(0.0-0.0) 0.33(-0.3-1.5) ing with reference prompts were collected in Table 4. Fo-\n∆y 0.00(0.0-0.3) 0.33(0.0-1.5) cusing only on dimensionality, ignoring the training data\n∆w 0.83(0.0-2.0) - domain, the Pareto-optimal models with the least parameter per prompt type were: SAM2.1 B+ (2D bounding ∆h 0.98(0.0-2.5) -\nbox), SAM B (2D center point), SAM2.1 T (2D center\nDetection performance in % ↑ point), Med-SAM2 (3D bounding box), nnInteractive (3D\nRecall 96.1 95.5 center point, 3D combination). These models were marked\nwith gray cells in Table 4 and large symbols in Figure 5.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 22,
+    "total_chunks": 71,
+    "char_count": 1328,
+    "word_count": 191,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "40d5fc0e-840b-4308-a3c6-9b0fa60114ae",
+    "text": "Table 3: Intra-rater and inter-rater annotation consistency for human bounding box and center point, reported as median (IQR). Comparing SAM2 with SAM2.1 and investigating variations of the 3D prompting strategies for Metrics Bounding Box Point\nintra inter intra inter automated extracted prompts showed the following trends:\n• There was no statistically significant difference beAgreement in % ↑\ntween SAM2 and SAM2.1, except for the tiny (T) IoU 93.04(88.5-96.1) 90.11(84.2-94.0) - -\nVariability in mm ↓ models (Appendix E.1). Eucl. distance 0.49(0.0-1.4) 0.73(0.3-1.5) 0.98(0.5-1.9) 1.37(0.7-2.6) • Limiting the propagation in SAM2.1 and Med-SAM2\n|∆x| 0.33(0.0-0.9) 0.33(0.0-1.0) 0.49(0.0-1.5) 0.83(0.0-1.6) prevented over-segmentation at the top and bot-\n|∆y| 0.33(0.0-0.8) 0.33(0.0-1.0) 0.49(0.0-1.5) 0.83(0.0-1.6) tom of an object which improved performance (Ap-\n|∆h| 0.65(0.0-1.5) 0.98(0.3-2.6) - - pendix E.2).\n|∆w| 0.65(0.0-1.5) 0.98(0.3-2.6) - - • Medical FMs benefit from multiple initial slices more\nSystematic differences in mm ↓ than SAM2.1 models (Appendix E.3). With multiple\n∆x 0.00(-0.3-0.3) 0.00(-0.3-0.3) 0.00(-0.5-0.5) 0.00(-0.3-0.3) initial slices, nnInteractive exceeded the performance\n∆y 0.00(-0.3-0.0) 0.00(-0.3-0.3) 0.00(-0.5-0.5) 0.00(-0.8-0.8) of Med-SAM2, which was the Pareto-optimal model\n∆w 0.00(-0.7-0.7) 0.00(-1.0-1.0) - - for the default settings (i.e, with single initial slice).\n∆h 0.00(-0.7-0.7) 0.00(-1.0-1.0) - -\n• There was only a marginal difference (mostly statistically non-significant) between using a single or multiple prompts for 3D prompting (Appendix E.4).bone for center points (Figure C.8) and tibia implant and\nhip for bounding box (Figure C.10), while annotations in\nthe dataset Wrist showed overall the lowest localization er- 4.3. Segmentation performance with human prompts\nrors and high intra-rater consistency. Detailed results and 2D models. SAM and SAM2.1 maintained their superior\nvisualizations per data subset and class labels are available performance compared to medical FMs, mirroring the\nin the Appendix C. trends seen with reference prompts (Table 5). The overall best model was again SAM2.1 T with combination\nAnnotation Time. The average annotation time was 4.22 prompting (89.65% DSC, 97.71% NSD, 1.06mm HD95).\nsec for a center point and 11.37 sec for a bounding box. The annotators required between 5 and 14 hours to com- 3D models. All 3D medical FMs consistently outperplete the project (excluding training phase), with a median formed SAM2.1 for all prompt types (Table 5). The overall\nof 8 hours and 48 min (IQR: 7 hours to 11 hours and 18 best model was Med-SAM2 with bounding box prompting\nminutes) (Figure 4). (77.05% DSC, 79.47% NSD, 14.36mm HD95). Segmentation performance with reference prompts Segmentation consistency. Intra-rater consistency is high\n2D models. For reference prompts, the combined prompt- for all FMs (Table 5).",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 23,
+    "total_chunks": 71,
+    "char_count": 2914,
+    "word_count": 418,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8df550bd-d97c-405d-8392-17c745f35090",
+    "text": "Notably, consistency was most proing strategy worked the best, followed by the bounding box nounced in the top-performing models.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 24,
+    "total_chunks": 71,
+    "char_count": 129,
+    "word_count": 19,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e30e6000-a3be-49b9-be7a-65c0f86532c0",
+    "text": "Figure 3: Spatial distribution of prompt placement per annotator per dataset. Each point corresponds to one annotator. It represents the mean deviations (mm) in the x- (∆x) and y-directions (∆y) of the center point (i.e., human\n(a) or extracted from the bounding box (b)), relative to the reference prompt at the origin (0,0). The same-colored (more transparent) ellipse (a) and\nrectangles (b) represent each annotator's intra-rater consistency ((a): (∆x, ∆y), (b): (∆w, ∆h)). Wrist shows the least localization errors and highest\nconsistency, while Lower Leg and Hip show high localization errors and low intra- and inter-rater consistency.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 25,
+    "total_chunks": 71,
+    "char_count": 641,
+    "word_count": 97,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "179d7e9a-eb93-477d-a0ec-4670748c6eab",
+    "text": "Model sensitivity to input prompt variations For all models and prompt types, the correlation coefficient showed decreasing intra-rater segmentation consistency with increasing prompt variability (Table 6, Figure\n6). All 2D models and 3D models box-prompted showed\nstatistically significant correlation (p-values< 0.05/(13 ×\n3) = 0.0013) for intra-rater variability. Only nnInteractive\ncombination-prompted and SAM2.1 T point-prompted\nwere robust for intra-rater variability. These two models\nwere analyzed for the inter-rater annotation variability and\nsegmentation consistency in an iterative search pattern\nbased on a sorted annotator pool (Table C.10), to identify\nthe threshold for model sensitivity. The results showed\nFigure 4: Accumulated annotation time per annotator for all\nthat SAM2.1 T point-prompted was sensitive to the low-projects.\nest inter-rater variability and nnInteractive combinationprompted was sensitive for the sixth lowest inter-rater variability (Table 7) (p-values < 0.05/(20 × 3) = 0.00083).\n4.4. Performance gap between reference- and humanprompted results\n5. Our study quantified intra- and inter-rater variability in\n2D models exhibited a statistically significant decline human prompts and analyzed their impact on segmentain performance when transitioning to human prompts tion consistency of Pareto-optimal FMs for MSK CT appli-\n(2.07% DSC, 0.87% NSD, −0.25mm HD95; p-values < cation, across four anatomical regions. The main findings\n0.05/6 = 0.0083), while 3D models showed a smaller but in analyzing the model sensitivity to input prompt variastill statistically significant performance drop compared to tions were: 1.) All 2D models showed sensitivity to prompt\ntheir reference-prompted counterparts (1.06% DSC, 0.47% variations. 2.) 3D models SAM2.1 T point-prompted\nNSD, −0.39mm HD95; with p-values < 0.05/6 = 0.0083) and nnInteractive combination-prompted showed robust-\n(Table F.16). ness for intra-rater variations, but not for all inter-rater Table 4: Segmentation performance with reference prompts of Pareto-optimal models per prompt type (i.e., bounding box, center point,\ncombination) and category (2D vs. 3D; medical vs. natural). The Pareto-optimal models with the least parameters per category are highlighted in bold (selected for further analysis with human prompts). Grayshaded cells and prompt symbols next to the model names indicate the smallest Pareto-optimal models per prompt type. No Pareto-optimal results\nare omitted in this Table and can be found in Table D.11.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 26,
+    "total_chunks": 71,
+    "char_count": 2527,
+    "word_count": 353,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "15fda36b-67bb-4f87-ad14-86d6d96a905f",
+    "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ M % % mm % % mm % % mm MedicoSAM2D 94 90.74±7.7 97.36±3.6 0.76±0.9 77.46±19.3 83.23±18.4 5.00±5.9 91.27±7.4 97.74±3.3 0.69±0.8 SAM-Med2d 271 - - - 73.69±17.0 84.48±14.9 5.35±5.0 - - - medical\nScribblePrompt-SAM 94 - - - 74.19±14.6 84.22±12.6 6.30±5.3 - - - SAM B 94 - - - 85.43±14.4 90.82±13.0 4.83±6.3 - - - SAM2.1 B+ 81 90.60±8.1 97.84±3.5 0.82±1.0 - - - 91.98±7.2 98.21±3.6 0.73±1.1\nnatural SAM2.1 L 224 - - - - - - 90.90±6.9 98.36±3.2 0.69±1.0\nSAM2.1 S 46 - - - - - - 91.51±7.0 98.40±3.3 0.69±0.9 SAM2.1 T 39 - - - - - - 91.83±6.9 98.38±3.2 0.71±1.0 3D Models\nmedical nnInteractiveMed-SAM2 10239 79.56±11.1- 80.25±10.5- 13.49±11.1- 69.40±11.2- 68.23±12.0- 30.98±9.4- 75.92±9.4- 76.60±9.6- 26.53±10.3- SAM2.1 B+ 81 66.11±10.1 66.59±10.0 24.77±18.1 - - - 68.33±9.4 67.86±10.2 26.04±18.2 SAM2.1 S 46 67.69±10.2 68.48±10.0 31.67±21.6 56.90±19.1 53.96±20.2 47.84±31.2 70.22±10.1 69.88±10.7 32.21±22.0 natural\nSAM2.1 T 39 - - - 54.74±15.9 52.92±16.9 46.40±28.5 - - - (a) All 12 2d models evaluated slice-wise. (b) All 13 3D models evaluated volumetric. HD95 (mm) performance of all models (color-encoded) and three prompt types (symbol-encoded) for perfect prompts. Larger symbols highlight the smallest Pareto-optimal models. variations. 3.) Performance estimates of \"ideal\" prompt- than point prompts, likely because defining a precise point\ning (i.e., reference prompts) do not translate to a human- for complex geometries is less intuitive for annotators than\ndriven setting. defining spatial boundaries.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 27,
+    "total_chunks": 71,
+    "char_count": 1647,
+    "word_count": 266,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d5be5823-bd11-48d3-8af1-87b2144b35b8",
+    "text": "Human prompt analysis less evident (Figure 7). There were considerable differences across data subsets\n5.2. Segmentation analysis\nand class labels for human prompts (Tables C.8, C.9),\nbut some consistent findings emerged. Circular structures In both 2D and 3D models, there were considerable per-\n(e.g., humerus, wrist bones) showed high rater consistency. formance variations for reference prompts across model\nPoint placement was more prone to deviation in elongated, types and prompting strategies(Figure 5, Table D.11). For\nthin, or annular bone shapes (e.g., scapula, femur with 2D medical FMs, MedicoSAM showed high performance\nmetal implant, see Figures C.8, C.9). For bounding boxes, compared to its alternatives Med-SAM and SAM-Med2D,\nconsistency decreased in structures with complex topolo- which is likely due to its training on a complex objective\ngies and multiple components (e.g., scapula, metal im- (in contrast to Med-SAM) while keeping the SAM architecplants, see Figures C.10, C.11). Overall, bounding box ture without adapters (in contrast to SAM-Med2D) [11].\nprompts demonstrated higher accuracy and consistency However, going to 3D, its propagation is outperformed by Table 5: Segmentation performance and intra-rater segmentation consistency with human prompts – grouped by prompt type (bounding\nbox, center point, combination). The best values per prompt type are highlighted in bold. The best performing models also showed the highest\nconsistency. The performance difference to perfect prompts is shown in Table F.16.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 29,
+    "total_chunks": 71,
+    "char_count": 1542,
+    "word_count": 223,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "95096652-d006-4fa8-bc3a-cddc521b1fa4",
+    "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ M % % mm % % mm % % mm Segmentation performance",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 30,
+    "total_chunks": 71,
+    "char_count": 188,
+    "word_count": 46,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a769e16f-e062-4f59-b28e-e9c7f10e8bc3",
+    "text": "2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 86.12±13.6- 95.40±5.8- 1.26±1.7- 72.50±18.075.95±20.7 84.26±12.983.54±16.6 6.39±5.35.13±5.8 86.53±13.0- 95.09±7.0- 1.38±2.1- SAM B 94 - - - 83.69±17.5 90.99±11.6 4.85±6.2 - - - SAM2.1 B+ 81 87.86±12.8 96.80±5.0 1.15±1.6 - - - - - - natural\nSAM2.1 T 39 - - - - - - 89.65±10.8 97.41±4.7 1.06±1.7 3D Models\nmedical Med-SAM2nnInteractive 10239 76.80±13.5- 79.27±11.2- 14.46±11.8- 68.12±12.6- 68.63±11.5- 30.10±8.8- 75.59±10.6- 77.29±9.1- 25.65±9.5-\nnatural SAM2.1SAM2.1 ST 4639 65.93±11.6- 67.83±10.2- 32.71±21.6- 53.72±16.3- 52.93±16.5- 46.84±27.8- 68.80±11.2- 69.19±10.9- 33.88±22.4- Segmentation consistency",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 31,
+    "total_chunks": 71,
+    "char_count": 659,
+    "word_count": 80,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5efa4186-34ef-4ee2-82ae-327641110467",
+    "text": "2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 95.35±8.1- 99.20±1.8- 0.38±0.8- 93.13±13.196.27±10.8 96.45±9.398.29±6.3 1.68±5.00.99±3.9 95.15±9.7- 98.48±3.9- 0.64±1.7- SAM B 94 - - - 97.17±9.4 99.07±3.5 0.49±1.9 - - - SAM2.1 B+ 81 97.13±8.1 99.40±1.8 0.26 ±0.8 - - - - - - natural\nSAM2.1 T 39 - - - - - - 97.71±9.2 99.38±2.1 0.31±1.2 3D Models\nmedical Med-SAM2nnInteractive 10239 88.13±20.0- 90.79±16.2- 7.58±15.1- 84.89±20.6- 86.71±18.0- 7.32±12.7- 88.44±17.7- 88.75±15.2- 8.05±12.9-\nnatural SAM2.1SAM2.1 ST 4639 85.45±20.9- 87.63±17.8- 16.94±31.3- 79.63±28.5- 80.52±27.2- 23.94±40.8- 87.46±19.6- 88.28±17.6- 16.64±30.5-",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 32,
+    "total_chunks": 71,
+    "char_count": 627,
+    "word_count": 79,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "91077add-65ab-47ea-8f3f-e94bc67b271a",
+    "text": "Table 6: Spearman's rank correlation coefficients for each metric (ρDSC, ρNSD, ρHD95) between intra-rater annotation variability and segmentation consistency. Asterisks (∗) denote statistical significance. Positive values for HD95 and negative values for DSC and NSD indicate that increased prompt variability\nsignificantly reduces segmentation consistency.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 33,
+    "total_chunks": 71,
+    "char_count": 357,
+    "word_count": 44,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "058b4624-87d8-4342-a956-68c5f9b36898",
+    "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D)\nSize ρDSC ↑ρNSD ↑ ρHD95 ↓ ρDSC ↑ρNSD ↑ ρHD95 ↓ ρDSC ↑ρNSD ↑ ρHD95 ↓\nM % % mm % % mm % % mm 2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 -0.36*- -0.48*- 0.49*- -0.57*-0.58* -0.55*-0.54* 0.59*0.48* -0.45*- -0.54*- 0.50*- SAM B 94 - - - -0.53* -0.52* 0.46* - - - SAM2.1 B+ 81 -0.33* -0.38* 0.41* - - - - - - natural\nSAM2.1 T 39 - - - - - - -0.60* -0.59* 0.55* 3D Models\nmedical Med-SAM2nnInteractive 10239 -0.38*- -0.41*- 0.50*- -0.32*- -0.30*- 0.35*- -0.09- -0.11- 0.16-\nnatural SAM2.1SAM2.1 ST 4639 -0.23*- -0.29*- 0.41*- -0.05- -0.02- 0.12- -0.31*- -0.32*- 0.41*- native 3D models such as Med-SAM2 and nnInteractive. additional input channel for 3D feature extraction, which\nWhile MedicoSAM3D projects the prediction of adjacent is less prone to error propagation by design.\nslices, Med-SAM2 leverages the memory bank mechanism Several 3D medical FMs perform significantly worse than\nof SAM2 and nnInteractive integrates user prompts as an others. For SAM-Med3D, resampling the entire image (a) 2D Models (b) 3D Models Figure 6: Model sensitivity to input variations visualized as intra-rater annotation variability (euclidean distance or IoU) vs. segmentation\nconsistency (DSC). Each point represent the mean prompt variability and mean segmentation consistency for one sample. Dotted lines represent\nordinary least squares (OLS) linear regression trends. Shaded areas denote the 95%confidence intervals (α = 0.05). Table 7: Spearman's rank correlation coefficients for each metric distribution. Even with cropping, performance remains\n(ρDSC, ρNSD, ρHD95) between inter-rater annotation variability and\nsegmentation consistency. below competitive levels. SegVol and Vista3D prompted\nAnnotator rows are ordered by euclidean distance (mm), starting with with center points also demonstrate suboptimal results,\nthe lowest-variance rater.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 34,
+    "total_chunks": 71,
+    "char_count": 1922,
+    "word_count": 292,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d39224cb-1b1e-4f26-8297-b3a267a03aaf",
+    "text": "Due to iterative search, not all inter-rater\nvariabilities are tested (Table C.10). The first column indicates the order which is likely due to the underlying training data, favorof tests. Asterisks (∗) denote statistical significance. Positive values ing abdominal and thoracic organ segmentation over bone\nfor HD95 and negative values for DSC and NSD indicate that increased\nprompt variability significantly reduces segmentation consistency. and metal implant segmentation. A direct comparison between 2D and 3D models is limAnnotator Eucl. distance ρDSC ↑ρNSD ↑ρHD95 ↓ ited by fundamental differences in evaluation and prompt-\n(mm) (%) (%) (mm) ing strategies. 3D performance was measured across entire\nnnInteractive volumes, where error propagation in more distal slices can\nlower overall metrics, whereas 2D models were evaluated 1 Annotator02 1.67±2.8 -0.19 -0.15 0.14\non single slices without such penalties. In addition, 2D\n5 Annotator05 1.85±3.2 -0.21 -0.21 0.18\nmethods utilized prompts per component (i.e., multiple\n6 Annotator14 1.86±3.2 -0.22 0.18 -0.22 prompts per object), while 3D models were often restricted\n7 Annotator20 1.86±3.2 -0.26* 0.20 -0.26* to a single prompt per object, especially for bounding box\n4 Annotator01 1.87±3.3 -0.27* -0.27* 0.23* prompting. Due to these differences, we treated 2D and\n3D models as two different categories of models in our 3 Annotator09 1.96±3.1 -0.25* -0.27* 0.24*\nanalysis.\n2 Annotator07 2.40±3.6 -0.37* -0.38* 0.35*\nWhile performance findings remain consistent for both\nSAM2.1 T\nprompt sources (i.e., perfect and human), the performance\n1 Annotator15 2.51±5.5 -0.23* 0.30* -0.29* drops for human prompts. This suggests that reference\nprompts are more \"ideal\" for optimizing model output,\nindicating that standard benchmarks might overestimate\nachievable performance in practical, human-driven set-without cropping to 128×128×128 leads to notable loss of\ntings.performance, likely due to image distortion and misalignment of the object of interest relative to the training data Visual inspection of segmentation results revealed three (a) SAM2.1 T point-prompted; Wrist sample showing ulna and radius; (b) nnInteractive point-prompted; Wrist sample showing ulna and raDespite small intra-rater variation, the resulting 3D prediction shows dius; Small intra-rater variation with small differences in resulting 3D\nlarge differences (72.5% DSC, 69.0% NSD, 24.8mm HD95). prediction (98.7% DSC, 100.0% NSD, 0.3mm HD95). (c) MedicoSAM2D (first row), ScribblePrompt-SAM (second row); Hip (d) SAM2.1 S box-prompted; Hip sample showing left/right hip; Despite\nsample showing left/right hip; Varying model sensitivity for input prompt small intra-rater variation, the resulting 3D prediction shows large difvariations. ferences (64.1% DSC, 62.9% NSD, 48.6mm HD95) Figure 7: Visual examples for model sensitivity to input prompt variations: reference mask, predicted mask with reference prompt, predicted\nmask with 1st set of human prompt and with 2nd set of the same annotator. The reference prompt is drawn as black point or box. The human\nprompt is drawn as colored point or box.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 35,
+    "total_chunks": 71,
+    "char_count": 3130,
+    "word_count": 452,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6e417a19-bc1d-41e9-bcd4-9e0f80e1db9c",
+    "text": "common mistakes (Figure 7), which also explain poor per- variability increases, segmentation consistency declines.\nformance metrics: 1.) Anatomical Ambiguity: Due to the While the high values for segmentation consistency suggest\ndifferent Houndsfield Unit (HU) for cortical bone and tra- that the resulting masks remain mostly similar, the models\nbecular bone, models struggled to differentiate between nonetheless show sensitivity where even minor changes in\nthese structures and the combined total bone volume. This the input prompt can trigger large changes in the output\nissue is caused by prompt ambiguity, where a point or box segmentation (Figure 6, Figure 7). Consequently, sensitivmay not clearly define whether the user intends to segment ity to prompt fluctuations should be considered a critical\nthe entire bone or just a specific layer. 2.) Oversegmen- performance metric for the development and real-world\ntation: Both 2D and 3D architectures sometimes failed evaluation of FMs, particularly in domains where user into identify clear anatomical boundaries. For 2D models, put can inherently vary.\nthis typically resulted in the prediction extending beyond Intra-rater annotation variability was consistently lower\nthe bone contour within a single slice. For 3D models, than even the most stable inter-rater setting for both point\nthese boundary failures were magnified by the additional prompts (intra-rater Euclidean distance of 2.00 ± 5.3 mm\nspatial dimension, allowing errors to propagate and grow vs. lowest pairwise Euclidean distance of 2.51 ± 5.5 mm)\ninto neighboring structures, even across joint spaces. This and combined prompts (intra-rater Euclidean distance of\npropagation error suggests that the models lack a robust 1.41 ± 2.4 mm vs. lowest inter-rater Euclidean distance\nvolumetric \"stop\" signal. 3.) Undersegmentation: In re- of 1.68 ± 2.8 mm). Therefore, if a model demonstrated\ngions with fading or fluctuating intensity values, models sensitivity to the variations within a single annotator, it\nsometimes stopped the predictions too early. likely exhibits similar or greater sensitivity to the larger\nfluctuations between annotators. For models that did not\n5.3. Model sensitivity to input prompt variations show statistically significant correlation for the intra-rater\nAn inverse relationship was observed between prompt setting, the correlation for inter-rater settings was tested\nvariability and segmentation consistency; as input prompt as well.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 36,
+    "total_chunks": 71,
+    "char_count": 2484,
+    "word_count": 360,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0689936d-74c6-4d79-a6ec-b1aeb23a24cb",
+    "text": "While all 2D models and box-prompted 3D models exhibited sensitivity to prompt fluctuations already for the is often to fully automate the segmentation process withintra-rater setting, nnInteractive combination-prompted out user interaction.\nand SAM 2.1 T point-prompted showed a lack of statistically significant correlations, suggesting model robustness Geometric prompting. Aside from geometric prompting,\nto intra-rater input prompt variations. Testing them fur- also text prompts become more popular and are for examther with inter-rater settings, SAM2.1 T point-prompted ple integrated in the recently released SAM3 framework\nshowed sensitivity at the first inter-rater iterations (2.51± [7]. Text prompts remove user interaction and therefore\n5.5), while nnInteractive combination-prompted showed geometric variations and could potentially be automatized\nsensitivity at the sixth inter-rater level with 1.87 ± 3.3. for specific medical tasks if always the same structures\nThus, no tested model is robust against large fluctuations should be segmented.\nbetween annotators, but nnInteractive shows the least sensitivity. It is critical to emphasize that model sensitivity should not be viewed as an isolated performance met- 6. It must be evaluated in combination with absolute\nperformance and segmentation consistency to ensure a The observed performance drop when transitioning from\nmore complete evaluation.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 37,
+    "total_chunks": 71,
+    "char_count": 1415,
+    "word_count": 194,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6c1e3d2b-fa1d-40a1-95de-048c555267af",
+    "text": "Considering that, nnInteractive idealized reference prompts to human inputs and the sensicombination-prompted presented itself as the best option tivity to human prompt fluctuations across models, shows\nof all tested models. that prompt placement matters. Our findings suggest that\nsegmentation performances derived from \"ideal\" (i.e., reference prompts) may not accurately reflect performance5.4. Limitations & Future Work\nin human-driven settings. Consequently, model sensitivDataset. The TotalSegmentator dataset was used to train\nity to prompt variability should be established as a comsome of the investigated FMs. Not all FMs reported a deplementary performance metric for the development and\ntailed train–test split (Table 1). However, by introducing\nreal-world evaluation of promptable FMs. This would help\nthe new classes femur implant left and right, the evalubridging the gap between theoretical potential and practiated task extended beyond the original training labels and\ncal application.\nposed a new task unseen by the FMs, even if the selected\ntest samples were included in previous training.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 38,
+    "total_chunks": 71,
+    "char_count": 1108,
+    "word_count": 156,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "063f19a3-2054-4941-9d93-3f2c6d25f91b",
+    "text": "Acknowledgments\nAxial slices. We limited our study to axial slices to limit\nthe workload for annotators. Sagittal and coronal slices, We thank all the students, who participated in the obwhich are often underexplored, could serve as a meaningful server study and made the collection of human prompts\nalternative or complementary source of information. possible. We also want to thank Dieuwertje Luitse, for her\ninput to study questionnaires sent to the students at the\nObserver Study. The annotators in our observer study are beginning and end of their study participation to collect\nmedical students rather than trained radiologists, primar- additional information about the study participants and\nily due to availability. However, the results indicate that their study experience. We thank Thomas Koopman and\nextensive medical training may not be required for the in- the team from grand-challenge for their great help with\nvestigated tasks, although this may not generalize to more setting up the observer study.\ncomplex clinical applications such as tumor identification. Non-iterative prompting. Our study was conducted in a References\nstatic setting without iterative refinement or segmentation\ncorrection. While interactive workflows are important for [1] A. Rolland,\nreal-world deployment, they increase the complexity of the L.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 39,
+    "total_chunks": 71,
+    "char_count": 1336,
+    "word_count": 197,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c85fae7f-a9f2-48b5-9ab1-985a50cedd63",
+    "text": "Berg,\nevaluation, as the individual contributions of the interac- W.-Y. Girshick, Segment anything\ntion step and their effect on model sensitivity would be (2023). arXiv:2304.02643.\nmore difficult to isolate and quantify. Future evaluation URL https://arxiv.org/abs/2304.02643\nstudies should be conducted to analyze interactive refinement efficiency, which may mitigate commonly observed [2] Y. Chang,\nsegmentation mistakes. For example, the impact of severe X. Chen,\noversegmentation and volumetric leakage could be miti- S. Grau,\ngated by the strategic use of negative prompts to define D.-P. Ni, Segment anyexclusion zones. Similarly, anatomical ambiguity could be thing model for medical images?, Medical Imovercome by several carefully placed positive prompts until age Analysis 92 (2024) 103061. doi:https:\nthe the desired anatomical boundary is reached. However, //doi.org/10.1016/j.media.2023.103061.\na disadvantage of iterative refinement is the additionally URL https://www.sciencedirect.com/science/\nrequired user interaction and time, where the ultimate goal article/pii/S1361841523003213",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 40,
+    "total_chunks": 71,
+    "char_count": 1100,
+    "word_count": 140,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "733c0119-0a82-43a3-b5df-c61e541179dc",
+    "text": "Qiao, Sam-med2d (2023). Streekstra, Joint space narrowing in pa- arXiv:2308.16184.\ntients with pisotriquetral osteoarthritis, HAND\n[10] H. Dalca, Scrib- 12 (5) (2017) 490–492, pMID: 28832198. arXiv:\nbleprompt: Fast and flexible interactive segmentation https://doi.org/10.1177/1558944716677542,\nfor any biomedical image, European Conference on doi:10.1177/1558944716677542. Computer Vision (ECCV) (2024). URL https://doi.org/10.1177/\n1558944716677542 [11] A. Pape, Medicosam: Towards foundation models for medical image segmen-\n[4] C. Kievit,\ntation (2025). arXiv:2501.11734. Streekstra,\nURL https://arxiv.org/abs/2501.11734 C. Blankevoort, Automation in tibial\nimplant loosening detection using deep-learning [12] J. Wu, Medical sam 2: Segment medical\nsegmentation, International Journal of Computer images as video via segment anything model 2 (2024). Assisted Radiology and Surgery 20 (2025) 2065–2073. arXiv:2408.00874.\ndoi:10.1007/s11548-025-03459-1. URL https://arxiv.org/abs/2408.00874\nURL https://doi.org/10.1007/s11548-025-\n03459-1 [13] J. J. sam2: Segment anything in 3d medical images and\nKerkhoffs, G. J. van den videos (2025). arXiv:2504.03600. P. van Deurzen, URL https://arxiv.org/abs/2504.03600\nMinimal but potentially clinically relevant anteroinferior position of the humeral head following traumatic [14] H.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 41,
+    "total_chunks": 71,
+    "char_count": 1326,
+    "word_count": 153,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "63d9101f-8b4a-425a-ace9-3e8f9ac002ca",
+    "text": "Li,\nanterior shoulder dislocations: A 3d-ct analysis, J. Zhang,\nJournal of Orthopaedic Research 42 (8) (2024) J. Qiao, Sam-med3d: Towards general-purpose\n1641–1652. arXiv:https://onlinelibrary. segmentation models for volumetric medical images\nwiley.com/doi/pdf/10.1002/jor.25831, (2024). arXiv:2310.15161.\ndoi:https://doi.org/10.1002/jor.25831. URL https://arxiv.org/abs/2310.15161\nURL https://onlinelibrary.wiley.com/doi/\n[15] Y. Zhao, Segvol: Universal\nabs/10.1002/jor.25831\nand interactive volumetric medical image segmentation (2025). arXiv:2311.13385.[6] N. Ryali,\nURL https://arxiv.org/abs/2311.13385 T. Carion, C.-Y. [16] Y. Feichtenhofer, Sam Z.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 42,
+    "total_chunks": 71,
+    "char_count": 654,
+    "word_count": 63,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "3714f5ff-d666-4d9e-b12e-e9fe4f76202f",
+    "text": "Har-\n2: Segment anything in images and videos, arXiv mon, B. Li, Vista3d: A unified\npreprint arXiv:2408.00714 (2024). segmentation foundation model for 3d medical imagURL https://arxiv.org/abs/2408.00714 ing (2024). arXiv:2406.05285. URL https://arxiv.org/abs/2406.05285[7] N. Maier-Hein, nninteractive: Redefining 3d promptmeni, R. Li, able segmentation (2025). arXiv:2503.08373. Ravi, URL https://arxiv.org/abs/2503.08373\nK. Feichtenhofer, Sam 3: Segment anything with concepts (2025). arXiv:2511. [18] B. Merhof, Foundational\nURL https://arxiv.org/abs/2511.16719 models in medical imaging: A comprehensive survey\nand future vision (2023). arXiv:2310.18689.\n[8] J. Wang, Seg- URL https://arxiv.org/abs/2310.18689\nment anything in medical images, Nature Communications 15 (2024) 1–9. [19] Y. Jiao, Segment anything\nmodel for medical image segmentation: Current\n[9] J. Wang, applications and future directions, Computers in BiY. He, ology and Medicine 171 (2024) 108238. doi:https: //doi.org/10.1016/j.compbiomed.2024.108238. [28] C. KupssinURL https://www.sciencedirect.com/science/ skü, O.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 43,
+    "total_chunks": 71,
+    "char_count": 1091,
+    "word_count": 126,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1efa5fda-14bb-493b-8502-10acbad0990b",
+    "text": "Barros, Zeroarticle/pii/S0010482524003226 shot performance of the segment anything model\n(sam) in 2d medical imaging: A comprehensive eval-\n[20] S. Lee, A narrative review of foundation uation and practical guidelines, in: 2023 IEEE 23rd\nmodels for medical image segmentation: zero-shot International Conference on Bioinformatics and Bioperformance evaluation on diverse modalities, Quan- engineering (BIBE), 2023, pp. 108–112. doi:10.\ntitative Imaging in Medicine and Surgery 15 (6) 1109/BIBE60311.2023.00025.\n(2025). URL https://qims.amegroups.org/article/ [29] H. A.\nview/138057 Mazurowski, Segment anything model 2: an application to 2d and 3d medical images (2024). arXiv:\n[21] M. Yao, A re- URL https://arxiv.org/abs/2408.00756\nview of the segment anything model (sam) for\nmedical image analysis: Accomplishments and [30] S. Soni, Is sam 2 better\nperspectives, Computerized Medical Imaging than sam in medical image segmentation? (2024).\nand Graphics 119 (2025) 102473. doi:https: arXiv:2408.04212.\n//doi.org/10.1016/j.compmedimag.2024.102473. URL https://arxiv.org/abs/2408.04212\nURL https://www.sciencedirect.com/science/\n[31] J. Wang,\narticle/pii/S0895611124001502\nL. Ren, Sam 2 in robotic surgery: An em-\n[22] D. Kang, pirical evaluation for robustness and generalization\nA. Mukasheva, A review of deep learning approaches in surgical video segmentation (2024). arXiv:2408.\nbased on segment anything model for medical image 04593.\nsegmentation, Bioengineering 12 (12) (2025). URL https://arxiv.org/abs/2408.04593\nURL https://www.mdpi.com/2306-5354/12/12/\n[32] Y. Unberath, Perfor-\nmance and non-adversarial robustness of the seg-\n[23] P. Ma, ment anything model 2 in surgical video segmentation\nQ. Chang, Vision foundation models in medical image (2024). arXiv:2408.04098.\nanalysis: Advances and challenges (2025). arXiv: URL https://arxiv.org/abs/2408.04098\n2502.14584.\n[33] C.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 44,
+    "total_chunks": 71,
+    "char_count": 1888,
+    "word_count": 234,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "15b9df20-7980-415f-b1fb-2ae8f4c15fd6",
+    "text": "Kervadec, Zero-shot URL https://arxiv.org/abs/2502.14584\ncapability of 2d SAM-family models for bone seg-\n[24] S. Rokuss, mentation in CT scans, in: Medical Imaging with\nN.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 45,
+    "total_chunks": 71,
+    "char_count": 172,
+    "word_count": 24,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1053bdd2-a394-4180-9d1d-ea3bf5fa361f",
+    "text": "Maier- Deep Learning, 2025. Hein, Sam.md: Zero-shot medical image segmen- URL https://openreview.net/forum?id=\ntation capabilities of the segment anything model AUv6NhK9aH\n(2023). arXiv:2304.05396.\n[34] C. URL https://arxiv.org/abs/2304.05396\nJaeger, K. Maier-Hein, Radioactive: 3d radiological\n[25] S. E. interactive segmentation benchmark (2025). arXiv:\nGrant, Y. Ou, Computer-vision benchmark segment- 2411.07885.\nanything model (sam) in medical images: Accuracy URL https://arxiv.org/abs/2411.07885\nin 12 datasets (2023). arXiv:2304.09324.\n[35] J. Pradella, URL https://arxiv.org/abs/2304.09324\nD. Segeroth, TotalsegmentaN. Zhang, Segment anything model for tor: Robust segmentation of 104 anatomic structures\nmedical image analysis: An experimental study, in ct images, Radiology: Artificial Intelligence 5 (5)\nMedical Image Analysis 89 (2023) 102918. doi: (2023) e230024. doi:10.1148/ryai.230024.\nhttps://doi.org/10.1016/j.media.2023.102918. URL https://doi.org/10.1148/ryai.230024\nURL https://www.sciencedirect.com/science/\n[36] P. Gu,\narticle/pii/S1361841523001780\nH. Li, Zhou, Deep learning to segment pelvic bones: LargeSam on medical images: A comprehensive study on scale ct datasets and baseline models (2021). arXiv:\nthree prompt modes (2023). arXiv:2305.00035. 2012.08721. URL https://arxiv.org/abs/2305.00035 URL https://arxiv.org/abs/2012.08721 Löffler, (2024). arXiv:2403.15063. Payer, URL https://arxiv.org/abs/2403.15063\nD. Štern, M.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 46,
+    "total_chunks": 71,
+    "char_count": 1453,
+    "word_count": 158,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8a75c017-0cae-4e6b-b056-3c7f4a3ae433",
+    "text": "Kirschke, Verse: A vertebrae labelling and segmentation benchmark for multi-detector ct images,\nMedical Image Analysis 73 (2021) 102166. doi:\nhttps://doi.org/10.1016/j.media.2021.102166. URL https://www.sciencedirect.com/science/\narticle/pii/S1361841521002127 Jäger, et al., Metrics\nreloaded: recommendations for image analysis\nvalidation, Nature Methods 21 (2024) 195–212. URL https://doi.org/10.1038/s41592-023-\n02151-7 Kervadec, Distorch: A fast gpu implementation of 3d hausdorff distance (2025). URL https://github.com/jeromerony/distorch Streekstra, Evaluation of a ct-based technique\nto measure the transfer accuracy of a virtually planned osteotomy, Medical Engineering &\nPhysics 36 (8) (2014) 1081–1087. doi:https:\n//doi.org/10.1016/j.medengphy.2014.05.012. URL https://www.sciencedirect.com/science/\narticle/pii/S1350453314001271 Gerig, User-guided 3D\nactive contour segmentation of anatomical structures:\nSignificantly improved efficiency and reliability, Neuroimage 31 (3) (2006) 1116–1128. Xu, Towards a comprehensive, efficient and promptable anatomic structure\nsegmentation model using 3d whole-body ct scans The three data subsets Wrist, Lower Leg, Shoulder were acquired at the Amsterdam UMC with a\nBrilliance 64-channel CT Scanner (Philips Healthcare, Best, The Netherlands) or a Siemens SOMATOM Force. The\nreference segmentation masks were generated in a two-step annotation process: First, an in-house 3D annotation software\n[40] was used to generate preliminary mask with a threshold-based region-growing segmentation algorithm. Then, these\npreliminary masks were manually corrected and refined with ITK-SNAP [41].",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 48,
+    "total_chunks": 71,
+    "char_count": 1635,
+    "word_count": 190,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "35205e79-027c-4dc0-b537-7c35cc91844d",
+    "text": "The fourth data subset Hip is derived from the publicly reported test set of TotalSegmentator [35],\na labeled CT dataset created by the Research and Analysis Department at University Hospital Basel. Following the\nofficial test split, we selected 11 CT scans, manually ensuring that 6 of them contained at least one hip implant. The\nreference segmentation mask was generated by merging the original reference mask with a manually created annotation\nin ITK-SNAP [41] of the hip implant (stem and cup together). The existing segmentation masks for the left and right\nhips, as well as the left and right femurs, were left unchanged; no corrections for over- or under-segmentation were\napplied.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 49,
+    "total_chunks": 71,
+    "char_count": 689,
+    "word_count": 110,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "23393e9e-3a76-4622-ace2-057eb7033e30",
+    "text": "To reduce the workload in the observer study, axial slices were extracted from the 3D CT volumes taking\ninto account data coverage, diversity and comparability between data subsets. To avoid slices with little to no relevant\nanatomical information, the top and bottom 10% of each object were excluded from the slice selection. By default, two\nslices per class were extracted from the remaining object volume, maintaining at least a 10-slice interval (see Figure 2). However, since the data subsets differ in their characteristics (e.g., number of classes and slices), the default setting was\nadjusted accordingly. For Wrist, a 5-slice gap was used because six classes were distributed across an average of 363\naxial slices, making a 10-slice gap too large to maintain. To ensure a comparable number of slices across datasets and\nto account for the large volume size (over 1000 slices), three slices per class were selected for Lower Leg. For Hip, only\nthe original classes were used for slice selection to ensure an equal number of slices per sample, as the two newly added\nlabels do not appear in every CT scan. Samples seen twice by annotators. A dataset-specific duplication strategy was applied. For the Wrist, Shoulder, and\nHip datasets, a balanced approach was used by selecting one of the two selected slices per class label a second time\n(i.e., 50% slices used twice). In contrast, all samples in the Lower Leg dataset were used a second time due to several\ndataset-specific characteristics: The number of classes per slice is limited (at most two reference classes), which reduces\nannotation time per sample; The majority of selected slices only contains one class, whereas slices in Wrist, Shoulder\nand Hip commonly display multiple classes; The extraction of three slices per class label precludes an even duplication\nsplit, unlike the other datasets. SAM, SAM2, Med-SAM, Med-SAM2, SAM-Med2D, ScribblePrompt, SegVol, Vista3D, MedicoSAM2D, and nnInteractive were used as described by their GitHub repositories, including the provided tutorials and example scripts for\ndata pre-processing3. MedicoSAM3D [11] has three hyperparameter for prompt propagation: the IoU threshold, projection mode, and box\nextension factor, which controls the expansion of the box after projection. Optimal performance requires tuning these\nhyperparameter for each data subset. To establish a single standardized inference protocol for our entire dataset, we\nperformed a grid-based hyperparameter search on four representative samples – one from each subset, the same samples\nthat participants from the observer study used for their training phase.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 50,
+    "total_chunks": 71,
+    "char_count": 2635,
+    "word_count": 406,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "73f4cbf3-a3e4-472e-a73b-b5b3ccfe0a72",
+    "text": "The search space included IoU thresholds from\n0.7 to 0.9 (step 0.1), projection modes box, points, points and masks, single point, and box extensions from 0 to 0.25\n(step 0.05). The final settings, selected by majority vote from all experiments, were iou_threshold = 0.7, projection\n= single_point, and box_extension = 0.0. The latest version of SAM-Med3D4 does not support sliding-window inference with built-in prompt propagation,\nin contrast to methods such as SegVol [15] or Vista3D [16]. In its current implementation, inference operates on\nindependent (128,128,128) window crops, each of which requires a newly provided prompt. Because the method does 3SAM: commit 6fdee8f, SAM2: commit 2b90b9f, Med-SAM: commit 2b7c64c, Med-SAM2: commit 332f30d, SAM-Med2D: commit bfd2b93,\nScribblePrompt: commit 182449, SegVol: 4ee0a47, Vista3D: commit 8bb7572, MedicoSAM: 9d73c29, nnInteractive: 47c4626\n4SAM-Med3D commit: e8d2e0a",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 51,
+    "total_chunks": 71,
+    "char_count": 922,
+    "word_count": 129,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6204780f-adf8-40cb-9ea1-419a1e1440d8",
+    "text": "not implement an overlapping sliding window where prompts are automatically derived from the previously generated\nmask, the user needs to provide prompts for every crop. As our use case requires fully automatic inference after the\ninitial prompt, this evaluation strategy cannot be applied. To perform inference with SAM-Med3D, we implemented\ntwo alternatives without modifying the model framework: The first naive approach is to crop a (128,128,128) window\naround the initial prompt, which may fail to fully capture objects that exceed this size; The second is to resample the\nentire image by resizing its longest side to 128 voxels. Although this ensuring that the entire object is captures, it can\nsignificantly distort the image and affect the performance.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 52,
+    "total_chunks": 71,
+    "char_count": 760,
+    "word_count": 117,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b6e645d0-8917-4fa2-a900-81c7adeeb13b",
+    "text": "MedicalSAM2 (MedSAM-2) [12] was not included in our analysis due to persistent assertion errors in the model\narchitecture code preventing successful execution5, and resolving these issues would have required extensive investigation\nbeyond the scope of this study. CT-SAM3D [42] was not included in our analysis because preliminary tests produced\nempty prediction masks. We hypothesize that the fixed 64×64×64 patch size in combination with the absence of a\nsliding-window inference or automatic prompt propagation (similar to SAM-Med3D) did not generalize well to our data. 5MedicalSAM2: commit 18b0f5b",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 53,
+    "total_chunks": 71,
+    "char_count": 602,
+    "word_count": 86,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5e5ed344-eba8-4df6-90b5-6a8afe0bb452",
+    "text": "Human prompt variation Accuracy of human prompts\nCenter point. Table C.8 collects detailed results on the median Euclidean distance (mm). Figure C.8 visualizes the\nspatial distribution of the center point deviations (∆x, ∆y) and the intra-rater consistency (∆x, ∆y) per class label. Table C.8: Euclidean distances (mm) of human center points compared to reference center points measured as median and IQR. (a) Total Average & Dataset Lower Leg and corresponding class labels Annotator Total Lower Leg Tibia Implant Tibia bone all 1.50 (0.7-3.0) 1.76 (1.0-1.0) 1.54 (0.7-0.7) 2.04 (1.1-1.1)\nannotator01 1.50 (0.7-0.7) 1.95 (1.0-1.0) 1.54 (0.7-0.7) 2.04 (1.1-1.1)\nannotator02 1.50 (0.7-0.7) 1.38 (0.7-0.7) 1.54 (1.0-1.0) 1.38 (0.7-0.7)\nannotator03 1.50 (0.8-0.8) 2.01 (1.0-1.0) 1.42 (0.7-0.7) 2.01 (1.1-1.1)\nannotator04 1.50 (0.7-0.7) 1.76 (1.0-1.0) 1.46 (0.5-0.5) 1.95 (1.1-1.1)\nannotator05 1.38 (0.7-0.7) 1.54 (0.7-0.7) 1.24 (0.5-0.5) 1.54 (0.7-0.7)\nannotator06 1.63 (0.8-0.8) 2.04 (1.0-1.0) 1.78 (0.8-0.8) 2.13 (1.1-1.1)\nannotator07 1.50 (0.7-0.7) 2.85 (1.1-1.1) 1.65 (1.0-1.0) 3.59 (1.7-1.7)\nannotator08 1.37 (0.7-0.7) 1.54 (1.0-1.0) 1.76 (1.0-1.0) 1.54 (1.0-1.0)\nannotator09 1.50 (0.8-0.8) 1.95 (1.1-1.1) 1.78 (1.0-1.0) 1.95 (1.1-1.1)\nannotator10 1.59 (0.7-0.7) 1.95 (1.1-1.1) 1.09 (0.7-0.7) 2.01 (1.2-1.2)\nannotator11 1.74 (1.0-1.0) 1.76 (1.1-1.1) 1.95 (1.1-1.1) 1.76 (1.1-1.1)\nannotator12 1.66 (1.0-1.0) 1.76 (1.1-1.1) 1.76 (1.1-1.1) 1.95 (1.1-1.1)\nannotator13 1.50 (0.8-0.8) 1.54 (1.0-1.0) 1.38 (0.7-0.7) 1.95 (1.1-1.1)\nannotator14 1.46 (0.7-0.7) 1.54 (1.0-1.0) 1.09 (0.7-0.7) 1.76 (1.0-1.0)\nannotator15 1.38 (0.7-0.7) 1.09 (0.7-0.7) 1.09 (0.7-0.7) 1.09 (0.7-0.7)\nannotator16 1.71 (0.9-0.9) 2.07 (1.1-1.1) 1.50 (1.1-1.1) 2.18 (1.5-1.5)\nannotator17 1.46 (0.7-0.7) 1.76 (1.0-1.0) 1.38 (0.7-0.7) 2.01 (1.1-1.1)\nannotator18 1.50 (0.7-0.7) 1.95 (1.1-1.1) 1.95 (1.1-1.1) 1.95 (1.1-1.1)\nannotator19 1.52 (0.7-0.7) 2.01 (1.0-1.0) 1.09 (0.5-0.5) 2.31 (1.1-1.1)\nannotator20 1.54 (0.8-0.8) 2.01 (1.1-1.1) 2.07 (1.0-1.0) 2.01 (1.4-1.4) (b) Dataset Shoulder and corresponding class labels Annotator Shoulder Humerus R Scapula R Humerus L Scapula L\nall 1.86 (1.0-1.0) 1.38 (1.0-1.0) 1.67 (1.0-1.0) 1.36 (0.9-0.9) 2.18 (1.2-1.2)\nannotator01 1.38 (1.0-1.0) 1.38 (1.0-1.0) 1.67 (1.0-1.0) 1.36 (0.9-0.9) 2.18 (1.2-1.2)\nannotator02 1.91 (1.0-1.0) 1.29 (1.0-1.0) 2.17 (1.2-1.2) 1.56 (1.0-1.0) 2.04 (1.2-1.2)\nannotator03 1.94 (1.2-1.2) 1.89 (1.0-1.0) 2.18 (1.6-1.6) 1.38 (1.0-1.0) 2.18 (1.2-1.2)\nannotator04 1.91 (1.0-1.0) 1.69 (1.0-1.0) 2.50 (1.4-1.4) 1.22 (0.9-0.9) 2.46 (1.4-1.4)\nannotator05 1.38 (1.0-1.0) 1.21 (1.0-1.0) 1.86 (1.0-1.0) 1.38 (1.0-1.0) 1.38 (1.0-1.0)\nannotator06 1.91 (1.0-1.0) 1.38 (1.0-1.0) 2.76 (1.5-1.5) 1.29 (0.9-0.9) 3.08 (1.3-1.3)\nannotator07 1.66 (1.0-1.0) 1.69 (1.0-1.0) 1.91 (1.2-1.2) 1.36 (1.0-1.0) 1.86 (1.0-1.0)\nannotator08 1.37 (1.0-1.0) 1.28 (1.0-1.0) 1.89 (1.0-1.0) 1.18 (0.9-0.9) 1.38 (1.0-1.0)\nannotator09 1.66 (1.0-1.0) 1.38 (1.0-1.0) 2.18 (1.0-1.0) 1.18 (1.0-1.0) 1.94 (1.3-1.3)\nannotator10 1.94 (1.0-1.0) 0.98 (0.8-0.8) 4.03 (2.7-2.7) 0.98 (0.9-0.9) 4.03 (1.9-1.9)\nannotator11 1.95 (1.2-1.2) 1.86 (1.4-1.4) 2.18 (1.9-1.9) 1.29 (1.0-1.0) 2.36 (1.4-1.4)\nannotator12 1.94 (1.2-1.2) 1.89 (1.2-1.2) 2.18 (1.7-1.7) 1.52 (1.0-1.0) 1.95 (1.4-1.4)\nannotator13 1.86 (1.0-1.0) 1.37 (1.0-1.0) 1.95 (1.2-1.2) 1.38 (1.0-1.0) 2.30 (1.4-1.4)\nannotator14 1.86 (1.2-1.2) 1.69 (1.0-1.0) 2.18 (1.7-1.7) 1.37 (1.0-1.0) 2.18 (1.2-1.2)\nannotator15 1.38 (1.0-1.0) 1.21 (1.0-1.0) 1.95 (1.2-1.2) 1.19 (1.0-1.0) 1.94 (1.0-1.0)\nannotator16 1.95 (1.2-1.2) 1.94 (1.2-1.2) 2.53 (1.7-1.7) 1.30 (1.0-1.0) 2.27 (1.9-1.9)\nannotator17 1.38 (1.0-1.0) 1.23 (1.0-1.0) 1.95 (1.2-1.2) 1.18 (1.0-1.0) 1.94 (1.2-1.2)\nannotator18 1.38 (1.0-1.0) 1.26 (1.0-1.0) 1.94 (1.1-1.1) 1.19 (1.0-1.0) 1.94 (1.4-1.4)\nannotator19 1.94 (1.2-1.2) 1.38 (1.0-1.0) 2.36 (1.4-1.4) 1.22 (1.0-1.0) 3.44 (1.9-1.9)\nannotator20 1.91 (1.0-1.0) 1.69 (1.0-1.0) 2.06 (1.4-1.4) 1.38 (1.0-1.0) 2.17 (1.4-1.4) (c) Dataset Wrist and corresponding class labels\nAnnotator Wrist Capitate Lunate Radius Scaphoid Triquetrum Ulna\nall 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.95 (0.7-0.7) 0.65 (0.3-0.3) 0.73 (0.6-0.6) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator01 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.95 (0.7-0.7) 0.65 (0.3-0.3) 0.73 (0.6-0.6) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator02 0.73 (0.5-0.5) 0.65 (0.3-0.3) 1.17 (0.7-0.7) 0.73 (0.5-0.5) 0.73 (0.6-0.6) 0.65 (0.3-0.3) 0.65 (0.4-0.4)\nannotator03 0.73 (0.5-0.5) 0.65 (0.5-0.5) 1.46 (0.7-0.7) 0.92 (0.5-0.5) 0.73 (0.5-0.5) 0.69 (0.5-0.5) 0.46 (0.3-0.3)\nannotator04 0.73 (0.5-0.5) 0.65 (0.4-0.4) 1.17 (0.7-0.7) 0.73 (0.3-0.3) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator05 0.65 (0.3-0.3) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.65 (0.4-0.4) 0.65 (0.3-0.3) 0.46 (0.3-0.3) 0.46 (0.4-0.4)\nannotator06 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.03 (0.7-0.7) 0.73 (0.3-0.3) 0.73 (0.7-0.7) 0.73 (0.5-0.5) 0.65 (0.3-0.3)\nannotator07 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.92 (0.7-0.7) 0.65 (0.3-0.3) 0.98 (0.7-0.7) 0.69 (0.5-0.5) 0.65 (0.5-0.5)\nannotator08 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.5-0.5)\nannotator09 0.73 (0.5-0.5) 0.69 (0.3-0.3) 1.03 (0.7-0.7) 0.73 (0.7-0.7) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.46 (0.3-0.3)\nannotator10 0.73 (0.5-0.5) 0.65 (0.3-0.3) 1.03 (0.5-0.5) 0.69 (0.4-0.4) 0.98 (0.7-0.7) 0.46 (0.3-0.3) 0.46 (0.4-0.4)\nannotator11 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.17 (0.7-0.7) 0.73 (0.5-0.5) 0.95 (0.7-0.7) 0.92 (0.5-0.5) 0.69 (0.4-0.4)\nannotator12 0.73 (0.5-0.5) 0.73 (0.5-0.5) 1.17 (0.7-0.7) 0.82 (0.5-0.5) 0.92 (0.7-0.7) 0.73 (0.5-0.5) 0.73 (0.5-0.5)\nannotator13 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.92 (0.7-0.7) 0.65 (0.4-0.4) 0.73 (0.7-0.7) 0.65 (0.4-0.4) 0.46 (0.3-0.3)\nannotator14 0.73 (0.3-0.3) 0.73 (0.3-0.3) 0.92 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.65 (0.5-0.5) 0.46 (0.3-0.3)\nannotator15 0.73 (0.5-0.5) 0.46 (0.3-0.3) 1.26 (0.7-0.7) 0.69 (0.5-0.5) 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5)\nannotator16 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.98 (0.7-0.7) 0.73 (0.5-0.5) 0.98 (0.7-0.7) 0.65 (0.3-0.3) 0.65 (0.5-0.5)\nannotator17 0.65 (0.5-0.5) 0.65 (0.3-0.3) 1.00 (0.6-0.6) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.4-0.4) 0.65 (0.5-0.5)\nannotator18 0.73 (0.5-0.5) 0.65 (0.3-0.3) 0.92 (0.5-0.5) 0.65 (0.3-0.3) 0.73 (0.5-0.5) 0.73 (0.5-0.5) 0.73 (0.4-0.4)\nannotator19 0.65 (0.4-0.4) 0.46 (0.3-0.3) 1.10 (0.5-0.5) 0.65 (0.5-0.5) 0.98 (0.7-0.7) 0.46 (0.3-0.3) 0.46 (0.3-0.3)\nannotator20 0.73 (0.5-0.5) 0.73 (0.3-0.3) 1.17 (0.7-0.7) 0.46 (0.5-0.5) 0.98 (0.7-0.7) 0.73 (0.5-0.5) 0.56 (0.3-0.3) (d) Dataset Hip and corresponding class labels\nAnnotator Hip Femur L Femur R Hip L Hip R Femur Implant L Femur Implant R\nall 3.35 (2.1-2.1) 3.00 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 3.00 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator01 3.00 (2.1-2.1) 3.00 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 3.00 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator02 3.35 (2.1-2.1) 3.18 (2.1-2.1) 5.41 (3.0-3.0) 4.74 (3.3-3.3) 4.37 (3.0-3.0) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator03 3.35 (2.1-2.1) 3.35 (1.7-1.7) 4.50 (2.1-2.1) 4.24 (2.1-2.1) 4.74 (3.0-3.0) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator04 3.35 (2.1-2.1) 3.00 (2.1-2.1) 4.24 (2.1-2.1) 5.41 (3.0-3.0) 4.74 (3.0-3.0) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator05 3.00 (1.5-1.5) 2.12 (2.1-2.1) 3.18 (1.5-1.5) 3.35 (2.1-2.1) 3.00 (1.5-1.5) 1.50 (0.0-0.0) 2.12 (1.5-1.5)\nannotator06 3.35 (2.1-2.1) 3.00 (1.5-1.5) 4.74 (3.4-3.4) 4.37 (2.1-2.1) 3.35 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator07 3.18 (1.5-1.5) 3.00 (1.5-1.5) 4.50 (2.1-2.1) 3.35 (2.1-2.1) 3.35 (1.5-1.5) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator08 3.00 (1.5-1.5) 2.12 (1.5-1.5) 3.35 (3.0-3.0) 3.35 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (2.1-2.1) 2.12 (1.5-1.5)\nannotator09 3.35 (2.1-2.1) 3.35 (2.1-2.1) 4.74 (3.3-3.3) 4.24 (3.0-3.0) 4.24 (2.1-2.1) 1.50 (1.5-1.5) 3.35 (2.1-2.1)\nannotator10 4.50 (2.1-2.1) 2.12 (1.5-1.5) 4.50 (3.0-3.0) 6.71 (3.4-3.4) 7.50 (3.4-3.4) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator11 3.35 (2.1-2.1) 3.35 (1.5-1.5) 4.74 (3.4-3.4) 4.74 (3.3-3.3) 3.35 (2.1-2.1) 1.50 (0.0-0.0) 2.12 (1.5-1.5)\nannotator12 3.35 (2.1-2.1) 3.35 (2.1-2.1) 5.41 (3.4-3.4) 3.35 (2.1-2.1) 4.24 (2.3-2.3) 1.50 (1.5-1.5) 1.50 (1.5-1.5)\nannotator13 3.35 (2.1-2.1) 2.12 (2.1-2.1) 4.24 (2.1-2.1) 4.24 (3.0-3.0) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator14 3.35 (2.1-2.1) 3.00 (1.5-1.5) 6.35 (3.0-3.0) 3.35 (2.1-2.1) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator15 3.35 (1.5-1.5) 2.12 (1.5-1.5) 4.95 (1.5-1.5) 4.74 (3.4-3.4) 4.50 (2.1-2.1) 1.50 (0.0-0.0) 1.50 (1.5-1.5)\nannotator16 3.35 (2.1-2.1) 2.12 (1.5-1.5) 4.24 (2.1-2.1) 5.41 (3.4-3.4) 4.50 (2.8-2.8) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator17 3.00 (1.5-1.5) 2.12 (1.5-1.5) 3.35 (2.1-2.1) 3.35 (1.5-1.5) 3.00 (2.1-2.1) 2.12 (1.5-1.5) 2.12 (1.5-1.5)\nannotator18 3.35 (1.5-1.5) 2.12 (1.5-1.5) 4.24 (2.1-2.1) 3.35 (2.1-2.1) 4.50 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5)\nannotator19 3.35 (2.1-2.1) 3.18 (1.5-1.5) 4.24 (3.0-3.0) 5.41 (3.0-3.0) 4.74 (2.1-2.1) 1.50 (1.5-1.5) 1.50 (1.5-1.5)\nannotator20 3.35 (2.1-2.1) 2.74 (2.1-2.1) 3.35 (3.0-3.0) 4.24 (2.1-2.1) 3.35 (2.1-2.1) 1.50 (1.5-1.5) 2.12 (1.5-1.5) Figure C.8: Spatial distribution of mean ∆x and ∆y per annotator per class label. The same-colored (more transparent) ellipse represent\neach annotator's intra-rater consistency (∆x, ∆y).",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 54,
+    "total_chunks": 71,
+    "char_count": 9278,
+    "word_count": 1211,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e288d80f-53b2-47b8-a8a4-bf213436e027",
+    "text": "(a) Wrist (b) Lower Leg Figure C.9: Examples for center point annotations: Center points with low euclidean distance (mm) (top row) and high values (bottom row)\nper data subset. Black dots are automatically extracted reference annotation, annotators' annotations are color-encoded.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 55,
+    "total_chunks": 71,
+    "char_count": 281,
+    "word_count": 40,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "fcfd172c-c2ee-4adb-ba5d-e261792cc011",
+    "text": "Table C.9 collects detailed results on the median IoU (%). Figure C.10 visualizes the spatial distribution\nof the bounding boxes' center point deviations (∆x, ∆y) and the intra-rater consistency (∆w, ∆h) per class labels. Table C.9: IoU (%) of human bounding boxes compared to referenc bounding boxes measured as median and IQR. (a) Total Average & Dataset Lower Leg and corresponding class labels Annotator Total Lower Leg Tibia Implant Tibia bone\nall 90.56 (83.4-94.5) 90.02 (80.3-80.3) 84.82 (79.1-79.1) 92.92 (89.5-89.5)\nannotator01 91.22 (85.5-85.5) 90.10 (86.5-86.5) 84.82 (79.1-79.1) 92.92 (89.5-89.5)\nannotator02 91.22 (85.9-85.9) 91.68 (81.2-81.2) 75.00 (68.5-68.5) 93.33 (91.8-91.8)\nannotator03 84.27 (73.8-73.8) 79.96 (61.2-61.2) 52.91 (45.4-45.4) 84.06 (79.6-79.6)\nannotator04 93.32 (89.0-89.0) 93.66 (90.2-90.2) 89.29 (80.9-80.9) 94.83 (91.9-91.9)\nannotator05 86.99 (79.1-79.1) 84.33 (77.0-77.0) 72.63 (56.3-56.3) 87.80 (83.5-83.5)\nannotator06 92.62 (86.8-86.8) 93.48 (88.6-88.6) 86.89 (78.8-78.8) 95.29 (92.8-92.8)\nannotator07 92.87 (87.2-87.2) 91.12 (70.7-70.7) 63.89 (49.8-49.8) 93.86 (91.4-91.4)\nannotator08 93.15 (88.5-88.5) 93.33 (88.7-88.7) 84.67 (75.5-75.5) 95.64 (92.8-92.8)\nannotator09 90.69 (85.0-85.0) 92.03 (81.5-81.5) 76.38 (66.4-66.4) 94.60 (92.2-92.2)\nannotator10 93.29 (88.2-88.2) 92.93 (87.2-87.2) 81.59 (70.0-70.0) 94.70 (91.7-91.7)\nannotator11 85.43 (76.0-76.0) 82.85 (70.2-70.2) 64.10 (50.4-50.4) 88.11 (82.5-82.5)\nannotator12 88.42 (80.1-80.1) 84.96 (78.3-78.3) 69.27 (56.5-56.5) 88.89 (84.1-84.1)\nannotator13 90.32 (83.9-83.9) 89.41 (83.5-83.5) 77.66 (68.4-68.4) 93.04 (88.5-88.5)\nannotator14 90.75 (85.1-85.1) 91.66 (82.5-82.5) 74.05 (68.3-68.3) 93.79 (91.8-91.8)\nannotator15 92.22 (87.2-87.2) 91.88 (84.7-84.7) 79.62 (74.5-74.5) 93.87 (91.5-91.5)\nannotator16 79.13 (70.2-70.2) 75.81 (65.1-65.1) 55.88 (48.0-48.0) 82.70 (75.5-75.5)\nannotator17 91.67 (85.3-85.3) 90.69 (83.8-83.8) 81.59 (72.3-72.3) 93.48 (90.6-90.6)\nannotator18 90.61 (84.6-84.6) 89.11 (80.7-80.7) 79.69 (66.1-66.1) 92.30 (89.0-89.0)\nannotator19 91.54 (86.7-86.7) 90.61 (84.3-84.3) 80.00 (72.4-72.4) 93.81 (89.8-89.8)\nannotator20 91.38 (85.4-85.4) 91.55 (80.8-80.8) 72.59 (67.7-67.7) 94.65 (91.6-91.6) (b) Dataset Shoulder and corresponding class labels Annotator Shoulder Humerus R Scapula R Humerus L Scapula L\nall 87.82 (80.5-80.5) 84.63 (81.0-81.0) 93.14 (89.4-89.4) 84.44 (80.5-80.5) 92.27 (89.7-89.7)\nannotator01 88.74 (82.5-82.5) 84.63 (81.0-81.0) 93.14 (89.4-89.4) 84.44 (80.5-80.5) 92.27 (89.7-89.7)\nannotator02 90.62 (85.4-85.4) 87.53 (83.4-83.4) 92.80 (89.1-89.1) 87.00 (83.6-83.6) 92.64 (88.5-88.5)\nannotator03 75.09 (65.1-65.1) 67.35 (62.4-62.4) 86.25 (79.1-79.1) 68.78 (62.0-62.0) 80.45 (73.4-73.4)\nannotator04 91.13 (87.1-87.1) 89.32 (84.9-84.9) 92.58 (90.3-90.3) 89.24 (86.0-86.0) 91.47 (89.2-89.2)\nannotator05 80.64 (71.1-71.1) 71.50 (66.9-66.9) 85.92 (82.5-82.5) 72.87 (67.3-67.3) 85.68 (79.7-79.7)\nannotator06 89.29 (82.8-82.8) 84.15 (77.7-77.7) 93.30 (91.0-91.0) 84.18 (79.9-79.9) 91.21 (86.9-86.9)\nannotator07 91.27 (86.1-86.1) 88.10 (83.3-83.3) 93.94 (91.3-91.3) 89.26 (84.7-84.7) 92.59 (89.4-89.4)\nannotator08 92.61 (88.0-88.0) 89.61 (85.7-85.7) 94.15 (90.4-90.4) 92.05 (87.0-87.0) 94.16 (90.4-90.4)\nannotator09 89.38 (84.0-84.0) 86.57 (80.7-80.7) 93.11 (89.0-89.0) 85.42 (80.0-80.0) 91.04 (87.0-87.0)\nannotator10 92.00 (86.2-86.2) 89.51 (83.9-83.9) 93.96 (91.3-91.3) 87.87 (83.8-83.8) 93.78 (91.0-91.0)\nannotator11 78.22 (70.5-70.5) 72.47 (67.4-67.4) 82.97 (77.1-77.1) 73.86 (69.5-69.5) 81.83 (75.2-75.2)\nannotator12 80.95 (72.6-72.6) 75.11 (69.8-69.8) 86.13 (80.2-80.2) 77.74 (71.5-71.5) 83.58 (77.5-77.5)\nannotator13 86.67 (80.7-80.7) 81.92 (76.7-76.7) 90.38 (86.0-86.0) 83.61 (78.8-78.8) 90.13 (87.3-87.3)\nannotator14 88.70 (83.5-83.5) 85.95 (81.6-81.6) 90.95 (87.4-87.4) 85.93 (79.6-79.6) 90.79 (87.6-87.6)\nannotator15 92.80 (88.0-88.0) 90.21 (85.5-85.5) 95.16 (92.4-92.4) 91.00 (85.5-85.5) 93.26 (90.2-90.2)\nannotator16 73.83 (66.2-66.2) 72.41 (66.0-66.0) 75.25 (68.4-68.4) 71.87 (63.4-63.4) 75.02 (67.1-67.1)\nannotator17 87.25 (80.4-80.4) 81.88 (77.6-77.6) 90.26 (87.4-87.4) 82.92 (76.1-76.1) 90.24 (85.8-85.8)\nannotator18 88.21 (81.5-81.5) 84.33 (79.6-79.6) 91.81 (88.4-88.4) 82.84 (77.6-77.6) 90.00 (85.9-85.9)\nannotator19 89.29 (83.7-83.7) 87.85 (81.8-81.8) 91.36 (87.8-87.8) 88.04 (82.6-82.6) 89.84 (86.5-86.5)\nannotator20 90.18 (85.2-85.2) 86.46 (83.0-83.0) 93.44 (89.2-89.2) 89.54 (84.2-84.2) 92.17 (86.4-86.4) (c) Dataset Wrist and corresponding class labels\nAnnotator Wrist Capitate Lunate Radius Scaphoid Triquetrum Ulna\nall 92.21 (88.1-88.1) 92.80 (89.2-89.2) 94.09 (91.1-91.1) 94.61 (90.4-90.4) 94.67 (92.1-92.1) 93.00 (89.9-89.9) 91.80 (88.7-88.7)\nannotator01 93.77 (89.9-89.9) 92.80 (89.2-89.2) 94.09 (91.1-91.1) 94.61 (90.4-90.4) 94.67 (92.1-92.1) 93.00 (89.9-89.9) 91.80 (88.7-88.7)\nannotator02 91.17 (87.8-87.8) 91.04 (87.6-87.6) 91.07 (86.4-86.4) 92.35 (88.7-88.7) 91.58 (89.3-89.3) 89.56 (87.6-87.6) 90.31 (88.3-88.3)\nannotator03 90.64 (85.6-85.6) 91.49 (89.0-89.0) 89.29 (84.9-84.9) 91.80 (85.7-85.7) 92.08 (87.5-87.5) 88.67 (82.0-82.0) 87.09 (83.4-83.4)\nannotator04 95.03 (92.6-92.6) 94.72 (92.6-92.6) 95.59 (91.1-91.1) 95.99 (92.3-92.3) 95.66 (93.5-93.5) 94.59 (92.3-92.3) 95.00 (92.6-92.6)\nannotator05 89.20 (85.8-85.8) 89.30 (85.5-85.5) 89.30 (85.2-85.2) 90.57 (86.7-86.7) 89.47 (86.8-86.8) 88.46 (84.8-84.8) 87.96 (83.9-83.9)\nannotator06 94.30 (91.2-91.2) 94.52 (90.7-90.7) 94.58 (90.5-90.5) 94.87 (92.2-92.2) 94.29 (92.3-92.3) 92.88 (90.6-90.6) 94.23 (91.0-91.0)\nannotator07 94.60 (91.8-91.8) 95.28 (93.8-93.8) 94.72 (91.5-91.5) 94.10 (90.6-90.6) 94.88 (93.2-93.2) 94.44 (91.1-91.1) 93.44 (89.6-89.6)\nannotator08 94.59 (91.8-91.8) 95.25 (92.5-92.5) 94.58 (91.2-91.2) 95.24 (93.2-93.2) 94.69 (91.5-91.5) 94.14 (90.5-90.5) 94.37 (92.0-92.0)\nannotator09 91.20 (87.1-87.1) 91.07 (87.4-87.4) 91.44 (86.9-86.9) 91.33 (87.8-87.8) 91.53 (88.1-88.1) 90.84 (86.8-86.8) 90.47 (86.2-86.2)\nannotator10 94.59 (91.3-91.3) 94.29 (91.9-91.9) 94.62 (90.5-90.5) 95.11 (91.7-91.7) 95.18 (92.6-92.6) 94.08 (91.3-91.3) 93.01 (89.2-89.2)\nannotator11 88.35 (83.3-83.3) 87.56 (83.9-83.9) 90.32 (85.6-85.6) 86.98 (80.7-80.7) 89.45 (85.8-85.8) 87.77 (83.9-83.9) 83.17 (79.3-79.3)\nannotator12 92.01 (88.9-88.9) 92.01 (89.6-89.6) 92.24 (89.0-89.0) 92.86 (86.5-86.5) 92.63 (89.6-89.6) 91.54 (88.3-88.3) 89.80 (86.2-86.2)\nannotator13 91.41 (88.8-88.8) 91.15 (88.9-88.9) 91.78 (90.6-90.6) 92.82 (90.1-90.1) 91.42 (90.0-90.0) 90.24 (87.6-87.6) 90.53 (85.3-85.3)\nannotator14 91.23 (87.8-87.8) 90.50 (88.0-88.0) 90.97 (88.0-88.0) 93.24 (90.9-90.9) 91.25 (89.0-89.0) 89.03 (83.9-83.9) 92.11 (88.5-88.5)\nannotator15 92.96 (90.0-90.0) 92.72 (89.7-89.7) 94.12 (89.2-89.2) 94.11 (90.5-90.5) 94.01 (92.0-92.0) 91.93 (89.5-89.5) 92.08 (88.3-88.3)\nannotator16 80.70 (75.7-75.7) 81.46 (77.1-77.1) 80.90 (76.5-76.5) 79.76 (74.9-74.9) 82.49 (77.6-77.6) 79.99 (75.1-75.1) 76.81 (72.7-72.7)\nannotator17 93.54 (90.7-90.7) 93.75 (90.7-90.7) 92.92 (90.6-90.6) 94.49 (91.5-91.5) 93.47 (91.8-91.8) 92.67 (89.9-89.9) 93.41 (90.6-90.6)\nannotator18 92.22 (88.7-88.7) 91.02 (88.4-88.4) 93.06 (89.2-89.2) 92.87 (88.6-88.6) 92.41 (90.3-90.3) 91.67 (88.6-88.6) 92.26 (86.3-86.3)\nannotator19 93.14 (90.7-90.7) 92.63 (90.4-90.4) 93.72 (91.0-91.0) 93.96 (90.3-90.3) 93.73 (92.0-92.0) 93.12 (90.6-90.6) 91.97 (89.4-89.4)\nannotator20 92.42 (89.0-89.0) 92.88 (90.9-90.9) 92.11 (85.2-85.2) 94.83 (90.7-90.7) 92.42 (90.6-90.6) 91.68 (86.5-86.5) 91.14 (86.0-86.0) (d) Dataset Hip and corresponding class labels\nAnnotator Hip Femur L Femur R Hip L Hip R Femur Implant L Femur Implant R\nall 90.69 (82.1-82.1) 91.59 (88.7-88.7) 90.81 (88.7-88.7) 90.91 (87.6-87.6) 91.66 (87.4-87.4) 67.11 (60.2-60.2) 65.98 (55.8-55.8)\nannotator01 90.19 (81.0-81.0) 91.59 (88.7-88.7) 90.81 (88.7-88.7) 90.91 (87.6-87.6) 91.66 (87.4-87.4) 67.11 (60.2-60.2) 65.98 (55.8-55.8)\nannotator02 91.34 (83.8-83.8) 91.46 (87.9-87.9) 91.04 (85.0-85.0) 93.29 (87.7-87.7) 93.27 (86.7-86.7) 75.00 (70.6-70.6) 77.67 (70.4-70.4)\nannotator03 85.58 (74.7-74.7) 85.71 (81.0-81.0) 84.08 (80.3-80.3) 89.74 (84.8-84.8) 90.03 (83.3-83.3) 60.71 (52.4-52.4) 55.84 (49.1-49.1)\nannotator04 92.00 (85.0-85.0) 91.58 (88.4-88.4) 90.84 (86.3-86.3) 95.00 (89.3-89.3) 93.75 (89.2-89.2) 83.22 (76.7-76.7) 83.33 (77.4-77.4)\nannotator05 89.51 (81.1-81.1) 89.86 (85.9-85.9) 88.89 (84.2-84.2) 91.28 (86.5-86.5) 91.30 (87.0-87.0) 65.24 (59.4-59.4) 65.38 (59.4-59.4)\nannotator06 92.68 (85.9-85.9) 91.30 (86.8-86.8) 92.50 (88.9-88.9) 95.96 (89.7-89.7) 93.41 (89.5-89.5) 74.56 (70.8-70.8) 77.67 (70.2-70.2)\nannotator07 92.42 (84.9-84.9) 94.19 (89.2-89.2) 92.11 (88.7-88.7) 93.42 (89.3-89.3) 92.92 (86.5-86.5) 74.30 (72.9-72.9) 70.64 (62.7-62.7)\nannotator08 90.88 (83.3-83.3) 90.31 (84.0-84.0) 90.87 (84.5-84.5) 92.82 (87.1-87.1) 92.60 (87.3-87.3) 77.73 (74.6-74.6) 75.45 (65.7-65.7)\nannotator09 90.19 (81.4-81.4) 91.44 (88.8-88.8) 90.19 (86.4-86.4) 91.78 (83.7-83.7) 90.16 (83.1-83.1) 83.04 (77.1-77.1) 76.60 (69.3-69.3)\nannotator10 92.86 (85.5-85.5) 93.66 (89.3-89.3) 93.33 (88.7-88.7) 94.12 (89.1-89.1) 94.35 (90.7-90.7) 77.78 (75.0-75.0) 77.24 (70.3-70.3)\nannotator11 89.74 (79.6-79.6) 90.82 (86.7-86.7) 90.91 (86.6-86.6) 91.41 (86.4-86.4) 90.51 (84.0-84.0) 62.63 (60.7-60.7) 60.44 (55.5-55.5)\nannotator12 90.73 (82.9-82.9) 91.89 (85.5-85.5) 90.18 (87.0-87.0) 91.43 (87.4-87.4) 92.63 (86.0-86.0) 75.00 (70.0-70.0) 69.35 (61.7-61.7)\nannotator13 91.87 (83.5-83.5) 91.11 (87.1-87.1) 91.67 (85.5-85.5) 92.86 (87.4-87.4) 93.79 (88.1-88.1) 75.00 (70.2-70.2) 76.39 (59.2-59.2)\nannotator14 91.49 (83.9-83.9) 92.12 (86.6-86.6) 91.54 (88.7-88.7) 93.12 (88.5-88.5) 94.35 (89.1-89.1) 77.73 (66.1-66.1) 77.24 (64.8-64.8)\nannotator15 90.10 (82.4-82.4) 89.31 (82.5-82.5) 89.02 (82.9-82.9) 90.64 (85.8-85.8) 92.51 (88.1-88.1) 81.82 (74.1-74.1) 79.00 (71.2-71.2)\nannotator16 82.36 (71.0-71.0) 86.49 (80.2-80.2) 82.45 (76.2-76.2) 85.87 (79.6-79.6) 82.44 (75.6-75.6) 62.13 (47.6-47.6) 53.61 (42.7-42.7)\nannotator17 92.38 (84.9-84.9) 92.58 (89.8-89.8) 91.55 (89.7-89.7) 94.18 (89.8-89.8) 94.29 (89.5-89.5) 77.78 (74.6-74.6) 77.24 (71.2-71.2)\nannotator18 91.55 (83.4-83.4) 94.59 (88.7-88.7) 90.42 (86.3-86.3) 93.74 (88.3-88.3) 93.67 (86.8-86.8) 81.25 (74.6-74.6) 75.76 (72.6-72.6)\nannotator19 91.56 (85.4-85.4) 91.82 (87.4-87.4) 90.27 (86.7-86.7) 93.30 (89.5-89.5) 93.94 (87.9-87.9) 81.48 (70.8-70.8) 75.93 (70.5-70.5)\nannotator20 90.48 (82.5-82.5) 90.48 (84.2-84.2) 90.90 (86.5-86.5) 92.18 (84.8-84.8) 92.11 (88.2-88.2) 77.06 (74.6-74.6) 76.92 (71.7-71.7) Figure C.10: Spatial distribution of the mean ∆x and ∆y per annotator per class label. The same-colored (more transparent) rectangle\nrepresents each annotator's intra-rater consistency (∆w, ∆h). (a) Wrist (b) Lower Leg",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 56,
+    "total_chunks": 71,
+    "char_count": 10682,
+    "word_count": 1208,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "19c77f47-9552-468c-bc07-c33880ddd3fe",
+    "text": "Figure C.11: Examples for bounding box annotations: Boxes with high IoU (%) (top row) and low values (bottom row) per data subset. Black dots are automatically extracted reference annotation, annotators' annotations are color-encoded. Inter-rater annotation consistency\nTable C.10 shows the inter-rater variability ranking, starting with the annotator with the lowest variability to all other\nannotators. This ranking is used for the iterative search to determine the threshold of model sensitivity to inter-rater\nvariability. The rows highlighted in bold have been tested in the iterative search.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 57,
+    "total_chunks": 71,
+    "char_count": 597,
+    "word_count": 85,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "04eccdb7-90a8-45e1-aa1a-059df0bde48f",
+    "text": "Table C.10: Ranking of inter-rater variability, measured by averaged euclidean distance (mm), per annotator, starting with the lowest\nvariability. The euclidean distance (mm) is averaged for all comparisons of one annotator to all other annotators. For the combination prompt, the\neuclidean distance, averaged from center point and bounding box analysis, is used for the ranking, because it considers both prompts. Annotators\nhighlighted in bold have been used in the iterative search approach. (a) Center Point (b) Combination Annotator Eucl. distance (mm) Annotator Eucl. distance (mm) IoU (%) annotator15 2.51±5.5 annotator02 1.67±2.8 87.32±7.9 annotator02 2.54±5.9 annotator15 1.68±2.8 88.44±8.3 annotator20 2.68±6.8 annotator05 1.85±3.2 87.58±8.9 annotator01 2.69±6.8 annotator14 1.86±3.2 87.76±8.0 annotator14 2.70±6.8 annotator20 1.86±3.2 87.81±8.1 annotator05 2.73±6.8 annotator01 1.87±3.3 89.08±9.6 annotator17 2.83±7.2 annotator17 1.94±3.4 89.07±8.8 annotator18 2.88±6.9 annotator04 1.94±3.2 89.25±8.2 annotator04 2.93±6.9 annotator06 1.94±3.2 89.35±8.1",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 58,
+    "total_chunks": 71,
+    "char_count": 1063,
+    "word_count": 132,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0b0901bd-a29f-489d-9854-f46fa0dcbdec",
+    "text": "annotator06 2.94±6.9 annotator18 1.95±3.2 88.64±9.4 annotator08 2.95±7.2 annotator09 1.96±3.1 87.95±8.1 annotator19 2.99±6.6 annotator08 2.00±3.4 88.15±8.8 annotator12 3.01±6.7 annotator19 2.00±3.1 88.66±8.8 annotator03 3.05±7.4 annotator12 2.05±3.2 87.67±7.8 annotator11 3.05±7.0 annotator10 2.09±3.4 89.41±8.6 annotator09 3.06±8.2 annotator11 2.09±3.3 86.73±7.8 annotator16 3.12±7.4 annotator03 2.17±3.4 84.13±9.6 annotator13 3.27±8.5 annotator13 2.20±4.2 89.03±7.6 annotator10 3.27±6.9 annotator16 2.24±3.6 81.06±9.1 annotator07 3.63±7.7 annotator07 2.40±3.6 87.22±11.6 Segmentation performance with reference prompts Table D.11 reports the segmentation performance of all 2D and 3D models, with the selected models (i.e., smallest\nPareto-optimal models) highlighted as gray-shaded cells. This table is an extension of Table 4, where the Pareto-optimal\nmodels per category and prompt type are summarized. The axial slices with the lowest average DSC values (i.e., negative\nexamples) across all 2D models are shown in Figures D.12 - D.15.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 59,
+    "total_chunks": 71,
+    "char_count": 1040,
+    "word_count": 126,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "476ad65b-2601-4741-a6aa-14ac8b7d0b53",
+    "text": "Table D.11: Segmentation performance of all 2D and 3D models per prompt type. Gray-shaded cells indicate the smallest 2D and 3D Pareto-optimal models per prompt type. Omitted results (-) mean that the experiment was not\nperformed, since it was not supported (see Table 1).",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 60,
+    "total_chunks": 71,
+    "char_count": 272,
+    "word_count": 44,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0a25aa8c-cbc0-4a31-a7ed-045019c6b6ad",
+    "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (M) (%) (%) (mm) (%) (%) (mm) (%) (%) (mm) Med-SAM 94 66.89±14.7 79.47±11.2 4.59±2.5 - - - - - - MedicoSAM2D 94 90.74±7.7 97.36±3.6 0.76±0.9 77.46±19.3 83.23±18.4 5.00±5.9 91.27±7.4 97.74±3.3 0.69±0.8 SAM-Med2d 271 78.87±13.8 91.43±8.5 2.27±1.8 73.69±17.0 84.48±14.9 5.35±5.0 79.88±13.2 91.59±8.2 2.47±2.1 medical ScribblePrompt-SAM 94 66.23±20.1 78.25±16.8 5.48±4.1 74.19±14.6 84.22±12.6 6.30±5.3 - - - ScribblePrompt-UNet 4 66.14±24.6 79.83±18.2 5.87±4.6 71.15±16.2 80.85±14.7 6.96±5.3 72.46±13.8 83.27±12.0 5.91±4.4 SAM B 94 89.03±9.7 96.89±4.7 1.10±1.4 85.43±14.4 90.82±13.0 4.83±6.3 91.80±8.0 97.84±4.4 1.07±1.6 SAM H 641 90.44±8.8 97.68±3.9 0.84±1.0 81.83±17.7 87.61±17.6 6.32±9.4 91.56±7.7 98.01±3.8 0.78±1.1 SAM L 312 89.53±9.4 97.34±4.2 0.91±1.1 79.34±20.0 84.83±19.8 6.92±10.7 91.41±8.1 97.92±4.3 0.80±1.2 SAM2.1 B+ 81 90.60±8.1 97.84±3.5 0.82±1.0 83.20±16.5 88.87±15.1 7.59±9.7 91.98±7.2 98.21±3.6 0.73±1.1 natural SAM2.1 L 224 88.39±8.7 97.30±3.9 0.92±1.0 81.72±17.4 88.44±16.4 6.60±10.7 90.90±6.9 98.36±3.2 0.69±1.0 SAM2.1 S 46 89.40±8.3 97.43±3.8 0.91±1.0 82.26±15.6 88.46±14.2 6.64±8.4 91.51±7.0 98.40±3.3 0.69±0.9 SAM2.1 T 39 89.57±8.4 97.55±3.8 0.88±1.0 82.12±16.3 88.62±14.8 6.16±8.6 91.83±6.9 98.38±3.2 0.71±1.0 3D Models evaluated volumetric Med-SAM2 39 79.56±11.1 80.25±10.5 13.49±11.1 - - - - - -",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 61,
+    "total_chunks": 71,
+    "char_count": 1459,
+    "word_count": 200,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "07943c1b-448f-4018-889c-df912f9acfed",
+    "text": "MedicoSAM3D 94 51.78±15.1 52.73±13.9 34.85±13.6 54.39±16.4 53.70±15.4 36.89±17.9 52.16±15.0 53.04±13.9 34.65±14.0 SAM-Med3d-Turbo-crop 101 - - - 25.22±9.0 18.20±7.3 57.92±9.7 - - - SAM-Med3d-Turbo-resample 101 - - - 4.85±5.3 4.20±3.6 121.30±20.9 - - - SAM-Med3d-crop 101 - - - 28.76±8.6 19.91±5.8 52.93±10.2 - - - medical SAM-Med3d-resample 101 - - - 3.52±2.5 3.02±1.6 116.75±20.0 - - - SegVol 181 - - - 33.47±13.5 32.97±12.1 62.53±22.4 - - - Vista3D 218 - - - 25.70±13.1 22.32±11.6 58.14±16.0 - - - nnInteractive 102 76.15±9.3 77.51±9.2 25.36±9.9 69.40±11.2 68.23±12.0 30.98±9.4 75.92±9.4 76.60±9.6 26.53±10.3 SAM2.1 B+ 81 66.11±10.1 66.59±10.0 24.77±18.1 53.38±18.1 50.31±19.6 48.14±29.5 68.33±9.4 67.86±10.2 26.04±18.2\nnatural SAM2.1SAM2.1 LS 22446 67.69±10.258.98±11.8 68.48±10.057.27±11.4 31.67±21.655.04±30.1 56.90±19.148.41±20.0 53.96±20.244.29±20.9 47.84±31.269.02±34.4 70.22±10.162.42±11.3 69.88±10.759.79±11.4 32.21±22.055.14±29.8\nSAM2.1 T 39 61.87±11.9 63.40±11.0 34.24±22.6 54.74±15.9 52.92±16.9 46.40±28.5 65.89±9.8 66.34±9.8 33.41±21.4 Figure D.12: Axial slice of Wrist with lowest DSC value (69.9%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.13: Axial slice of Lower Leg with lowest DSC value (62.1%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.14: Axial slice of Shoulder with lowest DSC value (75.1%) across 2D models. The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label. Figure D.15: Axial slice of Hip with lowest DSC value (58.4%) across 2D models.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 62,
+    "total_chunks": 71,
+    "char_count": 1895,
+    "word_count": 267,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "553c83d6-360c-4859-8307-418d6500bcb8",
+    "text": "The predictions are binary and were combined for visualization; as a result, some predicted regions may not appear because each pixel can only be\nassigned a single label.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 63,
+    "total_chunks": 71,
+    "char_count": 170,
+    "word_count": 28,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c6b232e1-b022-4ce6-b8ad-d827dea09e04",
+    "text": "SAM2.1\nComparing SAM2 (released July 29, 2024) and SAM2.1 (released September 29, 2024) showed only marginal differences\nin segmentation performance for the same prompt type and model size (Table E.12). Using the paired Wilcoxon signedrank test with Bonferroni correction (n = 12), none of the model pairs showed a statistically significant difference on\nany of the three metrics, except for the comparison between SAM2 T and SAM2.1 T prompted with bounding box. Table E.12: Comparison of 2D segmentation performance of all model sizes of SAM2 and SAM2.1 per prompt type.\n↗indicates that all metrics improve, whereas – denotes no consistent trend across metrics. Asterisk (∗) marks statistically significant differences\nbetween models (p-value< 0.05/12 = 0.0042. Model SAM2 Trend SAM2.1 DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) B+ 90.40±8.0 97.81±3.4 0.81±0.9 ↗ 90.60±8.1 97.84±3.5 0.82±1.0 L 88.23±8.8 97.20±4.0 0.93±1.0 ↗ 88.39±8.7 97.30±3.9 0.92±1.0 S 89.06±8.8 97.28±4.0 0.93±1.0 ↗ 89.40±8.3 97.43±3.8 0.91±1.0 T 89.07±8.5 97.39±3.9 0.92±1.0 ↗* 89.57±8.4 97.55±3.8 0.88±1.0 B+ 83.39±16.6 89.12±15.3 7.45±9.9 – 83.20±16.5 88.87±15.1 7.59±9.7 L 78.45±21.2 85.49±21.0 8.30±13.4 ↗ 81.72±17.4 88.44±16.4 6.60±10.7 S 81.51±16.9 87.56±16.3 7.22±9.5 ↗ 82.26±15.6 88.46±14.2 6.64±8.4 T 80.38±18.0 86.84±16.9 7.53±10.9 ↗ 82.12±16.3 88.62±14.8 6.16±8.6",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 64,
+    "total_chunks": 71,
+    "char_count": 1368,
+    "word_count": 201,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "567bb7ca-2fe0-4212-b65a-df2e5972a6e3",
+    "text": "B+ 91.82±7.1 98.32±3.3 0.70±1.0 – 91.98±7.2 98.21±3.6 0.73±1.1 L 90.78±7.0 98.28±3.1 0.68±0.9 – 90.90±6.9 98.36±3.2 0.69±1.0 S 91.48±7.1 98.28±3.5 0.71±1.0 – 91.51±7.0 98.40±3.3 0.69±0.9 T 91.33±6.9 98.26±3.3 0.73±1.0 ↗ 91.83±6.9 98.38±3.2 0.71±1.0",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 65,
+    "total_chunks": 71,
+    "char_count": 248,
+    "word_count": 32,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "af2e99ed-f872-43d9-9abc-b205125d43f0",
+    "text": "Limited vs. unlimited volume propagation\nSAM2.1 and Med-SAM2 generate volumetric predictions via memory bank and a propagation mechanism, which\ncan be restricted to known start and/or end slices (see Table 1). Although MedicoSAM3D also employs slice-by-slice\npropagation, the original method does not include a volume restriction for prediction and was therefore not including in\nour analysis. Applying the prediction volume restriction requires knowing the object's top and bottom slices, which adds\ntwo extra annotations to the required input information. However, limiting the propagation yielded better performance\ncompared to unlimited propagation for all models (Table E.13). Table E.13: Comparison of volumetric prediction without (default setting) and with propagation limitation, per prompt type. Model unlimited propagation limited propagation DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) Med-SAM2 79.56±11.1 80.25±10.5 13.49±11.1 84.00±7.3 84.03±7.6 7.76±4.8 SAM2.1 B+ 66.11±10.1 66.59±10.0 24.77±18.1 83.47±6.5 84.07±6.9 6.75±4.4 SAM2.1 L 58.98±11.8 57.27±11.4 55.04±30.1 80.97±7.2 80.99±7.4 8.41±6.1 SAM2.1 S 67.69±10.2 68.48±10.0 31.67±21.6 82.70±6.8 84.15±6.8 7.85±6.7 SAM2.1 T 61.87±11.9 63.40±11.0 34.24±22.6 81.50±9.8 83.09±9.3 8.91±9.2 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 69.15±16.7 65.92±19.2 27.68±18.8 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 67.98±19.8 64.41±22.4 25.60±21.1",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 66,
+    "total_chunks": 71,
+    "char_count": 1424,
+    "word_count": 189,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7a846b1a-756d-4f4a-8cf4-4a2bed1da5de",
+    "text": "SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 70.76±17.4 67.50±19.7 22.96±19.2 SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 73.38±15.5 71.38±17.2 22.13±19.0 SAM2.1 B+ 68.33±9.4 67.86±10.2 26.04±18.2 86.47±4.7 86.87±5.8 6.59±4.2 SAM2.1 L 62.42±11.3 59.79±11.4 55.14±29.8 84.98±5.3 84.44±6.5 8.37±6.2 SAM2.1 S 70.22±10.1 69.88±10.7 32.21±22.0 86.16±5.7 86.77±6.4 7.47±6.0 SAM2.1 T 65.89±9.8 66.34±9.8 33.41±21.4 86.35±5.6 87.23±6.2 6.96±5.7 Single vs. multiple initial slices\nFor medical FMs (Med-SAM2, SegVol, Vista3D, nnInteractive), using multiple initial slices improved the performance\nfor all prompt types, whereas for SAM2.1 models (except SAM2.1 L box-prompted), the performance was better for a\nsingle initial slice (Table E.14). nnInteractive box-prompted outperformed Med-SAM2, which was the Pareto-optimal\nmodel for the default settings (i.e., single initial slice). Using the paired Wilcoxon signed-rank test with Bonferroni\ncorrection (n = 18), all model pairs showed a statistically significant difference in all three metrics, except for SAM2.1\nL and SegVol. Table E.14: Comparison of volumetric prediction with a single initial slice (default setting) or all initial slices, per prompt type.\n↗indicates that all metrics improve, whereas ↘indicates that all metrics deteriorate. Asterisk (∗) marks statistically significant differences between\nmodels. Model 1 initial slice Trend NS initial slices\nDSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) Med-SAM2 84.00±7.3 84.03±7.6 7.76±4.8 ↗* 86.57±6.3 87.54±6.3 4.75±3.1 SAM2.1 B+ 66.11±10.1 66.59±10.0 24.77±18.1 ↘* 59.80±9.0 60.19±7.2 38.10±20.4 SAM2.1 L 58.98±11.8 57.27±11.4 55.04±30.1 ↗ 60.01±11.9 59.17±10.5 51.21±31.3",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 67,
+    "total_chunks": 71,
+    "char_count": 1692,
+    "word_count": 228,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "0bb1696a-f413-44cb-933d-6b1c896bb8a6",
+    "text": "SAM2.1 S 67.69±10.2 68.48±10.0 31.67±21.6 ↘* 60.84±9.2 60.10±8.1 54.76±25.9 SAM2.1 T 61.87±11.9 63.40±11.0 34.24±22.6 ↘* 55.64±8.9 57.06±8.2 46.79±24.4 nnInteractive 76.15±9.3 77.51±9.2 25.36±9.9 ↗* 90.02±5.8 92.08±5.9 2.69±1.8 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 ↘* 41.87±13.8 40.96±12.6 57.80±16.3 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 ↘ 38.92±22.1 37.36±20.2 74.30±38.0 SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 ↘* 44.84±16.8 42.52±14.3 71.08±29.5 SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 ↘* 42.53±15.1 43.22±14.0 62.72±30.3 SegVol 33.47±13.5 32.97±12.1 62.53±22.4 ↗ 38.32±14.2 37.42±14.2 19.86±8.4 Vista3D 25.70±13.1 22.32±11.6 58.14±16.0 ↗* 44.98±14.8 35.88±12.9 28.00±14.7 nnInteractive 69.40±11.2 68.23±12.0 30.98±9.4 ↗* 85.67±7.1 82.89±9.7 4.44±2.7 SAM2.1 B+ 68.33±9.4 67.86±10.2 26.04±18.2 ↘* 60.65±8.4 60.91±7.1 39.73±20.4 SAM2.1 L 62.42±11.3 59.79±11.4 55.14±29.8 ↘ 62.37±11.3 62.10±10.8 50.94±30.1 SAM2.1 S 70.22±10.1 69.88±10.7 32.21±22.0 ↘* 61.01±8.7 60.67±8.1 56.47±25.8 SAM2.1 T 65.89±9.8 66.34±9.8 33.41±21.4 ↘* 58.48±8.4 59.77±7.5 46.36±24.8 nnInteractive 75.92±9.4 76.60��9.6 26.53±10.3 ↗* 89.81±5.2 91.37±6.3 2.70±1.7",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 68,
+    "total_chunks": 71,
+    "char_count": 1148,
+    "word_count": 130,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "176b2fa2-a4bd-454f-bfe8-aa81bc520ccb",
+    "text": "Single vs. multiple prompts\nThe support for multiple prompts varies for 3D models, with more models supporting multiple point (see Table 1). The\nmultiple prompt setting was equivalent to the default setting for 2D models. Comparing the volumetric segmentation\nperformance for single vs. multiple prompts per prompt type showed only marginal differences per model (Table E.15). Using the paired Wilcoxon signed-rank test with Bonferroni correction (n = 6 for bounding box, n = 24 for center point),\nonly MedicoSAM3D showed statistically significant difference in all three metrics. Table E.15: Comparison of volumetric prediction with a single (default setting) or multiple (up to 5) prompts, per prompt type. ↗indicates\nthat all metrics improve, ↘indicates that all metrics deteriorate, whereas – denotes no consistent trend across metrics. An asterisk (*) marks\nstatistically significant differences between models. Model 1 prompt Trend up to 5 prompts DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (%) (%) (mm) (%) (%) (mm) MedicoSAM3D 51.78±15.1 52.73±13.9 34.85±13.6 ↘* 51.63±15.1 52.59±14.0 35.15±13.9 nnInteractive 76.15±9.3 77.51±9.2 25.36±9.9 – 76.63±8.7 78.09±8.6 25.26±9.7 MedicoSAM3D 54.39±16.4 53.70±15.4 36.89±17.9 ↘* 54.11±16.5 53.45±15.6 37.32±18.3 SAM2.1 B+ 53.38±18.1 50.31±19.6 48.14±29.5 – 53.27±18.6 50.28±20.1 47.91±29.8 SAM2.1 L 48.41±20.0 44.29±20.9 69.02±34.4 – 48.43±20.2 44.35±21.1 68.72±34.2 SAM2.1 S 56.90±19.1 53.96±20.2 47.84±31.2 ↘ 56.76±19.4 53.83±20.6 48.38±31.7",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 69,
+    "total_chunks": 71,
+    "char_count": 1487,
+    "word_count": 211,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f10585bf-cd1b-464a-9e0d-de71e50da721",
+    "text": "SAM2.1 T 54.74±15.9 52.92±16.9 46.40±28.5 ↘ 54.23±16.5 52.37±17.4 47.53±29.2 SegVol 33.47±13.5 32.97±12.1 62.53±22.4 – 33.63±13.4 33.12±12.0 62.90±22.6 Vista3D 25.70±13.1 22.32±11.6 58.14±16.0 ↘ 25.63±13.0 22.31±11.5 58.34±16.1 nnInteractive 69.40±11.2 68.23±12.0 30.98±9.4 ↗ 69.66±10.8 68.50±11.6 30.80±9.2",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 70,
+    "total_chunks": 71,
+    "char_count": 307,
+    "word_count": 33,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e7b0efef-acc7-4ecf-8d84-8ae577e817d3",
+    "text": "Comparison segmentation with reference and human prompts Table F.16 shows the average difference for the performance of FMs prompted with reference and human prompts. The paired Wilcoxon signed-rank test showed a statistically significant difference for the overall comparison of 2D and\n3D models, with p-value smaller than the Bonferroni-corrected α-value (0.05/6 = 0.0083). Table F.16: Difference in segmentation performance between reference and human prompts, per prompt type. The models with the least difference per prompt type are highlighted in bold. The selected models are the smallest Pareto-optimal models prompted\nwith reference prompts per category highlighted in bold in Table 4.",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 71,
+    "total_chunks": 71,
+    "char_count": 694,
+    "word_count": 100,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f577a987-3563-4f95-8462-8ef5b8418841",
+    "text": "Model Bounding Box 2D or 3D Center Point (2D) or (3D) Combination (2D) or (3D) Size DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ DSC ↑ NSD ↑ HD95 ↓ (M) (%) (%) (mm) (%) (%) (mm) (%) (%) (mm) 2D Models\nmedical MedicoSAM2DScribblePrompt-SAM 9494 3.39 -± 6.3 1.41 -± 4.4 -0.41 -± 1.1 1.241.33 ±± 6.35.5 0.190.28 ±± 3.63.5 -0.05-0.16 ±± 2.42.2 3.47 -± 6.3 2.15 -± 5.3 -0.62 -± 1.5 SAM B 94 - - - 1.39 ± 5.5 0.13 ± 2.9 0.16 ± 3.2 - - - SAM2.1 B+ 81 2.05 ± 5.8 0.92 ± 3.7 -0.31 ± 1.0 - - - - - - natural\nSAM2.1 T 39 - - - - - - 1.64 ± 5.2 0.99 ± 4.1 -0.36 ± 1.2 Average per prompt type 2.72 ± 6.1 1.16 ± 4.1 -0.36 ± 1.1 1.32 ± 5.8 0.20 ± 3.3 -0.02 ± 2.6 2.56 ± 5.8 1.57 ± 4.8 -0.49 ± 1.4 Average 2D Models 2.07 ± 1.0 % DSC (p < 0.001) 0.87 ± 0.7 % NSD (p < 0.001) -0.25 ± 0.3 mm HD95 (p < 0.001) 3D Models evaluated volumetric\nmedical Med-SAM2nnInteractive 10239 76.80±13.5- 79.27±11.2- 14.46±11.8- 68.12±12.6- 68.63±11.5- 30.10±8.8- 75.59±10.6- 77.29±9.1- 25.65±9.5-\nnatural SAM2.1SAM2.1 ST 4639 65.93±11.6- 67.83±10.2- 32.71±21.6- 53.72±16.3- 52.93±16.5- 46.84±27.8- 68.80±11.2- 69.19±10.9- 33.88±22.4- Average per prompt type 1.76 ± 5.8 0.96 ± 4.8 -0.89 ± 10.2 0.80 ± 7.4 0.07 ± 6.6 0.20 ± 7.8 0.63 ± 4.5 0.40 ± 4.2 -0.48 ± 8.2 Average 3D Models 1.06 ± 0.7 % DSC (p < 0.001) 0.47 ± 0.6 % NSD (p < 0.001) -0.39 ± 0.7 mm HD95 (p < 0.001)",
+    "paper_id": "2603.10541",
+    "title": "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation",
+    "authors": [
+      "Caroline Magg",
+      "Maaike A. ter Wee",
+      "Johannes G. G. Dobbe",
+      "Geert J. Streekstra",
+      "Leendert Blankevoort",
+      "Clara I. SÃ¡nchez",
+      "Hoel Kervadec"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10541v1",
+    "chunk_index": 72,
+    "total_chunks": 71,
+    "char_count": 1325,
+    "word_count": 276,
+    "chunking_strategy": "semantic"
+  }
+]
\ No newline at end of file