diff --git "a/data/chunks/2603.10731_semantic.json" "b/data/chunks/2603.10731_semantic.json"
new file mode 100644--- /dev/null
+++ "b/data/chunks/2603.10731_semantic.json"
@@ -0,0 +1,1294 @@
+[
+  {
+    "chunk_id": "6bbe4d41-a34c-4adf-a3e2-a7f32cd28e57",
+    "text": "Beyond Accuracy: Reliability and Uncertainty Estimation in\nConvolutional Neural Networks Sanne Ruijsa, Alina Kosiakovaa, Farrukh Javedb,∗ aDepartment of Economics, Lund University, Sweden\nbDepartment of Statistics, Lund University, Sweden",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 0,
+    "total_chunks": 68,
+    "char_count": 238,
+    "word_count": 28,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e0bec0b4-6fc8-4989-865d-a0533465c420",
+    "text": "Deep neural networks (DNNs) have become integral to a wide range of scientific and prac-2026\ntical applications due to their flexibility and strong predictive performance. Despite their\naccuracy, however, DNNs frequently exhibit poor calibration, often assigning overly confi-Mar dent probabilities to incorrect predictions. This limitation underscores the growing need for\nintegrated mechanisms that provide reliable uncertainty estimation. In this article, we com-\n11 pare two prominent approaches for uncertainty quantification: a Bayesian approximation\nvia Monte Carlo Dropout and the nonparametric Conformal Prediction framework. Both\nmethods are assessed using two convolutional neural network architectures; H-CNN VGG16\nand GoogLeNet, trained on the Fashion-MNIST dataset. The empirical results show that\nalthough H-CNN VGG16 attains higher predictive accuracy, it tends to exhibit pronounced[cs.LG] overconfidence, whereas GoogLeNet yields better-calibrated uncertainty estimates. Conformal Prediction additionally demonstrates consistent validity by producing statistically\nguaranteed prediction sets, highlighting its practical value in high-stakes decision-making\ncontexts. Overall, the findings emphasize the importance of evaluating model performance\nbeyond accuracy alone and contribute to the development of more reliable and trustworthy\ndeep learning systems. Keywords: Uncertainty estimation, Conformal Prediction, deep learning, Bayesian\ninference, Monte Carlo Dropout, Model Calibration. Deep neural networks (DNNs) have become a cornerstone of modern machine learning,arXiv:2603.10731v1\nowing to their ability to model complex data structures and their broad applicability across\nscientific domains such as medical imaging, robotics, and earth observation (Gawlikowski\net al., 2023). Despite their impressive performance, DNNs are often regarded as \"black\nboxes\" due to their limited interpretability and the difficulty of aligning their internal representations with human reasoning (Roth and Bajorath, 2024). This lack of transparency\nis particularly concerning in safety-critical applications. Furthermore, research has shown ∗Corresponding author\nEmail addresses: Sanneruys2@gmail.com (Sanne Ruijs), alina.kosiakova@student.lu.se (Alina\nKosiakova), farrukh.javed@stat.lu.se (Farrukh Javed)",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 1,
+    "total_chunks": 68,
+    "char_count": 2313,
+    "word_count": 284,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5589d3a7-4ea9-4961-aaf1-b93ccf3f0ce1",
+    "text": "that neural networks frequently exhibit overconfidence, assigning high-probability predictions even when they are incorrect (Nguyen et al., 2015). A key limitation of conventional\nDNNs is their reliance on point estimates without providing any quantification of uncertainty. As a result, estimating predictive uncertainty becomes essential for assessing the reliability\nand robustness of model outputs, particularly in high-risk or decision-sensitive scenarios. To address the lack of built-in uncertainty estimation, several methods have been proposed. One widely used technique is Bayesian inference with Monte Carlo (MC) Dropout,\nwhich approximates a posterior distribution while maintaining computational efficiency (see,\nfor example, (Gal and Ghahramani, 2016) and (Son and Seok, 2026)). Alternatively, Conformal Prediction offers statistically valid prediction sets without requiring assumptions about\nthe underlying data distribution (Angelopoulos and Bates, 2022).These approaches represent\ntwo fundamentally different paradigms: Bayesian methods are probabilistic and integrated\nduring model training and inference, whereas Conformal Prediction is a post-hoc method\nthat can be applied to any pre-trained model. These two methods have been selected due\nto their inherent differences, which offer valuable comparisons, as well as their growing popularity in uncertainty estimation. As deep learning research grows, uncertainty quantification (UQ) has become increasingly\nrelevant. Between 2010 and 2020, over 2,500 papers were published on UQ in various fields\n(see, for example, (Abdar et al., 2021) and (Ferchichi et al., 2025) for review).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 2,
+    "total_chunks": 68,
+    "char_count": 1650,
+    "word_count": 221,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "bf5a8791-0873-4380-bafb-9dab54c4aaaa",
+    "text": "Yet, few\nstudies offer a systematic comparison of Bayesian and Conformal approaches, particularly\nacross diverse neural network architectures. Additionally, the relationship between accuracy\nand uncertainty remains ambiguous (Roth and Bajorath, 2024). High classification accuracy\ndoes not necessarily imply trustworthy predictions, as models often remain overconfident\neven when incorrect. Therefore, UQ should play a more central role in model evaluation.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 3,
+    "total_chunks": 68,
+    "char_count": 457,
+    "word_count": 59,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c924c7b7-e974-480c-a34e-6a4f30c85bc6",
+    "text": "The CNN architectures were chosen based on their strong empirical performance in\nprior studies and complementary design characteristics. H-CNN VGG16 was selected for\nits demonstrated effectiveness on the Fashion-MNIST dataset, particularly in distinguishing between visually similar clothing categories. Its hierarchical architecture helps reduce\nmisclassification of ambiguous classes and enhances model interpretability. GoogLeNet, by\ncontrast, adopts an inception-based design that processes features through parallel convolutional paths. This architecture achieves high accuracy while being more parameter-efficient\nthan H-CNN VGG16, making it a computationally attractive alternative without compromising performance. This study aims to fill this gap by conducting a comparative analysis of two uncertainty\nestimation methods: Bayesian inference via MC Dropout and Conformal Prediction across\ntwo convolutional neural network architectures: H-CNN VGG16 and GoogLeNet. Beyond\ntheir overall performance, the analysis investigates the behavior of predictive uncertainty\nat multiple levels, including the decomposition of uncertainty and its manifestation in ambiguous class predictions. The study contributes to a more comprehensive understanding\nof model reliability and advances the interpretability and trustworthiness of deep learning\nmodels. The Fashion-MNIST dataset is used throughout, offering a standardized benchmark in image classification for a precise comparison of the selected methods across model\narchitectures without unexplained effects from data inconsistency. Main Contributions\n• Comparative study of Bayesian MC Dropout and Conformal Prediction in neural networks • Uses uncertainty to expose class ambiguity to enhance decision making • Shows MC Dropout limits in overfitting-prone deep hierarchical architectures • Demonstrates complementary strengths of Bayesian MC Dropout and Conformal Prediction • Highlights potential of uncertainty estimation with minimal computational cost In summary, this article contributes to a more comprehensive understanding of predictive\nreliability in deep learning.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 4,
+    "total_chunks": 68,
+    "char_count": 2125,
+    "word_count": 272,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8a0aaba8-d772-4301-86b9-a6fcb52e6d2c",
+    "text": "By examining both Bayesian and Conformal frameworks across\ndistinct architectures (VGG16 and GoogLeNet), the study advances the development of\nmodels that are not only accurate but also transparent and trustworthy. This section reviews the importance of uncertainty quantification (UQ) in deep learning\nand examines two leading methods: Bayesian approximation via Monte Carlo Dropout (MC\nDropout) and Conformal Prediction (CP). It also explores prior research on the FashionMNIST dataset as a benchmarking tool for neural network models and describes the rationale\nbehind selecting the CNN architectures used in the article. Uncertainty Estimation in Deep Neural Networks\nDeep learning models, particularly CNNs, have achieved significant success in image classification tasks, including high-risk domains such as medical diagnostics and autonomous systems (Angelopoulos and Bates, 2022). Despite their high predictive accuracy, these models\nare often overconfident, producing unreliable probability estimates (Guo et al., 2017; Gawlikowski et al., 2023). This miscalibration undermines trust in critical applications, where\nunderstanding a model's uncertainty is essential. As noted by Poceviči¯ut˙e (2023), deep neural networks typically rely on the softmax output to estimate \"confidence,\" which reflects the\nconditional probability assigned to each class. However, as Guo et al. (2017) demonstrate,\nthese softmax-derived probabilities are often poorly calibrated and do not match the true\nlikelihood of correctness. As a result, relying on softmax outputs as a measure of model\nconfidence can be misleading. Thus, there is a growing demand for methods that estimate predictive uncertainty in\na statistically sound and interpretable manner. Uncertainty quantification serves several\nessential roles: First, it allows practitioners to defer uncertain cases to human experts (Papadopoulos, 2008), enhances interpretability by highlighting ambiguous samples (Lu et al.,\n2022), and supports robust decision-making in high-risk contexts. Second, it offers additional insight into the model's behaviour and aids in the interpretation of deep learning\nmethods; for instance, analysing the size and composition of prediction sets can help identify ambiguous inputs or systematically difficult classes (Lu et al., 2022). These capabilities\nmake uncertainty estimation an essential component in deploying machine learning models\nresponsibly in high-risk environments. Uncertainty Estimation Methods\nIn this section, we provide a review of two uncertainty estimation methods and describe\nprevious research regarding their applicability to neural networks. Conformal Predictions\nCP is a model-agnostic technique that provides prediction sets with a guaranteed error\nrate under minimal assumptions (Shafer and Vovk, 2008; Fontana et al., 2023).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 5,
+    "total_chunks": 68,
+    "char_count": 2834,
+    "word_count": 388,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "2c61dc35-1341-4913-b4dc-c4b92cf6782d",
+    "text": "This includes classification and regression methods, such as support vector machines, decision trees and neural networks (Shafer and Vovk, 2008). This versatility directly allows\nthe method to be easily implemented on large datasets and deep, complex models without\naltering the structure of the underlying architectures. CP has also been integrated with various CNNs, including VGG16 and ResNet, and successfully used in applications ranging from\nfacial recognition (Matiz and Barner, 2019) to skin lesion diagnosis (Lu et al., 2022).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 7,
+    "total_chunks": 68,
+    "char_count": 535,
+    "word_count": 78,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4406e596-c4b6-4711-9612-1403de5a328e",
+    "text": "Its simplicity and statistical guarantees make it an appealing option for uncertainty quantification,\nespecially when Bayesian priors are difficult to specify. Bayesian Approximation Using Monte Carlo Dropout\nBayesian methods can effectively capture two types of uncertainties in data modelling,\nnamely the aleatoric and epistemic uncertainties (Kendall, 2019). The first, aleatoric uncertainty, captures the natural noise present in observations, such as measurement variability,\nthat remains constant regardless of how much data we collect. The second, epistemic uncertainty, represents our incomplete understanding of the model itself. Unlike aleatoric\nuncertainty, this model uncertainty diminishes as we gather more training data, reflecting\nour growing confidence in the learned parameters. While the former is irreducible, the latter\ncan be reduced by improving the model learned by the neural network and introducing more\ndata (Gawlikowski et al., 2023; Essbai et al., 2024). While conceptually straightforward, neural networks are often non-linear and high-dimensional,\nmaking the process of inference computationally infeasible and the resulting posterior distribution intractable (Kendall, 2019; Sun, 2022). MC Dropout, introduced by Gal and\nGhahramani (2016), addressed this limitation and approximates Bayesian inference by applying dropout at inference time, thereby sampling from an approximate posterior distribution. Instead of learning fixed weights, the method samples from an approximated posterior\ndistribution during inference, introducing model uncertainty. Gal and Ghahramani (2016)\nshowed that the model is effectively regularised by averaging multiple sampled weight configurations to reduce variance and discouraging over-dependence on specific parameters. The\ntraining process remains unchanged, but becomes scalable compared to other Bayesian inference methods. MC Dropout have been used in various domains confirming its scalability and robustness. For example, in medical imaging, Eaton-Rosen et al. (2018) performed uncertainty\nquantification of brain tumour image segmentation on the ResNet architecture. Gal et al.\n(2017) demonstrated the effectiveness of MC Dropout by evaluating it on both MNIST and\ndermoscopic lesion image datasets using the VGG16 CNN architecture. Their approach\nincorporated MC Dropout to approximate predictive distributions and quantify predictive\nuncertainty during inference.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 8,
+    "total_chunks": 68,
+    "char_count": 2436,
+    "word_count": 324,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "132f91b7-e6d5-4785-af9c-1d6ee69f947e",
+    "text": "Similarly, in soil analysis, Padarian et al. (2022) applied a 2D convolutional neural network with MC Dropout to soil spectral data for predicting it's\nproperties while assessing prediction uncertainty.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 9,
+    "total_chunks": 68,
+    "char_count": 202,
+    "word_count": 29,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "28dd6cec-e282-4bbf-8497-122522fe3a9d",
+    "text": "Their research specifically focused on evaluating how uncertainty quantification methods perform when testing data differs significantly\nfrom training distributions. The results revealed that MC Dropout provides substantially\nwider and more reliable prediction intervals for out-of-domain data compared to alternative methods such as bootstrapping. These findings demonstrate the broad applicability\nand robustness of MC Dropout, establishing it as a suitable choice for Bayesian uncertainty\nquantification in neural network applications. Comparison of Bayesian Inference and Conformal Predictions\nThe two methods, Bayesian inference and Conformal Prediction, adopt fundamentally\ndifferent strategies for uncertainty estimation. Bayesian inference is a probabilistic framework that relies on prior assumptions about the data distribution and model parameters. While Monte Carlo (MC) Dropout offers an efficient approximation of the posterior distribution (Kendall and Gal, 2017), achieving proper calibration of this distribution remains\na challenge. In contrast, Conformal Predictions are nonparametric and distribution-free,\nproviding finite-sample guarantees on coverage without requiring strong model assumptions\n(Sun, 2022). One key advantage of Conformal Prediction is its flexibility: it can be applied\npost hoc to any trained model, including deep neural networks, without modifying the underlying architecture. Moreover, it is computationally more efficient and easier to implement\nthan Bayesian methods (Angelopoulos and Bates, 2022). However, its primary limitation lies\nin being overly conservative, often resulting in unnecessarily wide prediction sets or intervals\n(Fan et al., 2024).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 10,
+    "total_chunks": 68,
+    "char_count": 1698,
+    "word_count": 221,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8e5e5451-cf95-4378-a1b3-fdd1aaa2cfe0",
+    "text": "Despite the widespread application of both methods across deep learning and convolutional neural network (CNN) architectures, few studies have performed direct comparisons of\ntheir performance on shared benchmarks. Furthermore, little attention has been paid to how\narchitectural design affects the behavior of uncertainty estimation techniques. For instance,\nFontes (2023) examined uncertainty quantification in binary classification using logistic regression models on small-scale datasets. Their evaluation, based on F1 scores at different\nuncertainty thresholds (top 25%, 50%, and 75% most confident predictions), revealed that\nwhile Conformal Predictions required reducing the training set size, they provided valid prediction sets. Meanwhile, the Bayesian approach offered greater flexibility by capturing both\nepistemic and aleatoric uncertainty, albeit at the cost of higher computational complexity. Similarly, Liang et al. (2024) compared MC Dropout, CP, and ensemble methods for\nquantifying uncertainty in neural operators, specialized networks used for solving complex\npartial differential equations. Their results demonstrated that Conformal Prediction yielded\nmore reliable confidence intervals with theoretical coverage guarantees. In another study,\nKhurjekar and Gerstoft (2023) evaluated the statistical validity of uncertainty intervals produced by Conformal Prediction and MC Dropout in the context of direction-of-arrival estimation. While Conformal Prediction consistently met the required coverage levels, MC\nDropout failed to do so, underscoring its limitations in providing calibrated uncertainty\nestimates. Furthermore, Li et al. (2025) investigated pixel-level, sample-level and overall\nuncertainty evaluation for medical image segmentation.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 11,
+    "total_chunks": 68,
+    "char_count": 1767,
+    "word_count": 226,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "08926d40-231a-48cc-bd87-01164392091c",
+    "text": "One of the methods employed was\nMC Dropout, demonstrating a good balance between reliability and accuracy, which is easy\nto implement. Overall, these findings highlight the complementary strengths and weaknesses of both\napproaches and emphasize the need for systematic evaluations, particularly in relation to\nneural network architecture and application domain. This article addresses this gap in\nthe current literature by evaluating both uncertainty estimation methods on a common\nbenchmark. The review revealed no prior systematic comparison of the two techniques,\nnor an assessment of how CNN architectural differences influence uncertainty quantification\noutcomes. The novelty of this article lies in its comparative perspective and the interpretation\nof how model design affects uncertainty behavior in deep neural networks. Fashion-MNIST and Prior Research\nThis section outlines the rationale for selecting Fashion-MNIST as the benchmarking\ndataset for this study and highlights its widespread use in deep learning research. It also\nreviews prior work that applied the dataset in image classification tasks, particularly with\nconvolutional neural networks (CNNs). Benchmarking Fashion-MNIST with Deep Learning Models\nFashion-MNIST is a publicly available dataset introduced by Xiao et al. (2017) as a\nmore challenging alternative to the classic MNIST digit dataset. Designed for benchmarking\nmachine learning algorithms, it has become a standard case study for image classification\ntasks. Upon release, Fashion-MNIST was benchmarked using several traditional classifiers,\nsuch as Gradient Boosting (88.0% accuracy), K-Nearest Neighbours (85.4%), and Support\nVector Classifiers (89.7%). Since then, it has been widely adopted in the deep learning\ncommunity, especially for evaluating convolutional neural networks, which leverage layered\narchitectures to learn spatial hierarchies from edges to complex textures (Bbouzidi et al.,\n2024). Numerous CNN architectures have been applied to Fashion-MNIST with consistently high\nperformance. LeNet, one of the earliest CNN models, consists of seven layers and achieved\n90.14% accuracy on this dataset (Vives-Boix and Ruiz-Fernández, 2021). AlexNet, with its\ndeeper architecture and use of larger filters and non-linear activations (Krizhevsky et al.,\n2012), improved performance to 91.19% (Vives-Boix and Ruiz-Fernández, 2021). VGG-type\narchitectures introduced by Simonyan and Zisserman (2015), notably VGG16 and VGG19,\nwith 16 and 19 layers respectively, enabled extraction of more complex features and reached\naccuracies of 92.89% and 92.90% on Fashion-MNIST (Seo and Shin, 2019).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 12,
+    "total_chunks": 68,
+    "char_count": 2631,
+    "word_count": 363,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "fc9bb9fc-951d-4b8c-8b3c-d6992a735352",
+    "text": "These results were\nfurther supported by Vives-Boix and Ruiz-Fernández (2021), who reported 92.45% accuracy\nusing VGG16. Due to this strong performance, VGG-type models are now commonly used\nas baselines for evaluating newer CNN architectures on Fashion-MNIST. Over time, CNN research has shifted toward increasingly deeper and more complex architectures (Krichen, 2023). As noted by Alzubaidi et al. (2021), shallow networks often\nstruggle to capture hierarchical patterns in high-dimensional data, limiting their effectiveness. This trend has motivated a growing focus on balancing computational cost with model\naccuracy, which is a trade-off that remains central to architecture selection in modern CNN\nresearch. Rationale for Architecture Selection\nThis study employs two high-performing convolutional architectures: H-CNN VGG16\n(Seo and Shin, 2019) and GoogLeNet (Szegedy et al., 2015; Bougareche et al., 2022; Vives- Boix and Ruiz-Fernández, 2021), chosen for their strong empirical results and complementary\ndesign characteristics.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 13,
+    "total_chunks": 68,
+    "char_count": 1037,
+    "word_count": 142,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "560f5993-697a-43ef-806a-8269a0a8fe8a",
+    "text": "Seo and Shin (2019) proposed a Hierarchical Convolutional Neural Network (H-CNN)\nto address the challenge of misclassifying visually similar apparel categories such as shirts,\nT-shirts, and coats. The model leverages a hierarchical structure, first classifying broad categories (e.g., tops and bottoms) and subsequently refining predictions to more specific labels. To validate the approach, the H-CNN structure was applied to VGG16 and VGG19, resulting\nin a notable accuracy gain. Specifically, the VGG16-based H-CNN achieved 93.52% accuracy, approximately 1% higher than standard VGG16. The architecture also incorporated\ndropout regularization and loss weight scheduling, encouraging the model to focus gradually\nfrom general to finer-grained categories. This helped mitigate early overfitting and improved\nthe model's ability to resolve class ambiguities. GoogLeNet (also known as Inception V1), introduced by Szegedy et al. (2015), is a\nprominent deep learning architecture that differs from traditional sequential designs by using parallel convolutional paths. Instead of stacking layers linearly, GoogLeNet employs\nInception modules, which apply multiple convolutional filters of varying sizes within the\nsame layer.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 14,
+    "total_chunks": 68,
+    "char_count": 1223,
+    "word_count": 166,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "2ad6ab60-ce4f-482d-b049-f18f8b7d5a0d",
+    "text": "The outputs of these parallel operations are then concatenated to form the\nmodule's final output (Bbouzidi et al., 2024). This design improves computational efficiency\nand helps mitigate issues such as vanishing gradients, contributing to more stable training\n(Janjua et al., 2023). GoogLeNet also uses dropout prior to the final fully connected layer\nto reduce overfitting (Szegedy et al., 2015). When applied to Fashion-MNIST, GoogLeNet\nachieved high accuracy; 93.75% according to Bougareche et al. (2022), and 91.89% in Seo\nand Shin (2019). Despite being shallower than VGG-type networks, GoogLeNet offers comparable accuracy with significantly fewer parameters, making it a computationally efficient\nalternative. The contrasting design philosophies of GoogLeNet and H-CNN VGG16 allow for an\ninsightful investigation into how architecture influences the performance and behavior of uncertainty quantification methods. While CNNs have traditionally emphasized accuracy, their\ntendency toward poor calibration and overconfidence remains a key challenge (Guo et al.,\n2017). These limitations highlight the need for robust uncertainty estimation strategies. This study addresses a notable gap in the literature by conducting a side-by-side evaluation\nof uncertainty quantification methods across structurally distinct CNN architectures using a\nstandardized benchmark dataset. Unlike most previous research that centers on classification\naccuracy, this work places greater emphasis on uncertainty calibration and interpretability,\nthereby contributing to the advancement of more transparent and trustworthy deep learning\nsystems. The dataset used in this study is Fashion-MNIST, a publicly available benchmark introduced by Xiao et al. (2017). It comprises 70,000 grayscale images of fashion products\nsourced from Zalando's online catalog, encompassing a diverse selection of men's, women's,\nkids', and unisex clothing. Each image, sized at 28×28 pixels, depicts a single item and\nis annotated with one of ten predefined class labels: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 15,
+    "total_chunks": 68,
+    "char_count": 2122,
+    "word_count": 292,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4b8b208f-bce0-4b46-b128-17ab99babde7",
+    "text": "These labels correspond to Zalando's\nsilhouette codes and were manually verified to ensure annotation accuracy and consistency. To maintain compatibility with MNIST-based models, the dataset underwent standardized preprocessing, including whitespace trimming, aspect-ratio-preserving resizing, Gaussian sharpening, grayscale conversion, intensity inversion, and centering based on the object's\ncenter of mass. Images with low contrast or unsuitable backgrounds were excluded.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 16,
+    "total_chunks": 68,
+    "char_count": 475,
+    "word_count": 57,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d41b60ee-d8c8-4bc7-a56f-6310b9d8b1e1",
+    "text": "The final\ndataset is divided into a training set of 60,000 images and a test set of 10,000 images, and it\nretains the same format and file structure as MNIST, facilitating straightforward adoption\nin existing machine learning workflows. This section outlines the methodological framework used in the study, covering the selected CNN architectures, uncertainty estimation techniques, and evaluation metrics. Model Architectures\nThe study evaluates two convolutional architectures, H-CNN VGG16 and GoogLeNet,\nselected for their strong empirical performance on Fashion-MNIST and their complementary\nstructural designs. The H-CNN VGG16 architecture, introduced by Seo and Shin (2019), builds on the\nVGG16 model by incorporating a hierarchical classification strategy to address confusion\namong visually similar classes. The model applies dropout regularization and loss-weight\nscheduling to shift learning gradually from general to fine-grained categories, helping reduce\noverfitting and improve interpretability. GoogLeNet (Inception v1), proposed by Szegedy\net al. (2015), utilizes Inception modules to process input through multiple convolutional\npaths in parallel. This architecture efficiently captures both local and global features while\nmaintaining a lower parameter count compared to VGG-based models. To ensure a fair\ncomparison, training hyperparameters such as batch size, learning rate, and number of epochs\nwere kept consistent across both models. All experiments were conducted on a workstation\nequipped with an NVIDIA RTX 3080 GPU, 32 GB RAM, and an Intel® Core™i9-11900KF\nprocessor. Training times reported in Section 5 reflect this setup.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 17,
+    "total_chunks": 68,
+    "char_count": 1652,
+    "word_count": 229,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a765289d-fe8c-4290-aea0-69cef2a339ea",
+    "text": "Uncertainty Estimation Methods\nTwo uncertainty estimation frameworks are evaluated: Inductive Conformal Prediction\n(ICP) and Monte Carlo (MC) Dropout. This section focuses on the theoretical foundations\nand implementation of ICP. Conformal Predictions\nConformal Prediction (CP) is a distribution-free framework that quantifies uncertainty\nin machine learning models by producing prediction sets instead of single-point estimates. These sets offer finite-sample coverage guarantees and do not rely on the data distribution,\nmaking CP particularly attractive for high-dimensional tasks such as image classification\n(Angelopoulos and Bates, 2022). CP is based on the assumption of exchangeability, a weaker\ncondition than i.i.d., which ensures that the joint probability of the data remains invariant This allows CP to maintain its validity without additional assumptions\non the model or data-generating process (Shafer and Vovk, 2008; Zhou et al., 2025). The primary goal of CP is to ensure that, with a user-defined significance level α, the\nprediction set C(xnew) for a new input xnew contains the true label ynew with probability at\nleast 1 −α:\nP (ynew ∈C(xnew)) ≥1 −α. (1) To implement CP, the data is divided into training, calibration, and test sets. The calibration set is used to compute nonconformity scores, which assess how atypical a prediction\nis. While a larger calibration set improves the precision of prediction sets, it may reduce\nmodel performance by shrinking the training set (Barber et al., 2023). To balance this\ntrade-off, this study uses 2,000 samples (approximately 2.86% of the dataset) for calibration,\npreserving sufficient training data while maintaining reliable coverage, consistent with the\nsuggestions of Angelopoulos and Bates (2022). In classification tasks, the nonconformity score is often defined as:",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 18,
+    "total_chunks": 68,
+    "char_count": 1837,
+    "word_count": 268,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "9db21556-f7ce-4dec-8afd-fe4295ce7ade",
+    "text": "si = 1 −ˆf(Xi)yi, (2) where ˆf(Xi)yi is the softmax probability assigned to the true class label. Higher scores\nindicate lower confidence in the true class, and thus higher uncertainty. Once nonconformity\nscores are calculated for the calibration set, a quantile threshold ˆq is determined: (n + 1)(1 −α)\nˆq = the -th smallest score, (3) where n is the size of the calibration set. The prediction set for a new input includes all\nlabels for which the predicted nonconformity score is less than or equal to ˆq. CP aims to balance two properties: validity, or the statistical guarantee that the true\nlabel lies within the prediction set, and efficiency, which reflects the set's compactness and\ninformativeness. While CP guarantees validity under the exchangeability assumption, efficiency is not assured and depends on the quality of the underlying model and nonconformity\nfunction (Shafer and Vovk, 2008). This is particularly relevant in multiclass problems, where\ninefficient prediction sets may include several irrelevant classes, reducing interpretability. CP is categorized into two main types: Transductive Conformal Prediction (TCP) and\nInductive Conformal Prediction (ICP). TCP assigns each possible label to the new input,\ncomputes the corresponding nonconformity score, and compares it against calibration scores\nto form the prediction set. While accurate, this process is computationally intensive.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 19,
+    "total_chunks": 68,
+    "char_count": 1409,
+    "word_count": 211,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "dae7a421-643c-4dc0-a370-8f1eeab00a6c",
+    "text": "In\ncontrast, ICP trains a single model on the training set and applies it to both calibration\nand test data. The calibration set is used to compute a quantile threshold, and predictions\nfor new inputs are made using a fixed rule derived from the training phase (Papadopoulos,\n2008; Fontana et al., 2023). This study adopts Inductive Conformal Prediction (ICP) for its computational efficiency\nand scalability, particularly suited for large-scale, high-dimensional datasets like FashionMNIST. ICP is also model-agnostic and can be applied post hoc to any pre-trained CNN,\nproviding flexibility in practical deployment while offering valid and interpretable uncertainty\nestimates. Bayesian Approximation using MC Dropout\nBayesian inference provides a probabilistic framework for modelling uncertainty by estimating distributions over parameters rather than relying on fixed point estimates (Lindholm\net al., 2021). It incorporates prior beliefs about model parameters, which are updated using\nobserved data to compute the posterior distribution. This process is governed by Bayes'\ntheorem: P(θ) · P(y | θ)\nP(θ | y) = , (4)\nP(y)\nwhere P(θ | y) denotes the posterior distribution, P(θ) the prior, P(y | θ) the likelihood,\nand P(y) the marginal likelihood (evidence). While powerful, the Bayesian approach faces\npractical limitations in deep learning contexts. The reliance on prior distributions introduces subjectivity, and the inference procedure can become intractable in high-dimensional\nparameter spaces (Jospin et al., 2022; Abdullah et al., 2022). Bayesian neural networks (BNNs) generalize conventional deep learning models by placing\ndistributions over the weights and biases, thereby enabling the quantification of epistemic\nuncertainty—uncertainty arising from limited data or model structure (Chandra and He,\n2021; Essbai et al., 2024). Unlike traditional deterministic networks that risk overfitting by\nmemorizing training data, BNNs produce probabilistic outputs that more accurately reflect\nmodel confidence. However, full Bayesian inference in deep neural networks is computationally expensive, requiring either sampling-based methods or variational approximations, both\nof which can be prohibitive for large-scale models. To overcome these challenges, this study adopts Monte Carlo (MC) Dropout, a scalable\nand efficient Bayesian approximation technique introduced by Gal and Ghahramani (2016). Originally proposed as a regularization method, dropout involves randomly deactivating\nunits during training to prevent overfitting.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 20,
+    "total_chunks": 68,
+    "char_count": 2540,
+    "word_count": 353,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7abe09f1-64b2-4aae-88f8-05d26c0bb890",
+    "text": "MC Dropout extends this mechanism to the\ninference phase by keeping dropout active during testing. Each stochastic forward pass with\ndropout yields a different output for the same input, approximating samples from the model's\npredictive posterior. By performing T stochastic forward passes for each input x∗, MC Dropout estimates the\npredictive distribution as follows: p(y∗| x∗, D) ≈1 X p y∗| x∗, Wt , (5) T c\nt=1\nwhere Wt represents the randomly sampled network weights at iteration t, and p(y∗| c\nx∗, Wt) is the softmax output for that pass. The mean of these outputs provides the final c\nprediction, while their variance serves as an estimate of epistemic uncertainty. MC Dropout thus enables uncertainty-aware prediction without modifying the training\nobjective or architecture, making it especially useful for convolutional neural networks applied to complex image datasets such as Fashion-MNIST. The technique is lightweight and\nwell-suited for high-dimensional tasks, offering a balance between computational efficiency\nand the interpretability of Bayesian methods. Prior studies suggest that 50 forward passes\nyield a reliable trade-off between uncertainty estimation quality and inference cost (Gal and\nGhahramani, 2016; Abdar et al., 2021). Importantly, MC Dropout has also been interpreted as a form of variational inference,\napproximating the posterior distribution over weights without requiring explicit sampling or\nexpensive reparameterization strategies (Shridhar et al., 2019; Gal et al., 2017). This makes\nit a practical tool for incorporating uncertainty into deep learning pipelines, especially where\nfull Bayesian implementations are infeasible. Overall, MC Dropout provides an accessible\nand effective mechanism for estimating model uncertainty in deep learning, enhancing both\nreliability and interpretability of neural predictions.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 21,
+    "total_chunks": 68,
+    "char_count": 1856,
+    "word_count": 266,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "cc5ab035-a514-44bf-a34a-3d953a13ee60",
+    "text": "Evaluation Metrics\nThis section outlines the evaluation metrics used to assess the performance of the proposed uncertainty quantification methods. It covers both general model evaluation criteria\nand metrics specific to Conformal Prediction (CP) and Bayesian approximation using Monte\nCarlo (MC) Dropout. General Evaluation Metrics",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 22,
+    "total_chunks": 68,
+    "char_count": 331,
+    "word_count": 44,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "b5f41279-cd63-40ba-b8c6-611f64075e6d",
+    "text": "Sparsity\nSparsity is typically analyzed in the context of model compression and pruning techniques\naimed at reducing network complexity without compromising predictive performance. One\nsuch method is Global Magnitude Pruning (Global MP), a widely used and computationally\nefficient approach that removes weights with magnitudes below a predefined threshold. This\nthreshold, denoted by t, is computed based on a target sparsity level κtarget and serves as\na universal cut-off across all network layers, in contrast to layer-wise or uniform pruning\nstrategies that require separate thresholds per layer (Gupta et al., 2024). Formally, the pruning rule is expressed as follows: ( 0 if |w| < t\nf(w) = (6)\nw otherwise",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 23,
+    "total_chunks": 68,
+    "char_count": 712,
+    "word_count": 109,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f544f6cc-aa1f-44a0-a9b9-141a94feda00",
+    "text": "In this study, we do not implement pruning explicitly. Rather, we adopt the Global\nMP thresholding concept as a diagnostic tool to assess the inherent sparsity of the trained\nnetwork. By quantifying the proportion of weights falling below various threshold levels, we\ninvestigate the underlying weight distribution and structural redundancy, without modifying\nthe network's architecture or altering its predictive performance. This approach allows for\na nuanced evaluation of sparsity across different configurations. To facilitate this analysis, sparsity is compared between two configurations: a baseline\nconvolutional neural network (CNN) and its Bayesian counterpart utilizing MC Dropout. This comparison is pertinent because CP operates post hoc and does not influence the\ntraining process or the learned weight distribution.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 24,
+    "total_chunks": 68,
+    "char_count": 830,
+    "word_count": 116,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "945be833-a991-4ae9-8fc7-23c77953488f",
+    "text": "In contrast, the Bayesian framework\nincorporates a prior, specifically, a zero-mean Gaussian prior, over the weights. This prior\npromotes weight shrinkage by penalizing large values, thereby encouraging weights to cluster\naround zero. Consequently, this regularizing effect contributes to higher weight sparsity,\nwhich is captured by analyzing the number of parameters falling below predefined magnitude\nthresholds.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 25,
+    "total_chunks": 68,
+    "char_count": 415,
+    "word_count": 55,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d1bda27b-8c24-4d58-98fb-a363361eb52c",
+    "text": "Conformal Prediction Evaluation Metrics Validity\nThe validity of Conformal Prediction (CP) refers to the extent to which the prediction sets\ninclude the true class label, as guaranteed by a predefined confidence level 1 −α, where α is\nthe significance level (Shafer and Vovk, 2008). This property is quantified through empirical\ncoverage, defined as the proportion of test instances for which the true class lies within\nthe corresponding prediction set (Zhou et al., 2025). Mathematically, empirical coverage is\ncomputed as: Coverage = X 1(yi ∈Γα(xi)) (7)\nn −v\ni=v+1\nHere, yi represents the true label of the i-th test sample, Γα(xi) denotes the prediction\nset for the corresponding input xi, and 1(·) is the indicator function that returns 1 if the\ncondition is satisfied and 0 otherwise. The summation is computed over the test set, indexed\nfrom v + 1 to n. In implementation, coverage is estimated by checking whether the true label is present\nin each prediction set, followed by averaging over all test instances. If the empirical coverage falls significantly below the nominal level 1 −α, this may indicate a violation of the\nexchangeability assumption, thereby compromising the theoretical guarantees of CP. Efficiency\nEfficiency measures how informative or precise the prediction sets are, with smaller sets\nindicating higher efficiency (Fontana et al., 2023). In classification problems, efficiency is\ncommonly assessed by computing the average number of labels included in the prediction\nsets across the test set. Formally, it is expressed as:",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 26,
+    "total_chunks": 68,
+    "char_count": 1552,
+    "word_count": 242,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "518acafd-9dc2-4dc1-9e92-5fe72acf9b6e",
+    "text": "Efficiency = X |Γα(xi)| (8)\nn −v\ni=v+1\nIn this equation, Γα(xi) is the prediction set for input xi, derived at significance level α,\nand the summation is performed over all test instances. A prediction set containing only\na single label is considered most efficient, as it reflects maximum confidence in the model's\nprediction. While CP guarantees validity under minimal assumptions, efficiency depends\non the model's capacity to distinguish between classes. Thus, evaluating both validity and\nefficiency provides a comprehensive understanding of a model's uncertainty quantification\nperformance. Bayesian Inference Evaluation Metrics Predictive Entropy (Class-Level and Sample-Level)\nPredictive entropy quantifies the overall uncertainty in the model's output distribution\nfor a given input and is particularly relevant in Bayesian inference settings. It is computed\nas the entropy of the mean predictive distribution obtained from multiple stochastic forward\npasses using MC Dropout (Gal and Ghahramani, 2016). Formally, it is defined as:",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 27,
+    "total_chunks": 68,
+    "char_count": 1040,
+    "word_count": 148,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "ffab1b46-57dd-4b0c-a9a8-b86dadacabf1",
+    "text": "H(p) = − X pc log pc (9) c=1\nwhere pc denotes the mean predictive probability for class c, averaged over the Monte\nCarlo samples. A higher entropy value indicates greater uncertainty in the model's prediction, typically observed when predicted probabilities are evenly distributed across multiple\nclasses. Conversely, a lower entropy value, approaching zero, reflects high model confidence\nconcentrated on a single class prediction. In this study, predictive entropy is evaluated at both the sample and class levels. At the\nsample level, it serves to quantify the model's confidence in individual predictions. At the\nclass level, the entropy is averaged across all test samples belonging to each class, offering\ninsights into which categories the model finds more ambiguous or uncertain. This duallevel analysis supports a more nuanced understanding of model performance, particularly in\nhigh-dimensional classification tasks.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 28,
+    "total_chunks": 68,
+    "char_count": 926,
+    "word_count": 136,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7a9b3ba1-a188-4181-a940-9e93e972188d",
+    "text": "Expected Calibration Error (ECE)\nThe Expected Calibration Error (ECE), introduced by Naeini et al. (2015), is a widely\nused metric to assess the calibration of predicted probabilities. It quantifies the average discrepancy between a model's predicted confidence and the actual accuracy, thereby evaluating\nhow well the confidence scores reflect true correctness (Guo et al., 2017). To compute ECE, predicted probabilities are first grouped into M equally spaced bins\naccording to their confidence levels. For each bin, the absolute difference between the average\npredicted confidence and the empirical accuracy is computed. The final ECE is the weighted\naverage of these differences across all bins, formulated as follows: ECE = X |acc(Bm) −conf(Bm)| (10)\nm=1\nHere, Bm denotes the set of predictions falling into the m-th confidence bin, acc(Bm)\nis the empirical accuracy within the bin, conf(Bm) is the average predicted confidence, and\nn is the total number of predictions. A lower ECE indicates that the model's confidence\nestimates are well-calibrated, i.e., closely aligned with actual prediction accuracy, while a\nhigher ECE suggests over- or under-confidence. In Bayesian methods, which yield predictive distributions rather than single-point estimates, ECE plays a crucial role in evaluating the reliability of uncertainty quantification. High ECE values imply that confidence scores may be misleading, thereby compromising the\ninterpretability and safety of uncertainty, driven decisions. Notably, ECE is not applicable\nto Conformal Prediction (CP), since CP produces sets of possible labels rather than scalar\nconfidence values. In such cases, alternative evaluation criteria such as validity and efficiency\nare used instead.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 29,
+    "total_chunks": 68,
+    "char_count": 1735,
+    "word_count": 252,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "870ddf1d-22ce-4480-b556-9f2bb86f7ce2",
+    "text": "Standard Deviation Across Samples\nWhile entropy provides a measure of total predictive uncertainty, combining both aleatoric\n(data-driven) and epistemic (model-driven) sources, it does not distinguish between them. contrast, variance across predictions offers a direct approximation of epistemic uncertainty\nby capturing the variability of model outputs across multiple stochastic forward passes (Gal\nand Ghahramani, 2016). High variability implies significant epistemic uncertainty, indicating\nthat the model is uncertain about its internal parameters.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 30,
+    "total_chunks": 68,
+    "char_count": 553,
+    "word_count": 70,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d2cacc7f-109a-4739-b68d-71e448f4eec4",
+    "text": "In this study, standard deviation is computed per class for each test input. Specifically,\neach input is passed through the network multiple times using MC Dropout, resulting in a\ndistribution of softmax outputs. The standard deviation of these outputs across all passes reflects the uncertainty associated with each class prediction. These values are then aggregated\nto estimate the overall epistemic uncertainty at the sample level. To assess the consistency of uncertainty estimates across methods, we compare the predictive entropy derived from MC Dropout with the prediction set sizes generated by Conformal\nPrediction (CP). In cases of high certainty, we expect CP to produce smaller prediction sets\nand MC Dropout to yield lower entropy values. This comparison provides insights into the\ncoherence of uncertainty estimates between the two frameworks. Mutual Information (Epistemic Uncertainty)\nMutual Information (MI) is a more refined measure for isolating epistemic uncertainty. While predictive entropy accounts for both aleatoric and epistemic components, MI quantifies\nthe reducible portion of uncertainty that arises due to uncertainty in model parameters. It\nis defined as the difference between the entropy of the mean predictive distribution and the\naverage entropy across individual stochastic forward passes:",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 31,
+    "total_chunks": 68,
+    "char_count": 1326,
+    "word_count": 192,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5d648017-6c20-47b1-a221-9037070ac5df",
+    "text": "MI(y, θ | x) = H [¯p(y | x)] −1 X H [pt(y | x)] (11)\nt=1 In this equation, H[¯p(y | x)] denotes the entropy of the average predictive distribution,\nand H[pt(y | x)] is the entropy of the prediction in the t-th stochastic pass. The difference\nquantifies the extent to which the model's predictions fluctuate under dropout, providing a\nclear indicator of epistemic uncertainty. Mutual Information is particularly useful in identifying uncertainty driven by model\nambiguity, which often occurs in regions of the input space with limited or conflicting training\ndata. It complements other metrics such as predictive entropy by enabling a more granular\nunderstanding of the uncertainty decomposition. Average Entropy (Aleatoric Uncertainty)\nAverage entropy across multiple stochastic forward passes serves as an approximation of\naleatoric uncertainty, which arises from intrinsic noise or ambiguity in the input data. Unlike\nepistemic uncertainty, which can be reduced with more data or better modeling, aleatoric\nuncertainty reflects irreducible randomness and is often due to overlapping class boundaries\nor low-quality observations. This metric is derived by averaging the entropy of the softmax outputs obtained from\neach of the T MC Dropout passes.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 32,
+    "total_chunks": 68,
+    "char_count": 1248,
+    "word_count": 190,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8ff03932-96d3-4bab-93b2-b0142845aa73",
+    "text": "Formally, it is computed as: X H [pt(y | x)] (12)\nt=1 Here, H[pt(y | x)] denotes the entropy of the predicted class probabilities in the t-th\nstochastic pass. The resulting average quantifies the level of uncertainty inherent in the data\nfor a given input. Together with mutual information, average entropy facilitates the decomposition of total predictive uncertainty into epistemic and aleatoric components, enabling a\nmore nuanced understanding of model confidence and decision reliability. In summary, the evaluation metrics discussed in this section collectively provide a comprehensive framework for assessing both the predictive performance and the quality of uncertainty quantification methods. By combining classical measures such as sparsity and\ncalibration with more advanced probabilistic metrics, this study enables a deeper and more\ninterpretable analysis of model reliability across different uncertainty estimation techniques. This section presents the empirical results across several performance metrics, including\nclassification accuracy, uncertainty quantification, prediction set efficiency, and validity. Although the H-CNN VGG16 architecture produces multiple hierarchical outputs, our analysis\nfocuses exclusively on the final predictions for the ten Fashion-MNIST classes. We begin\nby evaluating baseline performance in terms of accuracy, overfitting, sparsity, and Expected\nCalibration Error (ECE) for both H-CNN VGG16 and GoogLeNet, as well as their Bayesian\ncounterparts. We then assess predictive reliability using Conformal Prediction, focusing on\nprediction set sizes, empirical coverage, and class-wise confidence variation. Next, we investigate uncertainty through Bayesian approximation using Monte Carlo\nDropout. We analyse overall model uncertainty by visualising the distribution and confidence intervals of predictive entropy, and decompose this uncertainty into epistemic and\naleatoric components. We compare predictive entropy across correct classifications and misclassifications to explore how uncertainty relates to prediction correctness. To gain deeper\ninsight into class-level ambiguity, we conclude this section with class-wise comparisons of\npredictive entropy and corresponding confidence intervals for both models.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 33,
+    "total_chunks": 68,
+    "char_count": 2264,
+    "word_count": 298,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a1cb58e5-8369-4818-8533-378fe0da2df5",
+    "text": "We then extend the analysis by examining the relationship between empirical and probabilistic reliability measures, specifically comparing Conformal Prediction set sizes with predictive entropy derived from Bayesian inference. Collectively, these analyses shed light on\nhow the two architectures express and manage uncertainty under different estimation frameworks, and how this impacts both predictive reliability and model behaviour. Overall Performance\nWe begin by evaluating the H-CNN VGG16 and GoogLeNet architectures, along with\ntheir Bayesian counterparts, across multiple performance dimensions including accuracy,\ntraining duration, sparsity, overfitting tendencies, and Expected Calibration Error (ECE). For classification accuracy on the Fashion-MNIST dataset, the best-performing H-CNN\nVGG16 model achieves 92.99%, with a five-fold cross-validation average of 92.62%. In comparison, GoogLeNet attains a maximum accuracy of 89.72%, with an average of 88.24%\nacross folds. These results are consistent with prior studies where Seo and Shin (2019) reported an accuracy of 93.52% for H-CNN VGG16, while Vives-Boix and Ruiz-Fernández\n(2021) documented GoogLeNet achieving 91.89%. Accuracy and Duration Table 1: Performance summary for H-CNN VGG16 and GoogLeNet architectures Model Metric Baseline Bayesian Accuracy (Best) 92.99% 92.47%\nAccuracy (5-Fold 92.62% 92.29%\nAvg.)\nH-CNN\nDuration (s) 12,342.65 13,417.11 VGG16\nTrainable Param- 90,312,274\neters\nNon-Trainable 2,944\nParams\nOptimizer Param- 90,312,276\neters\nTotal Parame- 180,627,494\nters Accuracy (Best) 89.72% 88.68%\nAccuracy (5-Fold 88.24% 87.49%\nGoogLe Avg.)\nNet Duration (s) 1,428.88 1,471.92\nTrainable Param- 5,977,530\neters\nOptimizer Param- 5,977,532\neters\nTotal Parame- 11,955,062\nters Although H-CNN VGG16 outperforms GoogLeNet in accuracy, this comes at a significant\ncomputational cost. With more than 180 million parameters, substantially exceeding the\nparameter count of GoogLeNet, H-CNN VGG16 requires markedly longer training times. On average, each fold takes nearly ten times longer to complete.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 34,
+    "total_chunks": 68,
+    "char_count": 2074,
+    "word_count": 274,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "382d1bf7-5c64-4f25-bf54-2af693cf0e0b",
+    "text": "This stark difference in\nmodel complexity directly underlies the observed gap in computational efficiency between\nthe two architectures. After implementing the Bayesian approximation using Monte Carlo Dropout, both architectures exhibit a slight decrease in best-case accuracy. This decrease is attributable to the\nstochastic nature of MC Dropout, which introduces additional variance into the predictions. Furthermore, inference time increases substantially, as each prediction requires 50 stochastic forward passes.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 35,
+    "total_chunks": 68,
+    "char_count": 517,
+    "word_count": 67,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "5a4a85a3-2808-4e75-8d0b-2edc55da3798",
+    "text": "Nevertheless, under the Bayesian setting, H-CNN VGG16 continues to\noutperform GoogLeNet in terms of classification accuracy. Overall, both the standard and Bayesian variants of H-CNN VGG16 outperform their\nGoogLeNet counterparts in terms of classification accuracy. However, this improvement\ncomes at a significantly higher computational cost. While the Bayesian implementation further increases training duration per fold, it yields only marginal gains in accuracy, suggesting\nlimited efficiency benefits relative to its added complexity.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 36,
+    "total_chunks": 68,
+    "char_count": 539,
+    "word_count": 71,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "536e8b83-3f1c-4778-8694-04823cfb3807",
+    "text": "Class-Wise Accuracy Confusion Matrices\nTo further evaluate model performance beyond overall accuracy, confusion matrices were\ngenerated to examine class-wise prediction behaviour (Figure 1). The general findings of\nthe baseline models are consistent with those of their Bayesian alternatives. To avoid redundancy, the Bayesian confusion matrices are provided in the Appendix, H-CNN VGG16\nand GoogLeNet in Figure 12. These visualizations highlight systematic misclassification patterns and reveal which classes are most frequently confused with one another. Importantly,\nsuch recurring misclassification trends also provide a foundation for the subsequent analysis\nof uncertainty measures, where predictive entropy and calibration are used to quantify the\nreliability of the models' class-level decisions.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 37,
+    "total_chunks": 68,
+    "char_count": 804,
+    "word_count": 106,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f3a63c3d-2b9c-4289-b3cf-6207f729902a",
+    "text": "Figure 1: Confusion Matrix for H-CNN VGG16 and GoogLeNet The H-CNN VGG16 model demonstrates strong classification performance across most\ncategories as can be seen in (Figure 1 (left)), achieving perfect accuracy for Sandal (1000) and\nhigh accuracy for Trouser (983), Bag (980), Ankle boot (974), Sneaker (914). These items\nare visually distinct, contributing to the model's high performance. By contrast, the highest\nconcentration of misclassifications occurs in the Shirt class, which is frequently confused with\nT-shirt (104), Pullover (52), and Coat (38). This reflects substantial visual similarity among\nthese categories.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 38,
+    "total_chunks": 68,
+    "char_count": 627,
+    "word_count": 90,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e7e350b4-4cd1-4d08-a6aa-7fd39d7ff45a",
+    "text": "This confusion also occurs in reverse where T-shirt are often predicted\nas Shirts (62), and occasionally as Pullovers (39) or Coats (41). Similarly, pullovers are\nsometimes misclassified as Coats (45), underscoring the model's difficulty distinguishing\nbetween classes with overlapping visual features. For GoogLeNet (Figure 1 (right)), the overall classification patterns are broadly similar. While the model demonstrates strong performance across most categories, it struggles\nnotably with Shirts (670 correct out of 1000) and Coats (759). Shirts are frequently misclassified as T-shirts (112) or Pullovers (47), whereas Coats are often predicted as Shirts (132)\nor Pullovers (68).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 39,
+    "total_chunks": 68,
+    "char_count": 683,
+    "word_count": 96,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "f7a2d473-12c1-4f04-be4b-5463573edfbf",
+    "text": "These patterns are similar to the trends observed in the H-CNN VGG16\nmodel, indicating that these classes are inherently more difficult to separate, most likely due\nto substantial visual overlap in their features. Overall, both models demonstrate strong classification performance on visually distinct\nitems, yet face difficulties classifying classes that exhibit substantial visual overlap, particularly Shirt, Pullover, and Coat. Overfitting Analysis\nTo evaluate the extent of overfitting in both baseline models and their Bayesian counterparts, training and validation loss and accuracy curves were analyzed across 60 epochs. All models were trained using 5-fold cross-validation, however, for clarity, only the bestperforming fold is presented for each baseline model (Figure 2 and Figure 3). The plots for\nthe remaining folds, along with their Bayesian variants, are provided in Appendix Figures 13–\n16, as they do not exhibit substantial differences from the non-Bayesian counterparts.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 40,
+    "total_chunks": 68,
+    "char_count": 991,
+    "word_count": 141,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7755b142-686a-4124-a195-35c993b03459",
+    "text": "Figure 2: Training vs. Validation Loss and Accuracy for H-CNN VGG16 (Fold 2) The training curves for H-CNN VGG16 reveal a pronounced gap between training and\nvalidation performance. While training loss steadily decreases and accuracy approaches\n100%, validation loss plateaus early and exhibits minor fluctuations, with validation accuracy\nstabilising around 93%. This pattern indicates that the model fits the training data well\nbut shows limited improvement on unseen data, suggesting a degree of overfitting. The\narchitecture used here follows the design by Seo and Shin (2019), which includes dropout,\nbatch normalization, and loss weighting to support generalisation. These techniques appear\neffective in stabilising the training process, although the model's substantial parameter count\nlikely contributes to its tendency to overfit. Figure 3: Training vs. Validation Loss and Accuracy for GoogLeNet (Fold 2) GoogLeNet, on the other hand, demonstrates more consistent training behaviour. Training and validation losses decrease in parallel, and validation accuracy closely follows training\naccuracy throughout, indicating good generalisation and limited overfitting. This implementation, adapted from Vives-Boix and Ruiz-Fernández (2021) for the Fashion-MNIST dataset,\nbenefits from a more compact architecture and a substantially lower parameter count, which\nlikely contribute to its stable performance. In both architectures, introducing MC Dropout during inference had minimal impact on\nthe training dynamics. The Bayesian models showed nearly identical learning curves to their\nrespective baselines, which supports the decision to move those plots to the Appendix for\nreference.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 41,
+    "total_chunks": 68,
+    "char_count": 1688,
+    "word_count": 233,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "d509152d-bc64-4f6f-bb55-4ae67adb30a8",
+    "text": "In summary, H-CNN VGG16 achieves higher training accuracy but exhibits moderate\noverfitting, whereas GoogLeNet maintains a more balanced relationship between training and\nvalidation performance. Considering its computational efficiency and stronger generalisation,\nGoogLeNet may represent the more practical choice in scenarios where resource constraints\nor overfitting risks are critical concerns. Sparsity\nAs introduced in Section 4, sparsity is examined here through both graphical and tabular summaries for the H-CNN VGG16 and GoogLeNet models (Figure 4). For brevity, this\nsection presents only the visualisations and tables corresponding to the MC Dropout implementation, while the baseline visualisations and weight tables are provided in the Appendix\nFigures 17 and Table 5. The plots provide a visual overview of the cumulative sparsity across a range of thresholds. Both models exhibit similar elbow-shaped curve in their sparsity profiles. For both models,\nsparsity remains low at small thresholds but begins to increase sharply around 0.001. Closer\ninspection of this region, however, reveals important differences in how the two architectures\ndistribute their weights, highlighting distinct sparsity patterns. Figure 4: Sparsity vs. Sparsity as a percentage of total trainable weights The H-CNN VGG16 model contains significantly more weights, which is expected given\nits substantially larger number of parameters. It also exhibits a higher proportion of weights\nclose to zero. This is a noteworthy observation considering the model's earlier difficulties with\ngeneralisation, as discussed above.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 42,
+    "total_chunks": 68,
+    "char_count": 1609,
+    "word_count": 225,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4f986dad-9875-4cf4-a170-43fa10fee785",
+    "text": "According to the magnitude pruning framework proposed by Gupta et al. (2024), such near-zero weights contribute little to model performance and\ntherefore constitute strong candidates for pruning. Table 2: Sparsity Comparison Across Thresholds for (1) Bayesian H-CNN VGG16 and (2) GoogLeNet Weight Range # Weights % Total Cum. % (1) (2) (1) (2) (1) (2) <0.00001 34,427 1,548 0.04 0.03 0.04 0.03\n0.00001–0.00005 134,757 4,812 0.15 0.08 0.19 0.11\n0.00005–0.0001 165,450 6,033 0.18 0.10 0.37 0.21\n0.0001–0.0005 1,297,613 48,000 1.44 0.80 1.81 1.01\n0.0005–0.001 1,609,454 59,512 1.78 1.00 3.59 2.01\n0.001–0.005 12,857,487478,938 14.24 8.01 17.83 10.04\n≥0.005 74,216,0305,377,52382.17 89.96 100 100",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 43,
+    "total_chunks": 68,
+    "char_count": 692,
+    "word_count": 101,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "34b64063-47ed-4d57-b1c9-4cbdc9e16074",
+    "text": "In contrast, GoogLeNet demonstrates a more efficient utilisation of its smaller parameter\nset. The increase in sparsity across thresholds is more gradual, with fewer weights falling\nbelow the different threshold values. This supports previous observations that GoogLeNet's\nmore compact architecture is structurally more constrained, thereby promoting more effective\ngeneralisation. In particular, the implementation of MC Dropout does not significantly affect the weight\nsparsity of the H-CNN VGG16 or GoogLeNet models. This is expected, as MC Dropout\nis applied during inference to estimate predictive uncertainty and does not influence the\nunderlying weight magnitudes during training. As a result, the sparsity pattern remains\nunchanged and the models are not further compressed. However, this stands in contrast to\na full Bayesian approach, where the choice of prior distributions can induce regularisation\nduring training (Abdar et al., 2021; Abdullah et al., 2022).",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 44,
+    "total_chunks": 68,
+    "char_count": 971,
+    "word_count": 138,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "4ec4f86c-8859-47bd-8ce0-bbab2a8bd804",
+    "text": "Expected Calibration Error (ECE)\nExpected Calibration Error (ECE) quantifies the discrepancy between predicted confidence and actual accuracy. For instance, a perfectly calibrated model would assign 80%\nconfidence to predictions that are correct precisely 80% of the time. Lower ECE values\ntherefore indicate better calibration, and a stronger alignment between confidence and correctness. This study uses ECE to evaluate how effectively the Bayesian versions of H-CNN\nVGG16 and GoogLeNet capture and express predictive uncertainty. Table 3: Comparison of ECE before and after Bayesian modeling Architecture ECE (Baseline) ECE (Bayesian) H-CNN VGG16 5.66% 5.61%\nGoogLeNet 2.82% 1.37%",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 45,
+    "total_chunks": 68,
+    "char_count": 683,
+    "word_count": 95,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "79603898-e068-4132-a643-6454defb959d",
+    "text": "The results presented above show a clear contrast between the two models. GoogLeNet\ndemonstrates a significant improvement in calibration, with ECE dropping from 2.82% in the\nbaseline model to 1.37% under Bayesian inference. This indicates that Monte Carlo Dropout\neffectively improves the model's ability to reflect uncertainty in its confidence scores. contrast, H-CNN VGG16 shows only a marginal improvement, with ECE decreasing slightly\nfrom 5.66% to 5.61%. Even with Bayesian inference, the model remains comparatively\npoorly calibrated relative to GoogLeNet. A more detailed interpretation of these findings is\nprovided in Section 5.2.2.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 46,
+    "total_chunks": 68,
+    "char_count": 643,
+    "word_count": 92,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "97293652-c7bd-4de3-974b-af57641af516",
+    "text": "Uncertainty Estimation\nThis section provides a rigorous empirical analysis of Conformal Predictions and Bayesian\napproximation with MC Dropout. Both methods are evaluated using multiple metrics to\nassess the predictive reliability across two neural network architectures. Moreover, both\napproaches are compared to understand the relationship between Conformal Prediction set\nsizes and predictive entropy derived from Bayesian approximation. Conformal Prediction\nConformal Predictions requires an additional calibration split. Therefore, the data set is\npartitioned into 60,000 observations for training, 2,000 for calibration and 8,000 for testing. This design follows the standard Inductive Conformal Prediction (ICP) framework, in which\nthe calibration set is used to compute conformity scores and derive a quantile threshold that\ncontrols prediction set sizes while ensuring the desired coverage level (validity) on unseen\ntest data (Vovk et al., 2005). This section also reports the empirical coverage achieved by\neach model. Because the confidence band is adaptively adjusted, the empirical coverage will\nnot match the nominal 95% level exactly, as is typical in ICP.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 47,
+    "total_chunks": 68,
+    "char_count": 1172,
+    "word_count": 163,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "ded965ba-2587-4628-aa1b-d801f249d436",
+    "text": "The histograms below illustrate the distribution of calibration scores for both baseline\nmodels (Figure 5). The scores represent how confident the model was in the prediction of\nthe true label where values closer to zero indicate higher confidence assigned to the true class,\nwhereas larger values suggest greater uncertainty. Both models achieve an empirical coverage of 95%, demonstrating that the ICP method is well calibrated overall and successfully\nachieves validity.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 48,
+    "total_chunks": 68,
+    "char_count": 473,
+    "word_count": 70,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7d55ce35-bd82-4ffb-baf6-696dfc63a4a8",
+    "text": "Figure 5: H-CNN VGG16 & GoogLeNet Calibration Scores For the H-CNN VGG16 model (Figure 5 (right)), the calibration score distribution\nshows a pronounced peak near zero, indicating that the model assigns high confidence to the\nmajority of its predictions. This pattern is consistent with the model's overall high accuracy.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 49,
+    "total_chunks": 68,
+    "char_count": 321,
+    "word_count": 49,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "9d5639d1-4a1e-4f08-8d5e-848ce6b84199",
+    "text": "However, another peak appears on the right-hand side, suggesting reduced confidence for\ncertain predictions. As shown later in the efficiency plots, this lower-confidence region is\nprimarily associated with the Shirt class. GoogleNet's distribution of calibration scores is broader and less structured, indicating\ndifferent confidence behaviour (see (Figure 5 (left)). Like H-CNN VGG16, it displays a peak\nat zero, indicating that both models assign high confidence to many correct predictions. However, GoogLeNet exhibits greater variability overall.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 50,
+    "total_chunks": 68,
+    "char_count": 551,
+    "word_count": 75,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "48d7d343-14a3-497a-b5aa-0a12121dad16",
+    "text": "In the case of H-CNN VGG16,\nthere is a high peak of calibration scores near 1, meaning that in some cases the model\nassigns low probability to the true class, leading to lower confidence. This suggests that\nGoogLeNet is generally less overconfident and adopts a more cautious stance when making\npredictions. At the same time, this broader distribution makes it more difficult to judge the\nreliability of individual predictions based solely on calibration scores. The information below depicts efficiency evaluation, which is measured via the prediction\nset size. A smaller set size indicates higher efficiency, meaning the model includes fewer labels\nin the prediction, making it more certain. If the set size is larger, the model is less confident\nand tries to include more labels to reach the coverage guarantee. Table 4: Prediction set sizes for GoogLeNet and H-CNN VGG16 models GoogLeNet Count 6,431 1,377 185 7\nH-CNN VGG16 Count 7,551 398 44 7 In the case of H-CNN VGG16 (Figure 6 (left)), the majority of predictions consist of a\nsingle label, indicating high efficiency. This is evident in the plot below, where the distribution of the prediction set size is centred around 1 for most classes. The main exception\nis the Shirt category, which has a broader distribution at a prediction set size of two, indicating lower model confidence. This observation aligns with the right tail of calibration\nscore distribution discussed above, suggesting that Shirt instances frequently generate high\nnonconformity scores. Moreover, Pullover appears as the only category with a prediction set\nsize of five, representing a rare case of particularly low confidence. Overall, maintains both\nhigh coverage (validity) and compact prediction sets.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 51,
+    "total_chunks": 68,
+    "char_count": 1736,
+    "word_count": 274,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "08346cd9-cf68-4e53-a87d-ecd5e184f347",
+    "text": "Figure 6: Prediction Set Size Per Class H-CNN VGG16 GoogLeNet (Figure 6 (right)) produces fewer prediction sets of size one and significantly\nmore sets of size two and three. This pattern corresponds directly to the earlier calibration\nscore histograms, where GoogLeNet exhibited fewer instances of extreme confidence. The\ndistribution of prediction set sizes is also more dispersed compared to H-CNN VGG16. As with the previous model, the Shirt category shows the widest spread, underscoring its\ndifficulty to classify. This finding is consistent with the broader calibration score distribution\nobserved for GoogLeNet, which was less sharply peaked than that of H-CNN VGG16. Overall, H-CNN VGG16 achieves high empirical coverage with prediction sets that are\ntypically small, but it also exhibits overconfidence, particularly for visually ambiguous classes. At the same time, GoogLeNet produces a broader distribution of confidence scores and generates broader prediction sets, reflecting a greater ability to signal uncertainty when the\nmodel is unsure.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 52,
+    "total_chunks": 68,
+    "char_count": 1055,
+    "word_count": 154,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "06e437a2-fc8f-4cf9-9736-d1c3038005ad",
+    "text": "Uncertainty in Bayesian Inference\nBayesian uncertainty estimation is implemented with MC Dropout, following the approach of Gal and Ghahramani (2016). This method relies on dropout being applied during\ntesting, resulting in the model performing multiple stochastic forward passes to approximate\nthe posterior distribution. In the article, all results are based on fifty Monte Carlo passes\nper observation, with the mean predictive entropy and its standard deviation computed from\nthese. This section provides a general overview of the model's uncertainty and the predictive\nentropy, which reflects the combined contribution of aleatoric and epistemic uncertainties,\nas outlined in Section 4. In essence, predictive entropy quantifies how uncertain the model\nis on average across its predictions. Figure 7 below displays predictive confidence values, sorted from lowest to highest. The\ndark blue line represents the mean predicted confidence, while the shaded area illustrates\nthe variation (standard deviation) across the fifty dropout passes. Figure 7: Overall Confidence Intervals Bayesian The confidence curve for H-CNN VGG16 rises sharply, reaching near-perfect confidence\nwithin the first 1,000 samples.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 53,
+    "total_chunks": 68,
+    "char_count": 1208,
+    "word_count": 170,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "c15026e2-b4ba-4886-9280-0fb173b8187c",
+    "text": "Although the shaded region begins relatively wide, it narrows\nquickly, indicating that the model becomes highly consistent across dropout passes. However,\nthis consistency appears excessive; the model shows little hesitation even on inputs that are\nlikely to be ambiguous. This behaviour echoes earlier findings of overfitting, with overconfidence persisting even under the Bayesian setting. The model attains certainty too rapidly\nand exhibits minimal variation between stochastic passes, suggesting that it underestimates\nepistemic uncertainty and fails to adequately signal when it is unsure.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 54,
+    "total_chunks": 68,
+    "char_count": 595,
+    "word_count": 82,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "7801aaf1-e0b4-4026-931a-5cbc4e105cc1",
+    "text": "For GoogLeNet, the increase in confidence is smooth and more gradual. The model\nbegins with relatively low certainty for the first few hundred samples and then steadily gains\nconfidence, eventually flattening near 1.0. Notably, the variation across dropout passes is\nmost pronounced in the lower-confidence region and diminishes as confidence increases. This\nbehaviour is expected; different dropout passes yield different predictions when the model\nis uncertain. However, once the model is sure, the variability in predictions diminishes. These dynamics indicate that GoogLeNet not only becomes confident but also appropriately\nreflects its uncertainty, behaving as expected for a well-calibrated Bayesian model. To further support the findings on model confidence and uncertainty, predictive entropy\ndistributions were examined for both H-CNN VGG16 and GoogLeNet. As shown in Figure 18\nin the Appendix.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 55,
+    "total_chunks": 68,
+    "char_count": 904,
+    "word_count": 129,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "97d247d7-4bc7-4feb-8542-c615470673b2",
+    "text": "H-CNN VGG16's entropy values are predominantly low and right-skewed,\nreflecting consistently high confidence across most predictions. In contrast, GoogLeNet\nexhibits a wider range of entropy values with a longer tail, indicating more frequent highuncertainty predictions. These patterns are consistent with the confidence interval plots\ndiscussed earlier; H-CNN VGG16 remains confident and consistent, even on more ambiguous\ninputs, while GoogLeNet demonstrates greater variability and appears more responsive to\nuncertainty.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 56,
+    "total_chunks": 68,
+    "char_count": 525,
+    "word_count": 68,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "de9b0190-9ed4-454d-80d0-49ba7e6578a0",
+    "text": "As described in Section 4, total predictive uncertainty can be decomposed into two main\ncomponents. Predictive entropy captures the overall uncertainty in the model's output, combining contributions from both epistemic and aleatoric uncertainties. Epistemic uncertainty,\noften quantified using mutual information, reflects uncertainty about the model parameters. For example,when the model has limited exposure to similar data and predictions vary across\ndropout passes. Aleatoric uncertainty, measured through the expected entropy, arises from\nnoise or intrinsic ambiguity in the data itself, such as visually similar categories (e.g. Shirt\nand T-shirt) that are inherently difficult to distinguish.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 57,
+    "total_chunks": 68,
+    "char_count": 700,
+    "word_count": 94,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "83fefd5e-d7de-4521-ae6b-e6ce21448f27",
+    "text": "The following analysis disentangles\nthese two components to provide a more detailed characterisation of each model's uncertainty\nprofile. In the H-CNN VGG16 plot (Figure 8 (left)), we see that both predictive entropy and\nexpected entropy exhibit similar distributions, while the mutual information remains noticeably lower. This suggests that most of the model's uncertainty is aleatoric, the predictions\nare relatively stable across dropout passes, but the model still expresses uncertainty when\nthe input is ambiguous. The narrow distribution of mutual information indicates limited\nepistemic uncertainty and is overall confident in its predictions, reinforcing earlier observations that H-CNN VGG16 becomes confident rapidly and shows minimal variation across\npasses. Figure 8: Uncertainty Decomposition for H-CNN VGG16 & GoogLe Net GoogLeNet (Figure 8 (right)) displays a distinct pattern compared to H-CNN VGG16.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 58,
+    "total_chunks": 68,
+    "char_count": 917,
+    "word_count": 128,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "979ec3b8-9336-476f-abc7-dd09c2e242f4",
+    "text": "Both predictive and expected entropy values are higher, and mutual information is substantially larger. This suggests that GoogLeNet expresses greater epistemic uncertainty; the\nmodel's predictions vary more across dropout passes, indicating increased uncertainty in the\nmodel parameters. At the same time, the elevated expected entropy shows that the model\nalso captures data-related ambiguity. The wider spread of all three metrics, especially mutual information, indicates that GoogLeNet is more expressive in signalling when unsure,\nwhich aligns with its more gradual confidence rise and wider uncertainty bands in earlier\nplots. As a result, H-CNN VGG16 tends to rely heavily on its learned decision boundaries\nand rarely adjusts its predictions, even in cases where uncertainty would be warranted—an\nindication of overconfidence. GoogLeNet, on the other hand, exhibits more flexible behaviour, capturing both model and data uncertainty more clearly. This supports the view\nthat GoogLeNet is better calibrated and more reliable in representing meaningful uncertainty. Previously, uncertainty was aggregated across multiple forward passes for each individual\nobservation. In this plot (Figure 9), we group predictions by class and separate them into\ncorrect and incorrect cases. This approach allows us to see how confident the model is on\naverage when it classifies correctly versus when it misclassifies. Ideally, a well-calibrated\nmodel should exhibit low entropy for correct classifications and higher entropy for errors,\nthereby signalling uncertainty appropriately. Consequently, we aim to maximise the difference to reflect the model's ability to distinguish between classes in a meaningful and reliable\nmanner. We see a familiar pattern in H-CNN VGG16 (Figure 9 (left)): the model shows low\nentropy for correct predictions and high entropy for misclassified ones, with a clear separation\nbetween the two. At first glance, this is desirable, as it suggests the model can express\nuncertainty when it errs. However, considering our earlier findings, this sharp separation\nmay also be a sign of overfitting.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 59,
+    "total_chunks": 68,
+    "char_count": 2115,
+    "word_count": 308,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "8e67d522-8d2e-4b65-927d-e0a175ed169b",
+    "text": "The model becomes highly confident very quickly, which\nmay not always reflect genuine uncertainty, particularly for ambiguous inputs.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 60,
+    "total_chunks": 68,
+    "char_count": 133,
+    "word_count": 18,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "1a8061ef-f910-400b-a2c8-e232906bb6e1",
+    "text": "This pattern\nis evident in classes like Pullover and Coat, which were previously identified as frequently\nmisclassified. For Shirt, the worst-performing class according to the confusion matrix, the\ngap between correct and incorrect predictions is noticeably smaller, suggesting that the\nmodel does express higher uncertainty when less certain, consistent with behaviour expected from a well-calibrated model. Figure 9: Entropy per Class for Correct vs. Incorrect Predictions for H-CNN VGG16 & GoogLeNet GoogLeNet shows a different behaviour though (see (Figure 9 (right)). While the model\ngenerally assigns higher entropy to incorrect predictions than to correct ones, which is desirable, the separation between the two is less pronounced than in H-CNN VGG16. For certain\nclasses, such as Shirt, Coat, T-shirt, and Dress, the entropy for correct predictions remains\nrelatively high.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 61,
+    "total_chunks": 68,
+    "char_count": 882,
+    "word_count": 128,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "6ecfc53c-fba9-456f-b008-7220ccab8b34",
+    "text": "This aligns with previous findings from the confusion matrix, which highlighted GoogLeNet's difficulty in distinguishing among these visually similar categories. The\nmodel often expresses uncertainty even when it predicts correctly, indicating that its correct\nclassifications may be closer to the decision boundary and at times verge on misclassification. GoogLeNet shows smaller gaps between correct and incorrect entropy, and higher uncertainty even for correct predictions. Although H-CNN VGG16 is more confident and sharp\nin its separation, GoogLeNet appears to be more cautious and uncertain overall. This suggests that H-CNN VGG16 may still be overconfident, while GoogLeNet is more hesitant but\npotentially better calibrated. To better understand how uncertainty is expressed across the model, we examine predictive entropy and confidence at the class level rather than relying on overall averages. This approach allows us to identify whether certain classes are more challenging to predict\nand to evaluate whether the model expresses uncertainty in a manner consistent with their\nperformance. The following analysis presents the distribution of predictive entropy values\nper class for both neural networks (Figure 10), using a logarithmic scale to emphasise variation across both straightforward and ambiguous classes. This approach also enables a direct\nlink between uncertainty and the misclassification trends observed earlier, providing a more\ncomplete picture of model behaviour. For H-CNN VGG16 (Figure 10 (left)), predictive entropy remains low across most classes,\nwith relatively tight distributions and few high-entropy outliers. This indicates that the\nmodel is generally confident in its predictions, regardless of class. However, slightly elevated\nentropy values appear for classes such as T-shirt, Coat, Pullover, and Shirt, among the most\nfrequently misclassified classes in the confusion matrix.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 62,
+    "total_chunks": 68,
+    "char_count": 1920,
+    "word_count": 271,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "e8629288-e1f8-45f4-8a5d-24b44fe9a147",
+    "text": "Nevertheless, the overall variation\nremains limited, indicating that the model's confidence does not adjust substantially between\neasier and more difficult classes. This behaviour reinforces earlier indications of overfitting,\nas H-CNN VGG16 tends to remain confident even on ambiguous or borderline inputs. For GoogLeNet (Figure 10 (rigth)), predictive entropy remains consistently low across\nmost classes, with relatively narrow distributions and only a few high-entropy outliers. This\nsuggests that the model is generally confident in its predictions, regardless of class. elevated entropy values appear for categories such as T-shirt, Coat, Pullover, and Shirt,\nalso among the most frequently misclassified in the confusion matrix.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 63,
+    "total_chunks": 68,
+    "char_count": 735,
+    "word_count": 101,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "bbbeead8-bb77-48a9-8a46-19d3f6d8f00c",
+    "text": "Nevertheless, the\noverall variation remains limited, indicating that the model's confidence does not adjust\nsubstantially between easier and more difficult classes. This behaviour reinforces earlier\nindications of overfitting, as H-CNN VGG16 tends to remain confident even on ambiguous\nor borderline inputs. Figure 10: Predictive Entropy by Class for H-CNN VGG16 & GoogLeNet In contrast, GoogLeNet shows a broader spread of predictive entropy across classes. Classes like Shirt, T-shirt, Coat, Pullover, and Dress exhibit notably higher median and\nupper-quartile entropy values, corresponding to the most frequently misclassified classes in\nthe confusion matrix.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 64,
+    "total_chunks": 68,
+    "char_count": 662,
+    "word_count": 91,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "a7d552db-4c84-4834-a1dc-6c3b580444d7",
+    "text": "This indicates that GoogLeNet is more sensitive to uncertainty in\nchallenging cases and reflects that uncertainty more clearly in its predictions. Compared to\nH-CNN VGG16, GoogLeNet appears better calibrated and more capable of expressing doubt\nwhere appropriate, particularly for visually similar or ambiguous classes. Taken together,\nthese findings highlight a trade-off; H-CNN VGG16 maintains efficiency through strong\nconfidence, while GoogLeNet prioritises reliability by more explicitly expressing uncertainty. To assess the robustness of our results, we examined the mean predicted class confidence\ntogether with its standard deviation, computed from fifty Monte Carlo Dropout forward\npasses across different classes. As these findings are consistent with the patterns discussed\nabove, the corresponding plots are provided in the Appendix Figures 19 and 20. This\nplacement avoids redundancy in the main text while still providing full detail for reference.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 65,
+    "total_chunks": 68,
+    "char_count": 963,
+    "word_count": 134,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "2abf5ac2-2594-4bf4-86dd-73e702ddbe5a",
+    "text": "Comparative Analysis: Conformal vs. Bayesian\nThis section examines how Conformal Prediction (CP) set sizes relate to predictive uncertainty estimated via Monte Carlo Dropout. While both methods quantify uncertainty, they\ndo so in fundamentally different ways. MC Dropout approximates a Bayesian posterior by\nperforming multiple stochastic forward passes during inference, with predictive entropy capturing the dispersion of predicted probabilities across classes as a measure of uncertainty. In\ncontrast, Conformal Prediction does not rely on entropy; instead, it ensures statistically valid\ncoverage by calibrating prediction set sizes according to the softmax probability assigned to\nthe true class, using a held-out calibration set. This distinction has practical consequences.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 66,
+    "total_chunks": 68,
+    "char_count": 780,
+    "word_count": 105,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "922f7ba7-486c-4ad3-a2a0-37c5352807a1",
+    "text": "A model may exhibit high uncertainty (high\nentropy) while still assigning high confidence to the true label, resulting in a small CP set. Conversely, it may have low entropy but low true-class confidence, triggering CP to expand The correlation between entropy and set size therefore depends on how\nclosely these two forms of uncertainty align in practice. In the case of H-CNN VGG16 (Figure 11 (left)), the relationship between entropy and set\nsize is strong and consistent. As predictive entropy increases, CP prediction sets also expand. The scatter plot shows a clear, monotonic pattern: predictions with low entropy typically\ncorrespond to set sizes of 1, whereas higher-entropy predictions more often require sets of size\n2, 3, or even 4. This indicates that although H-CNN VGG16 tends to be overconfident overall\n(as shown in earlier plots), its entropy rankings still provide a reliable signal of prediction\ndifficulty, allowing CP to adapt its set sizes effectively. In short, H-CNN VGG16's entropy\nmay underestimate total uncertainty, but it is internally coherent and aligns well with CP\ncalibration. Figure 11: Prediction Set Vs. Predictive Entropy for H-CNN VGG16 & GoogLeNet For GoogLeNet (Figure 11 (right)), the relationship between entropy and set size is less\nclear. Although entropy generally increases for more difficult predictions, many low-entropy\nsamples still result in large prediction sets. This behavior stems from the model's cautious\nprobability assignments; GoogLeNet tends to distribute its confidence more evenly across\nmultiple plausible classes, even when its prediction is correct. As a result, the probability\nassigned to the true class can be relatively modest, not because the model is incorrect,\nbut because it is better calibrated and avoids overconfidence. While this conservatism is\nadvantageous from a reliability perspective, it often causes CP to enlarge prediction sets,\nthereby weakening the correlation between entropy and set size.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 67,
+    "total_chunks": 68,
+    "char_count": 1981,
+    "word_count": 300,
+    "chunking_strategy": "semantic"
+  },
+  {
+    "chunk_id": "ab9a285d-52bb-4bba-9d2c-9cbd48082e0c",
+    "text": "In conclusion, these findings highlight the complementary roles of MC Dropout and Conformal Prediction. MC Dropout captures the model's internal uncertainty, while CP guarantees empirical coverage regardless of calibration quality. When predictive entropy aligns\nwell with true-label confidence, as in H-CNN VGG16, the two methods work in sync. When\nthis alignment is weaker, as in GoogLeNet, CP acts as a corrective mechanism to maintain\nreliability, even when entropy alone does not fully explain variation in set sizes. Ultimately,\nthis comparison underscores a trade-off: H-CNN VGG16 favours efficiency through smaller\nprediction sets, whereas GoogLeNet prioritises reliability by more consistently signalling\nuncertainty. This paper addressed the gap in the existing literature by comparing two fundamentally different approaches to uncertainty estimation in deep convolutional neural networks:\nBayesian approximation via Monte Carlo Dropout and the nonparametric method of Conformal Prediction. The analysis was conducted on two distinct architectures, H-CNN VGG16\nand GoogLeNet, applied consistently to the Fashion-MNIST dataset to ensure methodological comparability. Our empirical results provide several key insights. First, they clarify\nhow uncertainty is expressed across models, distinguishing between epistemic and aleatoric\ncomponents. Second, they reveal systematic patterns in class-level ambiguity, showing how\nmodels respond to visually similar categories. Finally, they demonstrate the complementary\nstrengths of Bayesian and Conformal approaches: while Bayesian methods capture internal\nmodel uncertainty, Conformal Prediction guarantees empirical coverage and corrects for calibration weaknesses. Taken together, these findings advance our understanding of predictive\nreliability in deep learning and underscore the importance of designing models that not only\nachieve high accuracy but also convey trustworthy measures of uncertainty.",
+    "paper_id": "2603.10731",
+    "title": "Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks",
+    "authors": [
+      "Sanne Ruijs",
+      "Alina Kosiakova",
+      "Farrukh Javed"
+    ],
+    "published_date": "2026-03-11",
+    "primary_category": "",
+    "arxiv_url": "http://arxiv.org/abs/2603.10731v1",
+    "chunk_index": 68,
+    "total_chunks": 68,
+    "char_count": 1957,
+    "word_count": 257,
+    "chunking_strategy": "semantic"
+  }
+]
\ No newline at end of file