Harshit Ghosh commited on
Commit
b74bbb6
·
1 Parent(s): b0fbfb5

b4 performance an dreadme update

Browse files
Files changed (2) hide show
  1. B4_Performance_Report.md +305 -0
  2. README.md +69 -562
B4_Performance_Report.md ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # B4 Model Performance Report
2
+
3
+ ## Project Title
4
+ AI-Assisted Intracranial Hemorrhage Screening from Non-Contrast CT using EfficientNet-B4 with 2.5D Context and Multi-Label Learning
5
+
6
+ ## Document Metadata
7
+ - Student Project: Major Project Submission
8
+ - Model Focus: EfficientNet-B4
9
+ - Pipeline Type: 2.5D (9-channel) multi-label screening
10
+ - Evaluation Style: 5-fold OOF slice-level metrics with triage analysis
11
+ - Report Length Target: 15-20 pages when formatted in standard academic layout
12
+
13
+ ---
14
+
15
+ ## Executive Summary
16
+ This report presents the full technical and performance assessment of the EfficientNet-B4 pipeline developed for intracranial hemorrhage (ICH) screening from CT images. The system is designed as a screening and triage aid, not a standalone diagnostic engine. The B4 pathway extends baseline design by incorporating richer spatial context through 2.5D input, multi-label outputs for hemorrhage subtypes, calibration-aware decisioning, and triage-centric reporting.
17
+
18
+ The model was evaluated in out-of-fold (OOF) settings with subgroup reporting and triage distribution analysis. The any-class slice-level AUC is 0.95052, with strong subtype performance and high intraventricular discrimination (AUC 0.97196). Per-fold uncertainty is low (mean any-AUC 0.95102 plus/minus 0.00230), indicating stable training behavior under GroupKFold validation.
19
+
20
+ Although a historical baseline showed slightly higher any-only AUC in one earlier protocol, the B4 design adds significant practical value through:
21
+ - Multi-label subtype support
22
+ - Better clinical interpretability
23
+ - Triage-band stratification
24
+ - Fold-robust uncertainty estimation
25
+ - Scalable architecture for future deployment upgrades
26
+
27
+ This report includes methods, evaluation metrics, interpretation, risks, limitations, and deployment guidance.
28
+
29
+ ---
30
+
31
+ ## 1. Introduction
32
+ Intracranial hemorrhage is a time-sensitive radiological emergency in which delayed detection can lead to severe disability or death. In high-throughput emergency environments, AI-supported prioritization can reduce time-to-review for suspicious cases. However, purely accuracy-focused solutions are insufficient for clinical workflows unless they also provide reliability indicators and structured outputs that support radiologist decisions.
33
+
34
+ The B4 model in this project was developed to improve operational usefulness, not just headline AUC. The key premise is that hemorrhage patterns often span adjacent slices and multiple subtype manifestations; therefore, a context-aware and multi-label approach is better aligned with practical reading conditions than isolated binary single-slice scoring.
35
+
36
+ ---
37
+
38
+ ## 2. Problem Statement
39
+ The project addresses the following practical gaps in ICH AI screening:
40
+ - Binary-only formulations miss subtype granularity required for useful case communication.
41
+ - Single-slice input ignores local anatomical continuity.
42
+ - Single-split validation can overstate confidence in model performance.
43
+ - Output without triage semantics has limited workflow impact.
44
+
45
+ Goal for B4 track:
46
+ - Build a robust, fold-validated, context-aware, multi-label screening model with triage-oriented outputs and explicit safety framing.
47
+
48
+ ---
49
+
50
+ ## 3. Model and Data Pipeline
51
+
52
+ ### 3.1 Backbone
53
+ - EfficientNet-B4
54
+ - Approximate parameter scale: 19M
55
+
56
+ ### 3.2 Input Design
57
+ - 2.5D representation
58
+ - 9-channel tensor constructed from neighboring slices and CT windows
59
+
60
+ ### 3.3 Output Targets
61
+ The model predicts six labels:
62
+ - any
63
+ - epidural
64
+ - intraparenchymal
65
+ - intraventricular
66
+ - subarachnoid
67
+ - subdural
68
+
69
+ ### 3.4 Validation Strategy
70
+ - 5-fold GroupKFold OOF evaluation
71
+ - Per-fold AUC tracking for uncertainty estimation
72
+
73
+ ### 3.5 Calibration and Triage
74
+ - Isotonic calibration used in improved path
75
+ - Triage bands include HIGH, MEDIUM, UNCERTAIN, LOW
76
+
77
+ ---
78
+
79
+ ## 4. Experimental Configuration
80
+ The B4 pipeline was executed under the following report-aligned configuration:
81
+ - Model family: EfficientNet-B4
82
+ - Input formulation: 2.5D, 9 channels
83
+ - Task formulation: Multi-label classification (6 classes)
84
+ - Validation protocol: 5-fold OOF
85
+ - Reporting scope: Slice-level metrics, per-fold uncertainty, triage distributions, representative case summaries
86
+
87
+ ---
88
+
89
+ ## 5. Quantitative Results
90
+
91
+ ### 5.1 Slice-Level OOF Metrics
92
+
93
+ | Subtype | AUC | Sens | Spec | F1 | Threshold | Sens@Spec90 | Spec@Sens95 |
94
+ |---|---:|---:|---:|---:|---:|---:|---:|
95
+ | any | 0.95052 | 0.8556 | 0.9038 | 0.7041 | 0.1622 | 0.8556 | 0.7447 |
96
+ | epidural | 0.90027 | 0.8471 | 0.8234 | 0.0400 | 0.0050 | 0.7554 | 0.5025 |
97
+ | intraparenchymal | 0.95888 | 0.8868 | 0.8996 | 0.4548 | 0.0556 | 0.8862 | 0.7695 |
98
+ | intraventricular | 0.97196 | 0.9371 | 0.9055 | 0.4174 | 0.0436 | 0.9371 | 0.8833 |
99
+ | subarachnoid | 0.92206 | 0.8744 | 0.8146 | 0.3112 | 0.0548 | 0.7561 | 0.6988 |
100
+ | subdural | 0.93016 | 0.8507 | 0.8584 | 0.4285 | 0.0685 | 0.7860 | 0.6867 |
101
+
102
+ Observations:
103
+ - Best subtype discrimination is intraventricular (AUC 0.97196).
104
+ - any-class AUC is strong for screening use.
105
+ - Epidural shows expected challenges due to class rarity and imbalance effects.
106
+
107
+ ### 5.2 Per-Fold Uncertainty (Any Label)
108
+ - Fold 0: 0.95199
109
+ - Fold 1: 0.95106
110
+ - Fold 2: 0.95228
111
+ - Fold 3: 0.95315
112
+ - Fold 4: 0.94662
113
+
114
+ Aggregate:
115
+ - Mean plus/minus Std: 0.95102 plus/minus 0.00230
116
+ - Approximate 95 percent CI: [0.94901, 0.95303]
117
+
118
+ Interpretation:
119
+ - Low fold variance indicates robust generalization behavior and reproducibility within the chosen split methodology.
120
+
121
+ ---
122
+
123
+ ## 6. Confusion Matrix Evidence (OOF)
124
+ At threshold approximately 0.162 for any-class decisioning, the OOF confusion matrix from provided figure is:
125
+ - True Negative: 58,284
126
+ - False Positive: 6,203
127
+ - False Negative: 1,558
128
+ - True Positive: 9,235
129
+
130
+ Derived notes:
131
+ - High true-negative volume indicates broad filtering ability.
132
+ - False negatives remain non-zero, therefore mandatory human oversight is required.
133
+ - False positives are operationally less harmful than false negatives in screening context, but they increase workload.
134
+
135
+ ---
136
+
137
+ ## 7. Triage Distribution Analysis
138
+ Reported 4-band triage distribution:
139
+ - HIGH: n=6,707, positives=6,023, prevalence=0.898
140
+ - UNCERTAIN: n=816, positives=408, prevalence=0.500
141
+ - MEDIUM: n=3,941, positives=1,887, prevalence=0.479
142
+ - LOW: n=63,816, positives=2,475, prevalence=0.039
143
+
144
+ Interpretation:
145
+ - HIGH band demonstrates strong concentration of positive findings.
146
+ - UNCERTAIN band isolates clinically ambiguous instances and is suitable for prioritised manual attention.
147
+ - LOW band maintains low prevalence, supporting triage de-prioritization while retaining safety caveat.
148
+
149
+ ---
150
+
151
+ ## 8. Representative Clinical Case Summaries
152
+ The provided high-triage case summaries demonstrate that the model often identifies plausible multi-subtype combinations in positive studies. Example patterns include combinations of subdural with subarachnoid or intraparenchymal signals at high confidence.
153
+
154
+ Operational takeaway:
155
+ - Multi-label output is useful for communication and review focus.
156
+ - The model remains a screening aid and should never be used as sole diagnostic authority.
157
+
158
+ ---
159
+
160
+ ## 9. B4 Versus Baseline Context
161
+ From supplied comparison text:
162
+ - Architecture: B0 baseline vs B4 improved
163
+ - Input: 2D 3ch vs 2.5D 9ch
164
+ - Formulation: Binary vs Multi-label
165
+ - Validation: Single split vs 5-fold GroupKFold
166
+
167
+ Reported delta (any AUC) in that specific summary:
168
+ - Baseline any AUC: 0.95800
169
+ - Improved B4 any AUC: 0.95052
170
+ - Delta: -0.00748
171
+
172
+ Important interpretation note:
173
+ This delta should not be read as a simple regression claim without protocol harmonization. The improved model is evaluated under a more demanding setup (multi-label plus fold-based uncertainty estimation), and it contributes features not captured by any-only AUC:
174
+ - subtype intelligence
175
+ - triage-aware utility
176
+ - improved methodological rigor in validation
177
+
178
+ ---
179
+
180
+ ## 10. Figure-Based Analysis (From Supplied Images)
181
+
182
+ ### Figure Group A: ROC Curves Per Subtype (OOF)
183
+ Insights:
184
+ - Strong curve separation from random baseline for all subtypes.
185
+ - Intraventricular and intraparenchymal are strongest.
186
+ - Epidural remains the most difficult class.
187
+
188
+ ### Figure Group B: Patient-Level ROC Aggregation Comparison
189
+ Displayed AUCs are close:
190
+ - max: 0.9525
191
+ - mean: 0.9471
192
+ - noisy_or: 0.9515
193
+ - topk_mean: 0.9514
194
+
195
+ Interpretation:
196
+ - Aggregation strategy selection should be aligned with clinical objective (sensitivity floor versus workload constraints).
197
+
198
+ ### Figure Group C: OOF Confusion Matrix
199
+ Confirms operational trade-off at selected threshold:
200
+ - substantial true-positive recovery
201
+ - acceptable but non-trivial false-positive volume
202
+
203
+ ---
204
+
205
+ ## 11. Statistical Reliability and Uncertainty
206
+ B4 fold spread is narrow, which improves confidence in performance stability:
207
+ - Std ~0.00230 for any-AUC across folds
208
+ - CI indicates strong central tendency around 0.951
209
+
210
+ For final submission, this fold uncertainty evidence is stronger than single-point metrics because it quantifies repeatability.
211
+
212
+ ---
213
+
214
+ ## 12. Clinical Utility Assessment
215
+ Strengths of B4 track for screening:
216
+ - High any-class discriminatory power
217
+ - Multi-label subtype context for communication
218
+ - Triage stratification to support operational prioritization
219
+ - Suitable for front-end assistance in high-volume settings
220
+
221
+ Risk points:
222
+ - False negatives cannot be eliminated
223
+ - Class imbalance (especially epidural) can affect threshold behavior
224
+ - External site drift remains possible
225
+
226
+ Recommendation:
227
+ - Use as assistive pre-read or queue-prioritization tool under radiologist supervision.
228
+
229
+ ---
230
+
231
+ ## 13. Error Modes and Risk Discussion
232
+ Likely error contributors:
233
+ - subtle small bleeds near skull base
234
+ - beam-hardening and motion artifacts
235
+ - mixed-density presentations
236
+ - overlap between non-hemorrhagic hyperdensities and hemorrhage-like patterns
237
+
238
+ Mitigation path:
239
+ - targeted hard-negative mining
240
+ - subtype-aware threshold optimization
241
+ - study-level fusion with metadata
242
+ - human review escalation rules for uncertain band
243
+
244
+ ---
245
+
246
+ ## 14. Deployment Position
247
+ Proposed deployment stance for B4:
248
+ - Assistive triage only
249
+ - Do not auto-clear without human review policy
250
+ - Preserve model versioning and threshold logs
251
+ - Track drift through periodic calibration audits
252
+
253
+ ---
254
+
255
+ ## 15. Limitations
256
+ - External validation across institutions is pending.
257
+ - Domain shift across scanner vendors is not fully characterized.
258
+ - This report relies on provided OOF and figure summaries; raw per-case audit trails should be archived for final viva.
259
+
260
+ ---
261
+
262
+ ## 16. Ethical and Regulatory Considerations
263
+ - Screening tool only, not diagnostic device.
264
+ - Human-in-the-loop is mandatory.
265
+ - Explanations are supportive visual cues, not confirmed lesion boundaries.
266
+ - Clinical decisions must remain with licensed professionals.
267
+
268
+ ---
269
+
270
+ ## 17. Future Work
271
+ 1. Add explicit subtype prevalence-aware loss weighting refinements.
272
+ 2. Integrate stronger patient-level calibration and threshold governance.
273
+ 3. Compare B4 against B3/B5 under unified evaluation scripts.
274
+ 4. Add prospective-like temporal split testing.
275
+ 5. Add cost-sensitive thresholding to tune false-positive burden.
276
+
277
+ ---
278
+
279
+ ## 18. Conclusion
280
+ The B4 model demonstrates a mature, clinically aligned screening pipeline with robust OOF performance and strong subtype-level discrimination. While any-only AUC comparisons with baseline require protocol-matched interpretation, the B4 system provides broader practical value through multi-label outputs, uncertainty-aware fold validation, and triage-oriented reporting behavior.
281
+
282
+ This makes B4 a defensible advanced model track for project submission, provided the final defense clearly states that the system is an assistive triage tool and not an autonomous diagnosis engine.
283
+
284
+ ---
285
+
286
+ ## 19. Appendix A: Key Numbers (Quick Reference)
287
+ - Any AUC: 0.95052
288
+ - Any Sensitivity: 0.8556
289
+ - Any Specificity: 0.9038
290
+ - Fold Mean AUC: 0.95102
291
+ - Fold Std: 0.00230
292
+ - OOF confusion (thr approx 0.162): TN 58,284, FP 6,203, FN 1,558, TP 9,235
293
+
294
+ ---
295
+
296
+ ## 20. Appendix B: Figures to Embed in Final PDF
297
+ Insert the provided B4 visuals in this order in your final report export:
298
+ 1. ROC Curves Per Subtype (OOF)
299
+ 2. Patient-Level ROC Aggregation Comparison
300
+ 3. OOF Confusion Matrix (threshold approx 0.162)
301
+
302
+ Suggested captions:
303
+ - Figure B4-1: Subtype ROC performance under OOF validation.
304
+ - Figure B4-2: Patient-level aggregation strategy comparison.
305
+ - Figure B4-3: Any-class confusion matrix at selected screening threshold.
README.md CHANGED
@@ -1,597 +1,104 @@
1
- # Major Project Documentation
2
 
3
- ## Project Structure and Setup
4
 
5
- This repository is organized into clear functional sections:
6
 
7
- - `app.py`: Flask web application for upload, inference, browsing reports, and logs
8
- - `run_interface.py`: Compatibility adapter between the web app and model inference code
9
- - `download_imp/`: Model artifacts and core inference implementation
10
- - `templates/`: Jinja2 HTML templates for the web UI
11
- - `static/`: CSS assets
12
- - `logs/` and `uploads/`: Runtime folders created automatically
13
 
14
- ### Quick Start
15
 
16
- 1. Create and activate a Python virtual environment.
17
- 1. Install dependencies:
18
 
19
- ```bash
20
- pip install -r requirements.txt
21
- ```
22
 
23
- 1. Create environment file:
24
 
25
- ```bash
26
- cp .env.example .env
27
- ```
28
 
29
- 1. Run the web app:
30
 
31
- ```bash
32
- python app.py
33
- ```
34
 
35
- 1. Open:
36
 
37
- ```text
38
- http://127.0.0.1:7860
39
- ```
40
 
41
- ### Notes
42
 
43
- - Model and calibration files are expected under `download_imp/`.
44
- - Generated reports are written to `download_imp/outputs/reports/`.
45
- - `.gitignore` excludes runtime/generated files to keep version control clean.
 
 
 
46
 
47
- ### Fold Selection
48
 
49
- The web app supports configurable fold selection via `.env` / environment variable:
 
 
 
 
50
 
51
  ```bash
52
- ICH_FOLD_SELECTION=ensemble
53
  ```
54
 
55
- Supported values:
56
-
57
- - `ensemble` (default): loads all available folds and averages predictions
58
- - `best`: uses the best single fold from the performance report summary
59
- - `0` to `4`: force a specific fold
60
-
61
- Based on [B4_Performance_Report.md](B4_Performance_Report.md), per-fold any-AUC indicates fold `3` as the strongest single fold in that table.
62
-
63
- ### GitHub Artifact Policy
64
-
65
- Model checkpoints and heavy binary artifacts are intentionally ignored for GitHub:
66
-
67
- - `download_imp/*.pth`
68
- - `download_imp/*.pkl`
69
- - `download_imp/outputs/`
70
-
71
- This keeps the repository lightweight and reproducible while allowing local/model-private inference.
72
-
73
- ### Publish Model To Hugging Face
74
-
75
- 1. Create a token with `Write` access: `https://huggingface.co/settings/tokens`
76
- 1. Set credentials in `.env` (or export in shell):
77
-
78
- ```bash
79
- HF_TOKEN=your_hf_token
80
- HF_REPO_ID=your-username/ich-b4-model
81
- ```
82
-
83
- 1. Upload model artifacts:
84
-
85
- ```bash
86
- python scripts/upload_to_huggingface.py --repo-id "$HF_REPO_ID" --private
87
- ```
88
-
89
- Published model repository:
90
-
91
- - [Hugging Face Model Repo](https://huggingface.co/HarshCode/eff_b4_brain)
92
-
93
- ### Host On GitHub Pages
94
-
95
- 1. Ensure your default branch is `main`.
96
- 1. Push this repository with `.github/workflows/pages.yml` and `docs/`.
97
- 1. In GitHub repository settings:
98
- 1. Open `Settings -> Pages`
99
- 1. Set source to `GitHub Actions`
100
-
101
- The site will be published automatically after push.
102
-
103
- ## AI-Assisted CT-Based Intracranial Hemorrhage Detection with Explainability and Clinical Reporting
104
-
105
- ---
106
-
107
- ## 1. Introduction
108
-
109
- **Intracranial hemorrhage (ICH)** is a life-threatening neurological condition caused by bleeding within the skull. It is one of the most critical forms of stroke and requires immediate medical attention. Delays in detection and intervention significantly increase the risk of mortality and long-term neurological damage.
110
-
111
- **Computed Tomography (CT)** imaging is the primary diagnostic tool used in emergency settings for detecting intracranial hemorrhage due to its speed, availability, and high sensitivity to bleeding. However, accurate interpretation of CT scans requires experienced radiologists and must often be performed under time pressure, especially in emergency departments with high patient volumes.
112
-
113
- Artificial intelligence has shown promise in assisting medical image interpretation. However, many AI-based solutions function as **black-box models** and focus solely on prediction accuracy, limiting their trustworthiness and clinical adoption. There is a need for an AI-assisted system that supports early screening while providing transparent and interpretable outputs to aid clinical decision-making.
114
-
115
- ---
116
-
117
- ## 2. Problem Statement
118
-
119
- Intracranial hemorrhage detection from CT brain scans is a critical yet time-sensitive task in emergency medical care. Although CT imaging is effective for identifying hemorrhage, timely and accurate interpretation depends heavily on the availability of skilled radiologists. In high-pressure or resource-constrained environments, delays or misinterpretations can adversely affect patient outcomes.
120
-
121
- Existing AI-based approaches for hemorrhage detection often emphasize binary predictions without sufficient explainability or clinical context. This lack of transparency makes it difficult for healthcare professionals to rely on AI outputs during screening and prioritization.
122
-
123
- **The problem addressed in this project** is the absence of an explainable, AI-assisted screening system that can detect intracranial hemorrhage from CT scans while providing interpretable visual evidence and structured clinical explanations to support medical professionals.
124
-
125
- > **Note:** The system is intended strictly as a screening and assistive tool, not as a diagnostic replacement for certified medical practitioners.
126
-
127
- ---
128
-
129
- ## 3. Objectives
130
-
131
- ### Primary Objectives
132
-
133
- - To develop an AI-based system capable of detecting intracranial hemorrhage from CT brain images
134
- - To classify CT scans into clearly defined categories: **hemorrhage present** or **absent**
135
- - To assist emergency screening by prioritizing high-risk cases
136
-
137
- ### Secondary Objectives
138
-
139
- - To integrate visual explainability techniques that highlight regions influencing the model's predictions
140
- - To generate structured, human-readable clinical reports summarizing findings
141
- - To evaluate model reliability with emphasis on **false-negative reduction**
142
- - To ensure ethical deployment as a decision-support system rather than a diagnostic authority
143
-
144
- ---
145
-
146
- ## 4. Scope of the Project
147
-
148
- ### Included
149
-
150
- ✅ CT brain image preprocessing and normalization
151
- ✅ Binary classification of intracranial hemorrhage presence
152
- ✅ Explainability using activation-based heatmaps
153
- ✅ Confidence-aware screening and structured reporting
154
-
155
- ### Excluded
156
-
157
- ❌ Stroke subtype classification beyond hemorrhage detection
158
- ❌ Treatment recommendation or diagnosis
159
- ❌ Real-time clinical deployment
160
- ❌ Integration with hospital information systems
161
-
162
- ---
163
-
164
- ## 5. Dataset Description
165
-
166
- The project utilizes publicly available CT brain imaging datasets from Kaggle, such as:
167
-
168
- **Primary datasets:**
169
- - RSNA Intracranial Hemorrhage Detection Dataset
170
- - CT Brain Hemorrhage Dataset
171
-
172
- These datasets contain labeled CT scan images indicating the presence or absence of intracranial hemorrhage, with some datasets also providing hemorrhage subtype annotations.
173
-
174
- Using public datasets ensures reproducibility, ethical compliance, and feasibility within academic constraints.
175
-
176
- ---
177
-
178
- ## 6. Methodology
179
-
180
- ### 6.1 Data Preprocessing
181
-
182
- Preprocessing is essential to improve image quality and model performance:
183
-
184
- - Conversion of DICOM images to standardized formats where required
185
- - Image resizing to fixed resolution
186
- - **CT windowing (brain window)** to enhance hemorrhage visibility
187
- - Intensity normalization
188
- - Noise reduction and artifact handling
189
- - Data augmentation to improve generalization and mitigate class imbalance
190
-
191
- ### 6.2 Model Development
192
-
193
- A **convolutional neural network (CNN)** is implemented using transfer learning.
194
-
195
- - Pretrained architectures such as **ResNet** or **EfficientNet** are adapted for CT image analysis
196
- - The final classification layer is modified for **binary output**:
197
- - **Class 0**: No intracranial hemorrhage
198
- - **Class 1**: Intracranial hemorrhage present
199
- - The model is trained using supervised learning with appropriate loss functions
200
-
201
- ### 6.3 Evaluation Metrics
202
-
203
- Model performance is evaluated using clinically relevant metrics:
204
-
205
- - **Sensitivity (Recall)** – prioritized to reduce missed hemorrhage cases
206
- - Specificity
207
- - Precision
208
- - Confusion matrix analysis
209
- - Receiver Operating Characteristic (ROC) curve
210
-
211
- > **Note:** Accuracy alone is not treated as a sufficient indicator of clinical usefulness.
212
-
213
- ---
214
-
215
- ## 7. Explainability Module
216
-
217
- To ensure transparency and trustworthiness:
218
-
219
- - **Gradient-weighted Class Activation Mapping (Grad-CAM)** will be applied
220
- - Heatmaps are generated to visualize regions contributing to predictions
221
- - Highlighted areas are analyzed in relation to known hemorrhage patterns in CT images
222
-
223
- This module allows clinicians to visually verify AI decisions rather than relying solely on binary outputs.
224
-
225
- ### 7.1 Explainability Quality Assurance
226
-
227
- To ensure the reliability of explainability outputs:
228
-
229
- - **Sanity checks** will be implemented (e.g., occlusion/perturbation tests) to verify Grad-CAM is not highlighting irrelevant borders, text markers, or artifacts
230
- - Failure cases where heatmaps are misleading will be documented
231
- - For a sample of True Positives, False Negatives, and False Positives, Grad-CAM overlays will include brief qualitative notes comparing:
232
- - What the model highlights
233
- - What a clinician would expect to see
234
- - This ensures visual evidence aligns with clinical reasoning
235
-
236
- ---
237
-
238
- ## 8. Confidence-Aware Screening
239
-
240
- Instead of a simple binary output, the system incorporates prediction confidence:
241
-
242
- - **High-confidence hemorrhage detection** → urgent attention
243
- - **Low-confidence predictions** → manual review recommendation
244
-
245
- This approach reflects real-world screening workflows and reduces over-reliance on automated decisions.
246
-
247
- ### 8.1 Confidence Calibration
248
-
249
- To ensure clinicians can trust the confidence scores:
250
-
251
- - **Calibration techniques** will be applied (e.g., temperature scaling or isotonic regression)
252
- - Both **raw probability** and **calibrated confidence** will be reported
253
- - Expected Calibration Error (ECE) will be evaluated
254
- - Three confidence bands will be defined:
255
- - **High confidence**: urgent attention required
256
- - **Medium confidence**: standard review
257
- - **Low confidence**: manual review recommended
258
- - Error rates will be analyzed across each confidence band to support triage decisions
259
-
260
- ---
261
-
262
- ## 9. Clinical Report Generation
263
-
264
- A structured report generation module converts model outputs into human-readable explanations. Each report includes:
265
-
266
- - Screening outcome summary
267
- - Prediction confidence
268
- - Visual explainability reference
269
- - Clinical interpretation phrased as decision support
270
-
271
- The report avoids diagnostic claims and emphasizes assistive screening.
272
 
273
- ### 9.1 Report Schema and Specifications
274
 
275
- To prevent diagnostic claims and ensure consistency:
276
-
277
- - A **fixed schema** will be defined with specific fields and allowed phrases
278
- - Reports will be locked down with rules to ensure they never make diagnostic claims
279
- - Each report field will have:
280
- - **Screening outcome**: "Hemorrhage detected" or "No hemorrhage detected" (not "diagnosed")
281
- - **Confidence level**: Numeric probability + calibrated confidence band
282
- - **Visual evidence**: Reference to Grad-CAM heatmap image
283
- - **Recommended action**: "Urgent radiologist review recommended" or "Standard review workflow"
284
- - **System disclaimer**: Clear statement that this is a screening tool, not a diagnostic device
285
- - Standardized phrasing ensures clinical safety and legal compliance
286
-
287
- ---
288
-
289
- ## 10. System Architecture Overview
290
-
291
- ```
292
- 1. CT Brain Image Input
293
-
294
- 2. Image Preprocessing Module
295
-
296
- 3. CNN-Based Hemorrhage Detection Model
297
-
298
- 4. Explainability Module (Grad-CAM)
299
-
300
- 5. Confidence Assessment
301
-
302
- 6. Structured Clinical Report Generator
303
-
304
- 7. Output for Medical Review
305
  ```
306
 
307
- ---
308
-
309
- ## 11. Technology Stack
310
-
311
- ### Programming Language
312
- - **Python**
313
-
314
- ### Libraries and Frameworks
315
- - **TensorFlow** or **PyTorch** (Deep Learning)
316
- - **OpenCV** (Image Processing)
317
- - **NumPy**, **Pandas** (Data Handling)
318
- - **Matplotlib** (Visualization)
319
-
320
- ### Development Platform
321
- - **Kaggle Notebooks**
322
- - Jupyter Notebook
323
-
324
- ---
325
-
326
- ## 12. Feasibility and Resources
327
-
328
- The project is **fully feasible** using free computational resources:
329
-
330
- - Kaggle provides free GPU access suitable for CNN training
331
- - Transfer learning minimizes training time
332
- - All tools used are open-source
333
- - No specialized hardware is required locally
334
-
335
- **Kaggle provides:**
336
- - Free GPU access (time-limited but sufficient)
337
- - Adequate RAM and storage for medical imaging datasets
338
- - Stable notebook environment for training and evaluation
339
-
340
- **Constraints:**
341
- - Training time per session is limited
342
- - Efficient model selection and batch sizing are necessary
343
-
344
- These constraints align well with transfer learning-based approaches.
345
-
346
- ---
347
-
348
- ## 13. Ethical Considerations
349
-
350
- - The system is designed strictly as a **screening and decision-support tool**
351
- - It does not provide diagnosis or treatment recommendations
352
- - Limitations and potential biases are explicitly documented
353
- - Human oversight is required for all clinical decisions
354
- - Dataset usage complies with public research licenses
355
- - Model limitations and potential biases will be documented
356
-
357
- ---
358
-
359
- ## 14. Assumptions and Risks
360
-
361
- ### Assumptions
362
- - Public datasets are representative of real-world CT brain images
363
- - Hemorrhage labels are clinically reliable and accurately annotated
364
-
365
- ### Potential Risks
366
- - Class imbalance may bias predictions toward non-hemorrhage cases
367
- - Overfitting due to dataset limitations
368
- - Misinterpretation of AI outputs by non-expert users
369
- - False negatives could delay critical interventions
370
-
371
- ### Clinical Risk Evaluation Protocol
372
-
373
- To address these risks systematically:
374
-
375
- - **Target operating points** will be defined (e.g., "maximize sensitivity subject to acceptable specificity")
376
- - **False negative analysis**: Pre-commit to reviewing all false negative cases with detailed inspection:
377
- - Inspect FN scans with Grad-CAM overlays
378
- - Document typical failure patterns (e.g., small bleeds, beam-hardening artifacts, post-operative changes)
379
- - Use findings to refine preprocessing or model architecture
380
- - Each component of the system architecture will be evaluated using the framework:
381
- - **Inputs** → **Outputs** → **Metrics** → **Failure Modes** → **Mitigations**
382
- - This structured approach ensures risks are systematically addressed before deployment
383
-
384
- Mitigation strategies will be implemented during development and documented in evaluation.
385
-
386
- ---
387
-
388
- ## 15. Proposed Experiments and Ablation Studies
389
-
390
- To validate design decisions and optimize performance, the following experiments will be conducted:
391
-
392
- ### 15.1 Preprocessing Ablations
393
-
394
- **Goal**: Justify the preprocessing pipeline choices
395
-
396
- **Experiments**:
397
- - Brain windowing: ON vs OFF
398
- - Different normalization strategies (min-max, z-score, percentile-based)
399
- - Data augmentation: ON vs OFF
400
-
401
- **Evaluation metrics**:
402
- - Sensitivity (primary)
403
- - ROC-AUC
404
- - Expected Calibration Error (ECE)
405
-
406
- **Outcome**: Select preprocessing configuration that maximizes sensitivity while maintaining calibration
407
-
408
- ### 15.2 Model Architecture Comparison
409
-
410
- **Goal**: Choose the optimal backbone architecture
411
-
412
- **Experiments**:
413
- - ResNet-50 vs EfficientNet-B0
414
- - Same train/validation split for fair comparison
415
- - Evaluate with fixed hyperparameters
416
-
417
- **Selection criteria**:
418
- - Sensitivity at fixed specificity (e.g., 95% specificity)
419
- - Inference time (important for screening/prioritization)
420
- - Model size and computational requirements
421
 
422
- **Outcome**: Select architecture based on clinical utility and deployment feasibility
 
 
 
 
 
 
423
 
424
- ### 15.3 Confidence-Aware Triage Study
425
 
426
- **Goal**: Validate the three-band confidence system
427
-
428
- **Experiments**:
429
- - Define thresholds for high/medium/low confidence bands
430
- - Analyze case distribution across bands
431
- - Compute error rates (sensitivity, specificity, FN rate) per band
432
-
433
- **Metrics**:
434
- - Percentage of cases in each band
435
- - False negative rate by confidence level
436
- - Positive predictive value by confidence level
437
-
438
- **Outcome**: Demonstrate that high-confidence predictions are more reliable and support triage workflow
439
-
440
- ### 15.4 Explainability Evaluation
441
-
442
- **Goal**: Validate that Grad-CAM provides clinically useful visualizations
443
-
444
- **Experiments**:
445
- - Generate Grad-CAM overlays for sample cases:
446
- - True Positives (correct hemorrhage detection)
447
- - False Negatives (missed hemorrhages)
448
- - False Positives (false alarms)
449
- - Qualitative analysis comparing:
450
- - What the model highlights
451
- - What clinicians would expect to see
452
- - Document cases where heatmaps are misleading or incorrect
453
-
454
- **Outcome**: Ensure explainability module provides trustworthy visual evidence
455
-
456
- ### 15.5 Calibration Study
457
-
458
- **Goal**: Improve confidence reliability
459
-
460
- **Experiments**:
461
- - Train baseline model and measure calibration (ECE, reliability diagram)
462
- - Apply temperature scaling and isotonic regression
463
- - Compare raw probabilities vs calibrated confidence
464
-
465
- **Outcome**: Deploy calibrated confidence scores that clinicians can trust
466
-
467
- ---
468
-
469
- ## 16. Expected Outcomes and Deliverables
470
-
471
- ### 16.1 Expected Outcomes
472
-
473
- **Model Performance**:
474
- - A trained CNN model achieving high sensitivity (target: >95%) for intracranial hemorrhage detection
475
- - Specificity maintained at clinically acceptable levels (target: >85%)
476
- - Calibrated confidence scores with low Expected Calibration Error (ECE < 0.05)
477
- - Comprehensive performance evaluation across all metrics (sensitivity, specificity, precision, F1-score, ROC-AUC)
478
-
479
- **Explainability**:
480
- - Reliable Grad-CAM visualizations highlighting hemorrhage regions
481
- - Validated explainability outputs through sanity checks
482
- - Documented failure modes and typical misclassification patterns
483
-
484
- **Clinical Utility**:
485
- - Confidence-based triage system validated across three bands (high/medium/low)
486
- - Demonstrated reduction in false negatives through systematic review
487
- - Structured reports that support clinical decision-making without making diagnostic claims
488
-
489
- **Research Insights**:
490
- - Evidence-based justification for preprocessing choices through ablation studies
491
- - Model architecture comparison showing optimal choice for screening scenarios
492
- - Understanding of model limitations and failure patterns
493
-
494
- ### 16.2 Project Deliverables
495
-
496
- **1. Trained Model**
497
- - Final CNN model weights saved in standard format (`.h5` or `.pth`)
498
- - Model configuration file documenting architecture and hyperparameters
499
- - Training history and learning curves
500
-
501
- **2. Preprocessing Pipeline**
502
- - Complete data preprocessing module for CT scan normalization
503
- - DICOM handling utilities where applicable
504
- - Data augmentation scripts
505
-
506
- **3. Explainability Module**
507
- - Grad-CAM implementation integrated with the model
508
- - Visualization generation scripts
509
- - Sanity check utilities for validation
510
-
511
- **4. Confidence Calibration Module**
512
- - Calibration implementation (temperature scaling/isotonic regression)
513
- - Scripts for computing calibrated confidence scores
514
- - Calibration evaluation metrics
515
-
516
- **5. Clinical Report Generator**
517
- - Structured report generation system with fixed schema
518
- - Template-based reporting following clinical safety guidelines
519
- - Sample reports demonstrating different scenarios
520
-
521
- **6. Evaluation Framework**
522
- - Complete evaluation scripts for all metrics
523
- - Confusion matrix and ROC curve generation
524
- - Ablation study implementation and results
525
-
526
- **7. Documentation**
527
- - Project report (this README and extended documentation)
528
- - Code documentation and inline comments
529
- - User guide for running the system
530
- - Dataset description and preprocessing details
531
-
532
- **8. Jupyter Notebooks**
533
- - Data exploration and preprocessing notebook
534
- - Model training and evaluation notebook
535
- - Explainability demonstration notebook
536
- - Report generation examples notebook
537
-
538
- **9. Results and Analysis**
539
- - Performance metrics across all experiments
540
- - Ablation study results with visualizations
541
- - Failure case analysis with Grad-CAM overlays
542
- - Calibration plots and reliability diagrams
543
-
544
- **10. Presentation Materials**
545
- - Project presentation slides
546
- - Demo video (optional)
547
- - Key visualizations and results summary
548
-
549
- ---
550
-
551
- ## 17. Limitations and Future Work
552
-
553
- ### Current Limitations
554
-
555
- - **Dataset Scope**: Limited to publicly available datasets which may not fully represent all clinical scenarios
556
- - **Binary Classification**: Does not classify hemorrhage subtypes (epidural, subdural, subarachnoid, etc.)
557
- - **Single Slice Analysis**: May not utilize full 3D volumetric information from CT scans
558
- - **No Real-Time Deployment**: System is designed for research and demonstration, not clinical deployment
559
- - **Limited Clinical Validation**: Evaluation based on dataset labels, not verified by multiple radiologists
560
-
561
- ### Future Enhancements
562
-
563
- **Clinical Extensions**:
564
- - Multi-class classification for hemorrhage subtypes
565
- - Integration of volumetric (3D) analysis using 3D CNNs
566
- - Temporal analysis for follow-up scan comparison
567
- - Integration with radiology workflow systems (PACS)
568
-
569
- **Technical Improvements**:
570
- - Ensemble models combining multiple architectures
571
- - Uncertainty quantification using Bayesian deep learning
572
- - Active learning for continuous model improvement
573
- - Real-time inference optimization for clinical deployment
574
-
575
- **Validation and Deployment**:
576
- - Prospective clinical validation with radiologist verification
577
- - Multi-center evaluation for generalizability
578
- - Regulatory compliance pathway (FDA/CE marking)
579
- - Production-ready deployment with monitoring
580
 
581
- **Enhanced Explainability**:
582
- - Multiple explainability methods comparison (Grad-CAM++, SHAP, attention mechanisms)
583
- - Interactive visualization tools for clinicians
584
- - Textual explanation generation describing detected features
585
 
586
- ---
 
 
587
 
588
- ## 18. Conclusion
589
 
590
- This project aims to develop an AI-assisted screening system for intracranial hemorrhage detection that prioritizes **clinical utility**, **explainability**, and **ethical deployment**. By combining deep learning with transparency mechanisms, confidence calibration, and structured reporting, the system is designed to support—not replace—clinical decision-making.
 
 
 
 
 
 
 
 
 
591
 
592
- The systematic approach to risk evaluation, comprehensive ablation studies, and focus on false negative reduction ensure that the system addresses real clinical needs while acknowledging its limitations as a screening tool.
593
 
594
- Through this project, we demonstrate that responsible AI in healthcare requires not just prediction accuracy, but also interpretability, calibration, careful validation, and explicit acknowledgment of system boundaries.
 
 
595
 
596
- ---
597
 
 
 
 
1
+ # Intracranial Hemorrhage Detection
2
 
3
+ AI-assisted screening system for intracranial hemorrhage (ICH) from head CT (DICOM) images.
4
 
5
+ This project provides a Flask web interface for:
6
 
7
+ - uploading single or batch DICOM scans,
8
+ - running model inference,
9
+ - viewing Grad-CAM visualizations,
10
+ - browsing past reports and logs,
11
+ - reviewing calibration and evaluation summaries.
 
12
 
13
+ ## Project Overview
14
 
15
+ Intracranial hemorrhage is a time-critical emergency finding in neuroimaging. This repository focuses on a practical screening workflow with explainability and structured report output.
 
16
 
17
+ The system is built for decision support and triage assistance, not standalone diagnosis.
 
 
18
 
19
+ ## Model and Artifacts
20
 
21
+ Model weights and related inference artifacts are hosted on Hugging Face:
 
 
22
 
23
+ - [Hugging Face Model Repository](https://huggingface.co/HarshCode/eff_b4_brain)
24
 
25
+ ## Detailed Performance Report
 
 
26
 
27
+ Detailed performance and B4-specific analysis are documented separately in:
28
 
29
+ - [B4_Performance_Report.md](B4_Performance_Report.md)
 
 
30
 
31
+ ## Repository Structure
32
 
33
+ - `app.py`: Flask application entry point
34
+ - `run_interface.py`: adapter layer between app and inference implementation
35
+ - `download_imp/`: inference code and local artifact layout
36
+ - `templates/`: HTML templates (Jinja2)
37
+ - `static/`: styles and static assets
38
+ - `docs/`: GitHub Pages content
39
 
40
+ ## Requirements
41
 
42
+ - Python 3.10+ (3.12 works)
43
+ - pip
44
+ - virtual environment (recommended)
45
+
46
+ Install dependencies:
47
 
48
  ```bash
49
+ pip install -r requirements.txt
50
  ```
51
 
52
+ ## Environment Setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ Create local environment file from template:
55
 
56
+ ```bash
57
+ cp .env.example .env
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
59
 
60
+ Important variables in `.env`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
+ - `ICH_APP_DEBUG`: run Flask in debug mode (`1` or `0`)
63
+ - `ICH_APP_PORT`: app port (default `7860`)
64
+ - `ICH_SECRET_KEY`: Flask secret key
65
+ - `ICH_MAX_UPLOAD_MB`: max upload size in MB
66
+ - `ICH_FOLD_SELECTION`: `ensemble`, `best`, or fold id (`0` to `4`)
67
+ - `ICH_LOCAL_MODE`: enables local directory scanning mode
68
+ - `ICH_LOG_LEVEL`: `DEBUG`, `INFO`, `WARNING`, `ERROR`
69
 
70
+ ## Run the Application
71
 
72
+ ```bash
73
+ python app.py
74
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ Open in browser:
 
 
 
77
 
78
+ ```text
79
+ http://127.0.0.1:7860
80
+ ```
81
 
82
+ ## Basic Usage
83
 
84
+ 1. Go to the upload page.
85
+ 2. Upload one `.dcm`, multiple `.dcm` files, or batch input.
86
+ 3. Wait for inference and report generation.
87
+ 4. Review:
88
+ - screening outcome,
89
+ - calibrated probability,
90
+ - confidence band,
91
+ - triage action,
92
+ - Grad-CAM overlay.
93
+ 5. Use Reports / Logs / Evaluation pages for history and analysis.
94
 
95
+ ## Notes
96
 
97
+ - Keep heavy model binaries out of GitHub (managed via `.gitignore`).
98
+ - Generated report outputs are created during runtime.
99
+ - If required artifacts are missing locally, fetch them from the Hugging Face repository linked above.
100
 
101
+ ## Disclaimer
102
 
103
+ This system is an AI-assisted screening and decision-support tool.
104
+ It does **not** provide a medical diagnosis and must be used with qualified clinical review.