Spaces:

HarshCode
/

ICH-Detection-Pipeline

Running

App Files Files Community

Harshit Ghosh commited on 18 days ago

Commit

b74bbb6

1 Parent(s): b0fbfb5

b4 performance an dreadme update

Browse files

Files changed (2) hide show

B4_Performance_Report.md +305 -0
README.md +69 -562

B4_Performance_Report.md ADDED Viewed

	@@ -0,0 +1,305 @@

+# B4 Model Performance Report
+## Project Title
+AI-Assisted Intracranial Hemorrhage Screening from Non-Contrast CT using EfficientNet-B4 with 2.5D Context and Multi-Label Learning
+## Document Metadata
+- Student Project: Major Project Submission
+- Model Focus: EfficientNet-B4
+- Pipeline Type: 2.5D (9-channel) multi-label screening
+- Evaluation Style: 5-fold OOF slice-level metrics with triage analysis
+- Report Length Target: 15-20 pages when formatted in standard academic layout
+---
+## Executive Summary
+This report presents the full technical and performance assessment of the EfficientNet-B4 pipeline developed for intracranial hemorrhage (ICH) screening from CT images. The system is designed as a screening and triage aid, not a standalone diagnostic engine. The B4 pathway extends baseline design by incorporating richer spatial context through 2.5D input, multi-label outputs for hemorrhage subtypes, calibration-aware decisioning, and triage-centric reporting.
+The model was evaluated in out-of-fold (OOF) settings with subgroup reporting and triage distribution analysis. The any-class slice-level AUC is 0.95052, with strong subtype performance and high intraventricular discrimination (AUC 0.97196). Per-fold uncertainty is low (mean any-AUC 0.95102 plus/minus 0.00230), indicating stable training behavior under GroupKFold validation.
+Although a historical baseline showed slightly higher any-only AUC in one earlier protocol, the B4 design adds significant practical value through:
+- Multi-label subtype support
+- Better clinical interpretability
+- Triage-band stratification
+- Fold-robust uncertainty estimation
+- Scalable architecture for future deployment upgrades
+This report includes methods, evaluation metrics, interpretation, risks, limitations, and deployment guidance.
+---
+## 1. Introduction
+Intracranial hemorrhage is a time-sensitive radiological emergency in which delayed detection can lead to severe disability or death. In high-throughput emergency environments, AI-supported prioritization can reduce time-to-review for suspicious cases. However, purely accuracy-focused solutions are insufficient for clinical workflows unless they also provide reliability indicators and structured outputs that support radiologist decisions.
+The B4 model in this project was developed to improve operational usefulness, not just headline AUC. The key premise is that hemorrhage patterns often span adjacent slices and multiple subtype manifestations; therefore, a context-aware and multi-label approach is better aligned with practical reading conditions than isolated binary single-slice scoring.
+---
+## 2. Problem Statement
+The project addresses the following practical gaps in ICH AI screening:
+- Binary-only formulations miss subtype granularity required for useful case communication.
+- Single-slice input ignores local anatomical continuity.
+- Single-split validation can overstate confidence in model performance.
+- Output without triage semantics has limited workflow impact.
+Goal for B4 track:
+- Build a robust, fold-validated, context-aware, multi-label screening model with triage-oriented outputs and explicit safety framing.
+---
+## 3. Model and Data Pipeline
+### 3.1 Backbone
+- EfficientNet-B4
+- Approximate parameter scale: 19M
+### 3.2 Input Design
+- 2.5D representation
+- 9-channel tensor constructed from neighboring slices and CT windows
+### 3.3 Output Targets
+The model predicts six labels:
+- any
+- epidural
+- intraparenchymal
+- intraventricular
+- subarachnoid
+- subdural
+### 3.4 Validation Strategy
+- 5-fold GroupKFold OOF evaluation
+- Per-fold AUC tracking for uncertainty estimation
+### 3.5 Calibration and Triage
+- Isotonic calibration used in improved path
+- Triage bands include HIGH, MEDIUM, UNCERTAIN, LOW
+---
+## 4. Experimental Configuration
+The B4 pipeline was executed under the following report-aligned configuration:
+- Model family: EfficientNet-B4
+- Input formulation: 2.5D, 9 channels
+- Task formulation: Multi-label classification (6 classes)
+- Validation protocol: 5-fold OOF
+- Reporting scope: Slice-level metrics, per-fold uncertainty, triage distributions, representative case summaries
+---
+## 5. Quantitative Results
+### 5.1 Slice-Level OOF Metrics
+| Subtype | AUC | Sens | Spec | F1 | Threshold | Sens@Spec90 | Spec@Sens95 |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| any | 0.95052 | 0.8556 | 0.9038 | 0.7041 | 0.1622 | 0.8556 | 0.7447 |
+| epidural | 0.90027 | 0.8471 | 0.8234 | 0.0400 | 0.0050 | 0.7554 | 0.5025 |
+| intraparenchymal | 0.95888 | 0.8868 | 0.8996 | 0.4548 | 0.0556 | 0.8862 | 0.7695 |
+| intraventricular | 0.97196 | 0.9371 | 0.9055 | 0.4174 | 0.0436 | 0.9371 | 0.8833 |
+| subarachnoid | 0.92206 | 0.8744 | 0.8146 | 0.3112 | 0.0548 | 0.7561 | 0.6988 |
+| subdural | 0.93016 | 0.8507 | 0.8584 | 0.4285 | 0.0685 | 0.7860 | 0.6867 |
+Observations:
+- Best subtype discrimination is intraventricular (AUC 0.97196).
+- any-class AUC is strong for screening use.
+- Epidural shows expected challenges due to class rarity and imbalance effects.
+### 5.2 Per-Fold Uncertainty (Any Label)
+- Fold 0: 0.95199
+- Fold 1: 0.95106
+- Fold 2: 0.95228
+- Fold 3: 0.95315
+- Fold 4: 0.94662
+Aggregate:
+- Mean plus/minus Std: 0.95102 plus/minus 0.00230
+- Approximate 95 percent CI: [0.94901, 0.95303]
+Interpretation:
+- Low fold variance indicates robust generalization behavior and reproducibility within the chosen split methodology.
+---
+## 6. Confusion Matrix Evidence (OOF)
+At threshold approximately 0.162 for any-class decisioning, the OOF confusion matrix from provided figure is:
+- True Negative: 58,284
+- False Positive: 6,203
+- False Negative: 1,558
+- True Positive: 9,235
+Derived notes:
+- High true-negative volume indicates broad filtering ability.
+- False negatives remain non-zero, therefore mandatory human oversight is required.
+- False positives are operationally less harmful than false negatives in screening context, but they increase workload.
+---
+## 7. Triage Distribution Analysis
+Reported 4-band triage distribution:
+- HIGH: n=6,707, positives=6,023, prevalence=0.898
+- UNCERTAIN: n=816, positives=408, prevalence=0.500
+- MEDIUM: n=3,941, positives=1,887, prevalence=0.479
+- LOW: n=63,816, positives=2,475, prevalence=0.039
+Interpretation:
+- HIGH band demonstrates strong concentration of positive findings.
+- UNCERTAIN band isolates clinically ambiguous instances and is suitable for prioritised manual attention.
+- LOW band maintains low prevalence, supporting triage de-prioritization while retaining safety caveat.
+---
+## 8. Representative Clinical Case Summaries
+The provided high-triage case summaries demonstrate that the model often identifies plausible multi-subtype combinations in positive studies. Example patterns include combinations of subdural with subarachnoid or intraparenchymal signals at high confidence.
+Operational takeaway:
+- Multi-label output is useful for communication and review focus.
+- The model remains a screening aid and should never be used as sole diagnostic authority.
+---
+## 9. B4 Versus Baseline Context
+From supplied comparison text:
+- Architecture: B0 baseline vs B4 improved
+- Input: 2D 3ch vs 2.5D 9ch
+- Formulation: Binary vs Multi-label
+- Validation: Single split vs 5-fold GroupKFold
+Reported delta (any AUC) in that specific summary:
+- Baseline any AUC: 0.95800
+- Improved B4 any AUC: 0.95052
+- Delta: -0.00748
+Important interpretation note:
+This delta should not be read as a simple regression claim without protocol harmonization. The improved model is evaluated under a more demanding setup (multi-label plus fold-based uncertainty estimation), and it contributes features not captured by any-only AUC:
+- subtype intelligence
+- triage-aware utility
+- improved methodological rigor in validation
+---
+## 10. Figure-Based Analysis (From Supplied Images)
+### Figure Group A: ROC Curves Per Subtype (OOF)
+Insights:
+- Strong curve separation from random baseline for all subtypes.
+- Intraventricular and intraparenchymal are strongest.
+- Epidural remains the most difficult class.
+### Figure Group B: Patient-Level ROC Aggregation Comparison
+Displayed AUCs are close:
+- max: 0.9525
+- mean: 0.9471
+- noisy_or: 0.9515
+- topk_mean: 0.9514
+Interpretation:
+- Aggregation strategy selection should be aligned with clinical objective (sensitivity floor versus workload constraints).
+### Figure Group C: OOF Confusion Matrix
+Confirms operational trade-off at selected threshold:
+- substantial true-positive recovery
+- acceptable but non-trivial false-positive volume
+---
+## 11. Statistical Reliability and Uncertainty
+B4 fold spread is narrow, which improves confidence in performance stability:
+- Std ~0.00230 for any-AUC across folds
+- CI indicates strong central tendency around 0.951
+For final submission, this fold uncertainty evidence is stronger than single-point metrics because it quantifies repeatability.
+---
+## 12. Clinical Utility Assessment
+Strengths of B4 track for screening:
+- High any-class discriminatory power
+- Multi-label subtype context for communication
+- Triage stratification to support operational prioritization
+- Suitable for front-end assistance in high-volume settings
+Risk points:
+- False negatives cannot be eliminated
+- Class imbalance (especially epidural) can affect threshold behavior
+- External site drift remains possible
+Recommendation:
+- Use as assistive pre-read or queue-prioritization tool under radiologist supervision.
+---
+## 13. Error Modes and Risk Discussion
+Likely error contributors:
+- subtle small bleeds near skull base
+- beam-hardening and motion artifacts
+- mixed-density presentations
+- overlap between non-hemorrhagic hyperdensities and hemorrhage-like patterns
+Mitigation path:
+- targeted hard-negative mining
+- subtype-aware threshold optimization
+- study-level fusion with metadata
+- human review escalation rules for uncertain band
+---
+## 14. Deployment Position
+Proposed deployment stance for B4:
+- Assistive triage only
+- Do not auto-clear without human review policy
+- Preserve model versioning and threshold logs
+- Track drift through periodic calibration audits
+---
+## 15. Limitations
+- External validation across institutions is pending.
+- Domain shift across scanner vendors is not fully characterized.
+- This report relies on provided OOF and figure summaries; raw per-case audit trails should be archived for final viva.
+---
+## 16. Ethical and Regulatory Considerations
+- Screening tool only, not diagnostic device.
+- Human-in-the-loop is mandatory.
+- Explanations are supportive visual cues, not confirmed lesion boundaries.
+- Clinical decisions must remain with licensed professionals.
+---
+## 17. Future Work
+1. Add explicit subtype prevalence-aware loss weighting refinements.
+2. Integrate stronger patient-level calibration and threshold governance.
+3. Compare B4 against B3/B5 under unified evaluation scripts.
+4. Add prospective-like temporal split testing.
+5. Add cost-sensitive thresholding to tune false-positive burden.
+---
+## 18. Conclusion
+The B4 model demonstrates a mature, clinically aligned screening pipeline with robust OOF performance and strong subtype-level discrimination. While any-only AUC comparisons with baseline require protocol-matched interpretation, the B4 system provides broader practical value through multi-label outputs, uncertainty-aware fold validation, and triage-oriented reporting behavior.
+This makes B4 a defensible advanced model track for project submission, provided the final defense clearly states that the system is an assistive triage tool and not an autonomous diagnosis engine.
+---
+## 19. Appendix A: Key Numbers (Quick Reference)
+- Any AUC: 0.95052
+- Any Sensitivity: 0.8556
+- Any Specificity: 0.9038
+- Fold Mean AUC: 0.95102
+- Fold Std: 0.00230
+- OOF confusion (thr approx 0.162): TN 58,284, FP 6,203, FN 1,558, TP 9,235
+---
+## 20. Appendix B: Figures to Embed in Final PDF
+Insert the provided B4 visuals in this order in your final report export:
+1. ROC Curves Per Subtype (OOF)
+2. Patient-Level ROC Aggregation Comparison
+3. OOF Confusion Matrix (threshold approx 0.162)
+Suggested captions:
+- Figure B4-1: Subtype ROC performance under OOF validation.
+- Figure B4-2: Patient-level aggregation strategy comparison.
+- Figure B4-3: Any-class confusion matrix at selected screening threshold.

README.md CHANGED Viewed

@@ -1,597 +1,104 @@
-# Major Project Documentation
-## Project Structure and Setup
-This repository is organized into clear functional sections:
-- `app.py`: Flask web application for upload, inference, browsing reports, and logs
-- `run_interface.py`: Compatibility adapter between the web app and model inference code
-- `download_imp/`: Model artifacts and core inference implementation
-- `templates/`: Jinja2 HTML templates for the web UI
-- `static/`: CSS assets
-- `logs/` and `uploads/`: Runtime folders created automatically
-### Quick Start
-1. Create and activate a Python virtual environment.
-1. Install dependencies:
-  ```bash
-  pip install -r requirements.txt
-  ```
-1. Create environment file:
-  ```bash
-  cp .env.example .env
-  ```
-1. Run the web app:
-  ```bash
-  python app.py
-  ```
-1. Open:
-  ```text
-  http://127.0.0.1:7860
-  ```
-### Notes
-- Model and calibration files are expected under `download_imp/`.
-- Generated reports are written to `download_imp/outputs/reports/`.
-- `.gitignore` excludes runtime/generated files to keep version control clean.
-### Fold Selection
-The web app supports configurable fold selection via `.env` / environment variable:
 ```bash
-ICH_FOLD_SELECTION=ensemble
 ```
-Supported values:
-- `ensemble` (default): loads all available folds and averages predictions
-- `best`: uses the best single fold from the performance report summary
-- `0` to `4`: force a specific fold
-Based on [B4_Performance_Report.md](B4_Performance_Report.md), per-fold any-AUC indicates fold `3` as the strongest single fold in that table.
-### GitHub Artifact Policy
-Model checkpoints and heavy binary artifacts are intentionally ignored for GitHub:
-- `download_imp/*.pth`
-- `download_imp/*.pkl`
-- `download_imp/outputs/`
-This keeps the repository lightweight and reproducible while allowing local/model-private inference.
-### Publish Model To Hugging Face
-1. Create a token with `Write` access: `https://huggingface.co/settings/tokens`
-1. Set credentials in `.env` (or export in shell):
-  ```bash
-  HF_TOKEN=your_hf_token
-  HF_REPO_ID=your-username/ich-b4-model
-  ```
-1. Upload model artifacts:
-  ```bash
-  python scripts/upload_to_huggingface.py --repo-id "$HF_REPO_ID" --private
-  ```
-Published model repository:
-- [Hugging Face Model Repo](https://huggingface.co/HarshCode/eff_b4_brain)
-### Host On GitHub Pages
-1. Ensure your default branch is `main`.
-1. Push this repository with `.github/workflows/pages.yml` and `docs/`.
-1. In GitHub repository settings:
-   1. Open `Settings -> Pages`
-   1. Set source to `GitHub Actions`
-The site will be published automatically after push.
-## AI-Assisted CT-Based Intracranial Hemorrhage Detection with Explainability and Clinical Reporting
----
-## 1. Introduction
-**Intracranial hemorrhage (ICH)** is a life-threatening neurological condition caused by bleeding within the skull. It is one of the most critical forms of stroke and requires immediate medical attention. Delays in detection and intervention significantly increase the risk of mortality and long-term neurological damage.
-**Computed Tomography (CT)** imaging is the primary diagnostic tool used in emergency settings for detecting intracranial hemorrhage due to its speed, availability, and high sensitivity to bleeding. However, accurate interpretation of CT scans requires experienced radiologists and must often be performed under time pressure, especially in emergency departments with high patient volumes.
-Artificial intelligence has shown promise in assisting medical image interpretation. However, many AI-based solutions function as **black-box models** and focus solely on prediction accuracy, limiting their trustworthiness and clinical adoption. There is a need for an AI-assisted system that supports early screening while providing transparent and interpretable outputs to aid clinical decision-making.
----
-## 2. Problem Statement
-Intracranial hemorrhage detection from CT brain scans is a critical yet time-sensitive task in emergency medical care. Although CT imaging is effective for identifying hemorrhage, timely and accurate interpretation depends heavily on the availability of skilled radiologists. In high-pressure or resource-constrained environments, delays or misinterpretations can adversely affect patient outcomes.
-Existing AI-based approaches for hemorrhage detection often emphasize binary predictions without sufficient explainability or clinical context. This lack of transparency makes it difficult for healthcare professionals to rely on AI outputs during screening and prioritization.
-**The problem addressed in this project** is the absence of an explainable, AI-assisted screening system that can detect intracranial hemorrhage from CT scans while providing interpretable visual evidence and structured clinical explanations to support medical professionals.
-> **Note:** The system is intended strictly as a screening and assistive tool, not as a diagnostic replacement for certified medical practitioners.
----
-## 3. Objectives
-### Primary Objectives
-- To develop an AI-based system capable of detecting intracranial hemorrhage from CT brain images
-- To classify CT scans into clearly defined categories: **hemorrhage present** or **absent**
-- To assist emergency screening by prioritizing high-risk cases
-### Secondary Objectives
-- To integrate visual explainability techniques that highlight regions influencing the model's predictions
-- To generate structured, human-readable clinical reports summarizing findings
-- To evaluate model reliability with emphasis on **false-negative reduction**
-- To ensure ethical deployment as a decision-support system rather than a diagnostic authority
----
-## 4. Scope of the Project
-### Included
-✅ CT brain image preprocessing and normalization
-✅ Binary classification of intracranial hemorrhage presence
-✅ Explainability using activation-based heatmaps
-✅ Confidence-aware screening and structured reporting
-### Excluded
-❌ Stroke subtype classification beyond hemorrhage detection
-❌ Treatment recommendation or diagnosis
-❌ Real-time clinical deployment
-❌ Integration with hospital information systems
----
-## 5. Dataset Description
-The project utilizes publicly available CT brain imaging datasets from Kaggle, such as:
-**Primary datasets:**
-- RSNA Intracranial Hemorrhage Detection Dataset
-- CT Brain Hemorrhage Dataset
-These datasets contain labeled CT scan images indicating the presence or absence of intracranial hemorrhage, with some datasets also providing hemorrhage subtype annotations.
-Using public datasets ensures reproducibility, ethical compliance, and feasibility within academic constraints.
----
-## 6. Methodology
-### 6.1 Data Preprocessing
-Preprocessing is essential to improve image quality and model performance:
-- Conversion of DICOM images to standardized formats where required
-- Image resizing to fixed resolution
-- **CT windowing (brain window)** to enhance hemorrhage visibility
-- Intensity normalization
-- Noise reduction and artifact handling
-- Data augmentation to improve generalization and mitigate class imbalance
-### 6.2 Model Development
-A **convolutional neural network (CNN)** is implemented using transfer learning.
-- Pretrained architectures such as **ResNet** or **EfficientNet** are adapted for CT image analysis
-- The final classification layer is modified for **binary output**:
-  - **Class 0**: No intracranial hemorrhage
-  - **Class 1**: Intracranial hemorrhage present
-- The model is trained using supervised learning with appropriate loss functions
-### 6.3 Evaluation Metrics
-Model performance is evaluated using clinically relevant metrics:
-- **Sensitivity (Recall)** – prioritized to reduce missed hemorrhage cases
-- Specificity
-- Precision
-- Confusion matrix analysis
-- Receiver Operating Characteristic (ROC) curve
-> **Note:** Accuracy alone is not treated as a sufficient indicator of clinical usefulness.
----
-## 7. Explainability Module
-To ensure transparency and trustworthiness:
-- **Gradient-weighted Class Activation Mapping (Grad-CAM)** will be applied
-- Heatmaps are generated to visualize regions contributing to predictions
-- Highlighted areas are analyzed in relation to known hemorrhage patterns in CT images
-This module allows clinicians to visually verify AI decisions rather than relying solely on binary outputs.
-### 7.1 Explainability Quality Assurance
-To ensure the reliability of explainability outputs:
-- **Sanity checks** will be implemented (e.g., occlusion/perturbation tests) to verify Grad-CAM is not highlighting irrelevant borders, text markers, or artifacts
-- Failure cases where heatmaps are misleading will be documented
-- For a sample of True Positives, False Negatives, and False Positives, Grad-CAM overlays will include brief qualitative notes comparing:
-  - What the model highlights
-  - What a clinician would expect to see
-- This ensures visual evidence aligns with clinical reasoning
----
-## 8. Confidence-Aware Screening
-Instead of a simple binary output, the system incorporates prediction confidence:
-- **High-confidence hemorrhage detection** → urgent attention
-- **Low-confidence predictions** → manual review recommendation
-This approach reflects real-world screening workflows and reduces over-reliance on automated decisions.
-### 8.1 Confidence Calibration
-To ensure clinicians can trust the confidence scores:
-- **Calibration techniques** will be applied (e.g., temperature scaling or isotonic regression)
-- Both **raw probability** and **calibrated confidence** will be reported
-- Expected Calibration Error (ECE) will be evaluated
-- Three confidence bands will be defined:
-  - **High confidence**: urgent attention required
-  - **Medium confidence**: standard review
-  - **Low confidence**: manual review recommended
-- Error rates will be analyzed across each confidence band to support triage decisions
----
-## 9. Clinical Report Generation
-A structured report generation module converts model outputs into human-readable explanations. Each report includes:
-- Screening outcome summary
-- Prediction confidence
-- Visual explainability reference
-- Clinical interpretation phrased as decision support
-The report avoids diagnostic claims and emphasizes assistive screening.
-### 9.1 Report Schema and Specifications
-To prevent diagnostic claims and ensure consistency:
-- A **fixed schema** will be defined with specific fields and allowed phrases
-- Reports will be locked down with rules to ensure they never make diagnostic claims
-- Each report field will have:
-  - **Screening outcome**: "Hemorrhage detected" or "No hemorrhage detected" (not "diagnosed")
-  - **Confidence level**: Numeric probability + calibrated confidence band
-  - **Visual evidence**: Reference to Grad-CAM heatmap image
-  - **Recommended action**: "Urgent radiologist review recommended" or "Standard review workflow"
-  - **System disclaimer**: Clear statement that this is a screening tool, not a diagnostic device
-- Standardized phrasing ensures clinical safety and legal compliance
----
-## 10. System Architecture Overview
-```
-1. CT Brain Image Input
-   ↓
-2. Image Preprocessing Module
-   ↓
-3. CNN-Based Hemorrhage Detection Model
-   ↓
-4. Explainability Module (Grad-CAM)
-   ↓
-5. Confidence Assessment
-   ↓
-6. Structured Clinical Report Generator
-   ↓
-7. Output for Medical Review
 ```
----
-## 11. Technology Stack
-### Programming Language
-- **Python**
-### Libraries and Frameworks
-- **TensorFlow** or **PyTorch** (Deep Learning)
-- **OpenCV** (Image Processing)
-- **NumPy**, **Pandas** (Data Handling)
-- **Matplotlib** (Visualization)
-### Development Platform
-- **Kaggle Notebooks**
-- Jupyter Notebook
----
-## 12. Feasibility and Resources
-The project is **fully feasible** using free computational resources:
-- Kaggle provides free GPU access suitable for CNN training
-- Transfer learning minimizes training time
-- All tools used are open-source
-- No specialized hardware is required locally
-**Kaggle provides:**
-- Free GPU access (time-limited but sufficient)
-- Adequate RAM and storage for medical imaging datasets
-- Stable notebook environment for training and evaluation
-**Constraints:**
-- Training time per session is limited
-- Efficient model selection and batch sizing are necessary
-These constraints align well with transfer learning-based approaches.
----
-## 13. Ethical Considerations
-- The system is designed strictly as a **screening and decision-support tool**
-- It does not provide diagnosis or treatment recommendations
-- Limitations and potential biases are explicitly documented
-- Human oversight is required for all clinical decisions
-- Dataset usage complies with public research licenses
-- Model limitations and potential biases will be documented
----
-## 14. Assumptions and Risks
-### Assumptions
-- Public datasets are representative of real-world CT brain images
-- Hemorrhage labels are clinically reliable and accurately annotated
-### Potential Risks
-- Class imbalance may bias predictions toward non-hemorrhage cases
-- Overfitting due to dataset limitations
-- Misinterpretation of AI outputs by non-expert users
-- False negatives could delay critical interventions
-### Clinical Risk Evaluation Protocol
-To address these risks systematically:
-- **Target operating points** will be defined (e.g., "maximize sensitivity subject to acceptable specificity")
-- **False negative analysis**: Pre-commit to reviewing all false negative cases with detailed inspection:
-  - Inspect FN scans with Grad-CAM overlays
-  - Document typical failure patterns (e.g., small bleeds, beam-hardening artifacts, post-operative changes)
-  - Use findings to refine preprocessing or model architecture
-- Each component of the system architecture will be evaluated using the framework:
-  - **Inputs** → **Outputs** → **Metrics** → **Failure Modes** → **Mitigations**
-- This structured approach ensures risks are systematically addressed before deployment
-Mitigation strategies will be implemented during development and documented in evaluation.
----
-## 15. Proposed Experiments and Ablation Studies
-To validate design decisions and optimize performance, the following experiments will be conducted:
-### 15.1 Preprocessing Ablations
-**Goal**: Justify the preprocessing pipeline choices
-**Experiments**:
-- Brain windowing: ON vs OFF
-- Different normalization strategies (min-max, z-score, percentile-based)
-- Data augmentation: ON vs OFF
-**Evaluation metrics**:
-- Sensitivity (primary)
-- ROC-AUC
-- Expected Calibration Error (ECE)
-**Outcome**: Select preprocessing configuration that maximizes sensitivity while maintaining calibration
-### 15.2 Model Architecture Comparison
-**Goal**: Choose the optimal backbone architecture
-**Experiments**:
-- ResNet-50 vs EfficientNet-B0
-- Same train/validation split for fair comparison
-- Evaluate with fixed hyperparameters
-**Selection criteria**:
-- Sensitivity at fixed specificity (e.g., 95% specificity)
-- Inference time (important for screening/prioritization)
-- Model size and computational requirements
-**Outcome**: Select architecture based on clinical utility and deployment feasibility
-### 15.3 Confidence-Aware Triage Study
-**Goal**: Validate the three-band confidence system
-**Experiments**:
-- Define thresholds for high/medium/low confidence bands
-- Analyze case distribution across bands
-- Compute error rates (sensitivity, specificity, FN rate) per band
-**Metrics**:
-- Percentage of cases in each band
-- False negative rate by confidence level
-- Positive predictive value by confidence level
-**Outcome**: Demonstrate that high-confidence predictions are more reliable and support triage workflow
-### 15.4 Explainability Evaluation
-**Goal**: Validate that Grad-CAM provides clinically useful visualizations
-**Experiments**:
-- Generate Grad-CAM overlays for sample cases:
-  - True Positives (correct hemorrhage detection)
-  - False Negatives (missed hemorrhages)
-  - False Positives (false alarms)
-- Qualitative analysis comparing:
-  - What the model highlights
-  - What clinicians would expect to see
-- Document cases where heatmaps are misleading or incorrect
-**Outcome**: Ensure explainability module provides trustworthy visual evidence
-### 15.5 Calibration Study
-**Goal**: Improve confidence reliability
-**Experiments**:
-- Train baseline model and measure calibration (ECE, reliability diagram)
-- Apply temperature scaling and isotonic regression
-- Compare raw probabilities vs calibrated confidence
-**Outcome**: Deploy calibrated confidence scores that clinicians can trust
----
-## 16. Expected Outcomes and Deliverables
-### 16.1 Expected Outcomes
-**Model Performance**:
-- A trained CNN model achieving high sensitivity (target: >95%) for intracranial hemorrhage detection
-- Specificity maintained at clinically acceptable levels (target: >85%)
-- Calibrated confidence scores with low Expected Calibration Error (ECE < 0.05)
-- Comprehensive performance evaluation across all metrics (sensitivity, specificity, precision, F1-score, ROC-AUC)
-**Explainability**:
-- Reliable Grad-CAM visualizations highlighting hemorrhage regions
-- Validated explainability outputs through sanity checks
-- Documented failure modes and typical misclassification patterns
-**Clinical Utility**:
-- Confidence-based triage system validated across three bands (high/medium/low)
-- Demonstrated reduction in false negatives through systematic review
-- Structured reports that support clinical decision-making without making diagnostic claims
-**Research Insights**:
-- Evidence-based justification for preprocessing choices through ablation studies
-- Model architecture comparison showing optimal choice for screening scenarios
-- Understanding of model limitations and failure patterns
-### 16.2 Project Deliverables
-**1. Trained Model**
-- Final CNN model weights saved in standard format (`.h5` or `.pth`)
-- Model configuration file documenting architecture and hyperparameters
-- Training history and learning curves
-**2. Preprocessing Pipeline**
-- Complete data preprocessing module for CT scan normalization
-- DICOM handling utilities where applicable
-- Data augmentation scripts
-**3. Explainability Module**
-- Grad-CAM implementation integrated with the model
-- Visualization generation scripts
-- Sanity check utilities for validation
-**4. Confidence Calibration Module**
-- Calibration implementation (temperature scaling/isotonic regression)
-- Scripts for computing calibrated confidence scores
-- Calibration evaluation metrics
-**5. Clinical Report Generator**
-- Structured report generation system with fixed schema
-- Template-based reporting following clinical safety guidelines
-- Sample reports demonstrating different scenarios
-**6. Evaluation Framework**
-- Complete evaluation scripts for all metrics
-- Confusion matrix and ROC curve generation
-- Ablation study implementation and results
-**7. Documentation**
-- Project report (this README and extended documentation)
-- Code documentation and inline comments
-- User guide for running the system
-- Dataset description and preprocessing details
-**8. Jupyter Notebooks**
-- Data exploration and preprocessing notebook
-- Model training and evaluation notebook
-- Explainability demonstration notebook
-- Report generation examples notebook
-**9. Results and Analysis**
-- Performance metrics across all experiments
-- Ablation study results with visualizations
-- Failure case analysis with Grad-CAM overlays
-- Calibration plots and reliability diagrams
-**10. Presentation Materials**
-- Project presentation slides
-- Demo video (optional)
-- Key visualizations and results summary
----
-## 17. Limitations and Future Work
-### Current Limitations
-- **Dataset Scope**: Limited to publicly available datasets which may not fully represent all clinical scenarios
-- **Binary Classification**: Does not classify hemorrhage subtypes (epidural, subdural, subarachnoid, etc.)
-- **Single Slice Analysis**: May not utilize full 3D volumetric information from CT scans
-- **No Real-Time Deployment**: System is designed for research and demonstration, not clinical deployment
-- **Limited Clinical Validation**: Evaluation based on dataset labels, not verified by multiple radiologists
-### Future Enhancements
-**Clinical Extensions**:
-- Multi-class classification for hemorrhage subtypes
-- Integration of volumetric (3D) analysis using 3D CNNs
-- Temporal analysis for follow-up scan comparison
-- Integration with radiology workflow systems (PACS)
-**Technical Improvements**:
-- Ensemble models combining multiple architectures
-- Uncertainty quantification using Bayesian deep learning
-- Active learning for continuous model improvement
-- Real-time inference optimization for clinical deployment
-**Validation and Deployment**:
-- Prospective clinical validation with radiologist verification
-- Multi-center evaluation for generalizability
-- Regulatory compliance pathway (FDA/CE marking)
-- Production-ready deployment with monitoring
-**Enhanced Explainability**:
-- Multiple explainability methods comparison (Grad-CAM++, SHAP, attention mechanisms)
-- Interactive visualization tools for clinicians
-- Textual explanation generation describing detected features
----
-## 18. Conclusion
-This project aims to develop an AI-assisted screening system for intracranial hemorrhage detection that prioritizes **clinical utility**, **explainability**, and **ethical deployment**. By combining deep learning with transparency mechanisms, confidence calibration, and structured reporting, the system is designed to support—not replace—clinical decision-making.
-The systematic approach to risk evaluation, comprehensive ablation studies, and focus on false negative reduction ensure that the system addresses real clinical needs while acknowledging its limitations as a screening tool.
-Through this project, we demonstrate that responsible AI in healthcare requires not just prediction accuracy, but also interpretability, calibration, careful validation, and explicit acknowledgment of system boundaries.
----

+# Intracranial Hemorrhage Detection
+AI-assisted screening system for intracranial hemorrhage (ICH) from head CT (DICOM) images.
+This project provides a Flask web interface for:
+- uploading single or batch DICOM scans,
+- running model inference,
+- viewing Grad-CAM visualizations,
+- browsing past reports and logs,
+- reviewing calibration and evaluation summaries.
+## Project Overview
+Intracranial hemorrhage is a time-critical emergency finding in neuroimaging. This repository focuses on a practical screening workflow with explainability and structured report output.
+The system is built for decision support and triage assistance, not standalone diagnosis.
+## Model and Artifacts
+Model weights and related inference artifacts are hosted on Hugging Face:
+- [Hugging Face Model Repository](https://huggingface.co/HarshCode/eff_b4_brain)
+## Detailed Performance Report
+Detailed performance and B4-specific analysis are documented separately in:
+- [B4_Performance_Report.md](B4_Performance_Report.md)
+## Repository Structure
+- `app.py`: Flask application entry point
+- `run_interface.py`: adapter layer between app and inference implementation
+- `download_imp/`: inference code and local artifact layout
+- `templates/`: HTML templates (Jinja2)
+- `static/`: styles and static assets
+- `docs/`: GitHub Pages content
+## Requirements
+- Python 3.10+ (3.12 works)
+- pip
+- virtual environment (recommended)
+Install dependencies:
 ```bash
+pip install -r requirements.txt
 ```
+## Environment Setup
+Create local environment file from template:
+```bash
+cp .env.example .env
 ```
+Important variables in `.env`:
+- `ICH_APP_DEBUG`: run Flask in debug mode (`1` or `0`)
+- `ICH_APP_PORT`: app port (default `7860`)
+- `ICH_SECRET_KEY`: Flask secret key
+- `ICH_MAX_UPLOAD_MB`: max upload size in MB
+- `ICH_FOLD_SELECTION`: `ensemble`, `best`, or fold id (`0` to `4`)
+- `ICH_LOCAL_MODE`: enables local directory scanning mode
+- `ICH_LOG_LEVEL`: `DEBUG`, `INFO`, `WARNING`, `ERROR`
+## Run the Application
+```bash
+python app.py
+```
+Open in browser:
+```text
+http://127.0.0.1:7860
+```
+## Basic Usage
+1. Go to the upload page.
+2. Upload one `.dcm`, multiple `.dcm` files, or batch input.
+3. Wait for inference and report generation.
+4. Review:
+   - screening outcome,
+   - calibrated probability,
+   - confidence band,
+   - triage action,
+   - Grad-CAM overlay.
+5. Use Reports / Logs / Evaluation pages for history and analysis.
+## Notes
+- Keep heavy model binaries out of GitHub (managed via `.gitignore`).
+- Generated report outputs are created during runtime.
+- If required artifacts are missing locally, fetch them from the Hugging Face repository linked above.
+## Disclaimer
+This system is an AI-assisted screening and decision-support tool.
+It does **not** provide a medical diagnosis and must be used with qualified clinical review.