Ellaft
/

multimodal-pc-fault-detector

Model card Files Files and versions

xet

Community

Ellaft commited on Apr 25

Commit

b8730cc

verified ·

1 Parent(s): a3e66e5

Add project README with architecture and experiment design

Browse files

Files changed (1) hide show

README.md +124 -0

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# Multimodal PC Fault Detection Using Audio-Visual Evidence Fusion
+## Project Overview
+A complete implementation of a **two-branch multimodal system** for detecting and classifying PC hardware faults by jointly analyzing **audio signals** (BIOS beep codes, fan noise, HDD anomalies) and **visual inputs** (blue screen errors, BIOS messages, hardware indicators).
+Investigates whether **multimodal learning improves diagnostic accuracy** over unimodal approaches, and evaluates the effectiveness of **LoRA (Low-Rank Adaptation)** compared to standard fine-tuning.
+## Architecture
+```
+┌─────────────────────────────┐     ┌─────────────────────────────┐
+│     VISUAL BRANCH           │     │      AUDIO BRANCH           │
+│                             │     │                             │
+│  Input: 224×224 RGB Image   │     │  Input: Log-Mel Spectrogram │
+│           ↓                 │     │           ↓                 │
+│  ViT-B/16 (ImageNet-21k)   │     │  AST (AudioSet pretrained)  │
+│  + LoRA (r=8, Q+V)         │     │  + LoRA (r=8, Q+V)         │
+│           ↓                 │     │           ↓                 │
+│  CLS Token → (B, 768)      │     │  CLS Token → (B, 768)      │
+│           ↓                 │     │           ↓                 │
+│  Projection → (B, 512)     │     │  Projection → (B, 512)     │
+└─────────────┬───────────────┘     └─────────────┬───────────────┘
+              │                                   │
+              └──────────┬────────────────────────┘
+                         │
+              ┌──────────┴──────────┐
+              │   LATE FUSION       │
+              │  Concat → (B, 1024) │
+              │  LayerNorm + GELU   │
+              │  MLP → (B, 512)     │
+              │  → (B, 5 classes)   │
+              └─────────────────────┘
+```
+**Total params:** 174.5M | **Trainable (LoRA):** 1.9M (1.09%)
+## Fault Taxonomy (5 Classes)
+| ID | Class | Audio Proxy (ESC-50) | Visual Indicator |
+|----|-------|---------------------|------------------|
+| 0 | Normal Operation | keyboard_typing, mouse_click | Green status screen |
+| 1 | Boot Failure | clock_alarm, siren | Black BIOS error screen |
+| 2 | Overheating/Fan | vacuum_cleaner, engine, washing_machine | Red temperature warning |
+| 3 | Storage Failure | clock_tick, door_wood_knock, hand_saw | Orange disk error screen |
+| 4 | System Crash | glass_breaking, fireworks, chainsaw | Blue BSOD screen |
+## Key Design Decisions (Literature-Backed)
+| Decision | Justification | Source |
+|----------|---------------|--------|
+| AST over CNN14 | mAP 0.459 vs 0.431 on AudioSet; HF-native | Gong et al., Interspeech 2021 |
+| Late fusion (concat) | Within ~1% of bottleneck attention; simple & interpretable | Nagrani et al., NeurIPS 2021 (MBT) |
+| LoRA r=8, α=16 | Optimal for audio transformers; regularization on small data | Cappellazzo et al., 2023; Zhao et al., 2024 |
+| Modality dropout p=0.3 | Cheapest robustness strategy for missing modalities | Woo et al., NeurIPS 2022 |
+| Multimodal > Unimodal | +14% F1 from adding audio to vision | Inceoglu et al., 2020 (FINO-Net) |
+## Ablation Experiments
+| Experiment | Mode | Method | Purpose |
+|------------|------|--------|---------|
+| Multimodal + LoRA | Both | LoRA r=8 | **Primary system** |
+| Visual Only + LoRA | ViT only | LoRA r=8 | Unimodal baseline |
+| Audio Only + LoRA | AST only | LoRA r=8 | Unimodal baseline |
+| Multimodal + Full FT | Both | Full fine-tuning | LoRA vs full FT |
+| Multimodal + Linear Probe | Both | Frozen encoders | Feature quality |
+| Multimodal + High Dropout | Both | LoRA + 50% dropout | Robustness |
+## Usage
+```bash
+# Install dependencies
+pip install torch torchaudio torchvision transformers peft datasets scikit-learn Pillow soundfile
+# Quick test (CPU, ~5 min)
+python train.py --mode multimodal --finetune lora --quick_test --no_push
+# Full training (GPU)
+python train.py --mode multimodal --finetune lora --eval_robustness --hub_model_id Ellaft/pc-fault-multimodal-lora
+# Run all ablation studies
+python run_ablations.py
+# Single modality baselines
+python train.py --mode visual_only --finetune lora --no_push
+python train.py --mode audio_only --finetune lora --no_push
+```
+## Files
+| File | Description |
+|------|-------------|
+| `config.py` | All hyperparameters, fault taxonomy, ESC-50 mappings, ablation configs |
+| `dataset.py` | PCFaultDataset, synthetic visual generation, audio preprocessing |
+| `models.py` | VisualBranch (ViT+LoRA), AudioBranch (AST+LoRA), LateFusion, full model |
+| `train.py` | Training loop with evaluation, confusion matrix, Hub push |
+| `run_ablations.py` | Automated ablation runner with comparison tables |
+## Datasets
+- **Audio:** [ESC-50](https://huggingface.co/datasets/ashraq/esc50) (520 clips mapped to 5 fault classes)
+- **Visual:** Synthetically generated diagnostic screens (no real BSOD dataset exists on HF Hub)
+- **Recommended upgrade:** [AudioSet balanced](https://huggingface.co/datasets/agkphysics/AudioSet) for richer PC sound coverage
+## Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| LoRA rank / alpha | 8 / 16 |
+| LoRA targets | query, value |
+| LR (LoRA / Full FT) | 5e-3 / 2e-5 |
+| Optimizer | AdamW (weight_decay=0.01) |
+| Scheduler | OneCycleLR (cosine) |
+| Batch size | 16 × 2 grad accum = 32 effective |
+| Epochs | 15 |
+| Modality dropout | 0.3 (default) / 0.5 (robust) |
+## Pretrained Backbones
+- Visual: [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) (86.4M params)
+- Audio: [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) (86.6M params)
+## License
+Apache-2.0