Add project README with architecture and experiment design
Browse files
README.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Multimodal PC Fault Detection Using Audio-Visual Evidence Fusion
|
| 2 |
+
|
| 3 |
+
## Project Overview
|
| 4 |
+
|
| 5 |
+
A complete implementation of a **two-branch multimodal system** for detecting and classifying PC hardware faults by jointly analyzing **audio signals** (BIOS beep codes, fan noise, HDD anomalies) and **visual inputs** (blue screen errors, BIOS messages, hardware indicators).
|
| 6 |
+
|
| 7 |
+
Investigates whether **multimodal learning improves diagnostic accuracy** over unimodal approaches, and evaluates the effectiveness of **LoRA (Low-Rank Adaptation)** compared to standard fine-tuning.
|
| 8 |
+
|
| 9 |
+
## Architecture
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
+
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
|
| 13 |
+
β VISUAL BRANCH β β AUDIO BRANCH β
|
| 14 |
+
β β β β
|
| 15 |
+
β Input: 224Γ224 RGB Image β β Input: Log-Mel Spectrogram β
|
| 16 |
+
β β β β β β
|
| 17 |
+
β ViT-B/16 (ImageNet-21k) β β AST (AudioSet pretrained) β
|
| 18 |
+
β + LoRA (r=8, Q+V) β β + LoRA (r=8, Q+V) β
|
| 19 |
+
β β β β β β
|
| 20 |
+
β CLS Token β (B, 768) β β CLS Token β (B, 768) β
|
| 21 |
+
β β β β β β
|
| 22 |
+
β Projection β (B, 512) β β Projection β (B, 512) β
|
| 23 |
+
βββββββββββββββ¬ββββββββββββββββ βββββββββββββββ¬ββββββββββββββββ
|
| 24 |
+
β β
|
| 25 |
+
ββββββββββββ¬βββββββββββββββββββββββββ
|
| 26 |
+
β
|
| 27 |
+
ββββββββββββ΄βββββββββββ
|
| 28 |
+
β LATE FUSION β
|
| 29 |
+
β Concat β (B, 1024) β
|
| 30 |
+
β LayerNorm + GELU β
|
| 31 |
+
β MLP β (B, 512) β
|
| 32 |
+
β β (B, 5 classes) β
|
| 33 |
+
βββββββββββββββββββββββ
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
**Total params:** 174.5M | **Trainable (LoRA):** 1.9M (1.09%)
|
| 37 |
+
|
| 38 |
+
## Fault Taxonomy (5 Classes)
|
| 39 |
+
|
| 40 |
+
| ID | Class | Audio Proxy (ESC-50) | Visual Indicator |
|
| 41 |
+
|----|-------|---------------------|------------------|
|
| 42 |
+
| 0 | Normal Operation | keyboard_typing, mouse_click | Green status screen |
|
| 43 |
+
| 1 | Boot Failure | clock_alarm, siren | Black BIOS error screen |
|
| 44 |
+
| 2 | Overheating/Fan | vacuum_cleaner, engine, washing_machine | Red temperature warning |
|
| 45 |
+
| 3 | Storage Failure | clock_tick, door_wood_knock, hand_saw | Orange disk error screen |
|
| 46 |
+
| 4 | System Crash | glass_breaking, fireworks, chainsaw | Blue BSOD screen |
|
| 47 |
+
|
| 48 |
+
## Key Design Decisions (Literature-Backed)
|
| 49 |
+
|
| 50 |
+
| Decision | Justification | Source |
|
| 51 |
+
|----------|---------------|--------|
|
| 52 |
+
| AST over CNN14 | mAP 0.459 vs 0.431 on AudioSet; HF-native | Gong et al., Interspeech 2021 |
|
| 53 |
+
| Late fusion (concat) | Within ~1% of bottleneck attention; simple & interpretable | Nagrani et al., NeurIPS 2021 (MBT) |
|
| 54 |
+
| LoRA r=8, Ξ±=16 | Optimal for audio transformers; regularization on small data | Cappellazzo et al., 2023; Zhao et al., 2024 |
|
| 55 |
+
| Modality dropout p=0.3 | Cheapest robustness strategy for missing modalities | Woo et al., NeurIPS 2022 |
|
| 56 |
+
| Multimodal > Unimodal | +14% F1 from adding audio to vision | Inceoglu et al., 2020 (FINO-Net) |
|
| 57 |
+
|
| 58 |
+
## Ablation Experiments
|
| 59 |
+
|
| 60 |
+
| Experiment | Mode | Method | Purpose |
|
| 61 |
+
|------------|------|--------|---------|
|
| 62 |
+
| Multimodal + LoRA | Both | LoRA r=8 | **Primary system** |
|
| 63 |
+
| Visual Only + LoRA | ViT only | LoRA r=8 | Unimodal baseline |
|
| 64 |
+
| Audio Only + LoRA | AST only | LoRA r=8 | Unimodal baseline |
|
| 65 |
+
| Multimodal + Full FT | Both | Full fine-tuning | LoRA vs full FT |
|
| 66 |
+
| Multimodal + Linear Probe | Both | Frozen encoders | Feature quality |
|
| 67 |
+
| Multimodal + High Dropout | Both | LoRA + 50% dropout | Robustness |
|
| 68 |
+
|
| 69 |
+
## Usage
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
# Install dependencies
|
| 73 |
+
pip install torch torchaudio torchvision transformers peft datasets scikit-learn Pillow soundfile
|
| 74 |
+
|
| 75 |
+
# Quick test (CPU, ~5 min)
|
| 76 |
+
python train.py --mode multimodal --finetune lora --quick_test --no_push
|
| 77 |
+
|
| 78 |
+
# Full training (GPU)
|
| 79 |
+
python train.py --mode multimodal --finetune lora --eval_robustness --hub_model_id Ellaft/pc-fault-multimodal-lora
|
| 80 |
+
|
| 81 |
+
# Run all ablation studies
|
| 82 |
+
python run_ablations.py
|
| 83 |
+
|
| 84 |
+
# Single modality baselines
|
| 85 |
+
python train.py --mode visual_only --finetune lora --no_push
|
| 86 |
+
python train.py --mode audio_only --finetune lora --no_push
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Files
|
| 90 |
+
|
| 91 |
+
| File | Description |
|
| 92 |
+
|------|-------------|
|
| 93 |
+
| `config.py` | All hyperparameters, fault taxonomy, ESC-50 mappings, ablation configs |
|
| 94 |
+
| `dataset.py` | PCFaultDataset, synthetic visual generation, audio preprocessing |
|
| 95 |
+
| `models.py` | VisualBranch (ViT+LoRA), AudioBranch (AST+LoRA), LateFusion, full model |
|
| 96 |
+
| `train.py` | Training loop with evaluation, confusion matrix, Hub push |
|
| 97 |
+
| `run_ablations.py` | Automated ablation runner with comparison tables |
|
| 98 |
+
|
| 99 |
+
## Datasets
|
| 100 |
+
|
| 101 |
+
- **Audio:** [ESC-50](https://huggingface.co/datasets/ashraq/esc50) (520 clips mapped to 5 fault classes)
|
| 102 |
+
- **Visual:** Synthetically generated diagnostic screens (no real BSOD dataset exists on HF Hub)
|
| 103 |
+
- **Recommended upgrade:** [AudioSet balanced](https://huggingface.co/datasets/agkphysics/AudioSet) for richer PC sound coverage
|
| 104 |
+
|
| 105 |
+
## Training Configuration
|
| 106 |
+
|
| 107 |
+
| Parameter | Value |
|
| 108 |
+
|-----------|-------|
|
| 109 |
+
| LoRA rank / alpha | 8 / 16 |
|
| 110 |
+
| LoRA targets | query, value |
|
| 111 |
+
| LR (LoRA / Full FT) | 5e-3 / 2e-5 |
|
| 112 |
+
| Optimizer | AdamW (weight_decay=0.01) |
|
| 113 |
+
| Scheduler | OneCycleLR (cosine) |
|
| 114 |
+
| Batch size | 16 Γ 2 grad accum = 32 effective |
|
| 115 |
+
| Epochs | 15 |
|
| 116 |
+
| Modality dropout | 0.3 (default) / 0.5 (robust) |
|
| 117 |
+
|
| 118 |
+
## Pretrained Backbones
|
| 119 |
+
|
| 120 |
+
- Visual: [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) (86.4M params)
|
| 121 |
+
- Audio: [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) (86.6M params)
|
| 122 |
+
|
| 123 |
+
## License
|
| 124 |
+
Apache-2.0
|