Ellaft commited on
Commit
b8730cc
Β·
verified Β·
1 Parent(s): a3e66e5

Add project README with architecture and experiment design

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multimodal PC Fault Detection Using Audio-Visual Evidence Fusion
2
+
3
+ ## Project Overview
4
+
5
+ A complete implementation of a **two-branch multimodal system** for detecting and classifying PC hardware faults by jointly analyzing **audio signals** (BIOS beep codes, fan noise, HDD anomalies) and **visual inputs** (blue screen errors, BIOS messages, hardware indicators).
6
+
7
+ Investigates whether **multimodal learning improves diagnostic accuracy** over unimodal approaches, and evaluates the effectiveness of **LoRA (Low-Rank Adaptation)** compared to standard fine-tuning.
8
+
9
+ ## Architecture
10
+
11
+ ```
12
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13
+ β”‚ VISUAL BRANCH β”‚ β”‚ AUDIO BRANCH β”‚
14
+ β”‚ β”‚ β”‚ β”‚
15
+ β”‚ Input: 224Γ—224 RGB Image β”‚ β”‚ Input: Log-Mel Spectrogram β”‚
16
+ β”‚ ↓ β”‚ β”‚ ↓ β”‚
17
+ β”‚ ViT-B/16 (ImageNet-21k) β”‚ β”‚ AST (AudioSet pretrained) β”‚
18
+ β”‚ + LoRA (r=8, Q+V) β”‚ β”‚ + LoRA (r=8, Q+V) β”‚
19
+ β”‚ ↓ β”‚ β”‚ ↓ β”‚
20
+ β”‚ CLS Token β†’ (B, 768) β”‚ β”‚ CLS Token β†’ (B, 768) β”‚
21
+ β”‚ ↓ β”‚ β”‚ ↓ β”‚
22
+ β”‚ Projection β†’ (B, 512) β”‚ β”‚ Projection β†’ (B, 512) β”‚
23
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
24
+ β”‚ β”‚
25
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
26
+ β”‚
27
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
28
+ β”‚ LATE FUSION β”‚
29
+ β”‚ Concat β†’ (B, 1024) β”‚
30
+ β”‚ LayerNorm + GELU β”‚
31
+ β”‚ MLP β†’ (B, 512) β”‚
32
+ β”‚ β†’ (B, 5 classes) β”‚
33
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
34
+ ```
35
+
36
+ **Total params:** 174.5M | **Trainable (LoRA):** 1.9M (1.09%)
37
+
38
+ ## Fault Taxonomy (5 Classes)
39
+
40
+ | ID | Class | Audio Proxy (ESC-50) | Visual Indicator |
41
+ |----|-------|---------------------|------------------|
42
+ | 0 | Normal Operation | keyboard_typing, mouse_click | Green status screen |
43
+ | 1 | Boot Failure | clock_alarm, siren | Black BIOS error screen |
44
+ | 2 | Overheating/Fan | vacuum_cleaner, engine, washing_machine | Red temperature warning |
45
+ | 3 | Storage Failure | clock_tick, door_wood_knock, hand_saw | Orange disk error screen |
46
+ | 4 | System Crash | glass_breaking, fireworks, chainsaw | Blue BSOD screen |
47
+
48
+ ## Key Design Decisions (Literature-Backed)
49
+
50
+ | Decision | Justification | Source |
51
+ |----------|---------------|--------|
52
+ | AST over CNN14 | mAP 0.459 vs 0.431 on AudioSet; HF-native | Gong et al., Interspeech 2021 |
53
+ | Late fusion (concat) | Within ~1% of bottleneck attention; simple & interpretable | Nagrani et al., NeurIPS 2021 (MBT) |
54
+ | LoRA r=8, Ξ±=16 | Optimal for audio transformers; regularization on small data | Cappellazzo et al., 2023; Zhao et al., 2024 |
55
+ | Modality dropout p=0.3 | Cheapest robustness strategy for missing modalities | Woo et al., NeurIPS 2022 |
56
+ | Multimodal > Unimodal | +14% F1 from adding audio to vision | Inceoglu et al., 2020 (FINO-Net) |
57
+
58
+ ## Ablation Experiments
59
+
60
+ | Experiment | Mode | Method | Purpose |
61
+ |------------|------|--------|---------|
62
+ | Multimodal + LoRA | Both | LoRA r=8 | **Primary system** |
63
+ | Visual Only + LoRA | ViT only | LoRA r=8 | Unimodal baseline |
64
+ | Audio Only + LoRA | AST only | LoRA r=8 | Unimodal baseline |
65
+ | Multimodal + Full FT | Both | Full fine-tuning | LoRA vs full FT |
66
+ | Multimodal + Linear Probe | Both | Frozen encoders | Feature quality |
67
+ | Multimodal + High Dropout | Both | LoRA + 50% dropout | Robustness |
68
+
69
+ ## Usage
70
+
71
+ ```bash
72
+ # Install dependencies
73
+ pip install torch torchaudio torchvision transformers peft datasets scikit-learn Pillow soundfile
74
+
75
+ # Quick test (CPU, ~5 min)
76
+ python train.py --mode multimodal --finetune lora --quick_test --no_push
77
+
78
+ # Full training (GPU)
79
+ python train.py --mode multimodal --finetune lora --eval_robustness --hub_model_id Ellaft/pc-fault-multimodal-lora
80
+
81
+ # Run all ablation studies
82
+ python run_ablations.py
83
+
84
+ # Single modality baselines
85
+ python train.py --mode visual_only --finetune lora --no_push
86
+ python train.py --mode audio_only --finetune lora --no_push
87
+ ```
88
+
89
+ ## Files
90
+
91
+ | File | Description |
92
+ |------|-------------|
93
+ | `config.py` | All hyperparameters, fault taxonomy, ESC-50 mappings, ablation configs |
94
+ | `dataset.py` | PCFaultDataset, synthetic visual generation, audio preprocessing |
95
+ | `models.py` | VisualBranch (ViT+LoRA), AudioBranch (AST+LoRA), LateFusion, full model |
96
+ | `train.py` | Training loop with evaluation, confusion matrix, Hub push |
97
+ | `run_ablations.py` | Automated ablation runner with comparison tables |
98
+
99
+ ## Datasets
100
+
101
+ - **Audio:** [ESC-50](https://huggingface.co/datasets/ashraq/esc50) (520 clips mapped to 5 fault classes)
102
+ - **Visual:** Synthetically generated diagnostic screens (no real BSOD dataset exists on HF Hub)
103
+ - **Recommended upgrade:** [AudioSet balanced](https://huggingface.co/datasets/agkphysics/AudioSet) for richer PC sound coverage
104
+
105
+ ## Training Configuration
106
+
107
+ | Parameter | Value |
108
+ |-----------|-------|
109
+ | LoRA rank / alpha | 8 / 16 |
110
+ | LoRA targets | query, value |
111
+ | LR (LoRA / Full FT) | 5e-3 / 2e-5 |
112
+ | Optimizer | AdamW (weight_decay=0.01) |
113
+ | Scheduler | OneCycleLR (cosine) |
114
+ | Batch size | 16 Γ— 2 grad accum = 32 effective |
115
+ | Epochs | 15 |
116
+ | Modality dropout | 0.3 (default) / 0.5 (robust) |
117
+
118
+ ## Pretrained Backbones
119
+
120
+ - Visual: [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) (86.4M params)
121
+ - Audio: [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) (86.6M params)
122
+
123
+ ## License
124
+ Apache-2.0