Multimodal PC Fault Detection Using Audio-Visual Evidence Fusion
Two-branch architecture (ViT visual + AST audio) with late fusion for 5 PC fault classes.
Fault Classes
| ID | Class | Audio Signal | Visual Signal |
|---|---|---|---|
| 0 | normal_operation |
Quiet fan hum | Clean desktop |
| 1 | boot_failure |
BIOS beep codes | POST error screen |
| 2 | overheating_fan |
Loud/grinding fan | Thermal warning UI |
| 3 | storage_failure |
HDD clicking | SMART/CHKDSK errors |
| 4 | system_crash |
Audio glitch/silence | BSOD |
Quick Start
# Clone
git clone https://huggingface.co/Ellaft/multimodal-pc-fault-detector
cd multimodal-pc-fault-detector
# Install
pip install -r requirements.txt
# Train (downloads dataset automatically from Hub)
cd src
python train.py --quick_test --no_push
# Full training (15 epochs, ~1hr on A100)
python train.py --eval_robustness
# All 6 ablation experiments
python run_ablations.py --quick_test
Dataset
Ellaft/pc-fault-real-dataset β 1,500 audio-visual pairs, auto-downloaded when you run train.py.
| Source | Content |
|---|---|
| Real fan recordings | HenriqueFrancaa/cooling-fans-db0 β normal vs abnormal PC cooling fans |
| Synthetic beep codes | 12 real AMI/Award/Phoenix BIOS beep patterns with timing jitter |
| Synthetic HDD clicks | Repetitive clicking, motor hum, head crash grinding |
| Synthetic crash audio | Noise bursts, buffer glitches, feedback loops, system hangs |
| Synthetic BSOD images | Windows 10/11/7/XP styles with real stop codes |
| Synthetic POST screens | BIOS vendor screens with real error messages |
| Synthetic thermal UIs | HWMonitor, BIOS warning, notification popup styles |
| Synthetic disk errors | SMART warnings, CHKDSK, CrystalDiskInfo displays |
To rebuild or extend the dataset (add YouTube scraping, etc.):
cd data
pip install -r requirements_data.txt
python build_dataset.py --max_per_class 500 --upload
Architecture
Audio (WAV) βββ AST (AudioSet) + LoRA βββ [CLS] 768d βββ audio_head βββ L_audio
β
ββββ concat βββ fusion_classifier βββ L_fusion
β
Visual (JPG) ββ ViT-B/16 (IN-21k) + LoRA ββ [CLS] 768d βββ visual_head βββ L_visual
Loss = L_fusion + 1.5 Γ L_visual + 0.5 Γ L_audio
Anti-Modality-Collapse
Three techniques prevent the visual branch from being ignored:
- Auxiliary unimodal heads β force each branch to independently classify
- OGM-GE (Peng et al., CVPR 2022) β suppress dominant modality gradients at each step
- Asymmetric learning rates β visual branch gets 3Γ base LR, audio gets 0.5Γ
Files
src/
config.py β All hyperparameters
models.py β ViT + AST + LateFusion + OGM-GE + auxiliary heads
dataset_v2.py β Loads from Ellaft/pc-fault-real-dataset
train.py β Training loop with OGM-GE
run_ablations.py β 6-experiment ablation runner
data/
build_dataset.py β Dataset builder (YouTube + HF + synthetic)
CLI Options
python train.py --mode multimodal # default
python train.py --mode visual_only # unimodal ablation
python train.py --mode audio_only # unimodal ablation
python train.py --finetune full --lr 2e-5 # full fine-tuning
python train.py --no_ogm # disable OGM-GE
python train.py --ogm_alpha 0.5 # more aggressive modulation
python train.py --lambda_visual 2.0 # stronger visual auxiliary loss
python train.py --visual_lr_mult 5.0 # 5Γ LR for visual branch