# Multimodal PC Fault Detection Using Audio-Visual Evidence Fusion Two-branch architecture (ViT visual + AST audio) with late fusion for 5 PC fault classes. ## Fault Classes | ID | Class | Audio Signal | Visual Signal | |----|-------|-------------|---------------| | 0 | `normal_operation` | Quiet fan hum | Clean desktop | | 1 | `boot_failure` | BIOS beep codes | POST error screen | | 2 | `overheating_fan` | Loud/grinding fan | Thermal warning UI | | 3 | `storage_failure` | HDD clicking | SMART/CHKDSK errors | | 4 | `system_crash` | Audio glitch/silence | BSOD | ## Quick Start ```bash # Clone git clone https://huggingface.co/Ellaft/multimodal-pc-fault-detector cd multimodal-pc-fault-detector # Install pip install -r requirements.txt # Train (downloads dataset automatically from Hub) cd src python train.py --quick_test --no_push # Full training (15 epochs, ~1hr on A100) python train.py --eval_robustness # All 6 ablation experiments python run_ablations.py --quick_test ``` ## Dataset **[Ellaft/pc-fault-real-dataset](https://huggingface.co/datasets/Ellaft/pc-fault-real-dataset)** — 1,500 audio-visual pairs, auto-downloaded when you run `train.py`. | Source | Content | |--------|---------| | Real fan recordings | [HenriqueFrancaa/cooling-fans-db0](https://huggingface.co/datasets/HenriqueFrancaa/cooling-fans-db0) — normal vs abnormal PC cooling fans | | Synthetic beep codes | 12 real AMI/Award/Phoenix BIOS beep patterns with timing jitter | | Synthetic HDD clicks | Repetitive clicking, motor hum, head crash grinding | | Synthetic crash audio | Noise bursts, buffer glitches, feedback loops, system hangs | | Synthetic BSOD images | Windows 10/11/7/XP styles with real stop codes | | Synthetic POST screens | BIOS vendor screens with real error messages | | Synthetic thermal UIs | HWMonitor, BIOS warning, notification popup styles | | Synthetic disk errors | SMART warnings, CHKDSK, CrystalDiskInfo displays | To rebuild or extend the dataset (add YouTube scraping, etc.): ```bash cd data pip install -r requirements_data.txt python build_dataset.py --max_per_class 500 --upload ``` ## Architecture ``` Audio (WAV) ──→ AST (AudioSet) + LoRA ──→ [CLS] 768d ──→ audio_head ──→ L_audio │ ├──→ concat ──→ fusion_classifier ──→ L_fusion │ Visual (JPG) ─→ ViT-B/16 (IN-21k) + LoRA ─→ [CLS] 768d ──→ visual_head ──→ L_visual ``` **Loss** = L_fusion + 1.5 × L_visual + 0.5 × L_audio ## Anti-Modality-Collapse Three techniques prevent the visual branch from being ignored: 1. **Auxiliary unimodal heads** — force each branch to independently classify 2. **OGM-GE** ([Peng et al., CVPR 2022](https://arxiv.org/abs/2203.15332)) — suppress dominant modality gradients at each step 3. **Asymmetric learning rates** — visual branch gets 3× base LR, audio gets 0.5× ## Files ``` src/ config.py — All hyperparameters models.py — ViT + AST + LateFusion + OGM-GE + auxiliary heads dataset_v2.py — Loads from Ellaft/pc-fault-real-dataset train.py — Training loop with OGM-GE run_ablations.py — 6-experiment ablation runner data/ build_dataset.py — Dataset builder (YouTube + HF + synthetic) ``` ## CLI Options ```bash python train.py --mode multimodal # default python train.py --mode visual_only # unimodal ablation python train.py --mode audio_only # unimodal ablation python train.py --finetune full --lr 2e-5 # full fine-tuning python train.py --no_ogm # disable OGM-GE python train.py --ogm_alpha 0.5 # more aggressive modulation python train.py --lambda_visual 2.0 # stronger visual auxiliary loss python train.py --visual_lr_mult 5.0 # 5× LR for visual branch ```