| # Clip 02 Misclassification Case Study |
|
|
| ## Issue Summary |
| - **Symptom**: `python predict.py data/iphone/*.wav` classified `data/iphone/clip_02.wav` as “MacBook built-in microphone” (~53 %) instead of “Local iPhone recordings” (≈47 %). |
| - **Impact**: Undermined trust in the classifier for quiet iPhone speech, indicating poor separation between the iPhone and AirPods/Mac classes. |
|
|
| ## Investigation |
| - Confirmed the mismatch reproduced after the first training run with the new TAU batches. |
| - Compared class distributions via `train.py --dry-run`; highlighted severe imbalance: TAU devices (≈295 clips each) vs. iPhone (15 wav + 47 m4a) vs. AirPods/Mac (15 wav + 14 m4a). |
| - Noted identical feature extraction between training and inference (`features.extract_features`), driving suspicion toward data coverage rather than pipeline drift. |
|
|
| ## Actions Taken |
| 1. **Data Organisation** |
| - Split the TAU Mobile archive into `data/audio`, `data/audio2`, and `data/audio9` based on filename suffixes (`-a/-b/-c`). |
| - Normalised provenance defaults in `configs/base.yaml` for the new device buckets. |
| 2. **Metadata Refresh** |
| - Ran `python3 scripts/refresh_metadata.py --config configs/base.yaml` to register hashes and sources for all clips (including new iPhone/AirPods captures). |
| - Repeated after each data ingest to keep `data/metadata.csv` consistent. |
| 3. **Model Retraining** |
| - Executed `python train.py` to rebuild `models/model.pkl` and `models/label_encoder.pkl` with the expanded dataset (990 clips total). |
| 4. **Inference UX Improvements** |
| - Allowed directory inputs in `predict.py` so `python predict.py data/iphone` expands automatically. |
| - Updated the “laptop” friendly name to “AirPods Pro / MacBook built-in microphone” to reflect the mixed capture source. |
|
|
| ## Verification |
| - Post-retrain prediction: |
| ``` |
| File: data/iphone/clip_02.wav |
| RMS loudness: -40.8 dBFS |
| 1. Local iPhone recordings — 96.1% |
| 2. AirPods Pro / MacBook built-in microphone — 3.9% |
| 3. Samsung Galaxy S7 (TAU device B) — 0.0% |
| ``` |
| - The confidence inversion (≈96 % iPhone) confirms the classifier now separates the classes even for low-level speech content. |
|
|
| ## Feature Changes for Improved Results |
| - `configs/base.yaml`: added TAU device folders to `include_devices` and defined CC-BY provenance defaults. |
| - `data/metadata.csv`: regenerated with 990 entries to incorporate the new recordings (62 iPhone, 29 AirPods/Mac). |
| - `devices.py`: renamed the “laptop” label to “AirPods Pro / MacBook built-in microphone” for accurate reporting. |
| - `predict.py`: added directory expansion and broader audio-extension support to streamline batch evaluation. |
| - Dataset restructuring: migrated TAU archive clips into `data/audio`, `data/audio2`, `data/audio9` directories, preserving the `-a/-b/-c` microphone mapping. |
|
|
| ## Follow-Up Recommendations |
| - Continue collecting parallel iPhone vs. AirPods recordings, especially in quiet environments, until class counts approach parity with TAU devices. |
| - Maintain a held-out validation set (not yet captured) to quantify gains objectively beyond spot checks. |
| - Document future ingestion runs by appending to this case study or a dedicated experiment log under `docs/`. |
|
|
|
|