# Clip 02 Misclassification Case Study

## Issue Summary
- **Symptom**: `python predict.py data/iphone/*.wav` classified `data/iphone/clip_02.wav` as “MacBook built-in microphone” (~53 %) instead of “Local iPhone recordings” (≈47 %).
- **Impact**: Undermined trust in the classifier for quiet iPhone speech, indicating poor separation between the iPhone and AirPods/Mac classes.

## Investigation
- Confirmed the mismatch reproduced after the first training run with the new TAU batches.
- Compared class distributions via `train.py --dry-run`; highlighted severe imbalance: TAU devices (≈295 clips each) vs. iPhone (15 wav + 47 m4a) vs. AirPods/Mac (15 wav + 14 m4a).
- Noted identical feature extraction between training and inference (`features.extract_features`), driving suspicion toward data coverage rather than pipeline drift.

## Actions Taken
1. **Data Organisation**
   - Split the TAU Mobile archive into `data/audio`, `data/audio2`, and `data/audio9` based on filename suffixes (`-a/-b/-c`).
   - Normalised provenance defaults in `configs/base.yaml` for the new device buckets.
2. **Metadata Refresh**
   - Ran `python3 scripts/refresh_metadata.py --config configs/base.yaml` to register hashes and sources for all clips (including new iPhone/AirPods captures).
   - Repeated after each data ingest to keep `data/metadata.csv` consistent.
3. **Model Retraining**
   - Executed `python train.py` to rebuild `models/model.pkl` and `models/label_encoder.pkl` with the expanded dataset (990 clips total).
4. **Inference UX Improvements**
   - Allowed directory inputs in `predict.py` so `python predict.py data/iphone` expands automatically.
   - Updated the “laptop” friendly name to “AirPods Pro / MacBook built-in microphone” to reflect the mixed capture source.

## Verification
- Post-retrain prediction:
  ```
  File: data/iphone/clip_02.wav
  RMS loudness: -40.8 dBFS
    1. Local iPhone recordings — 96.1%
    2. AirPods Pro / MacBook built-in microphone — 3.9%
    3. Samsung Galaxy S7 (TAU device B) — 0.0%
  ```
- The confidence inversion (≈96 % iPhone) confirms the classifier now separates the classes even for low-level speech content.

## Feature Changes for Improved Results
- `configs/base.yaml`: added TAU device folders to `include_devices` and defined CC-BY provenance defaults.
- `data/metadata.csv`: regenerated with 990 entries to incorporate the new recordings (62 iPhone, 29 AirPods/Mac).
- `devices.py`: renamed the “laptop” label to “AirPods Pro / MacBook built-in microphone” for accurate reporting.
- `predict.py`: added directory expansion and broader audio-extension support to streamline batch evaluation.
- Dataset restructuring: migrated TAU archive clips into `data/audio`, `data/audio2`, `data/audio9` directories, preserving the `-a/-b/-c` microphone mapping.

## Follow-Up Recommendations
- Continue collecting parallel iPhone vs. AirPods recordings, especially in quiet environments, until class counts approach parity with TAU devices.
- Maintain a held-out validation set (not yet captured) to quantify gains objectively beyond spot checks.
- Document future ingestion runs by appending to this case study or a dedicated experiment log under `docs/`.