A newer version of the Streamlit SDK is available: 1.57.0
Clip 02 Misclassification Case Study
Issue Summary
- Symptom:
python predict.py data/iphone/*.wavclassifieddata/iphone/clip_02.wavas “MacBook built-in microphone” (~53 %) instead of “Local iPhone recordings” (≈47 %). - Impact: Undermined trust in the classifier for quiet iPhone speech, indicating poor separation between the iPhone and AirPods/Mac classes.
Investigation
- Confirmed the mismatch reproduced after the first training run with the new TAU batches.
- Compared class distributions via
train.py --dry-run; highlighted severe imbalance: TAU devices (≈295 clips each) vs. iPhone (15 wav + 47 m4a) vs. AirPods/Mac (15 wav + 14 m4a). - Noted identical feature extraction between training and inference (
features.extract_features), driving suspicion toward data coverage rather than pipeline drift.
Actions Taken
- Data Organisation
- Split the TAU Mobile archive into
data/audio,data/audio2, anddata/audio9based on filename suffixes (-a/-b/-c). - Normalised provenance defaults in
configs/base.yamlfor the new device buckets.
- Split the TAU Mobile archive into
- Metadata Refresh
- Ran
python3 scripts/refresh_metadata.py --config configs/base.yamlto register hashes and sources for all clips (including new iPhone/AirPods captures). - Repeated after each data ingest to keep
data/metadata.csvconsistent.
- Ran
- Model Retraining
- Executed
python train.pyto rebuildmodels/model.pklandmodels/label_encoder.pklwith the expanded dataset (990 clips total).
- Executed
- Inference UX Improvements
- Allowed directory inputs in
predict.pysopython predict.py data/iphoneexpands automatically. - Updated the “laptop” friendly name to “AirPods Pro / MacBook built-in microphone” to reflect the mixed capture source.
- Allowed directory inputs in
Verification
- Post-retrain prediction:
File: data/iphone/clip_02.wav RMS loudness: -40.8 dBFS 1. Local iPhone recordings — 96.1% 2. AirPods Pro / MacBook built-in microphone — 3.9% 3. Samsung Galaxy S7 (TAU device B) — 0.0% - The confidence inversion (≈96 % iPhone) confirms the classifier now separates the classes even for low-level speech content.
Feature Changes for Improved Results
configs/base.yaml: added TAU device folders toinclude_devicesand defined CC-BY provenance defaults.data/metadata.csv: regenerated with 990 entries to incorporate the new recordings (62 iPhone, 29 AirPods/Mac).devices.py: renamed the “laptop” label to “AirPods Pro / MacBook built-in microphone” for accurate reporting.predict.py: added directory expansion and broader audio-extension support to streamline batch evaluation.- Dataset restructuring: migrated TAU archive clips into
data/audio,data/audio2,data/audio9directories, preserving the-a/-b/-cmicrophone mapping.
Follow-Up Recommendations
- Continue collecting parallel iPhone vs. AirPods recordings, especially in quiet environments, until class counts approach parity with TAU devices.
- Maintain a held-out validation set (not yet captured) to quantify gains objectively beyond spot checks.
- Document future ingestion runs by appending to this case study or a dedicated experiment log under
docs/.