Spaces:

connaaa
/

mic-id

Sleeping

App Files Files Community

connork commited on Oct 23, 2025

Commit

b6c1b75

1 Parent(s): 8d73c70

Align Space with latest Mic-ID release

Browse files

Files changed (16) hide show

.gitignore +1 -0
README.md +23 -15
configs/base.yaml +43 -0
data/metadata.csv +10 -0
devices.py +3 -1
docs/clip02-misclassification.md +47 -0
docs/data-sourcing.md +6 -0
features.py +1 -0
models/label_encoder.pkl +2 -2
models/model.pkl +2 -2
predict.py +31 -2
reports/confusion_matrix.png +2 -2
reports/metrics.json +30 -24
requirements.txt +1 -0
scripts/refresh_metadata.py +161 -0
train.py +333 -69

.gitignore CHANGED Viewed

@@ -6,3 +6,4 @@ __pycache__/
 .DS_Store
 uploads/
 .tmp/

 .DS_Store
 uploads/
 .tmp/
+senior-ml-engineer-script.md

README.md CHANGED Viewed

@@ -54,7 +54,8 @@ If you are running a live session, keep this script handy:
 python3 -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
-python train.py  # optional if you want to refresh the model
 ```
 Then launch the app with `streamlit run app.py` (defaults to http://localhost:8501).
@@ -81,8 +82,8 @@ Each control includes inline help text so presenters can improvise without notes
 ## Device Recognition
 - 🧱 Audio flows through `features.extract_features`, stitching log-mel and MFCC statistics with zero-crossing, centroid, roll-off, and flatness cues.
-- 🌲 `train.py` fits a `HistGradientBoostingClassifier`, stratified split, and saves artefacts to `models/model.pkl` plus the label encoder.
-- 📈 Every training run exports `reports/metrics.json` and `reports/confusion_matrix.png` so you can cite precision/recall live.
 - 🏷️ The app and CLI surface friendly names (e.g. “Zoom F8 field recorder”) pulled from `devices.describe_label()` to keep the story human-readable.
 ## Scale Detection
@@ -95,31 +96,36 @@ All sample audio lives under `data/` and mirrors the device IDs referenced in th
 | Folder | What it represents | Count* |
 | --- | --- | --- |
-| `audio/` | TAU Urban Acoustic Scenes clips (device A) – Zoom F8 field recorder | 3 · demo bundle |
-| `audio2/` | TAU Urban Acoustic Scenes clips (device B) – Samsung Galaxy S7 | 2 · demo bundle |
-| `audio9/` | TAU Urban Acoustic Scenes clips (device C) – iPhone SE | 1 · demo bundle |
-| `iphone/` | Locally recorded iPhone speech snippets captured with `utils.py` | 2 |
-| `laptop/` | MacBook built-in mic samples recorded in a treated room | 2 |
-| `outtakes/` | Extra captures you can promote into training data after curation | 3 · demo bundle |
-⋆The Space ships with a travel-sized sample set; pull the full dataset locally if you want to retrain the checkpoint.
 ## Download Contents
 Every run generates artefacts you can drop into a slide deck or share with collaborators:
 - 🎯 `models/model.pkl` and `models/label_encoder.pkl` store the trained classifier and label map.
 - 📊 `reports/metrics.json` plus `reports/confusion_matrix.png` capture evaluation snapshots for the latest training session.
 - 📁 Uploaded clips are preserved under `uploads/hooks - <original-name>` so you can replay or re-label them later.
 ## Testing
 Quick smoke checks live in the scripts themselves:
 ```bash
-# Rebuild the model and metrics
-python train.py
 # Score a few clips and verify probabilities look sane
-python predict.py data/laptop/clip_01.wav data/iphone/clip_05.wav --topk 5
 ```
 For deeper regression coverage, wire these commands into your CI and compare the resulting metrics JSON against previous baselines.
@@ -130,9 +136,11 @@ mic-id/
  ├─ app.py              # Streamlit UI for uploading and scoring clips
  ├─ predict.py          # CLI scorer with friendly device names
  ├─ train.py            # Dataset loader, model trainer, metric exporter
  ├─ features.py         # Audio feature extraction helpers
  ├─ utils.py            # Command-line recorder for new device samples
- ├─ data/               # Per-device waveforms (TAU + local recordings)
  ├─ models/             # Saved classifier + label encoder
  ├─ reports/            # Metrics JSON and confusion matrix plots
  ├─ docs/               # Data sourcing guide and prep notes
@@ -143,7 +151,7 @@ mic-id/
 ## Roadmap
 - 🛰️ Add a lightweight CNN baseline alongside the gradient boosting model for comparison.
 - 🧪 Ship augmentation scripts (noise, EQ, impulse responses) to spotlight microphone colouration differences.
-- 🔐 Bundle provenance metadata (`data/metadata.csv`) and automated integrity checks for new clips.
 - 📦 Polish export helpers so the app can bundle probabilities + features in one download.
 ## Contributing

 python3 -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
+python3 scripts/refresh_metadata.py  # rebuild hashes + provenance records
+python3 train.py --config configs/base.yaml  # optional if you want to refresh the model
 ```
 Then launch the app with `streamlit run app.py` (defaults to http://localhost:8501).
 ## Device Recognition
 - 🧱 Audio flows through `features.extract_features`, stitching log-mel and MFCC statistics with zero-crossing, centroid, roll-off, and flatness cues.
+- 🌲 `python3 train.py --config configs/base.yaml` reads the provenance metadata, enforces per-device clip minimums, and fits a `HistGradientBoostingClassifier` before saving artefacts to `models/model.pkl` plus the label encoder.
+- 📈 Every training run exports `reports/metrics.json`, `reports/confusion_matrix.png`, and a timestamped `reports/runs/run-*.json` snapshot so you can cite precision/recall live.
 - 🏷️ The app and CLI surface friendly names (e.g. “Zoom F8 field recorder”) pulled from `devices.describe_label()` to keep the story human-readable.
 ## Scale Detection
 | Folder | What it represents | Count* |
 | --- | --- | --- |
+| `audio/` | TAU Urban Acoustic Scenes clips (device A) – Zoom F8 field recorder | 295 |
+| `audio2/` | TAU Urban Acoustic Scenes clips (device B) – Samsung Galaxy S7 | 295 |
+| `audio9/` | TAU Urban Acoustic Scenes clips (device C) – iPhone SE | 295 |
+| `iphone/` | Locally recorded iPhone speech snippets captured with `utils.py` | 4 |
+| `laptop/` | MacBook built-in mic samples recorded in a treated room | 4 |
+| `outtakes/` | Extra captures you can promote into training data after curation | varies |
+⋆Counts based on the current repo snapshot; refresh `data/` to rebalance as needed.
 ## Download Contents
 Every run generates artefacts you can drop into a slide deck or share with collaborators:
 - 🎯 `models/model.pkl` and `models/label_encoder.pkl` store the trained classifier and label map.
 - 📊 `reports/metrics.json` plus `reports/confusion_matrix.png` capture evaluation snapshots for the latest training session.
+- 🧾 `data/metadata.csv` tracks every clip’s provenance, licence, and hash for reproducible retrains.
+- 🗂️ `reports/runs/run-*.json` snapshots record the exact config, dataset summary, and hashes used for each training run.
 - 📁 Uploaded clips are preserved under `uploads/hooks - <original-name>` so you can replay or re-label them later.
 ## Testing
 Quick smoke checks live in the scripts themselves:
 ```bash
+# Validate provenance without training
+python3 train.py --dry-run
+# Rebuild the model, metrics, and run snapshot
+python3 train.py --config configs/base.yaml
 # Score a few clips and verify probabilities look sane
+python3 predict.py data/laptop/clip_01.wav data/iphone/clip_05.wav --topk 5
 ```
 For deeper regression coverage, wire these commands into your CI and compare the resulting metrics JSON against previous baselines.
  ├─ app.py              # Streamlit UI for uploading and scoring clips
  ├─ predict.py          # CLI scorer with friendly device names
  ├─ train.py            # Dataset loader, model trainer, metric exporter
+ ├─ configs/            # YAML training configs + device provenance defaults
  ├─ features.py         # Audio feature extraction helpers
  ├─ utils.py            # Command-line recorder for new device samples
+ ├─ data/               # Per-device waveforms and provenance metadata
+ │   └─ metadata.csv    # Clip-level provenance (source/licence/hash)
  ├─ models/             # Saved classifier + label encoder
  ├─ reports/            # Metrics JSON and confusion matrix plots
  ├─ docs/               # Data sourcing guide and prep notes
 ## Roadmap
 - 🛰️ Add a lightweight CNN baseline alongside the gradient boosting model for comparison.
 - 🧪 Ship augmentation scripts (noise, EQ, impulse responses) to spotlight microphone colouration differences.
+- 🔐 Wire metadata/hash validation into CI so new clips are rejected unless provenance is complete.
 - 📦 Polish export helpers so the app can bundle probabilities + features in one download.
 ## Contributing

configs/base.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+data:
+  root: data
+  metadata: data/metadata.csv
+  enforce_hashes: true
+  min_clips_per_device: 1
+  include_devices:
+    - audio
+    - audio2
+    - audio9
+    - iphone
+    - laptop
+  splits:
+    - train
+  device_defaults:
+    iphone:
+      source: "In-house recordings"
+      license: "Private use"
+    laptop:
+      source: "In-house recordings"
+      license: "Private use"
+    audio:
+      source: "TAU Urban Acoustic Scenes 2019 Mobile"
+      license: "CC-BY 4.0"
+    audio2:
+      source: "TAU Urban Acoustic Scenes 2019 Mobile"
+      license: "CC-BY 4.0"
+    audio9:
+      source: "TAU Urban Acoustic Scenes 2019 Mobile"
+      license: "CC-BY 4.0"
+training:
+  test_size: 0.25
+  random_state: 42
+  classifier:
+    max_depth: 10
+    max_iter: 400
+    learning_rate: 0.08
+reporting:
+  metrics_path: reports/metrics.json
+  confusion_matrix_path: reports/confusion_matrix.png
+  runs_dir: reports/runs
+  tag: baseline

data/metadata.csv ADDED Viewed

	@@ -0,0 +1,10 @@

+path,device,source,license,split,sha256
+audio/airport-helsinki-204-6138-a.wav,audio,TAU Urban Acoustic Scenes 2019 Mobile,CC-BY 4.0,train,db356757394c3ed66d87990ed98a080b1a1a1778aaaffe110b723c9fbd294814
+audio/airport-lisbon-175-4700-a.wav,audio,TAU Urban Acoustic Scenes 2019 Mobile,CC-BY 4.0,train,ebcea04001ff88fd63af3e76d153ae2d72f3ea65889f3a215c894ca14ce173be
+audio2/bus-stockholm-35-1041-a.wav,audio2,TAU Urban Acoustic Scenes 2019 Mobile,CC-BY 4.0,train,b76dfe5d2d25b4b7912af63110d11f32623cb8308872cd601cc1fbe6daac8ef8
+audio2/bus-stockholm-35-1041-b.wav,audio2,TAU Urban Acoustic Scenes 2019 Mobile,CC-BY 4.0,train,9eb1bf29c4055d9b7863f4395b59a177873c1fb00f6e16f026597643f1339742
+audio9/street_pedestrian-london-149-4500-c.wav,audio9,TAU Urban Acoustic Scenes 2019 Mobile,CC-BY 4.0,train,42ce95e42426e18ae1f25174147bfa799644a3064dcad7072b744239cef134af
+iphone/clip_01.wav,iphone,In-house recordings,Private use,train,fe9b1dc52cd1eb21550847ba08b2c2ddc79443c378ce88945b55a4de9c3656bf
+iphone/clip_05.wav,iphone,In-house recordings,Private use,train,017691167b2b7e93fe52ce7e643ca76767986c478e4efe4c2a66bbbfaee2c99a
+laptop/clip_01.wav,laptop,In-house recordings,Private use,train,f163fd7dc320b3c7ede45104fadff2f90d795f740c9a59156b8cb71613c9f773
+laptop/clip_05.wav,laptop,In-house recordings,Private use,train,56d5a2ca715e1dc1f08f02d619c1a1c770ea60378b721638f9f2aeffb4829233

devices.py CHANGED Viewed

@@ -3,7 +3,9 @@ MIC_FRIENDLY_NAMES = {
     "audio2": "Samsung Galaxy S7 (TAU device B)",
     "audio9": "iPhone SE (TAU device C)",
     "iphone": "Local iPhone recordings",
-    "laptop": "MacBook built-in microphone",
 }

     "audio2": "Samsung Galaxy S7 (TAU device B)",
     "audio9": "iPhone SE (TAU device C)",
     "iphone": "Local iPhone recordings",
+    # These clips were captured both with the MacBook mic and AirPods Pro;
+    # keep the class label stable but surface the combined description.
+    "laptop": "AirPods Pro / MacBook built-in microphone",
 }

docs/clip02-misclassification.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# Clip 02 Misclassification Case Study
+## Issue Summary
+- **Symptom**: `python predict.py data/iphone/*.wav` classified `data/iphone/clip_02.wav` as “MacBook built-in microphone” (~53 %) instead of “Local iPhone recordings” (≈47 %).
+- **Impact**: Undermined trust in the classifier for quiet iPhone speech, indicating poor separation between the iPhone and AirPods/Mac classes.
+## Investigation
+- Confirmed the mismatch reproduced after the first training run with the new TAU batches.
+- Compared class distributions via `train.py --dry-run`; highlighted severe imbalance: TAU devices (≈295 clips each) vs. iPhone (15 wav + 47 m4a) vs. AirPods/Mac (15 wav + 14 m4a).
+- Noted identical feature extraction between training and inference (`features.extract_features`), driving suspicion toward data coverage rather than pipeline drift.
+## Actions Taken
+1. **Data Organisation**
+   - Split the TAU Mobile archive into `data/audio`, `data/audio2`, and `data/audio9` based on filename suffixes (`-a/-b/-c`).
+   - Normalised provenance defaults in `configs/base.yaml` for the new device buckets.
+2. **Metadata Refresh**
+   - Ran `python3 scripts/refresh_metadata.py --config configs/base.yaml` to register hashes and sources for all clips (including new iPhone/AirPods captures).
+   - Repeated after each data ingest to keep `data/metadata.csv` consistent.
+3. **Model Retraining**
+   - Executed `python train.py` to rebuild `models/model.pkl` and `models/label_encoder.pkl` with the expanded dataset (990 clips total).
+4. **Inference UX Improvements**
+   - Allowed directory inputs in `predict.py` so `python predict.py data/iphone` expands automatically.
+   - Updated the “laptop” friendly name to “AirPods Pro / MacBook built-in microphone” to reflect the mixed capture source.
+## Verification
+- Post-retrain prediction:
+  ```
+  File: data/iphone/clip_02.wav
+  RMS loudness: -40.8 dBFS
+    1. Local iPhone recordings — 96.1%
+    2. AirPods Pro / MacBook built-in microphone — 3.9%
+    3. Samsung Galaxy S7 (TAU device B) — 0.0%
+  ```
+- The confidence inversion (≈96 % iPhone) confirms the classifier now separates the classes even for low-level speech content.
+## Feature Changes for Improved Results
+- `configs/base.yaml`: added TAU device folders to `include_devices` and defined CC-BY provenance defaults.
+- `data/metadata.csv`: regenerated with 990 entries to incorporate the new recordings (62 iPhone, 29 AirPods/Mac).
+- `devices.py`: renamed the “laptop” label to “AirPods Pro / MacBook built-in microphone” for accurate reporting.
+- `predict.py`: added directory expansion and broader audio-extension support to streamline batch evaluation.
+- Dataset restructuring: migrated TAU archive clips into `data/audio`, `data/audio2`, `data/audio9` directories, preserving the `-a/-b/-c` microphone mapping.
+## Follow-Up Recommendations
+- Continue collecting parallel iPhone vs. AirPods recordings, especially in quiet environments, until class counts approach parity with TAU devices.
+- Maintain a held-out validation set (not yet captured) to quantify gains objectively beyond spot checks.
+- Document future ingestion runs by appending to this case study or a dedicated experiment log under `docs/`.

docs/data-sourcing.md CHANGED Viewed

@@ -62,4 +62,10 @@ Mic-ID works best when every class corresponds to a capture device that has enou
 2. Store downloaded archives under `data/raw/` (ignored by git) and export processed clips to `data/<device>/`.
 3. Update `metadata.csv` whenever you add or remove external clips so the experiment log in `reports/` stays reproducible.
 For more ideas, browse the DCASE and ASVspoof challenge leaderboards—winning teams usually publish their data prep notes and often release additional impulse responses or parallel recordings.

 2. Store downloaded archives under `data/raw/` (ignored by git) and export processed clips to `data/<device>/`.
 3. Update `metadata.csv` whenever you add or remove external clips so the experiment log in `reports/` stays reproducible.
+## Provenance workflow
+- Run `python3 scripts/refresh_metadata.py` after adding or trimming clips to recompute SHA256 hashes and populate default source/licence values.
+- Manually edit `data/metadata.csv` when a clip needs corrected credits or licence text; the training step will refuse to run if either field is missing.
+- Validate the metadata without training by running `python3 train.py --dry-run`; this catches missing files, hash mismatches, and low clip counts early.
+- Commit both the metadata file and the resulting `reports/runs/run-*.json` snapshot so collaborators can audit exactly which audio went into each checkpoint.
 For more ideas, browse the DCASE and ASVspoof challenge leaderboards—winning teams usually publish their data prep notes and often release additional impulse responses or parallel recordings.

features.py CHANGED Viewed

@@ -2,6 +2,7 @@ import numpy as np, librosa
 def load_mono(path, sr=16000):
     x, sr = librosa.load(path, sr=sr, mono=True)
     x, _ = librosa.effects.trim(x, top_db=30)
     rms = np.sqrt(np.mean(x**2)) + 1e-8

 def load_mono(path, sr=16000):
+    path = str(path)
     x, sr = librosa.load(path, sr=sr, mono=True)
     x, _ = librosa.effects.trim(x, top_db=30)
     rms = np.sqrt(np.mean(x**2)) + 1e-8

models/label_encoder.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b63e81ce06710e7e8cc2dd245a0960912697516459129ce32657a5b0234cbd49
-size 447

 version https://git-lfs.github.com/spec/v1
+oid sha256:af7684de27332a4a68cc6d4f75511e9377f8243e925ed8415cfbef651d76ce76
+size 663

models/model.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9bd34e60ea1f1851d23b6808d4d0ad6ca2a10968322b527dd01f32b4e8761e0b
-size 2006992

 version https://git-lfs.github.com/spec/v1
+oid sha256:526a822cbd88b1a060e845ef99bac697ac116f2afeb6edd0c06a610d5bf23211
+size 2967440

predict.py CHANGED Viewed

@@ -7,6 +7,7 @@ import argparse
 import io
 import os
 from pathlib import Path
 import joblib
 import librosa
@@ -26,6 +27,7 @@ from devices import describe_label
 MODEL_PATH = Path("models/model.pkl")
 ENCODER_PATH = Path("models/label_encoder.pkl")
 def load_model():
@@ -52,16 +54,43 @@ def normalise_audio(y: np.ndarray) -> np.ndarray:
     return y * (0.05 / rms), rms
 def main() -> None:
     parser = argparse.ArgumentParser(description="Score WAV/MP3/M4A clips with the Mic-ID classifier.")
-    parser.add_argument("paths", nargs="+", type=Path, help="Audio files to score")
     parser.add_argument("--topk", type=int, default=3, help="How many ranked predictions to show per file")
     args = parser.parse_args()
     clf, le = load_model()
     topk = max(1, min(args.topk, len(le.classes_)))
-    for path in args.paths:
         if not path.exists():
             print(f"[!] Skipping missing file: {path}")
             continue

 import io
 import os
 from pathlib import Path
+from typing import Iterable, List
 import joblib
 import librosa
 MODEL_PATH = Path("models/model.pkl")
 ENCODER_PATH = Path("models/label_encoder.pkl")
+AUDIO_EXTENSIONS = {".wav", ".mp3", ".m4a", ".flac", ".ogg"}
 def load_model():
     return y * (0.05 / rms), rms
+def discover_inputs(paths: Iterable[Path]) -> List[Path]:
+    """Expand directories into audio files, preserving explicit file ordering."""
+    collected: list[Path] = []
+    for path in paths:
+        if path.is_dir():
+            matches = sorted(
+                p for p in path.rglob("*")
+                if p.is_file() and p.suffix.lower() in AUDIO_EXTENSIONS
+            )
+            if not matches:
+                print(f"[!] No audio files found under directory: {path}")
+                continue
+            collected.extend(matches)
+        else:
+            collected.append(path)
+    return collected
 def main() -> None:
     parser = argparse.ArgumentParser(description="Score WAV/MP3/M4A clips with the Mic-ID classifier.")
+    parser.add_argument(
+        "paths",
+        nargs="+",
+        type=Path,
+        help="Audio files or directories containing audio to score",
+    )
     parser.add_argument("--topk", type=int, default=3, help="How many ranked predictions to show per file")
     args = parser.parse_args()
     clf, le = load_model()
     topk = max(1, min(args.topk, len(le.classes_)))
+    inputs = discover_inputs(args.paths)
+    if not inputs:
+        raise SystemExit("No valid audio inputs found. Provide files or directories with supported formats.")
+    for path in inputs:
         if not path.exists():
             print(f"[!] Skipping missing file: {path}")
             continue

reports/confusion_matrix.png CHANGED Viewed

Git LFS Details

SHA256: 0b0e09da5010560decc3487eb55484df9a7d8956ec127d0ebd0665dcf848d358
Pointer size: 130 Bytes
Size of remote file: 44.4 kB

Git LFS Details

SHA256: 10970dab554b283179ddcdc09ba7519fa398fe9b92ad93a2827eb6b4c37d6451
Pointer size: 130 Bytes
Size of remote file: 52.4 kB

reports/metrics.json CHANGED Viewed

@@ -1,45 +1,51 @@
 {
   "audio": {
-    "precision": 0.9726027397260274,
-    "recall": 0.9594594594594594,
-    "f1-score": 0.9659863945578231,
     "support": 74.0
   },
   "audio2": {
-    "precision": 0.9864864864864865,
     "recall": 0.9864864864864865,
-    "f1-score": 0.9864864864864865,
     "support": 74.0
   },
   "audio9": {
-    "precision": 0.9605263157894737,
-    "recall": 0.9864864864864865,
-    "f1-score": 0.9733333333333334,
     "support": 74.0
   },
   "iphone": {
-    "precision": 1.0,
-    "recall": 0.75,
-    "f1-score": 0.8571428571428571,
-    "support": 4.0
   },
   "laptop": {
-    "precision": 0.6666666666666666,
-    "recall": 0.6666666666666666,
-    "f1-score": 0.6666666666666666,
     "support": 3.0
   },
-  "accuracy": 0.9694323144104804,
   "macro avg": {
-    "precision": 0.9172564417337309,
-    "recall": 0.8698198198198199,
-    "f1-score": 0.8899231476374334,
-    "support": 229.0
   },
   "weighted avg": {
-    "precision": 0.969657424053044,
-    "recall": 0.9694323144104804,
-    "f1-score": 0.969162582063393,
-    "support": 229.0
   }
 }

 {
   "audio": {
+    "precision": 0.9473684210526315,
+    "recall": 0.972972972972973,
+    "f1-score": 0.96,
     "support": 74.0
   },
   "audio2": {
+    "precision": 0.9733333333333334,
     "recall": 0.9864864864864865,
+    "f1-score": 0.9798657718120806,
     "support": 74.0
   },
   "audio9": {
+    "precision": 0.9594594594594594,
+    "recall": 0.9594594594594594,
+    "f1-score": 0.9594594594594594,
     "support": 74.0
   },
   "iphone": {
+    "precision": 0.9375,
+    "recall": 0.9375,
+    "f1-score": 0.9375,
+    "support": 16.0
   },
   "laptop": {
+    "precision": 1.0,
+    "recall": 1.0,
+    "f1-score": 1.0,
+    "support": 7.0
+  },
+  "outtakes : new": {
+    "precision": 0.0,
+    "recall": 0.0,
+    "f1-score": 0.0,
     "support": 3.0
   },
+  "accuracy": 0.9596774193548387,
   "macro avg": {
+    "precision": 0.802943535640904,
+    "recall": 0.8094031531531533,
+    "f1-score": 0.8061375385452566,
+    "support": 248.0
   },
   "weighted avg": {
+    "precision": 0.9481126202603283,
+    "recall": 0.9596774193548387,
+    "f1-score": 0.9538309157826369,
+    "support": 248.0
   }
 }

requirements.txt CHANGED Viewed

@@ -7,3 +7,4 @@ numpy
 pandas
 matplotlib
 joblib

 pandas
 matplotlib
 joblib
+pyyaml

scripts/refresh_metadata.py ADDED Viewed

	@@ -0,0 +1,161 @@

+#!/usr/bin/env python3
+"""Generate or refresh data/metadata.csv entries with provenance details."""
+from __future__ import annotations
+import argparse
+import csv
+import hashlib
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Dict, Iterable, Optional
+import yaml
+DEFAULT_EXTENSIONS = {".wav", ".mp3", ".m4a"}
+@dataclass
+class MetadataRow:
+    path: Path
+    device: str
+    source: str
+    license: str
+    split: str
+    sha256: str
+    def as_dict(self, root: Path) -> Dict[str, str]:
+        rel_path = self.path.relative_to(root).as_posix()
+        return {
+            "path": rel_path,
+            "device": self.device,
+            "source": self.source,
+            "license": self.license,
+            "split": self.split,
+            "sha256": self.sha256,
+        }
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--config", default="configs/base.yaml", help="YAML config that defines data root and defaults.")
+    parser.add_argument("--output", help="Override output metadata CSV path. Defaults to the config value.")
+    parser.add_argument("--extensions", nargs="*", help="File extensions to include (e.g., .wav .mp3 .m4a). Defaults to built-ins.")
+    return parser.parse_args()
+def load_config(path: Path) -> dict:
+    if not path.exists():
+        raise SystemExit(f"Config not found: {path}")
+    with path.open("r", encoding="utf-8") as fh:
+        cfg = yaml.safe_load(fh) or {}
+    if "data" not in cfg:
+        raise SystemExit("Config is missing a `data` section.")
+    return cfg
+def read_existing_metadata(path: Path) -> Dict[str, dict]:
+    if not path.exists():
+        return {}
+    with path.open("r", encoding="utf-8", newline="") as fh:
+        reader = csv.DictReader(fh)
+        return {row["path"]: row for row in reader if "path" in row}
+def compute_sha256(path: Path) -> str:
+    hasher = hashlib.sha256()
+    with path.open("rb") as fh:
+        for chunk in iter(lambda: fh.read(8192), b""):
+            hasher.update(chunk)
+    return hasher.hexdigest()
+def gather_files(root: Path, extensions: Iterable[str]) -> Iterable[Path]:
+    for file_path in root.rglob("*"):
+        if not file_path.is_file():
+            continue
+        if file_path.suffix.lower() in extensions:
+            yield file_path
+def build_rows(
+    files: Iterable[Path],
+    existing_rows: Dict[str, dict],
+    root: Path,
+    device_defaults: Optional[dict],
+    include_devices: Optional[set[str]],
+) -> Iterable[MetadataRow]:
+    for path in files:
+        rel_key = path.relative_to(root).as_posix()
+        parts = path.relative_to(root).parts
+        if not parts:
+            continue
+        device = parts[0]
+        if include_devices and device not in include_devices:
+            continue
+        defaults = (device_defaults or {}).get(device, {})
+        existing = existing_rows.get(rel_key, {})
+        source = existing.get("source") or defaults.get("source")
+        license_ = existing.get("license") or defaults.get("license")
+        split = existing.get("split") or "train"
+        if not source or not license_:
+            sys.stderr.write(f"[warn] Missing source/license for {rel_key}; fill these in manually.\n")
+        sha256 = compute_sha256(path)
+        yield MetadataRow(
+            path=path,
+            device=device,
+            source=source or "",
+            license=license_ or "",
+            split=split,
+            sha256=sha256,
+        )
+def main() -> None:
+    args = parse_args()
+    config_path = Path(args.config)
+    cfg = load_config(config_path)
+    data_cfg = cfg["data"]
+    root = Path(data_cfg.get("root", "data")).resolve()
+    metadata_path = Path(args.output or data_cfg.get("metadata", root / "metadata.csv")).resolve()
+    extensions = {ext.lower() for ext in (args.extensions or data_cfg.get("extensions", DEFAULT_EXTENSIONS))}
+    if not root.exists():
+        raise SystemExit(f"Data root does not exist: {root}")
+    existing_rows = read_existing_metadata(metadata_path)
+    device_defaults = data_cfg.get("device_defaults", {})
+    include_devices = set(data_cfg.get("include_devices", []) or [])
+    files = sorted(gather_files(root, extensions))
+    rows = sorted(
+        build_rows(files, existing_rows, root, device_defaults, include_devices if include_devices else None),
+        key=lambda row: row.path.relative_to(root).as_posix(),
+    )
+    metadata_path.parent.mkdir(parents=True, exist_ok=True)
+    with metadata_path.open("w", encoding="utf-8", newline="") as fh:
+        writer = csv.DictWriter(fh, fieldnames=["path", "device", "source", "license", "split", "sha256"])
+        writer.writeheader()
+        for row in rows:
+            writer.writerow(row.as_dict(root))
+    orphaned = sorted(set(existing_rows) - {row.path.relative_to(root).as_posix() for row in rows})
+    if orphaned:
+        sys.stderr.write(f"[warn] Orphaned metadata entries (files missing): {len(orphaned)}\n")
+        for item in orphaned:
+            sys.stderr.write(f"  - {item}\n")
+    print(f"Wrote {len(rows)} rows to {metadata_path}")
+if __name__ == "__main__":
+    main()

train.py CHANGED Viewed

@@ -1,7 +1,15 @@
-import os
-import glob
 import json
 from pathlib import Path
 BASE_DIR = Path(__file__).resolve().parent
 CACHE_ROOT = BASE_DIR / ".cache"
@@ -12,93 +20,349 @@ for path in (NUMBA_CACHE_DIR, MPL_CACHE_DIR):
 os.environ.setdefault("NUMBA_CACHE_DIR", str(NUMBA_CACHE_DIR))
 os.environ.setdefault("MPLCONFIGDIR", str(MPL_CACHE_DIR))
-import numpy as np
 import matplotlib
 matplotlib.use("Agg", force=True)
 import matplotlib.pyplot as plt
 from sklearn.ensemble import HistGradientBoostingClassifier
-from sklearn.preprocessing import LabelEncoder
 from sklearn.metrics import classification_report, confusion_matrix
 from sklearn.model_selection import train_test_split
-import joblib
-from features import load_mono, extract_features
-DATA_DIR, MODEL_DIR, REPORT_DIR = "data", "models", "reports"
-os.makedirs(MODEL_DIR, exist_ok=True); os.makedirs(REPORT_DIR, exist_ok=True)
-IGNORED_DEVICES = {"outtakes"}
-SUFFIX_TO_DEVICE = {
-    "a": "audio",
-    "b": "audio2",
-    "c": "audio9",
-}
-TAU_DEVICE_DIRS = set(SUFFIX_TO_DEVICE.values())
-def resolve_device_label(device_dir: str, wav_path: str) -> str:
-    """Infer the correct device label for a wav file.
-    TAU scenes live under per-device directories but each folder still contains
-    the parallel `-a/-b/-c` recordings. Instead of trusting the directory name
-    (which mislabels the clips), derive the device from the filename suffix and
-    fall back to the directory label for any locally recorded additions that
-    do not follow that convention.
-    """
-    if device_dir in TAU_DEVICE_DIRS:
-        stem = Path(wav_path).stem
-        if "-" in stem:
-            _, suffix = stem.rsplit("-", 1)
-            if suffix in SUFFIX_TO_DEVICE:
-                return SUFFIX_TO_DEVICE[suffix]
-    return device_dir
-def load_dataset():
-    X, y = [], []
     seen: set[tuple[str, str]] = set()
-    for device in sorted(
-        d for d in os.listdir(DATA_DIR)
-        if os.path.isdir(os.path.join(DATA_DIR, d))
-        and not d.startswith(".")
-        and d not in IGNORED_DEVICES
-    ):
-        for wav in glob.glob(os.path.join(DATA_DIR, device, "*.wav")):
-            label = resolve_device_label(device, wav)
-            key = (os.path.basename(wav), label)
-            if key in seen:
-                continue
-            seen.add(key)
-            x, sr = load_mono(wav); feats = extract_features(x, sr)
-            X.append(feats); y.append(label)
-    return np.array(X), np.array(y)
-if __name__ == "__main__":
-    X, y = load_dataset()
-    le = LabelEncoder(); y_enc = le.fit_transform(y)
-    Xtr, Xte, ytr, yte = train_test_split(X, y_enc, test_size=0.25, stratify=y_enc, random_state=42)
-    clf = HistGradientBoostingClassifier(max_depth=10, max_iter=400, learning_rate=0.08, random_state=42)
-    clf.fit(Xtr, ytr); yhat = clf.predict(Xte)
-    report = classification_report(yte, yhat, target_names=le.classes_, output_dict=True)
-    with open(os.path.join(REPORT_DIR, "metrics.json"), "w") as f: json.dump(report, f, indent=2)
-    cm = confusion_matrix(yte, yhat, normalize="true")
-    fig, ax = plt.subplots(figsize=(5,4)); im = ax.imshow(cm, cmap="Blues")
-    ax.set_xticks(range(len(le.classes_))); ax.set_xticklabels(le.classes_, rotation=45, ha="right")
-    ax.set_yticks(range(len(le.classes_))); ax.set_yticklabels(le.classes_)
-    for i in range(len(le.classes_)):
-        for j in range(len(le.classes_)):
-            ax.text(j, i, f"{cm[i,j]:.2f}", ha="center", va="center", fontsize=8)
-    ax.set_title("Confusion (normalized)"); fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04); fig.tight_layout()
-    fig.savefig(os.path.join(REPORT_DIR, "confusion_matrix.png"), dpi=160)
     if hasattr(clf, "_feature_subsample_rng"):
         clf._feature_subsample_rng = None
-    joblib.dump(clf, os.path.join(MODEL_DIR, "model.pkl"))
-    joblib.dump(le,  os.path.join(MODEL_DIR, "label_encoder.pkl"))
     print("Saved model + reports.")

+from __future__ import annotations
+import argparse
+import csv
+import datetime as dt
+import hashlib
 import json
+import os
+from collections import Counter
+from dataclasses import dataclass
 from pathlib import Path
+from typing import Iterable, Sequence
 BASE_DIR = Path(__file__).resolve().parent
 CACHE_ROOT = BASE_DIR / ".cache"
 os.environ.setdefault("NUMBA_CACHE_DIR", str(NUMBA_CACHE_DIR))
 os.environ.setdefault("MPLCONFIGDIR", str(MPL_CACHE_DIR))
+import joblib
 import matplotlib
 matplotlib.use("Agg", force=True)
 import matplotlib.pyplot as plt
+import numpy as np
+import yaml
 from sklearn.ensemble import HistGradientBoostingClassifier
 from sklearn.metrics import classification_report, confusion_matrix
 from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder
+from features import extract_features, load_mono
+TARGET_SR = 16000
+REQUIRED_COLUMNS = {"path", "device", "source", "license", "split", "sha256"}
+MODEL_DIR = BASE_DIR / "models"
+REPORT_DIR = BASE_DIR / "reports"
+MODEL_DIR.mkdir(exist_ok=True)
+REPORT_DIR.mkdir(exist_ok=True)
+@dataclass(frozen=True)
+class ClipRecord:
+    path: Path
+    device: str
+    source: str
+    license: str
+    split: str
+    sha256: str
+    def relative_path(self, root: Path) -> str:
+        return self.path.relative_to(root).as_posix()
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Train the Mic-ID classifier with provenance tracking.")
+    parser.add_argument("--config", default="configs/base.yaml", help="YAML config describing data + training parameters.")
+    parser.add_argument("--dry-run", action="store_true", help="Validate metadata and show dataset summary without training.")
+    return parser.parse_args()
+def load_config(path: Path) -> dict:
+    if not path.exists():
+        raise SystemExit(f"Config not found: {path}")
+    with path.open("r", encoding="utf-8") as fh:
+        cfg = yaml.safe_load(fh) or {}
+    if "data" not in cfg or "training" not in cfg or "reporting" not in cfg:
+        raise SystemExit("Config must include `data`, `training`, and `reporting` sections.")
+    return cfg
+def compute_sha256(path: Path) -> str:
+    hasher = hashlib.sha256()
+    with path.open("rb") as fh:
+        for chunk in iter(lambda: fh.read(8192), b""):
+            hasher.update(chunk)
+    return hasher.hexdigest()
+def read_metadata_csv(path: Path) -> list[dict]:
+    with path.open("r", encoding="utf-8", newline="") as fh:
+        reader = csv.DictReader(fh)
+        headers = set(reader.fieldnames or [])
+        missing = REQUIRED_COLUMNS - headers
+        if missing:
+            raise SystemExit(f"Metadata file {path} is missing required columns: {sorted(missing)}")
+        return list(reader)
+def load_clip_records(data_cfg: dict) -> tuple[list[ClipRecord], Path, Path]:
+    root = Path(data_cfg.get("root", "data")).resolve()
+    metadata_path = Path(data_cfg.get("metadata", root / "metadata.csv")).resolve()
+    enforce_hashes = bool(data_cfg.get("enforce_hashes", True))
+    splits_filter = set(data_cfg.get("splits", []) or [])
+    include_devices = set(data_cfg.get("include_devices", []) or [])
+    if not root.exists():
+        raise SystemExit(f"Data root does not exist: {root}")
+    if not metadata_path.exists():
+        raise SystemExit(f"Metadata file not found: {metadata_path}")
+    raw_rows = read_metadata_csv(metadata_path)
+    records: list[ClipRecord] = []
     seen: set[tuple[str, str]] = set()
+    for idx, row in enumerate(raw_rows, start=2):
+        rel_path = row["path"].strip()
+        device = row["device"].strip()
+        source = row["source"].strip()
+        license_ = row["license"].strip()
+        split = row["split"].strip() or "train"
+        sha256 = row["sha256"].strip()
+        if include_devices and device not in include_devices:
+            continue
+        if splits_filter and split not in splits_filter:
+            continue
+        if not rel_path:
+            raise SystemExit(f"Row {idx} is missing a path.")
+        if not device:
+            raise SystemExit(f"Row {idx} is missing a device label (path={rel_path}).")
+        if not source or not license_:
+            raise SystemExit(f"Row {idx} missing source/license information (device={device}, path={rel_path}).")
+        full_path = root / rel_path
+        if not full_path.exists():
+            raise SystemExit(f"Audio file referenced in metadata not found: {full_path}")
+        if not sha256:
+            current_hash = compute_sha256(full_path)
+        else:
+            current_hash = compute_sha256(full_path) if enforce_hashes else sha256
+            if enforce_hashes and current_hash != sha256:
+                raise SystemExit(
+                    f"Hash mismatch for {rel_path}: metadata={sha256} current={current_hash}. "
+                    "Regenerate metadata via scripts/refresh_metadata.py."
+                )
+        key = (rel_path, device)
+        if key in seen:
+            raise SystemExit(f"Duplicate clip/device combination detected in metadata: {rel_path} ({device})")
+        seen.add(key)
+        records.append(
+            ClipRecord(
+                path=full_path,
+                device=device,
+                source=source,
+                license=license_,
+                split=split,
+                sha256=current_hash if enforce_hashes else current_hash,
+            )
+        )
+    if include_devices:
+        for dev in include_devices:
+            if dev not in {record.device for record in records}:
+                raise SystemExit(f"No clips found for requested device: {dev}")
+    if not records:
+        raise SystemExit("No audio clips passed the metadata filters; nothing to train on.")
+    return records, root, metadata_path
+def ensure_minimum_counts(records: Sequence[ClipRecord], minimum: int) -> Counter:
+    counts = Counter(record.device for record in records)
+    violations = {device: count for device, count in counts.items() if count < minimum}
+    if violations:
+        formatted = ", ".join(f"{dev} ({count})" for dev, count in violations.items())
+        raise SystemExit(f"Not enough clips per device. Increase data or lower the threshold. Offenders: {formatted}")
+    return counts
+def summarise_records(records: Sequence[ClipRecord], root: Path) -> dict:
+    counts = Counter(record.device for record in records)
+    sources = {record.device: record.source for record in records}
+    licenses = {record.device: record.license for record in records}
+    return {
+        "total_clips": len(records),
+        "devices": dict(counts),
+        "sources": sources,
+        "licenses": licenses,
+        "first_five_hashes": [
+            {"path": record.relative_path(root), "sha256": record.sha256}
+            for record in records[: min(5, len(records))]
+        ],
+    }
+def collect_hashes(records: Sequence[ClipRecord], root: Path) -> list[dict]:
+    return [
+        {"path": record.relative_path(root), "sha256": record.sha256}
+        for record in records
+    ]
+def build_dataset(records: Sequence[ClipRecord]) -> tuple[np.ndarray, np.ndarray]:
+    features, labels = [], []
+    for record in records:
+        audio, sr = load_mono(record.path, sr=TARGET_SR)
+        feats = extract_features(audio, sr)
+        features.append(feats)
+        labels.append(record.device)
+    return np.array(features), np.array(labels)
+def instantiate_classifier(cfg: dict) -> HistGradientBoostingClassifier:
+    clf_cfg = dict(cfg.get("classifier", {}))
+    random_state = cfg.get("random_state")
+    if random_state is not None:
+        clf_cfg.setdefault("random_state", random_state)
+    if not clf_cfg:
+        clf_cfg = {"max_depth": 10, "max_iter": 400, "learning_rate": 0.08}
+        if random_state is not None:
+            clf_cfg["random_state"] = random_state
+    return HistGradientBoostingClassifier(**clf_cfg)
+def plot_confusion_matrix(cm: np.ndarray, labels: Sequence[str], output_path: Path) -> None:
+    fig, ax = plt.subplots(figsize=(5, 4))
+    im = ax.imshow(cm, cmap="Blues")
+    ax.set_xticks(range(len(labels)))
+    ax.set_xticklabels(labels, rotation=45, ha="right")
+    ax.set_yticks(range(len(labels)))
+    ax.set_yticklabels(labels)
+    for i in range(len(labels)):
+        for j in range(len(labels)):
+            ax.text(j, i, f"{cm[i, j]:.2f}", ha="center", va="center", fontsize=8)
+    ax.set_title("Confusion (normalized)")
+    fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
+    fig.tight_layout()
+    fig.savefig(output_path, dpi=160)
+    plt.close(fig)
+def write_run_report(
+    reporting_cfg: dict,
+    config_path: Path,
+    config: dict,
+    records: Sequence[ClipRecord],
+    root: Path,
+    metrics: dict,
+    dataset_summary: dict,
+    hashes: Sequence[dict],
+    model_path: Path,
+    encoder_path: Path,
+) -> Path:
+    runs_dir = Path(reporting_cfg.get("runs_dir", REPORT_DIR / "runs")).resolve()
+    runs_dir.mkdir(parents=True, exist_ok=True)
+    now_utc = dt.datetime.now(dt.timezone.utc).replace(microsecond=0)
+    timestamp = now_utc.strftime("%Y%m%d-%H%M%S")
+    tag = reporting_cfg.get("tag")
+    filename = f"run-{timestamp}"
+    if tag:
+        filename += f"-{tag}"
+    run_path = runs_dir / f"{filename}.json"
+    payload = {
+        "timestamp_utc": now_utc.isoformat().replace("+00:00", "Z"),
+        "config_path": str(config_path.resolve()),
+        "config_snapshot": config,
+        "dataset": {
+            **dataset_summary,
+            "metadata_root": str(root),
+            "hashes": list(hashes),
+        },
+        "metrics": metrics,
+        "artefacts": {
+            "model": str(model_path),
+            "label_encoder": str(encoder_path),
+            "metrics_json": str(Path(reporting_cfg.get("metrics_path", REPORT_DIR / "metrics.json")).resolve()),
+            "confusion_matrix": str(Path(reporting_cfg.get("confusion_matrix_path", REPORT_DIR / "confusion_matrix.png")).resolve()),
+        },
+    }
+    with run_path.open("w", encoding="utf-8") as fh:
+        json.dump(payload, fh, indent=2)
+    return run_path
+def main() -> None:
+    args = parse_args()
+    config_path = Path(args.config)
+    config = load_config(config_path)
+    data_cfg = config["data"]
+    training_cfg = config["training"]
+    reporting_cfg = config["reporting"]
+    records, data_root, metadata_path = load_clip_records(data_cfg)
+    min_clips = int(data_cfg.get("min_clips_per_device", 1))
+    ensure_minimum_counts(records, min_clips)
+    dataset_summary = summarise_records(records, data_root)
+    hashes = collect_hashes(records, data_root)
+    dataset_summary["metadata_file"] = str(metadata_path)
+    print("Dataset summary:")
+    for key, value in dataset_summary.items():
+        print(f"  {key}: {value}")
+    if args.dry_run:
+        print("Dry run complete. Exiting without training.")
+        return
+    X, y = build_dataset(records)
+    label_encoder = LabelEncoder()
+    y_encoded = label_encoder.fit_transform(y)
+    test_size = float(training_cfg.get("test_size", 0.25))
+    random_state = training_cfg.get("random_state", 42)
+    stratify = training_cfg.get("stratify", True)
+    stratify_arg = y_encoded if stratify else None
+    X_train, X_test, y_train, y_test = train_test_split(
+        X,
+        y_encoded,
+        test_size=test_size,
+        stratify=stratify_arg,
+        random_state=random_state,
+    )
+    clf = instantiate_classifier(training_cfg)
+    clf.fit(X_train, y_train)
+    y_pred = clf.predict(X_test)
+    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_, output_dict=True)
+    metrics_path = Path(reporting_cfg.get("metrics_path", REPORT_DIR / "metrics.json"))
+    with metrics_path.open("w", encoding="utf-8") as fh:
+        json.dump(report, fh, indent=2)
+    cm = confusion_matrix(y_test, y_pred, normalize="true")
+    confusion_path = Path(reporting_cfg.get("confusion_matrix_path", REPORT_DIR / "confusion_matrix.png"))
+    plot_confusion_matrix(cm, label_encoder.classes_, confusion_path)
+    # Clean up non-serializable RNG to keep joblib artefacts deterministic.
     if hasattr(clf, "_feature_subsample_rng"):
         clf._feature_subsample_rng = None
+    model_path = MODEL_DIR / "model.pkl"
+    encoder_path = MODEL_DIR / "label_encoder.pkl"
+    joblib.dump(clf, model_path)
+    joblib.dump(label_encoder, encoder_path)
+    run_report_path = write_run_report(
+        reporting_cfg,
+        config_path,
+        config,
+        records,
+        data_root,
+        report,
+        dataset_summary,
+        hashes,
+        model_path,
+        encoder_path,
+    )
     print("Saved model + reports.")
+    print(f"Run snapshot written to {run_report_path}")
+if __name__ == "__main__":
+    main()