Speech vs Noise Classification with AST
An Open Audio Transformer Baseline by Norwood Systems
Distinguishing speech from noise is a fundamental building block in modern audio pipelines. It plays a critical role in tasks such as voice activity detection (VAD), speech recognition preprocessing, audio dataset cleaning, and noise-aware inference.This post introduces norwood-speechVSnoise-AST-based, an open-source Hugging Face project that provides:
- A trained speech-vs-noise classifier
- A reproducible training pipeline
- A Transformer-based audio modeling approach using AST
Repository:
What is this project?
norwood-speechVSnoise-AST-based is a binary audio classification model trained to distinguish between:
- Speech
- Noise
The model is fine-tuned from a pretrained Audio Spectrogram Transformer (AST) and is designed to serve as a robust, learnable alternative to heuristic or energy-based speech detection methods.
This project includes:
- Model weights
- Inference examples
Why Audio Spectrogram Transformer (AST)?
The model is built on:
bookbot/distil-ast-audioset
AST treats audio spectrograms as image-like patches and applies Transformer self-attention, allowing it to capture:
- Long-range temporal context
- Spectral structure characteristic of speech
- Noise patterns that vary over time
Compared to traditional CNN-based audio classifiers, AST provides:
- Better global context modeling
- Strong transfer learning from AudioSet
- Easy integration with 🤗 Transformers
Label configuration
The classifier is trained with two labels:
labels = ["noise", "speech"]
Mapped internally as:
label2id = {"noise": "0", "speech": "1"}
id2label = {"0": "noise", "1": "speech"}
This ensures compatibility with Hugging Face inference APIs and pipelines.
Audio preprocessing
Audio samples are:
- Loaded as waveforms
- Truncated to 5 seconds
- Converted to AST-compatible features
max_duration = 5 # seconds
This duration balances:
- Computational efficiency
- Enough temporal context to distinguish speech from noise
Inference usage
Once trained or downloaded from the Hub, the model can be used with the standard Transformers pipeline:
from transformers import pipeline
classifier = pipeline(
"audio-classification",
model="norwoodsystems/norwood-speechVSnoise-AST-based"
)
result = classifier("example.wav")
print(result)
The output provides class probabilities for speech and noise.
Intended use cases
This model is well-suited for:
- Voice activity detection (VAD)
- Speech-aware preprocessing before ASR
- Audio dataset filtering
- Noise-aware routing logic
- Research and experimentation
It is designed as a classification model, not a separation or enhancement system.
Limitations
- Binary classification only
- No explicit overlap detection
For diarization or enhancement, this model should be combined with additional components.
Why this matters
Traditional speech/noise detection often relies on:
- Energy thresholds
- Simple heuristics
- Hand-tuned parameters
This project demonstrates how pretrained audio Transformers can provide a more flexible and data-driven alternative while remaining fully open and reproducible.
Final thoughts
norwood-speechVSnoise-AST-based provides a practical, modern baseline for speech-vs-noise classification using Hugging Face tooling end to end. By pretrained Transformers, it serves as a useful reference for anyone building robust audio pipelines.
How are you currently detecting speech in noisy environments—heuristics, neural VADs, or Transformer-based classifiers?