Speech vs Noise Classification with AST

Community Article Published February 3, 2026

An Open Audio Transformer Baseline by Norwood Systems

Distinguishing speech from noise is a fundamental building block in modern audio pipelines. It plays a critical role in tasks such as voice activity detection (VAD), speech recognition preprocessing, audio dataset cleaning, and noise-aware inference.

This post introduces norwood-speechVSnoise-AST-based, an open-source Hugging Face project that provides:

  • A trained speech-vs-noise classifier
  • A reproducible training pipeline
  • A Transformer-based audio modeling approach using AST

Repository:


What is this project?

norwood-speechVSnoise-AST-based is a binary audio classification model trained to distinguish between:

  • Speech
  • Noise

The model is fine-tuned from a pretrained Audio Spectrogram Transformer (AST) and is designed to serve as a robust, learnable alternative to heuristic or energy-based speech detection methods.

This project includes:

  • Model weights
  • Inference examples

Why Audio Spectrogram Transformer (AST)?

The model is built on:

bookbot/distil-ast-audioset

AST treats audio spectrograms as image-like patches and applies Transformer self-attention, allowing it to capture:

  • Long-range temporal context
  • Spectral structure characteristic of speech
  • Noise patterns that vary over time

Compared to traditional CNN-based audio classifiers, AST provides:

  • Better global context modeling
  • Strong transfer learning from AudioSet
  • Easy integration with 🤗 Transformers

Label configuration

The classifier is trained with two labels:

labels = ["noise", "speech"]

Mapped internally as:

label2id = {"noise": "0", "speech": "1"}
id2label = {"0": "noise", "1": "speech"}

This ensures compatibility with Hugging Face inference APIs and pipelines.


Audio preprocessing

Audio samples are:

  • Loaded as waveforms
  • Truncated to 5 seconds
  • Converted to AST-compatible features
max_duration = 5  # seconds

This duration balances:

  • Computational efficiency
  • Enough temporal context to distinguish speech from noise

Inference usage

Once trained or downloaded from the Hub, the model can be used with the standard Transformers pipeline:

from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="norwoodsystems/norwood-speechVSnoise-AST-based"
)

result = classifier("example.wav")
print(result)

The output provides class probabilities for speech and noise.


Intended use cases

This model is well-suited for:

  • Voice activity detection (VAD)
  • Speech-aware preprocessing before ASR
  • Audio dataset filtering
  • Noise-aware routing logic
  • Research and experimentation

It is designed as a classification model, not a separation or enhancement system.


Limitations

  • Binary classification only
  • No explicit overlap detection

For diarization or enhancement, this model should be combined with additional components.


Why this matters

Traditional speech/noise detection often relies on:

  • Energy thresholds
  • Simple heuristics
  • Hand-tuned parameters

This project demonstrates how pretrained audio Transformers can provide a more flexible and data-driven alternative while remaining fully open and reproducible.


Final thoughts

norwood-speechVSnoise-AST-based provides a practical, modern baseline for speech-vs-noise classification using Hugging Face tooling end to end. By pretrained Transformers, it serves as a useful reference for anyone building robust audio pipelines.


How are you currently detecting speech in noisy environments—heuristics, neural VADs, or Transformer-based classifiers?

Community

Sign up or log in to comment