Speech vs Noise Classification with AST

Community Article Published February 3, 2026

Upvote

An Open Audio Transformer Baseline by Norwood Systems

Distinguishing speech from noise is a fundamental building block in modern audio pipelines. It plays a critical role in tasks such as voice activity detection (VAD), speech recognition preprocessing, audio dataset cleaning, and noise-aware inference.

This post introduces norwood-speechVSnoise-AST-based, an open-source Hugging Face project that provides:

A trained speech-vs-noise classifier
A reproducible training pipeline
A Transformer-based audio modeling approach using AST

Repository:

https://huggingface.co/norwoodsystems/norwood-speechVSnoise-AST-based

What is this project?

norwood-speechVSnoise-AST-based is a binary audio classification model trained to distinguish between:

Speech
Noise

The model is fine-tuned from a pretrained Audio Spectrogram Transformer (AST) and is designed to serve as a robust, learnable alternative to heuristic or energy-based speech detection methods.

This project includes:

Model weights
Inference examples

Why Audio Spectrogram Transformer (AST)?

The model is built on:

bookbot/distil-ast-audioset

AST treats audio spectrograms as image-like patches and applies Transformer self-attention, allowing it to capture:

Long-range temporal context
Spectral structure characteristic of speech
Noise patterns that vary over time

Compared to traditional CNN-based audio classifiers, AST provides:

Better global context modeling
Strong transfer learning from AudioSet
Easy integration with 🤗 Transformers

Label configuration

The classifier is trained with two labels:

labels = ["noise", "speech"]

Mapped internally as:

label2id = {"noise": "0", "speech": "1"}
id2label = {"0": "noise", "1": "speech"}

This ensures compatibility with Hugging Face inference APIs and pipelines.

Audio preprocessing

Audio samples are:

Loaded as waveforms
Truncated to 5 seconds
Converted to AST-compatible features

max_duration = 5  # seconds

This duration balances:

Computational efficiency
Enough temporal context to distinguish speech from noise

Inference usage

Once trained or downloaded from the Hub, the model can be used with the standard Transformers pipeline:

from transformers import pipeline

classifier = pipeline(
    "audio-classification",
    model="norwoodsystems/norwood-speechVSnoise-AST-based"
)

result = classifier("example.wav")
print(result)

The output provides class probabilities for speech and noise.

Intended use cases

This model is well-suited for:

Voice activity detection (VAD)
Speech-aware preprocessing before ASR
Audio dataset filtering
Noise-aware routing logic
Research and experimentation

It is designed as a classification model, not a separation or enhancement system.

Limitations

Binary classification only
No explicit overlap detection

For diarization or enhancement, this model should be combined with additional components.

Why this matters

Traditional speech/noise detection often relies on:

Energy thresholds
Simple heuristics
Hand-tuned parameters

This project demonstrates how pretrained audio Transformers can provide a more flexible and data-driven alternative while remaining fully open and reproducible.

Final thoughts

norwood-speechVSnoise-AST-based provides a practical, modern baseline for speech-vs-noise classification using Hugging Face tooling end to end. By pretrained Transformers, it serves as a useful reference for anyone building robust audio pipelines.

How are you currently detecting speech in noisy environments—heuristics, neural VADs, or Transformer-based classifiers?

Models mentioned in this article 1

Running PersonaPlex-7B on Hugging Face ZeroGPU: A Complete Guide

April 8, 2026

VoxCeleb Dataset: Real-World Speech for Speaker Recognition

March 17, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote