πŸŽ™οΈ # πŸŽ™οΈ Whisper Large v3 β€” Fine-tuned for Children's Speech Recognition

Model Competition License


🧠 What is This Model?

Imagine a smart assistant that can listen to a child speak and write down exactly what they said. That is what this model does!

Most speech recognition systems are trained on adult voices β€” they struggle to understand children because:

  • Children have higher pitched voices than adults
  • Children sometimes mispronounce words (e.g., "elephant" β†’ "efant")
  • Children speak with different rhythm and speed
  • Some children have speech disorders

This model is a fine-tuned version of OpenAI's Whisper Large v3 β€” one of the world's best speech recognition models β€” specially adapted to understand children's speech.


πŸ† Competition Results

This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData.

Metric Score
Validation WER 0.1030 (10.30%) πŸ”₯
Public Leaderboard WER 0.4432
Competition Leaderboard #1 0.1914

WER = Word Error Rate β€” Lower is better! If a child says 10 words and the model gets 1 wrong, WER = 10%.


πŸ“ Files in This Repository β€” Complete Guide

This section explains every single file in this repository, what it contains, and why it exists.


1. model.safetensors ⭐ (Most Important File!)

What it is: The "brain" of the model β€” contains all the learned knowledge.

What's inside:

  • Millions of numbers (called "weights" or "parameters") that the model learned during training
  • These numbers represent everything the model knows about children's speech
  • Total: 1.5 billion parameters (like 1.5 billion brain connections!)

How it's used:

  • Loaded automatically when you run from_pretrained()
  • Without this file β€” the model cannot work at all
  • Size: ~3GB (large because it contains so much knowledge)

Simple analogy: Think of it like a student's brain after years of studying. All the knowledge is stored here.


2. model.safetensors.index.json

What it is: A "table of contents" for the model weights.

What's inside:

{
  "metadata": {"total_size": 3000000000},
  "weight_map": {
    "encoder.layers.0.weight": "model.safetensors",
    "decoder.layers.0.weight": "model.safetensors",
    ...
  }
}

How it's used:

  • Tells the loading code exactly where each piece of the model is stored
  • Needed when model is split across multiple files
  • You never need to open this manually

Simple analogy: Like an index page in a textbook β€” tells you which page (file) contains which chapter (weight).


3. config.json ⭐ (Architecture Blueprint)

What it is: The blueprint that describes how the model is built.

What's inside:

{
  "model_type": "whisper",
  "d_model": 1280,
  "encoder_layers": 32,
  "decoder_layers": 32,
  "encoder_attention_heads": 20,
  "num_mel_bins": 128,
  "vocab_size": 51866,
  ...
}

Key settings explained:

Setting Value Meaning
model_type whisper This is a Whisper model
num_mel_bins 128 Uses 128 frequency bands (Large v3 specific!)
encoder_layers 32 32 layers for processing audio
decoder_layers 32 32 layers for generating text
vocab_size 51,866 Knows 51,866 different word pieces

How it's used:

  • Read first when loading the model
  • Tells PyTorch how to build the model structure
  • Must match the weights in model.safetensors

Simple analogy: Like an architect's blueprint β€” describes the structure before the building (weights) fills it.


4. generation_config.json

What it is: Settings that control how the model generates (writes) text.

What's inside:

{
  "forced_decoder_ids": null,
  "suppress_tokens": [1, 2, 7, 8, ...],
  "task": "transcribe",
  "language": "en",
  "max_new_tokens": 225
}

Key settings explained:

Setting Meaning
task: "transcribe" Convert speech to text (not translate)
language: "en" English language only
max_new_tokens: 225 Generate at most 225 words
suppress_tokens List of tokens to never generate

How it's used:

  • Automatically loaded during inference
  • Controls the text generation process
  • Can be overridden by passing parameters to generate()

Simple analogy: Like rules given to a writer β€” "write in English", "keep it under 225 words", "don't use these specific words".


5. preprocessor_config.json ⭐ (Audio Processing Settings)

What it is: Settings for converting raw audio into a format the model understands.

What's inside:

{
  "feature_size": 128,
  "sampling_rate": 16000,
  "hop_length": 160,
  "n_fft": 400,
  "padding_value": 0.0,
  "return_attention_mask": false
}

Key settings explained:

Setting Value Meaning
feature_size 128 Uses 128 mel frequency bins
sampling_rate 16000 Audio must be at 16kHz
hop_length 160 Slide window by 160 samples
n_fft 400 FFT window size of 400 samples

How it's used:

  • WhisperFeatureExtractor reads this to process audio correctly
  • Audio is converted to mel spectrogram (visual representation of sound)
  • Without correct settings β€” model receives wrong input!

Simple analogy: Like settings on a camera β€” resolution, brightness, zoom. Must be set correctly to get a good picture.


6. processor_config.json

What it is: Configuration for the complete processor (feature extractor + tokenizer combined).

What's inside:

{
  "auto_map": {
    "AutoProcessor": "transformers.WhisperProcessor"
  },
  "feature_extractor_type": "WhisperFeatureExtractor",
  "tokenizer_class": "WhisperTokenizer"
}

How it's used:

  • Tells AutoProcessor which classes to use
  • Links feature extractor and tokenizer together
  • Loaded automatically with WhisperProcessor.from_pretrained()

Simple analogy: Like a connector cable β€” links the audio processing part with the text processing part.


7. tokenizer.json ⭐ (Language Dictionary)

What it is: The complete tokenizer β€” converts between text and numbers.

What's inside:

  • A vocabulary of 51,866 "tokens" (word pieces)
  • Rules for splitting words into tokens
  • Special tokens like <|startoftranscript|>, <|en|>, <|endoftext|>

Example of tokenization:

"children" β†’ [1200, 303]  (converted to numbers)
[1200, 303] β†’ "children"  (converted back to text)

How it's used:

  • Model works with numbers, not words
  • Tokenizer converts predictions (numbers) β†’ readable text
  • Also converts training labels (text) β†’ numbers

Simple analogy: Like a codebook used by spies β€” converts messages to secret codes and back.


8. tokenizer_config.json

What it is: Configuration settings for the tokenizer.

What's inside:

{
  "tokenizer_class": "WhisperTokenizer",
  "language": "english",
  "task": "transcribe",
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>"
}

Key settings explained:

Setting Meaning
language: "english" Transcribe in English
task: "transcribe" Speech to text task
bos_token Token that starts a sequence
eos_token Token that ends a sequence
pad_token Token used for padding

Simple analogy: Like grammar rules for the codebook β€” when to start, when to stop, what punctuation to use.


9. vocab.json ⭐ (Vocabulary List)

What it is: Complete list of all words/pieces the model knows.

What's inside:

{
  "!": 0,
  "\"": 1,
  "#": 2,
  ...
  "children": 5234,
  "speech": 8901,
  ...
  "<|endoftext|>": 50256
}

How it's used:

  • Maps each token to a unique number
  • Used by tokenizer to encode/decode text
  • Contains 51,866 entries

Simple analogy: Like a dictionary β€” each word has a unique page number (ID).


10. merges.txt

What it is: Rules for how to combine smaller pieces into larger words.

What's inside:

#version: 0.2
Δ  t
Δ  a
h e
i n
r e
...

How it's used:

  • Byte-Pair Encoding (BPE) tokenization
  • Tells tokenizer how to merge characters into subwords
  • Example: "ch" + "ild" + "ren" β†’ "children"

Simple analogy: Like rules for combining LEGO pieces β€” small pieces combine into larger meaningful shapes.


11. added_tokens.json

What it is: Special tokens added specifically for Whisper's speech recognition tasks.

What's inside:

{
  "<|endoftext|>": 50256,
  "<|startoftranscript|>": 50258,
  "<|en|>": 50259,
  "<|transcribe|>": 50359,
  "<|notimestamps|>": 50363,
  ...
}

Key tokens explained:

Token Meaning
`< startoftranscript
`< en
`< transcribe
`< endoftext
`< notimestamps

Simple analogy: Like stage directions in a play β€” "BEGIN SCENE", "SPEAK IN ENGLISH", "END SCENE".


12. normalizer.json

What it is: Rules for cleaning and standardizing text output.

What's inside:

  • Rules to lowercase text
  • Rules to remove punctuation
  • Rules to convert numbers to words ("2" β†’ "two")
  • Rules to fix common spelling variations

How it's used:

  • Applied to both predictions and ground truth before calculating WER
  • Ensures fair comparison β€” "Hello!" and "hello" are treated the same
  • This is Whisper's official English Text Normalizer

Simple analogy: Like a proofreader who standardizes all text before grading β€” removes extra spaces, fixes capitalization.


13. special_tokens_map.json

What it is: Maps special token names to their actual token strings.

What's inside:

{
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>",
  "additional_special_tokens": [
    "<|startoftranscript|>",
    "<|en|>",
    "<|transcribe|>",
    ...
  ]
}

How it's used:

  • Tells the tokenizer which tokens have special meaning
  • Used during both training and inference
  • Ensures model knows when to start/stop generating

Simple analogy: Like a legend on a map β€” explains what each special symbol means.


πŸš€ How to Use This Model β€” Complete Step by Step Guide

Prerequisites (Things You Need First)

Step 1 β€” Install Python:

  • Download Python 3.11 from python.org
  • Make sure to check "Add Python to PATH" during installation

Step 2 β€” Install Required Libraries: Open terminal/command prompt and run:

pip install transformers torch torchaudio soundfile librosa

This installs:

Library Purpose
transformers Loads and runs the Whisper model
torch Deep learning framework
torchaudio Audio processing
soundfile Read audio files
librosa Audio analysis

Option 1 β€” Quickest Way (3 Lines of Code!)

from transformers import pipeline

# Load model directly from HuggingFace
transcriber = pipeline(
    "automatic-speech-recognition",
    model="harphool17/whisper-large-v3-children-asr"
)

# Transcribe audio file
result = transcriber("your_audio_file.wav")
print(result["text"])

That's it! πŸŽ‰


Option 2 β€” Complete Control (Recommended)

import torch
import soundfile as sf
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# ── Step 1: Load Model ──
print("Loading model...")
processor = WhisperProcessor.from_pretrained(
    "harphool17/whisper-large-v3-children-asr"
)
model = WhisperForConditionalGeneration.from_pretrained(
    "harphool17/whisper-large-v3-children-asr"
)

# Use GPU if available (much faster!)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
print(f"Model loaded on: {device}")

# ── Step 2: Load Audio ──
def load_audio(audio_path):
    """Load audio file and convert to 16kHz mono"""
    audio, sr = sf.read(audio_path, dtype="float32")
    
    # Convert stereo to mono if needed
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    
    return audio

# ── Step 3: Transcribe ──
def transcribe(audio_path):
    """Transcribe children's speech from audio file"""
    
    # Load audio
    audio = load_audio(audio_path)
    
    # Process audio into model input
    inputs = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    )
    
    # Move to GPU if available
    input_features = inputs.input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            language="en",
            task="transcribe",
            max_new_tokens=225,
        )
    
    # Convert numbers back to text
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    return transcription.lower().strip()

# ── Step 4: Use It! ──
audio_file = "child_speech.wav"  # Replace with your audio file path
result = transcribe(audio_file)
print(f"Child said: {result}")

Supported Audio Formats

Format Extension Supported
WAV .wav βœ… Yes
FLAC .flac βœ… Yes
MP3 .mp3 βœ… Yes
OGG .ogg βœ… Yes

Best format: WAV or FLAC (lossless quality)

Required: 16kHz sample rate, mono channel (code above handles this automatically!)


Common Errors and Fixes

Error Cause Fix
ModuleNotFoundError: transformers Library not installed Run pip install transformers
CUDA out of memory GPU memory full Add model = model.to("cpu")
RuntimeError: Input size mismatch Wrong audio format Use the load_audio() function above
OSError: model not found Wrong model name Check spelling: harphool17/whisper-large-v3-children-asr

πŸ“Š Training Details

Parameter Value
Base Model OpenAI Whisper Large v3
Training Data 82,490 children's speech samples
Training Hours ~185 hours of audio
Age Groups 3-4, 5-7, 8-11 years
Training Steps 2,576 steps
Batch Size 4 (effective: 64 with gradient accumulation)
Learning Rate 1e-5
Optimizer AdamW with cosine schedule
Precision bfloat16
GPU 2x NVIDIA RTX 4500 Ada (24GB each)

πŸ“ˆ Performance

Split WER Description
Validation 0.1030 10.3% word error rate
Public Test 0.4432 Competition test set

Note: The gap between validation and public test WER is due to distribution shift β€” the test set contains more challenging audio conditions than the training data.


πŸ”— Related Resources

Resource Link
GitHub Code harphool-singh/whisper-children-asr
Competition DrivenData Pasketti Challenge
Base Model openai/whisper-large-v3
Demo Website Coming Soon

πŸ“ Citation

If you use this model in your work, please cite:

@misc{whisper-children-asr-2026,
  author = {Harphool Singh},
  title = {Whisper Large v3 Fine-tuned for Children's Speech Recognition},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/harphool17/whisper-large-v3-children-asr}
}

πŸ‘€ Author

Harphool Singh


Built with ❀️ for improving children's education technology

Downloads last month
69
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support