ποΈ # ποΈ Whisper Large v3 β Fine-tuned for Children's Speech Recognition
π§ What is This Model?
Imagine a smart assistant that can listen to a child speak and write down exactly what they said. That is what this model does!
Most speech recognition systems are trained on adult voices β they struggle to understand children because:
- Children have higher pitched voices than adults
- Children sometimes mispronounce words (e.g., "elephant" β "efant")
- Children speak with different rhythm and speed
- Some children have speech disorders
This model is a fine-tuned version of OpenAI's Whisper Large v3 β one of the world's best speech recognition models β specially adapted to understand children's speech.
π Competition Results
This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData.
| Metric | Score |
|---|---|
| Validation WER | 0.1030 (10.30%) π₯ |
| Public Leaderboard WER | 0.4432 |
| Competition Leaderboard #1 | 0.1914 |
WER = Word Error Rate β Lower is better! If a child says 10 words and the model gets 1 wrong, WER = 10%.
π Files in This Repository β Complete Guide
This section explains every single file in this repository, what it contains, and why it exists.
1. model.safetensors β (Most Important File!)
What it is: The "brain" of the model β contains all the learned knowledge.
What's inside:
- Millions of numbers (called "weights" or "parameters") that the model learned during training
- These numbers represent everything the model knows about children's speech
- Total: 1.5 billion parameters (like 1.5 billion brain connections!)
How it's used:
- Loaded automatically when you run
from_pretrained() - Without this file β the model cannot work at all
- Size: ~3GB (large because it contains so much knowledge)
Simple analogy: Think of it like a student's brain after years of studying. All the knowledge is stored here.
2. model.safetensors.index.json
What it is: A "table of contents" for the model weights.
What's inside:
{
"metadata": {"total_size": 3000000000},
"weight_map": {
"encoder.layers.0.weight": "model.safetensors",
"decoder.layers.0.weight": "model.safetensors",
...
}
}
How it's used:
- Tells the loading code exactly where each piece of the model is stored
- Needed when model is split across multiple files
- You never need to open this manually
Simple analogy: Like an index page in a textbook β tells you which page (file) contains which chapter (weight).
3. config.json β (Architecture Blueprint)
What it is: The blueprint that describes how the model is built.
What's inside:
{
"model_type": "whisper",
"d_model": 1280,
"encoder_layers": 32,
"decoder_layers": 32,
"encoder_attention_heads": 20,
"num_mel_bins": 128,
"vocab_size": 51866,
...
}
Key settings explained:
| Setting | Value | Meaning |
|---|---|---|
model_type |
whisper | This is a Whisper model |
num_mel_bins |
128 | Uses 128 frequency bands (Large v3 specific!) |
encoder_layers |
32 | 32 layers for processing audio |
decoder_layers |
32 | 32 layers for generating text |
vocab_size |
51,866 | Knows 51,866 different word pieces |
How it's used:
- Read first when loading the model
- Tells PyTorch how to build the model structure
- Must match the weights in
model.safetensors
Simple analogy: Like an architect's blueprint β describes the structure before the building (weights) fills it.
4. generation_config.json
What it is: Settings that control how the model generates (writes) text.
What's inside:
{
"forced_decoder_ids": null,
"suppress_tokens": [1, 2, 7, 8, ...],
"task": "transcribe",
"language": "en",
"max_new_tokens": 225
}
Key settings explained:
| Setting | Meaning |
|---|---|
task: "transcribe" |
Convert speech to text (not translate) |
language: "en" |
English language only |
max_new_tokens: 225 |
Generate at most 225 words |
suppress_tokens |
List of tokens to never generate |
How it's used:
- Automatically loaded during inference
- Controls the text generation process
- Can be overridden by passing parameters to
generate()
Simple analogy: Like rules given to a writer β "write in English", "keep it under 225 words", "don't use these specific words".
5. preprocessor_config.json β (Audio Processing Settings)
What it is: Settings for converting raw audio into a format the model understands.
What's inside:
{
"feature_size": 128,
"sampling_rate": 16000,
"hop_length": 160,
"n_fft": 400,
"padding_value": 0.0,
"return_attention_mask": false
}
Key settings explained:
| Setting | Value | Meaning |
|---|---|---|
feature_size |
128 | Uses 128 mel frequency bins |
sampling_rate |
16000 | Audio must be at 16kHz |
hop_length |
160 | Slide window by 160 samples |
n_fft |
400 | FFT window size of 400 samples |
How it's used:
WhisperFeatureExtractorreads this to process audio correctly- Audio is converted to mel spectrogram (visual representation of sound)
- Without correct settings β model receives wrong input!
Simple analogy: Like settings on a camera β resolution, brightness, zoom. Must be set correctly to get a good picture.
6. processor_config.json
What it is: Configuration for the complete processor (feature extractor + tokenizer combined).
What's inside:
{
"auto_map": {
"AutoProcessor": "transformers.WhisperProcessor"
},
"feature_extractor_type": "WhisperFeatureExtractor",
"tokenizer_class": "WhisperTokenizer"
}
How it's used:
- Tells
AutoProcessorwhich classes to use - Links feature extractor and tokenizer together
- Loaded automatically with
WhisperProcessor.from_pretrained()
Simple analogy: Like a connector cable β links the audio processing part with the text processing part.
7. tokenizer.json β (Language Dictionary)
What it is: The complete tokenizer β converts between text and numbers.
What's inside:
- A vocabulary of 51,866 "tokens" (word pieces)
- Rules for splitting words into tokens
- Special tokens like
<|startoftranscript|>,<|en|>,<|endoftext|>
Example of tokenization:
"children" β [1200, 303] (converted to numbers)
[1200, 303] β "children" (converted back to text)
How it's used:
- Model works with numbers, not words
- Tokenizer converts predictions (numbers) β readable text
- Also converts training labels (text) β numbers
Simple analogy: Like a codebook used by spies β converts messages to secret codes and back.
8. tokenizer_config.json
What it is: Configuration settings for the tokenizer.
What's inside:
{
"tokenizer_class": "WhisperTokenizer",
"language": "english",
"task": "transcribe",
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "<|endoftext|>"
}
Key settings explained:
| Setting | Meaning |
|---|---|
language: "english" |
Transcribe in English |
task: "transcribe" |
Speech to text task |
bos_token |
Token that starts a sequence |
eos_token |
Token that ends a sequence |
pad_token |
Token used for padding |
Simple analogy: Like grammar rules for the codebook β when to start, when to stop, what punctuation to use.
9. vocab.json β (Vocabulary List)
What it is: Complete list of all words/pieces the model knows.
What's inside:
{
"!": 0,
"\"": 1,
"#": 2,
...
"children": 5234,
"speech": 8901,
...
"<|endoftext|>": 50256
}
How it's used:
- Maps each token to a unique number
- Used by tokenizer to encode/decode text
- Contains 51,866 entries
Simple analogy: Like a dictionary β each word has a unique page number (ID).
10. merges.txt
What it is: Rules for how to combine smaller pieces into larger words.
What's inside:
#version: 0.2
Δ t
Δ a
h e
i n
r e
...
How it's used:
- Byte-Pair Encoding (BPE) tokenization
- Tells tokenizer how to merge characters into subwords
- Example: "ch" + "ild" + "ren" β "children"
Simple analogy: Like rules for combining LEGO pieces β small pieces combine into larger meaningful shapes.
11. added_tokens.json
What it is: Special tokens added specifically for Whisper's speech recognition tasks.
What's inside:
{
"<|endoftext|>": 50256,
"<|startoftranscript|>": 50258,
"<|en|>": 50259,
"<|transcribe|>": 50359,
"<|notimestamps|>": 50363,
...
}
Key tokens explained:
| Token | Meaning |
|---|---|
| `< | startoftranscript |
| `< | en |
| `< | transcribe |
| `< | endoftext |
| `< | notimestamps |
Simple analogy: Like stage directions in a play β "BEGIN SCENE", "SPEAK IN ENGLISH", "END SCENE".
12. normalizer.json
What it is: Rules for cleaning and standardizing text output.
What's inside:
- Rules to lowercase text
- Rules to remove punctuation
- Rules to convert numbers to words ("2" β "two")
- Rules to fix common spelling variations
How it's used:
- Applied to both predictions and ground truth before calculating WER
- Ensures fair comparison β "Hello!" and "hello" are treated the same
- This is Whisper's official English Text Normalizer
Simple analogy: Like a proofreader who standardizes all text before grading β removes extra spaces, fixes capitalization.
13. special_tokens_map.json
What it is: Maps special token names to their actual token strings.
What's inside:
{
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "<|endoftext|>",
"additional_special_tokens": [
"<|startoftranscript|>",
"<|en|>",
"<|transcribe|>",
...
]
}
How it's used:
- Tells the tokenizer which tokens have special meaning
- Used during both training and inference
- Ensures model knows when to start/stop generating
Simple analogy: Like a legend on a map β explains what each special symbol means.
π How to Use This Model β Complete Step by Step Guide
Prerequisites (Things You Need First)
Step 1 β Install Python:
- Download Python 3.11 from
python.org - Make sure to check "Add Python to PATH" during installation
Step 2 β Install Required Libraries: Open terminal/command prompt and run:
pip install transformers torch torchaudio soundfile librosa
This installs:
| Library | Purpose |
|---|---|
transformers |
Loads and runs the Whisper model |
torch |
Deep learning framework |
torchaudio |
Audio processing |
soundfile |
Read audio files |
librosa |
Audio analysis |
Option 1 β Quickest Way (3 Lines of Code!)
from transformers import pipeline
# Load model directly from HuggingFace
transcriber = pipeline(
"automatic-speech-recognition",
model="harphool17/whisper-large-v3-children-asr"
)
# Transcribe audio file
result = transcriber("your_audio_file.wav")
print(result["text"])
That's it! π
Option 2 β Complete Control (Recommended)
import torch
import soundfile as sf
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# ββ Step 1: Load Model ββ
print("Loading model...")
processor = WhisperProcessor.from_pretrained(
"harphool17/whisper-large-v3-children-asr"
)
model = WhisperForConditionalGeneration.from_pretrained(
"harphool17/whisper-large-v3-children-asr"
)
# Use GPU if available (much faster!)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
print(f"Model loaded on: {device}")
# ββ Step 2: Load Audio ββ
def load_audio(audio_path):
"""Load audio file and convert to 16kHz mono"""
audio, sr = sf.read(audio_path, dtype="float32")
# Convert stereo to mono if needed
if audio.ndim > 1:
audio = audio.mean(axis=1)
# Resample to 16kHz if needed
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
return audio
# ββ Step 3: Transcribe ββ
def transcribe(audio_path):
"""Transcribe children's speech from audio file"""
# Load audio
audio = load_audio(audio_path)
# Process audio into model input
inputs = processor(
audio,
sampling_rate=16000,
return_tensors="pt"
)
# Move to GPU if available
input_features = inputs.input_features.to(device)
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(
input_features,
language="en",
task="transcribe",
max_new_tokens=225,
)
# Convert numbers back to text
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
return transcription.lower().strip()
# ββ Step 4: Use It! ββ
audio_file = "child_speech.wav" # Replace with your audio file path
result = transcribe(audio_file)
print(f"Child said: {result}")
Supported Audio Formats
| Format | Extension | Supported |
|---|---|---|
| WAV | .wav |
β Yes |
| FLAC | .flac |
β Yes |
| MP3 | .mp3 |
β Yes |
| OGG | .ogg |
β Yes |
Best format: WAV or FLAC (lossless quality)
Required: 16kHz sample rate, mono channel (code above handles this automatically!)
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError: transformers |
Library not installed | Run pip install transformers |
CUDA out of memory |
GPU memory full | Add model = model.to("cpu") |
RuntimeError: Input size mismatch |
Wrong audio format | Use the load_audio() function above |
OSError: model not found |
Wrong model name | Check spelling: harphool17/whisper-large-v3-children-asr |
π Training Details
| Parameter | Value |
|---|---|
| Base Model | OpenAI Whisper Large v3 |
| Training Data | 82,490 children's speech samples |
| Training Hours | ~185 hours of audio |
| Age Groups | 3-4, 5-7, 8-11 years |
| Training Steps | 2,576 steps |
| Batch Size | 4 (effective: 64 with gradient accumulation) |
| Learning Rate | 1e-5 |
| Optimizer | AdamW with cosine schedule |
| Precision | bfloat16 |
| GPU | 2x NVIDIA RTX 4500 Ada (24GB each) |
π Performance
| Split | WER | Description |
|---|---|---|
| Validation | 0.1030 | 10.3% word error rate |
| Public Test | 0.4432 | Competition test set |
Note: The gap between validation and public test WER is due to distribution shift β the test set contains more challenging audio conditions than the training data.
π Related Resources
| Resource | Link |
|---|---|
| GitHub Code | harphool-singh/whisper-children-asr |
| Competition | DrivenData Pasketti Challenge |
| Base Model | openai/whisper-large-v3 |
| Demo Website | Coming Soon |
π Citation
If you use this model in your work, please cite:
@misc{whisper-children-asr-2026,
author = {Harphool Singh},
title = {Whisper Large v3 Fine-tuned for Children's Speech Recognition},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/harphool17/whisper-large-v3-children-asr}
}
π€ Author
Harphool Singh
- GitHub: @harphool-singh
- HuggingFace: @harphool17
Built with β€οΈ for improving children's education technology
- Downloads last month
- 69