🎙️ # 🎙️ Whisper Large v3 — Fine-tuned for Children's Speech Recognition

🧠 What is This Model?

Imagine a smart assistant that can listen to a child speak and write down exactly what they said. That is what this model does!

Most speech recognition systems are trained on adult voices — they struggle to understand children because:

Children have higher pitched voices than adults
Children sometimes mispronounce words (e.g., "elephant" → "efant")
Children speak with different rhythm and speed
Some children have speech disorders

This model is a fine-tuned version of OpenAI's Whisper Large v3 — one of the world's best speech recognition models — specially adapted to understand children's speech.

🏆 Competition Results

This model was built for the "On Top of Pasketti: Children's Speech Recognition Challenge" on DrivenData.

Metric	Score
Validation WER	0.1030 (10.30%) 🔥
Public Leaderboard WER	0.4432
Competition Leaderboard #1	0.1914

WER = Word Error Rate — Lower is better! If a child says 10 words and the model gets 1 wrong, WER = 10%.

📁 Files in This Repository — Complete Guide

This section explains every single file in this repository, what it contains, and why it exists.

1. `model.safetensors` ⭐ (Most Important File!)

What it is: The "brain" of the model — contains all the learned knowledge.

What's inside:

Millions of numbers (called "weights" or "parameters") that the model learned during training
These numbers represent everything the model knows about children's speech
Total: 1.5 billion parameters (like 1.5 billion brain connections!)

How it's used:

Loaded automatically when you run from_pretrained()
Without this file — the model cannot work at all
Size: ~3GB (large because it contains so much knowledge)

Simple analogy: Think of it like a student's brain after years of studying. All the knowledge is stored here.

2. `model.safetensors.index.json`

What it is: A "table of contents" for the model weights.

What's inside:

{
  "metadata": {"total_size": 3000000000},
  "weight_map": {
    "encoder.layers.0.weight": "model.safetensors",
    "decoder.layers.0.weight": "model.safetensors",
    ...
  }
}

How it's used:

Tells the loading code exactly where each piece of the model is stored
Needed when model is split across multiple files
You never need to open this manually

Simple analogy: Like an index page in a textbook — tells you which page (file) contains which chapter (weight).

3. `config.json` ⭐ (Architecture Blueprint)

What it is: The blueprint that describes how the model is built.

What's inside:

{
  "model_type": "whisper",
  "d_model": 1280,
  "encoder_layers": 32,
  "decoder_layers": 32,
  "encoder_attention_heads": 20,
  "num_mel_bins": 128,
  "vocab_size": 51866,
  ...
}

Key settings explained:

Setting	Value	Meaning
`model_type`	whisper	This is a Whisper model
`num_mel_bins`	128	Uses 128 frequency bands (Large v3 specific!)
`encoder_layers`	32	32 layers for processing audio
`decoder_layers`	32	32 layers for generating text
`vocab_size`	51,866	Knows 51,866 different word pieces

How it's used:

Read first when loading the model
Tells PyTorch how to build the model structure
Must match the weights in model.safetensors

Simple analogy: Like an architect's blueprint — describes the structure before the building (weights) fills it.

4. `generation_config.json`

What it is: Settings that control how the model generates (writes) text.

What's inside:

{
  "forced_decoder_ids": null,
  "suppress_tokens": [1, 2, 7, 8, ...],
  "task": "transcribe",
  "language": "en",
  "max_new_tokens": 225
}

Key settings explained:

Setting	Meaning
`task: "transcribe"`	Convert speech to text (not translate)
`language: "en"`	English language only
`max_new_tokens: 225`	Generate at most 225 words
`suppress_tokens`	List of tokens to never generate

How it's used:

Automatically loaded during inference
Controls the text generation process
Can be overridden by passing parameters to generate()

Simple analogy: Like rules given to a writer — "write in English", "keep it under 225 words", "don't use these specific words".

5. `preprocessor_config.json` ⭐ (Audio Processing Settings)

What it is: Settings for converting raw audio into a format the model understands.

What's inside:

{
  "feature_size": 128,
  "sampling_rate": 16000,
  "hop_length": 160,
  "n_fft": 400,
  "padding_value": 0.0,
  "return_attention_mask": false
}

Key settings explained:

Setting	Value	Meaning
`feature_size`	128	Uses 128 mel frequency bins
`sampling_rate`	16000	Audio must be at 16kHz
`hop_length`	160	Slide window by 160 samples
`n_fft`	400	FFT window size of 400 samples

How it's used:

WhisperFeatureExtractor reads this to process audio correctly
Audio is converted to mel spectrogram (visual representation of sound)
Without correct settings — model receives wrong input!

Simple analogy: Like settings on a camera — resolution, brightness, zoom. Must be set correctly to get a good picture.

6. `processor_config.json`

What it is: Configuration for the complete processor (feature extractor + tokenizer combined).

What's inside:

{
  "auto_map": {
    "AutoProcessor": "transformers.WhisperProcessor"
  },
  "feature_extractor_type": "WhisperFeatureExtractor",
  "tokenizer_class": "WhisperTokenizer"
}

How it's used:

Tells AutoProcessor which classes to use
Links feature extractor and tokenizer together
Loaded automatically with WhisperProcessor.from_pretrained()

Simple analogy: Like a connector cable — links the audio processing part with the text processing part.

7. `tokenizer.json` ⭐ (Language Dictionary)

What it is: The complete tokenizer — converts between text and numbers.

What's inside:

A vocabulary of 51,866 "tokens" (word pieces)
Rules for splitting words into tokens
Special tokens like <|startoftranscript|>, <|en|>, <|endoftext|>

Example of tokenization:

"children" → [1200, 303]  (converted to numbers)
[1200, 303] → "children"  (converted back to text)

How it's used:

Model works with numbers, not words
Tokenizer converts predictions (numbers) → readable text
Also converts training labels (text) → numbers

Simple analogy: Like a codebook used by spies — converts messages to secret codes and back.

8. `tokenizer_config.json`

What it is: Configuration settings for the tokenizer.

What's inside:

{
  "tokenizer_class": "WhisperTokenizer",
  "language": "english",
  "task": "transcribe",
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>"
}

Key settings explained:

Setting	Meaning
`language: "english"`	Transcribe in English
`task: "transcribe"`	Speech to text task
`bos_token`	Token that starts a sequence
`eos_token`	Token that ends a sequence
`pad_token`	Token used for padding

Simple analogy: Like grammar rules for the codebook — when to start, when to stop, what punctuation to use.

9. `vocab.json` ⭐ (Vocabulary List)

What it is: Complete list of all words/pieces the model knows.

What's inside:

{
  "!": 0,
  "\"": 1,
  "#": 2,
  ...
  "children": 5234,
  "speech": 8901,
  ...
  "<|endoftext|>": 50256
}

How it's used:

Maps each token to a unique number
Used by tokenizer to encode/decode text
Contains 51,866 entries

Simple analogy: Like a dictionary — each word has a unique page number (ID).

10. `merges.txt`

What it is: Rules for how to combine smaller pieces into larger words.

What's inside:

#version: 0.2
Ġ t
Ġ a
h e
i n
r e
...

How it's used:

Byte-Pair Encoding (BPE) tokenization
Tells tokenizer how to merge characters into subwords
Example: "ch" + "ild" + "ren" → "children"

Simple analogy: Like rules for combining LEGO pieces — small pieces combine into larger meaningful shapes.

11. `added_tokens.json`

What it is: Special tokens added specifically for Whisper's speech recognition tasks.

What's inside:

{
  "<|endoftext|>": 50256,
  "<|startoftranscript|>": 50258,
  "<|en|>": 50259,
  "<|transcribe|>": 50359,
  "<|notimestamps|>": 50363,
  ...
}

Key tokens explained:

Token	Meaning
`<	startoftranscript
`<	en
`<	transcribe
`<	endoftext
`<	notimestamps

Simple analogy: Like stage directions in a play — "BEGIN SCENE", "SPEAK IN ENGLISH", "END SCENE".

12. `normalizer.json`

What it is: Rules for cleaning and standardizing text output.

What's inside:

Rules to lowercase text
Rules to remove punctuation
Rules to convert numbers to words ("2" → "two")
Rules to fix common spelling variations

How it's used:

Applied to both predictions and ground truth before calculating WER
Ensures fair comparison — "Hello!" and "hello" are treated the same
This is Whisper's official English Text Normalizer

Simple analogy: Like a proofreader who standardizes all text before grading — removes extra spaces, fixes capitalization.

13. `special_tokens_map.json`

What it is: Maps special token names to their actual token strings.

What's inside:

{
  "bos_token": "<|endoftext|>",
  "eos_token": "<|endoftext|>",
  "pad_token": "<|endoftext|>",
  "additional_special_tokens": [
    "<|startoftranscript|>",
    "<|en|>",
    "<|transcribe|>",
    ...
  ]
}

How it's used:

Tells the tokenizer which tokens have special meaning
Used during both training and inference
Ensures model knows when to start/stop generating

Simple analogy: Like a legend on a map — explains what each special symbol means.

🚀 How to Use This Model — Complete Step by Step Guide

Prerequisites (Things You Need First)

Step 1 — Install Python:

Download Python 3.11 from python.org
Make sure to check "Add Python to PATH" during installation

Step 2 — Install Required Libraries: Open terminal/command prompt and run:

pip install transformers torch torchaudio soundfile librosa

This installs:

Library	Purpose
`transformers`	Loads and runs the Whisper model
`torch`	Deep learning framework
`torchaudio`	Audio processing
`soundfile`	Read audio files
`librosa`	Audio analysis

Option 1 — Quickest Way (3 Lines of Code!)

from transformers import pipeline

# Load model directly from HuggingFace
transcriber = pipeline(
    "automatic-speech-recognition",
    model="harphool17/whisper-large-v3-children-asr"
)

# Transcribe audio file
result = transcriber("your_audio_file.wav")
print(result["text"])

That's it! 🎉

Option 2 — Complete Control (Recommended)

import torch
import soundfile as sf
import librosa
import numpy as np
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# ── Step 1: Load Model ──
print("Loading model...")
processor = WhisperProcessor.from_pretrained(
    "harphool17/whisper-large-v3-children-asr"
)
model = WhisperForConditionalGeneration.from_pretrained(
    "harphool17/whisper-large-v3-children-asr"
)

# Use GPU if available (much faster!)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
print(f"Model loaded on: {device}")

# ── Step 2: Load Audio ──
def load_audio(audio_path):
    """Load audio file and convert to 16kHz mono"""
    audio, sr = sf.read(audio_path, dtype="float32")
    
    # Convert stereo to mono if needed
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    
    return audio

# ── Step 3: Transcribe ──
def transcribe(audio_path):
    """Transcribe children's speech from audio file"""
    
    # Load audio
    audio = load_audio(audio_path)
    
    # Process audio into model input
    inputs = processor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    )
    
    # Move to GPU if available
    input_features = inputs.input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            language="en",
            task="transcribe",
            max_new_tokens=225,
        )
    
    # Convert numbers back to text
    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]
    
    return transcription.lower().strip()

# ── Step 4: Use It! ──
audio_file = "child_speech.wav"  # Replace with your audio file path
result = transcribe(audio_file)
print(f"Child said: {result}")

Supported Audio Formats

Format	Extension	Supported
WAV	`.wav`	✅ Yes
FLAC	`.flac`	✅ Yes
MP3	`.mp3`	✅ Yes
OGG	`.ogg`	✅ Yes

Best format: WAV or FLAC (lossless quality)

Required: 16kHz sample rate, mono channel (code above handles this automatically!)

Common Errors and Fixes

Error	Cause	Fix
`ModuleNotFoundError: transformers`	Library not installed	Run `pip install transformers`
`CUDA out of memory`	GPU memory full	Add `model = model.to("cpu")`
`RuntimeError: Input size mismatch`	Wrong audio format	Use the `load_audio()` function above
`OSError: model not found`	Wrong model name	Check spelling: `harphool17/whisper-large-v3-children-asr`

📊 Training Details

Parameter	Value
Base Model	OpenAI Whisper Large v3
Training Data	82,490 children's speech samples
Training Hours	~185 hours of audio
Age Groups	3-4, 5-7, 8-11 years
Training Steps	2,576 steps
Batch Size	4 (effective: 64 with gradient accumulation)
Learning Rate	1e-5
Optimizer	AdamW with cosine schedule
Precision	bfloat16
GPU	2x NVIDIA RTX 4500 Ada (24GB each)

📈 Performance

Split	WER	Description
Validation	0.1030	10.3% word error rate
Public Test	0.4432	Competition test set

Note: The gap between validation and public test WER is due to distribution shift — the test set contains more challenging audio conditions than the training data.

🔗 Related Resources

Resource	Link
GitHub Code	harphool-singh/whisper-children-asr
Competition	DrivenData Pasketti Challenge
Base Model	openai/whisper-large-v3
Demo Website	Coming Soon

📝 Citation

If you use this model in your work, please cite:

@misc{whisper-children-asr-2026,
  author = {Harphool Singh},
  title = {Whisper Large v3 Fine-tuned for Children's Speech Recognition},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/harphool17/whisper-large-v3-children-asr}
}

👤 Author

Harphool Singh

GitHub: @harphool-singh
HuggingFace: @harphool17

Built with ❤️ for improving children's education technology

Downloads last month: 69

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

🧠 What is This Model?

🏆 Competition Results

📁 Files in This Repository — Complete Guide

1. model.safetensors ⭐ (Most Important File!)

2. model.safetensors.index.json

3. config.json ⭐ (Architecture Blueprint)

4. generation_config.json

5. preprocessor_config.json ⭐ (Audio Processing Settings)

6. processor_config.json

7. tokenizer.json ⭐ (Language Dictionary)

8. tokenizer_config.json

9. vocab.json ⭐ (Vocabulary List)

10. merges.txt

11. added_tokens.json

12. normalizer.json

13. special_tokens_map.json

🚀 How to Use This Model — Complete Step by Step Guide

Prerequisites (Things You Need First)

Option 1 — Quickest Way (3 Lines of Code!)

Option 2 — Complete Control (Recommended)

Supported Audio Formats

Common Errors and Fixes

📊 Training Details

📈 Performance

🔗 Related Resources

📝 Citation

👤 Author

1. `model.safetensors` ⭐ (Most Important File!)

2. `model.safetensors.index.json`

3. `config.json` ⭐ (Architecture Blueprint)

4. `generation_config.json`

5. `preprocessor_config.json` ⭐ (Audio Processing Settings)

6. `processor_config.json`

7. `tokenizer.json` ⭐ (Language Dictionary)

8. `tokenizer_config.json`

9. `vocab.json` ⭐ (Vocabulary List)

10. `merges.txt`

11. `added_tokens.json`

12. `normalizer.json`

13. `special_tokens_map.json`