Stage1: LibriSpeech 960h (Top-4 Encoder + CTC + Speaker Kernel)
Speech-LLM checkpoint for ASR alignment, trained on LibriSpeech 960h with CTC auxiliary loss and SortFormer-based speaker kernel.
Architecture
| Component | Model | Parameters |
|---|---|---|
| Encoder | nvidia/parakeet-ctc-0.6b (FastConformer CTC) |
600M |
| Projector | Conv1d + Transformer (1024 → 896, stride=2) | trainable |
| LLM | Qwen/Qwen2.5-0.5B |
500M |
- Freeze strategy:
projector_and_encoder_top— projector fully trainable, top-4 encoder layers unfrozen - CTC auxiliary loss: λ=0.3
- Speaker kernel: SortFormer-based speaker activity injected into encoder
Training Data
| Dataset | Samples | Description |
|---|---|---|
| LibriSpeech 960h | ~276k | Read English speech (standard ASR benchmark) |
Training Configuration
- Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
- Scheduler: warmup (2000 steps) + cosine decay (min_lr=1e-6)
- Effective batch size: 512 (8 GPUs × batch_size=16 × grad_accum=4)
- Total steps: 100,000 (trained to ~40,000)
- DeepSpeed ZeRO-2, AMP bfloat16
- Gradient clipping: 5.0
- Dynamic batching: max_frames_in_batch=400,000
Usage
import torch
checkpoint = torch.load("final.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details
Files
final.pt— Full model checkpoint (encoder + projector + LLM state dicts)config.yaml— Training configuration
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support