Stage1: LibriSpeech 960h (Top-4 Encoder + CTC + Speaker Kernel)

Speech-LLM checkpoint for ASR alignment, trained on LibriSpeech 960h with CTC auxiliary loss and SortFormer-based speaker kernel.

Architecture

Component Model Parameters
Encoder nvidia/parakeet-ctc-0.6b (FastConformer CTC) 600M
Projector Conv1d + Transformer (1024 → 896, stride=2) trainable
LLM Qwen/Qwen2.5-0.5B 500M
  • Freeze strategy: projector_and_encoder_top — projector fully trainable, top-4 encoder layers unfrozen
  • CTC auxiliary loss: λ=0.3
  • Speaker kernel: SortFormer-based speaker activity injected into encoder

Training Data

Dataset Samples Description
LibriSpeech 960h ~276k Read English speech (standard ASR benchmark)

Training Configuration

  • Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
  • Scheduler: warmup (2000 steps) + cosine decay (min_lr=1e-6)
  • Effective batch size: 512 (8 GPUs × batch_size=16 × grad_accum=4)
  • Total steps: 100,000 (trained to ~40,000)
  • DeepSpeed ZeRO-2, AMP bfloat16
  • Gradient clipping: 5.0
  • Dynamic batching: max_frames_in_batch=400,000

Usage

import torch

checkpoint = torch.load("final.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details

Files

  • final.pt — Full model checkpoint (encoder + projector + LLM state dicts)
  • config.yaml — Training configuration
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train hiraki/parakeet-turntaking-stage1-libri-top4-ctc-spk