Stage1: LibriSpeech 960h (Top-4 Encoder + CTC + Speaker Kernel)

Speech-LLM checkpoint for ASR alignment, trained on LibriSpeech 960h with CTC auxiliary loss and SortFormer-based speaker kernel.

Architecture

Component	Model	Parameters
Encoder	`nvidia/parakeet-ctc-0.6b` (FastConformer CTC)	600M
Projector	Conv1d + Transformer (1024 → 896, stride=2)	trainable
LLM	`Qwen/Qwen2.5-0.5B`	500M

Freeze strategy: projector_and_encoder_top — projector fully trainable, top-4 encoder layers unfrozen
CTC auxiliary loss: λ=0.3
Speaker kernel: SortFormer-based speaker activity injected into encoder

Training Data

Dataset	Samples	Description
LibriSpeech 960h	~276k	Read English speech (standard ASR benchmark)

Training Configuration

Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
Scheduler: warmup (2000 steps) + cosine decay (min_lr=1e-6)
Effective batch size: 512 (8 GPUs × batch_size=16 × grad_accum=4)
Total steps: 100,000 (trained to ~40,000)
DeepSpeed ZeRO-2, AMP bfloat16
Gradient clipping: 5.0
Dynamic batching: max_frames_in_batch=400,000

Usage

import torch

checkpoint = torch.load("final.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details

Files

final.pt — Full model checkpoint (encoder + projector + LLM state dicts)
config.yaml — Training configuration

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

hiraki
/

parakeet-turntaking-stage1-interact-wordlevel-spk-sf