Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled, SortFormer speaker kernel)

Speech-LLM checkpoint for turn-taking prediction, trained on synthesized audio with cross-annotation labels. Step 2000.

This variant uses SortFormer-based speaker activity kernel throughout both Stage1 and Stage2 training.

Tag	Description
`<COMPLETE>`	Utterance is semantically complete
`<INCOMPLETE>`	Utterance is semantically incomplete
`<BACKCHANNEL>`	Brief acknowledgment

Architecture

Component	Model	Parameters
Encoder	`nvidia/parakeet-ctc-0.6b` (FastConformer CTC)	600M
Projector	Conv1d + Transformer (1024 → 896, stride=2)	trainable
LLM	`Qwen/Qwen2.5-0.5B`	500M

Freeze strategy: llm_and_projector — LLM and projector frozen, encoder trainable (top-0 unfrozen)
Speaker kernel: SortFormer-based speaker activity (used in both Stage1 and Stage2)

Stage1 Base: `interact_canary_wordlevel_spk_sf` (step 70,000)

This Stage2 model is initialized from a Stage1 ASR-alignment checkpoint trained with SortFormer speaker kernel.

W&B run: stage1-interact-canary-wordlevel-spk-sf in project parakeet-turntaking
Freeze strategy: all_trainable
Data: SeamlessM4T word-level aligned transcripts (Canary-transcribed), ~848k utterances
Stage1's own base: stage1_librispeech_spk_sf (LibriSpeech 960h with SortFormer speaker kernel)

Stage2 Training Data

Source: CANDOR corpus — naturalistic English dyadic conversations.

Audio is voice-cloned synthesized speech from CANDOR transcripts using Canary TTS, verified to have WER=0 against the original transcript. Turn-taking labels are derived from a cross-annotation pipeline (LLM + TEN confidence scoring).

Data filters applied:

tt_confidence >= 0.9 for COMPLETE and INCOMPLETE labels (high TEN agreement)
BC-priority conflict filter (when LLM and TEN disagree, BACKCHANNEL takes priority)
COMPLETE downsampled to match INCOMPLETE count (ratio=1.0)

Label distribution:

Split	COMPLETE	INCOMPLETE	BACKCHANNEL	Total
Train	14,146	14,146	2,717	31,009
Dev	6,850	1,975	376	9,201
Test	8,323	2,232	439	10,994

Note: Downsampling (COMPLETE→INCOMPLETE ratio=1.0) is applied only to the train split.

Stage2 Training Configuration

Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
Scheduler: warmup (1500 steps) + cosine decay (min_lr=1e-6)
Effective batch size: 512 (4 GPUs × batch_size=16 × grad_accum=8)
Tag loss weight: 5.0
Weighted sampling (max_weight=50.0)
DeepSpeed ZeRO-2, AMP bfloat16
Gradient clipping: 20.0
Hardware: 4× NVIDIA A40

Usage

import torch

checkpoint = torch.load("step_2000.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details

Files

step_2000.pt — Full model checkpoint (encoder + projector + LLM state dicts)
config.yaml — Training configuration

Downloads last month: 7

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

hiraki
/

parakeet-turntaking-stage2-synth-3tag-ttconf90-down-spk-sf

Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled, SortFormer speaker kernel)

Tags

Architecture

Stage1 Base: `interact_canary_wordlevel_spk_sf` (step 70,000)

Stage2 Training Data

Stage2 Training Configuration

Usage

Files

Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled, SortFormer speaker kernel)

Tags

Architecture

Stage1 Base: interact_canary_wordlevel_spk_sf (step 70,000)

Stage2 Training Data

Stage2 Training Configuration

Usage

Files

Stage1 Base: `interact_canary_wordlevel_spk_sf` (step 70,000)