Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled)

Speech-LLM checkpoint for turn-taking prediction, trained on synthesized audio with cross-annotation labels. Step 3000.

W&B run: https://wandb.ai/hirotakahiraki/parakeet-turntaking/runs/sohgijxf

Tag	Description
`<COMPLETE>`	Utterance is semantically complete
`<INCOMPLETE>`	Utterance is semantically incomplete
`<BACKCHANNEL>`	Brief acknowledgment

Architecture

Component	Model	Parameters
Encoder	`nvidia/parakeet-ctc-0.6b` (FastConformer CTC)	600M
Projector	Conv1d + Transformer (1024 → 896, stride=2)	trainable
LLM	`Qwen/Qwen2.5-0.5B`	500M

Freeze strategy: llm_and_projector — LLM and projector frozen, encoder frozen
Speaker kernel: SortFormer-based speaker activity

Stage1 Base: `interact_canary_wordlevel` (step 70,000)

This Stage2 model is initialized from a Stage1 ASR-alignment checkpoint.

W&B run: stage1-interact-canary-wordlevel in project parakeet-turntaking
Freeze strategy: all_trainable
Optimizer: AdamW (lr=5e-6)
Data: SeamlessM4T word-level aligned transcripts (Canary-transcribed), ~848k utterances
- Source: SeamlessM4T expressive speech corpus
- Forced-aligned at word level with Parakeet CTC; filtered by mean_prob >= 0.9
Stage1's own base: stage1_librispeech_spk step 90,000 (LibriSpeech 960h with speaker kernel)

Split	Samples
Train	847,835
Dev	10,156
Test	12,261

Stage2 Training Data

Source: CANDOR corpus — naturalistic English dyadic conversations.

Audio is voice-cloned synthesized speech from CANDOR transcripts using Canary TTS, verified to have WER=0 against the original transcript. Turn-taking labels are derived from a cross-annotation pipeline (LLM + TEN confidence scoring).

Data filters applied:

tt_confidence >= 0.9 for COMPLETE and INCOMPLETE labels (high TEN agreement)
BC-priority conflict filter (when LLM and TEN disagree, BACKCHANNEL takes priority)
COMPLETE downsampled to match INCOMPLETE count (ratio=1.0)

Label distribution:

Split	COMPLETE	INCOMPLETE	BACKCHANNEL	Total
Train	14,146	14,146	2,717	31,009
Dev	6,850	1,975	376	9,201
Test	8,323	2,232	439	10,994

Note: Downsampling (COMPLETE→INCOMPLETE ratio=1.0) is applied only to the train split.

Stage2 Training Configuration

Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
Scheduler: warmup (1500 steps) + cosine decay (min_lr=1e-6)
Effective batch size: 512 (4 GPUs × batch_size=16 × grad_accum=8)
Tag loss weight: 5.0
Weighted sampling (max_weight=50.0)
DeepSpeed ZeRO-2, AMP bfloat16
Gradient clipping: 20.0
Hardware: 4× NVIDIA A40

Usage

import torch

checkpoint = torch.load("step_3000.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details

Files

step_3000.pt — Full model checkpoint (encoder + projector + LLM state dicts)
config.yaml — Training configuration

Downloads last month: 6

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

hiraki
/

parakeet-turntaking-stage2-synth-3tag-ttconf90-down

Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled)

Tags

Architecture

Stage1 Base: `interact_canary_wordlevel` (step 70,000)

Stage2 Training Data

Stage2 Training Configuration

Usage

Files

Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled)

Tags

Architecture

Stage1 Base: interact_canary_wordlevel (step 70,000)

Stage2 Training Data

Stage2 Training Configuration

Usage

Files

Stage1 Base: `interact_canary_wordlevel` (step 70,000)