Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled)

Speech-LLM checkpoint for turn-taking prediction, trained on synthesized audio with cross-annotation labels. Step 3000.

Tags

Tag Description
<COMPLETE> Utterance is semantically complete
<INCOMPLETE> Utterance is semantically incomplete
<BACKCHANNEL> Brief acknowledgment

Architecture

Component Model Parameters
Encoder nvidia/parakeet-ctc-0.6b (FastConformer CTC) 600M
Projector Conv1d + Transformer (1024 → 896, stride=2) trainable
LLM Qwen/Qwen2.5-0.5B 500M
  • Freeze strategy: llm_and_projector — LLM and projector frozen, encoder frozen
  • Speaker kernel: SortFormer-based speaker activity

Stage1 Base: interact_canary_wordlevel (step 70,000)

This Stage2 model is initialized from a Stage1 ASR-alignment checkpoint.

  • W&B run: stage1-interact-canary-wordlevel in project parakeet-turntaking
  • Freeze strategy: all_trainable
  • Optimizer: AdamW (lr=5e-6)
  • Data: SeamlessM4T word-level aligned transcripts (Canary-transcribed), ~848k utterances
    • Source: SeamlessM4T expressive speech corpus
    • Forced-aligned at word level with Parakeet CTC; filtered by mean_prob >= 0.9
  • Stage1's own base: stage1_librispeech_spk step 90,000 (LibriSpeech 960h with speaker kernel)
Split Samples
Train 847,835
Dev 10,156
Test 12,261

Stage2 Training Data

Source: CANDOR corpus — naturalistic English dyadic conversations.

Audio is voice-cloned synthesized speech from CANDOR transcripts using Canary TTS, verified to have WER=0 against the original transcript. Turn-taking labels are derived from a cross-annotation pipeline (LLM + TEN confidence scoring).

Data filters applied:

  • tt_confidence >= 0.9 for COMPLETE and INCOMPLETE labels (high TEN agreement)
  • BC-priority conflict filter (when LLM and TEN disagree, BACKCHANNEL takes priority)
  • COMPLETE downsampled to match INCOMPLETE count (ratio=1.0)

Label distribution:

Split COMPLETE INCOMPLETE BACKCHANNEL Total
Train 14,146 14,146 2,717 31,009
Dev 6,850 1,975 376 9,201
Test 8,323 2,232 439 10,994

Note: Downsampling (COMPLETE→INCOMPLETE ratio=1.0) is applied only to the train split.

Stage2 Training Configuration

  • Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
  • Scheduler: warmup (1500 steps) + cosine decay (min_lr=1e-6)
  • Effective batch size: 512 (4 GPUs × batch_size=16 × grad_accum=8)
  • Tag loss weight: 5.0
  • Weighted sampling (max_weight=50.0)
  • DeepSpeed ZeRO-2, AMP bfloat16
  • Gradient clipping: 20.0
  • Hardware: 4× NVIDIA A40

Usage

import torch

checkpoint = torch.load("step_3000.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details

Files

  • step_3000.pt — Full model checkpoint (encoder + projector + LLM state dicts)
  • config.yaml — Training configuration
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support