Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled, SortFormer speaker kernel)
Speech-LLM checkpoint for turn-taking prediction, trained on synthesized audio with cross-annotation labels. Step 2000.
This variant uses SortFormer-based speaker activity kernel throughout both Stage1 and Stage2 training.
Tags
| Tag | Description |
|---|---|
<COMPLETE> |
Utterance is semantically complete |
<INCOMPLETE> |
Utterance is semantically incomplete |
<BACKCHANNEL> |
Brief acknowledgment |
Architecture
| Component | Model | Parameters |
|---|---|---|
| Encoder | nvidia/parakeet-ctc-0.6b (FastConformer CTC) |
600M |
| Projector | Conv1d + Transformer (1024 → 896, stride=2) | trainable |
| LLM | Qwen/Qwen2.5-0.5B |
500M |
- Freeze strategy:
llm_and_projector— LLM and projector frozen, encoder trainable (top-0 unfrozen) - Speaker kernel: SortFormer-based speaker activity (used in both Stage1 and Stage2)
Stage1 Base: interact_canary_wordlevel_spk_sf (step 70,000)
This Stage2 model is initialized from a Stage1 ASR-alignment checkpoint trained with SortFormer speaker kernel.
- W&B run:
stage1-interact-canary-wordlevel-spk-sfin projectparakeet-turntaking - Freeze strategy:
all_trainable - Data: SeamlessM4T word-level aligned transcripts (Canary-transcribed), ~848k utterances
- Stage1's own base:
stage1_librispeech_spk_sf(LibriSpeech 960h with SortFormer speaker kernel)
Stage2 Training Data
Source: CANDOR corpus — naturalistic English dyadic conversations.
Audio is voice-cloned synthesized speech from CANDOR transcripts using Canary TTS, verified to have WER=0 against the original transcript. Turn-taking labels are derived from a cross-annotation pipeline (LLM + TEN confidence scoring).
Data filters applied:
tt_confidence >= 0.9for COMPLETE and INCOMPLETE labels (high TEN agreement)- BC-priority conflict filter (when LLM and TEN disagree, BACKCHANNEL takes priority)
- COMPLETE downsampled to match INCOMPLETE count (ratio=1.0)
Label distribution:
| Split | COMPLETE | INCOMPLETE | BACKCHANNEL | Total |
|---|---|---|---|---|
| Train | 14,146 | 14,146 | 2,717 | 31,009 |
| Dev | 6,850 | 1,975 | 376 | 9,201 |
| Test | 8,323 | 2,232 | 439 | 10,994 |
Note: Downsampling (COMPLETE→INCOMPLETE ratio=1.0) is applied only to the train split.
Stage2 Training Configuration
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: warmup (1500 steps) + cosine decay (min_lr=1e-6)
- Effective batch size: 512 (4 GPUs × batch_size=16 × grad_accum=8)
- Tag loss weight: 5.0
- Weighted sampling (max_weight=50.0)
- DeepSpeed ZeRO-2, AMP bfloat16
- Gradient clipping: 20.0
- Hardware: 4× NVIDIA A40
Usage
import torch
checkpoint = torch.load("step_2000.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details
Files
step_2000.pt— Full model checkpoint (encoder + projector + LLM state dicts)config.yaml— Training configuration
- Downloads last month
- 7