Stage2: Synth-Canary 3-Tag Turn-Taking (tt_conf>=0.9, COMPLETE downsampled)
Speech-LLM checkpoint for turn-taking prediction, trained on synthesized audio with cross-annotation labels. Step 3000.
Tags
| Tag | Description |
|---|---|
<COMPLETE> |
Utterance is semantically complete |
<INCOMPLETE> |
Utterance is semantically incomplete |
<BACKCHANNEL> |
Brief acknowledgment |
Architecture
| Component | Model | Parameters |
|---|---|---|
| Encoder | nvidia/parakeet-ctc-0.6b (FastConformer CTC) |
600M |
| Projector | Conv1d + Transformer (1024 → 896, stride=2) | trainable |
| LLM | Qwen/Qwen2.5-0.5B |
500M |
- Freeze strategy:
llm_and_projector— LLM and projector frozen, encoder frozen - Speaker kernel: SortFormer-based speaker activity
Stage1 Base: interact_canary_wordlevel (step 70,000)
This Stage2 model is initialized from a Stage1 ASR-alignment checkpoint.
- W&B run:
stage1-interact-canary-wordlevelin projectparakeet-turntaking - Freeze strategy:
all_trainable - Optimizer: AdamW (lr=5e-6)
- Data: SeamlessM4T word-level aligned transcripts (Canary-transcribed), ~848k utterances
- Source: SeamlessM4T expressive speech corpus
- Forced-aligned at word level with Parakeet CTC; filtered by
mean_prob >= 0.9
- Stage1's own base:
stage1_librispeech_spkstep 90,000 (LibriSpeech 960h with speaker kernel)
| Split | Samples |
|---|---|
| Train | 847,835 |
| Dev | 10,156 |
| Test | 12,261 |
Stage2 Training Data
Source: CANDOR corpus — naturalistic English dyadic conversations.
Audio is voice-cloned synthesized speech from CANDOR transcripts using Canary TTS, verified to have WER=0 against the original transcript. Turn-taking labels are derived from a cross-annotation pipeline (LLM + TEN confidence scoring).
Data filters applied:
tt_confidence >= 0.9for COMPLETE and INCOMPLETE labels (high TEN agreement)- BC-priority conflict filter (when LLM and TEN disagree, BACKCHANNEL takes priority)
- COMPLETE downsampled to match INCOMPLETE count (ratio=1.0)
Label distribution:
| Split | COMPLETE | INCOMPLETE | BACKCHANNEL | Total |
|---|---|---|---|---|
| Train | 14,146 | 14,146 | 2,717 | 31,009 |
| Dev | 6,850 | 1,975 | 376 | 9,201 |
| Test | 8,323 | 2,232 | 439 | 10,994 |
Note: Downsampling (COMPLETE→INCOMPLETE ratio=1.0) is applied only to the train split.
Stage2 Training Configuration
- Optimizer: AdamW (lr=2e-5, weight_decay=0.01)
- Scheduler: warmup (1500 steps) + cosine decay (min_lr=1e-6)
- Effective batch size: 512 (4 GPUs × batch_size=16 × grad_accum=8)
- Tag loss weight: 5.0
- Weighted sampling (max_weight=50.0)
- DeepSpeed ZeRO-2, AMP bfloat16
- Gradient clipping: 20.0
- Hardware: 4× NVIDIA A40
Usage
import torch
checkpoint = torch.load("step_3000.pt", map_location="cpu")
# Load into SpeechLLM model — see the parakeet-turntaking repository for details
Files
step_3000.pt— Full model checkpoint (encoder + projector + LLM state dicts)config.yaml— Training configuration
- Downloads last month
- 6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support