Turn Detector Qwen3-1.7B
Fine-tuned Qwen3-1.7B for real-time turn-end detection in multilingual call center conversations.
The model predicts P(<|im_end|>) โ the probability that a speaker has finished their turn. Designed for low-latency voice agent pipelines (e.g. LiveKit) to determine when to respond.
How It Works
Given a conversation so far, the model outputs the probability of <|im_end|> as the next token:
- P(im_end) > 0.5 โ speaker is done talking (turn complete)
- P(im_end) < 0.5 โ speaker is still talking (turn incomplete)
Usage
import torch
import math
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Scicom-intl/Malaysian-Turn-Detector-Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).cuda().eval()
IM_END_ID = tokenizer.convert_tokens_to_ids("<|im_end|>")
def get_turn_end_prob(text):
if text.endswith("<|im_end|>"):
text = text[:-len("<|im_end|>")]
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
prob = F.softmax(logits[0, -1], dim=-1)[IM_END_ID].item()
return prob
Eval Results
Test set: 1200 samples (600 positive + 600 negative), 50 conversations per language pair.
Overall (threshold = 0.5)
| Metric |
Score |
| Accuracy |
96.67% |
| Precision |
99.82% |
| Recall |
93.50% |
| F1 |
96.56% |
Per Language
| Language Pair |
Overall |
Positive |
Negative |
| chinese-english |
95.00% |
90.00% |
100.00% |
| chinese-malay |
97.00% |
94.00% |
100.00% |
| chinese-tamil |
97.00% |
94.00% |
100.00% |
| english-chinese |
97.00% |
96.00% |
98.00% |
| english-malay |
94.00% |
88.00% |
100.00% |
| english-tamil |
95.00% |
90.00% |
100.00% |
| malay-chinese |
97.00% |
94.00% |
100.00% |
| malay-english |
96.00% |
92.00% |
100.00% |
| malay-tamil |
97.00% |
94.00% |
100.00% |
| tamil-chinese |
100.00% |
100.00% |
100.00% |
| tamil-english |
97.00% |
94.00% |
100.00% |
| tamil-malay |
98.00% |
96.00% |
100.00% |
Threshold Sweep
| Threshold |
Accuracy |
Precision |
Recall |
F1 |
| 0.1 |
99.00% |
99.66% |
98.33% |
98.99% |
| 0.2 |
98.67% |
99.66% |
97.67% |
98.65% |
| 0.3 |
98.00% |
99.66% |
96.33% |
97.97% |
| 0.4 |
97.58% |
99.65% |
95.50% |
97.53% |
| 0.5 |
96.67% |
99.82% |
93.50% |
96.56% |
| 0.6 |
95.50% |
99.82% |
91.17% |
95.30% |
| 0.7 |
93.67% |
99.81% |
87.50% |
93.25% |
| 0.8 |
91.17% |
100.00% |
82.33% |
90.31% |
| 0.9 |
83.83% |
100.00% |
67.67% |
80.72% |
Confusion Matrix (threshold = 0.5)
|
Pred Pos |
Pred Neg |
| Actual Pos |
561 |
39 |
| Actual Neg |
1 |
599 |
Probability Distribution
| Class |
Mean |
Median |
Min |
Max |
| Positive (turn complete) |
0.8813 |
0.9673 |
0.0063 |
1.0000 |
| Negative (turn incomplete) |
0.0020 |
0.0000 |
0.0000 |
0.7022 |
Dataset
Tokenized parquet datasets (chinidataset format) available at Scicom-intl/turn-detector-Qwen3-0.6B-dataset.
turn-detector-Qwen3-0.6B-dataset/
โโโ train-merged/
โโโ train/
โโโ test/
Training
- Base model: Qwen/Qwen3-1.7B
- Training data: Positive samples only (complete conversations ending with
<|im_end|>)
- Loss: Liger Fused Linear Cross Entropy
- Attention: Flash Attention 3
- Precision: bfloat16
- Block size: 8192 (multipacked)
- Batch size: 2 x 16 gradient accumulation
- Learning rate: 2e-5 (constant)
- Epochs: 1
Training Data Sources