Introduction

Voice Chat Pipeline -> ASR + TurnDetector + VAD + LLM + TTS

In the Voice Chat Pipeline, if we only rely on VAD (Voice Activity Detection) to determine whether the user's current turn input has ended, we cannot accurately handle situations where users pause while thinking. When there are pauses during the current turn input that hasn't been completed yet, VAD will detect the pause and prematurely judge that the sentence has ended, but semantically the sentence is not yet complete.

This introduces the Turn-Detector Model. The turn detection model is mainly applied in voice + text modal dialogue scenarios. At the semantic level, the turn detection model can analyze the text information transcribed by the ASR model at the semantic level, more accurately determining whether the current user input has ended. The Turn-Detector Model chooses small-parameter (0.5B/0.6B) large models based on Transformer architecture that have undergone instruction fine-tuning, with the main task being to predict the probability of the next_token being <|im_end|>.

Task: Semantic-level turn recognition, predicting the probability of next_token being <|im_end|>
Model: Small-parameter models after instruction fine-tuning (Qwen2.5-0.5B-Instruct, Qwen3-0.6B)
Goal: Reduce inaccurate VAD interruptions in voice dialogue pipelines (e.g., pauses caused while thinking of the next word)

# 1. get the user input
How tall is the Eiffel Tower

# 2. apply_chat_template
<|im_start|>user<|im_sep|>How tall is the Eiffel Tower<|im_end|>

# 3. cut <|im_end|>
<|im_start|>user<|im_sep|>How tall is the Eiffel Tower

# 4. predict next token

Language: Chinese, English

GitHub Page: https://github.com/zxsddcs/Turn-Detector

Dataset

The turn detection model is mainly applied in Chinese and English voice + text modal dialogue scenarios, with input data types mostly being common text instruction data and colloquial chat dialogue data. Therefore, the dataset uses public datasets such as Alpaca, MagicData (ASR dialogue dataset), ShareChatX, etc.

Alpaca
Magicdata
ShareChatX

Characteristics of ASR transcribed text

Sometimes sentence endings don't contain punctuation marks
There may be filler words or ... during the process

Dataset optimization based on ASR transcribed text characteristics

Sentence filtering: Call large models to analyze current input content, retaining semantically complete and colloquial data from the dataset
Filler word insertion: Randomly insert 1 filler word in sentences to simulate the actual effect of spoken dialogue

Call large models to generate Chinese and English filler word tables

en_words = ['uh', 'um', 'ah', 'er', 'hmm', ...]
zh_words = ['嗯', '啊', '哦', '呃', '那个', '对吧', ...]

Dataset optimization effect example

[
    {
        "instruction": "How tall is the Eiffel Tower",
        "input": "",
        "output": ""
    },
    {
        "instruction": "How tall is the um... Eiffel Tower",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Um how tall is the Eiffel Tower",
        "input": "",
        "output": ""
    },
    ...
]

Quantization

from optimum.onnxruntime import ORTModelForCausalLM
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from optimum.onnxruntime import ORTQuantizer
from transformers import AutoTokenizer

model_checkpoint = ""
save_directory = ""

ort_model = ORTModelForCausalLM.from_pretrained(
    model_checkpoint, 
    export=True,
    use_cache=False,
)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(ort_model)
quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)