Qwen3-ASR Finetuned on Teochew

模型依赖 (Dependencies)

参考 (Reference)：QWEN3-ASR

训练脚本参考（Training Script Reference）：QwenLM/Qwen3-ASR Finetuning；做了一点点修改（A minor modification）

需要安装 (Installation required)：

pip install -U qwen-asr
pip install -U flash-attn --no-build-isolation

训练数据 (Training Data)

数据来源 (Data source)：

teochew_wild
约2000条内部录音数据 (approximately 2000 internal recordings)
共12932条 (total 12,932 samples)

数据子集 (Data subsets)：

数据集 (Dataset)	样本数 (Samples)
训练集 (Train)	12,932
测试集 (Test)	800
验证集 (Val)	700

数据格式 (Data Format)

JSONL 格式 (JSONL format)，一行一个数据对象 (one JSON object per line)。

包含两个字段 (Contains two fields)：

audio：音频文件路径 (audio file path)
text：文本标注 (text annotation)

在 text 中，需加上语言类型 (language type must be specified)：language Teochew<asr_text>

示例 (Example)：

{"audio": "./resample_22k/S012/S012F002/S012F002C106.wav", "text": "language Teochew<asr_text>关键是无变朖斁，衹撮动物还吤受保护。遇着就嫑去合影，至切孬去惹"}

训练配置 (Training Configuration)

硬件环境 (Hardware)

GPU：8 × A100 GPU (40GB)
批量大小 (Batch Size) ：8
显存占用 (GPU Memory Usage) ：36~38 GB

训练参数 (Training Parameters)

训练方式 (Training Method) ：全量微调 (Full Fine-tuning); Lora Fine-tuning is not good.
训练轮数 (Epochs) ：15
训练时长 (Training Time) ：1小时20分钟 (1 hour 20 minutes)
最佳性能 (Best Performance) ：第200步 (step 200) 取得最佳CER (achieved best CER)

use training script：

# for training:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun --nproc_per_node=8 qwen3_asr_sft.py \
  --model_path ../pretrained_models/Qwen/Qwen3-ASR-1.7B \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --batch_size 8 \
  --grad_acc 4 \  # 梯度累积4次
  --lr 2e-5 \
  --epochs 15 \
  --save_steps 10 \  # 每10步保存一次，最多保存
  --save_total_limit 5  # 最多保存5个checkpoint，以及最佳checkpoint

性能对比 (Performance Comparison)

注意 (Note)

计算 CER 时，部分写法不同、意思相近的同音字统一为同一个字，不认为识别错误 (When calculating CER, homophonic characters with different writings but similar meanings are treated as the same, not counted as errors)。

例如 (e.g.)："二"和"两"、"仔"和"囝"、"他""她""它"、"伊""吚""𡛂"。

use evaluation script：

# for evaluation:
python eval_cer_finetuned.py \
    --eval_file val.jsonl \  #  评估文件
    --model_path ./qwen3-asr-finetuning-out/best_checkpoint/ \  #  权重路径
    --remove_punctuation \   # 是否移除标点，不加remove_punctuation则不移除
    --use_synonym \  # 使用同义词合并。如“他”“她”“它”合并为“他”
    --synonym_file synonym_example.json # 同义词文件

性能分析 (Performance Analysis)

CER 指标 (CER Metric) ：Qwen3-ASR 比 whisper 系列性能稍好，整体处于同一水平 (Qwen3-ASR performs slightly better than Whisper series, but overall at the same level)
推理速度 (Inference Speed) ：得益于更高的压缩率、flash-attn 加速等，Qwen3-ASR 比 Whisper 系列更快 (Thanks to higher compression rate and flash-attn acceleration, Qwen3-ASR is faster than Whisper series)
平均推理时间 (Average Inference Time) ：0.0886秒/样本 (0.0886s per sample)
RTF ：约 0.0164 (approximately 0.0164)（在 teochew-wild 中，音频样本平均为5.4秒 / in teochew-wild, average audio duration is 5.4s）

CER 对比表 (CER Comparison Table)

对比whisper时，Qwen3-ASR只在teochew-wild训练，不包含额外2000条录音。（When comparing with Whisper, Qwen3-ASR was trained only on teochew-wild, excluding the additional 2,000 recordings.）

模型 (Model)	测试集 (Dataset)	CER (%)
whisper-medium	val	9.61
whisper-medium	test	10.01
Qwen3-ASR	val	8.54
Qwen3-ASR	test	9.11

快速开始 (Quick Start)

识别结果中，返回 language 和 text，因此它也能识别语种 (The recognition results return language and text, so it can also identify the language)。

基础使用 (Basic Usage)

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "panlr/Qwen3_ASR_teochew",
    subfolder="Qwen3_asr",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.transcribe(audio="your_audio.wav")
print(results[0].text)      # 识别文本 (recognized text)
print(results[0].language)  # 识别语言 (detected language)

带时间戳的识别 (Recognition with Timestamps)

使用 ForcedAligner 获取词级时间戳 (Use ForcedAligner to get word-level timestamps)：

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "panlr/Qwen3_ASR_teochew",
    subfolder="Qwen3_asr",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs={"dtype": torch.bfloat16, "device_map": "cuda:0"},
)

results = model.transcribe(audio="your_audio.wav", return_time_stamps=True)
for ts in results[0].time_stamps:
    print(f"{ts.text}\t{ts.start_time:.3f}s -> {ts.end_time:.3f}s")

批量识别 (Batch Recognition)

批量识别多个文件 (Recognize multiple files at once)：

results = model.transcribe(audio=["audio1.wav", "audio2.wav", "audio3.wav"])
for r in results:
    print(r.text)

命令行推理 (CLI Inference)

使用方法 (Usage)

1. 单个音频文件识别（不带时间戳）(Single audio file without timestamps)

python infer.py --audio ./example.wav

2. 单个音频文件识别（带 ForcedAligner 时间戳）(Single audio file with ForcedAligner timestamps)

python infer.py --audio ./example.wav --forced_aligner

3. 批量识别整个文件夹 (Batch recognition for entire folder)

python infer.py --audio_dir ./audio_folder/

4. 批量识别 + 时间戳 + 输出到文件 (Batch recognition + timestamps + output to file)

python infer.py --audio_dir ./audio_folder/ --forced_aligner --output results.jsonl

5. 使用本地 checkpoint（而非 HuggingFace）(Use local checkpoint instead of HuggingFace)

python infer.py --audio ./example.wav --model_path ./qwen3-asr-finetuning-out/inference_checkpoint --model_subfolder ""

输出格式 (Output Format)

JSONL 格式，每行一个 JSON 对象 (JSONL format, one JSON object per line)：

{"audio": "./example.wav", "language": "Teochew", "text": "状元林大钦，兵部尚书翁万达，廖甲，工部左侍郎陈一松，拢是嘉靖年间介进士", "timestamps": [{"text": "状", "start_time": 1.04, "end_time": 1.36}, {"text": "元", "start_time": 1.36, "end_time": 1.44}, {"text": "林", "start_time": 1.44, "end_time": 1.76}, {"text": "大", "start_time": 1.76, "end_time": 1.92}, {"text": "钦", "start_time": 2.0, "end_time": 2.4}, {"text": "兵", "start_time": 2.96, "end_time": 3.04}, {"text": "部", "start_time": 3.2, "end_time": 3.36}, {"text": "尚", "start_time": 3.36, "end_time": 3.52}, {"text": "书", "start_time": 3.76, "end_time": 4.08}, {"text": "翁", "start_time": 4.16, "end_time": 4.64}, {"text": "万", "start_time": 4.72, "end_time": 4.88}, {"text": "达", "start_time": 4.88, "end_time": 4.88}, {"text": "廖", "start_time": 5.52, "end_time": 5.68}, {"text": "甲", "start_time": 5.68, "end_time": 5.92}, {"text": "工", "start_time": 6.32, "end_time": 6.56}, {"text": "部", "start_time": 6.56, "end_time": 6.8}, {"text": "左", "start_time": 6.8, "end_time": 6.96}, {"text": "侍", "start_time": 6.96, "end_time": 7.2}, {"text": "郎", "start_time": 7.2, "end_time": 7.52}, {"text": "陈", "start_time": 7.52, "end_time": 7.76}, {"text": "一", "start_time": 7.76, "end_time": 7.84}, {"text": "松", "start_time": 7.84, "end_time": 8.24}, {"text": "拢", "start_time": 8.96, "end_time": 9.12}, {"text": "是", "start_time": 9.12, "end_time": 9.36}, {"text": "嘉", "start_time": 9.36, "end_time": 9.6}, {"text": "靖", "start_time": 9.6, "end_time": 9.84}, {"text": "年", "start_time": 9.84, "end_time": 9.92}, {"text": "间", "start_time": 9.92, "end_time": 10.16}, {"text": "介", "start_time": 10.16, "end_time": 10.32}, {"text": "进", "start_time": 10.4, "end_time": 10.64}, {"text": "士", "start_time": 10.64, "end_time": 10.96}]}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support