Safetensors
qwen3_asr
Wendy9805 commited on
Commit
697d3c7
·
verified ·
1 Parent(s): a6528b1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - audio
8
+ - qwen3
9
+ - asr
10
+ - speech
11
+ pipeline_tag: automatic-speech-recognition
12
+ model-index:
13
+ - name: FormalASR-0.6B
14
+ results: []
15
+ ---
16
+
17
+ # FormalASR-0.6B
18
+
19
+ FormalASR-0.6B is a fine-tuned ASR (Automatic Speech Recognition) model based on [Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B), specifically optimized for **formal/written-style transcription** — outputting clean, punctuated, written-form text rather than colloquial spoken transcripts.
20
+
21
+ ## Model Description
22
+
23
+ | Attribute | Value |
24
+ |---|---|
25
+ | Architecture | Qwen3ASRForConditionalGeneration |
26
+ | Base Model | Qwen3-ASR-0.6B |
27
+ | Parameters | ~0.6B |
28
+ | Dtype | bfloat16 |
29
+ | Audio Encoder | Whisper-like (18 layers, d_model=896) |
30
+ | Text Decoder | Qwen3 (28 layers, hidden=1024) |
31
+
32
+ ## Key Features
33
+
34
+ - 🎯 **Formal-style output**: Produces formal, punctuated text suitable for documentation, subtitles, and professional use
35
+ - ⚡ **Compact**: Only 0.6B parameters, suitable for edge deployment
36
+ - 🔊 **Long-form audio**: Supports up to 800 windows (~160 seconds) inference
37
+
38
+ ## Usage
39
+
40
+ ```python
41
+ import torch
42
+ from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
43
+
44
+ model_id = "TaurenMountain/FormalASR-0.6B"
45
+
46
+ processor = AutoProcessor.from_pretrained(model_id)
47
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
48
+ model_id,
49
+ torch_dtype=torch.bfloat16,
50
+ device_map="auto"
51
+ )
52
+
53
+ # Load audio (16kHz, mono)
54
+ import librosa
55
+ audio, sr = librosa.load("your_audio.wav", sr=16000)
56
+
57
+ inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
58
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
59
+
60
+ with torch.no_grad():
61
+ generated_ids = model.generate(**inputs)
62
+
63
+ transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
64
+ print(transcription)
65
+ ```
66
+
67
+ ## Training Details
68
+
69
+ This model is fine-tuned from Qwen3-ASR-0.6B on a curated dataset of formal speech paired with formal-style transcriptions. The fine-tuning process focuses on:
70
+
71
+ - Converting spoken language patterns to formal written text
72
+ - Proper punctuation insertion
73
+ - Handling of filler words and disfluencies
74
+ - Improved text normalization
75
+
76
+ ## Evaluation
77
+
78
+ Evaluated on [SpeechIO-Formal](https://huggingface.co/datasets/TaurenMountain/Speechio-Formal) benchmark — a formal-domain Chinese speech recognition test set covering news, presentations, lectures, and other formal speech scenarios.
79
+
80
+ ## License
81
+
82
+ Apache 2.0
83
+
84
+ ## Citation
85
+
86
+ If you use this model in your research, please cite:
87
+
88
+ ```bibtex
89
+ @inproceedings{ning2026formalasr,
90
+ title={FormalASR: End-to-End Spoken Chinese to Formal Text},
91
+ author={Ning, Wanyi and Qian, Haitao and Cheng, Jiyuan and Feng, Weiyuan and Zhang, Yufei},
92
+ booktitle={arXiv preprint},
93
+ year={2026}
94
+ }
95
+ ```