TaurenMountain
/

FormalASR-0.6B

Safetensors

qwen3_asr

Model card Files Files and versions

xet

Community

Wendy9805 commited on 3 days ago

Commit

697d3c7

verified ·

1 Parent(s): a6528b1

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +95 -0

README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+---
+license: apache-2.0
+language:
+- zh
+tags:
+- automatic-speech-recognition
+- audio
+- qwen3
+- asr
+- speech
+pipeline_tag: automatic-speech-recognition
+model-index:
+- name: FormalASR-0.6B
+  results: []
+---
+# FormalASR-0.6B
+FormalASR-0.6B is a fine-tuned ASR (Automatic Speech Recognition) model based on [Qwen3-ASR-0.6B](https://huggingface.co/Qwen/Qwen3-ASR-0.6B), specifically optimized for **formal/written-style transcription** — outputting clean, punctuated, written-form text rather than colloquial spoken transcripts.
+## Model Description
+| Attribute | Value |
+|---|---|
+| Architecture | Qwen3ASRForConditionalGeneration |
+| Base Model | Qwen3-ASR-0.6B |
+| Parameters | ~0.6B |
+| Dtype | bfloat16 |
+| Audio Encoder | Whisper-like (18 layers, d_model=896) |
+| Text Decoder | Qwen3 (28 layers, hidden=1024) |
+## Key Features
+- 🎯 **Formal-style output**: Produces formal, punctuated text suitable for documentation, subtitles, and professional use
+- ⚡ **Compact**: Only 0.6B parameters, suitable for edge deployment
+- 🔊 **Long-form audio**: Supports up to 800 windows (~160 seconds) inference
+## Usage
+```python
+import torch
+from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
+model_id = "TaurenMountain/FormalASR-0.6B"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+# Load audio (16kHz, mono)
+import librosa
+audio, sr = librosa.load("your_audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+with torch.no_grad():
+    generated_ids = model.generate(**inputs)
+transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(transcription)
+```
+## Training Details
+This model is fine-tuned from Qwen3-ASR-0.6B on a curated dataset of formal speech paired with formal-style transcriptions. The fine-tuning process focuses on:
+- Converting spoken language patterns to formal written text
+- Proper punctuation insertion
+- Handling of filler words and disfluencies
+- Improved text normalization
+## Evaluation
+Evaluated on [SpeechIO-Formal](https://huggingface.co/datasets/TaurenMountain/Speechio-Formal) benchmark — a formal-domain Chinese speech recognition test set covering news, presentations, lectures, and other formal speech scenarios.
+## License
+Apache 2.0
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@inproceedings{ning2026formalasr,
+  title={FormalASR: End-to-End Spoken Chinese to Formal Text},
+  author={Ning, Wanyi and Qian, Haitao and Cheng, Jiyuan and Feng, Weiyuan and Zhang, Yufei},
+  booktitle={arXiv preprint},
+  year={2026}
+}
+```