Wav2Vec2-Large-XLSR-53 β Singing Phoneme CTC Model (Two-Stage Fine-Tuning, LoRA Merged)
Model Summary
This model is fine-tuned from jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn via a two-stage training pipeline. The original character-level CTC head (~3,500 Chinese characters) is replaced with a custom multilingual phoneme head (POLYGLOT, 39 phonemes), targeting phoneme-level temporal alignment of singing vocals (e.g., karaoke lyrics synchronization).
This checkpoint is a fully merged model. The Stage 2 LoRA adapter weights have been merged into the base model via
merge_and_unload(). No PEFT dependency is required at inference time β load directly withtransformers.
Model Architecture
- Backbone: Wav2Vec2-Large (XLSR-53), encoder hidden size 1024
- CTC Head:
Dropout(0.1)+Linear(1024 β 39)(replaces the original Chinese character CTC head) - Stage 2 LoRA Adapter (merged): Originally applied to all attention and FFN layers in the encoder (see hyperparameters below); weights have been merged into the backbone via
merge_and_unload()
Phoneme Set (POLYGLOT β 39 tokens)
| Category | Phonemes |
|---|---|
| Special tokens | <pad> (also CTC blank), <unk>, | (word boundary), <sil> (silence) |
| Stops β aspirated | p t k |
| Stops β voiced / unaspirated | b d g |
| Nasals | m n ng |
| Fricatives | f v s z sh zh h th dh hy |
| Affricates | ch jh ts dz |
| Liquids | l r |
| Glides | w y |
| Vowels | a i u e o ΓΌ ae ir |
The phoneme set is designed to cover Mandarin Chinese (pinyin initial/final decomposition), English (ARPAbet mapping), and a subset of Japanese phonemes.
Note β Training coverage: Although the vocabulary is multilingual by design (zh / en / ja), all training data in both stages is Mandarin-only (AISHELL-3 + M4Singer). The English and Japanese phonemes (
v,th,dh,hy,jh,dz,ΓΌ,ae,ir, β¦) exist in the vocabulary but have received no supervised signal; their behaviour at inference time is untested and likely unreliable.
Training Procedure
Stage 1 β Frozen Encoder, New CTC Head (AISHELL-3)
| Setting | Value |
|---|---|
| Training data | AISHELL-3 (Mandarin speech, ~85 speakers) |
| Encoder | Fully frozen (including CNN feature extractor) |
| Trainable parameters | lm_head only (new CTC head) |
| Input | Pre-extracted encoder hidden states cached to HDF5 |
| Batch size | 8 (Gradient Accumulation Γ 4 = effective batch 32) |
| Learning rate | 5e-4 (AdamW, weight decay 0.01) |
| Schedule | Cosine decay with linear warmup (3 epochs) |
| Regularization | Time masking (2Γ per sample, max 30 frames) |
| Early stopping | patience = 15 epochs (PER-based, min_delta = 0.001) |
| Evaluation metric | Phoneme Error Rate (PER), CTC greedy decoding |
Stage 2 β LoRA Fine-Tuning (M4Singer)
| Setting | Value |
|---|---|
| Training data | M4Singer (Mandarin singing, ~700 songs, 20 singers) |
| Base model | Stage 1 best checkpoint |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, out_proj, intermediate_dense, output_dense |
modules_to_save |
lm_head (full weights, not LoRA) |
| CNN feature extractor | Frozen |
| Batch size | 8 (Gradient Accumulation Γ 4 = effective batch 32) |
| Learning rate | LoRA adapter: 1e-5 / lm_head: 1e-6 (separate param groups, AdamW) |
| Schedule | Cosine decay with linear warmup (10 epochs) |
| Regularization | Wav2Vec2 built-in SpecAugment (mask_time_prob=0.05) |
| Early stopping | patience = 100 epochs (PER-based, min_delta = 0.001) |
| Training precision | bfloat16 AMP (CTC loss computed in float32) |
| Evaluation metric | Phoneme Error Rate (PER), CTC greedy decoding |
Usage
Load the Model
This checkpoint is already merged β load it directly with transformers, no PEFT required.
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
processor = Wav2Vec2Processor.from_pretrained("path/to/final")
model = Wav2Vec2ForCTC.from_pretrained("path/to/final")
model.eval()
Inference (CTC Greedy Decode)
import torchaudio
waveform, sr = torchaudio.load("singing.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(0) # convert to mono
inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits # (1, T, 39)
# CTC greedy decode: argmax β collapse repeats β remove blank (id=0)
pred_ids = logits.argmax(dim=-1)[0].tolist()
decoded = []
prev = -1
for idx in pred_ids:
if idx != prev:
if idx != 0:
decoded.append(idx)
prev = idx
id2phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}
phonemes = [id2phoneme[i] for i in decoded]
print(phonemes)
Merge LoRA Weights
Already done. This checkpoint was saved after calling
merge_and_unload(). The code below is provided for reference only, in case you are working from an unmerged PEFT checkpoint.
# Only needed if starting from a raw PEFT checkpoint (not this one)
from peft import PeftModel
base_model = Wav2Vec2ForCTC.from_pretrained("path/to/stage1_best")
peft_model = PeftModel.from_pretrained(base_model, "path/to/stage2_peft")
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("path/to/final")
processor.save_pretrained("path/to/final")
Intended Use
- Singing vocal alignment: Force-aligning phoneme sequences to audio timestamps (CTC forced alignment) for generating per-character / per-phoneme timing (
.lrc,.srt,.ass) - Singing acoustics research: Phoneme-level feature extraction and articulation analysis for Mandarin singing
Out-of-Scope Use
- General Mandarin ASR: Use the original
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cninstead - Non-Mandarin singing: The phoneme set supports multilingual phonemes by design, but training data is Mandarin-only; generalization to other languages is unverified
- Real-time / low-latency applications: The model is not optimized for streaming inference
Training Datasets
Limitations
- The model outputs a phoneme sequence, not Chinese characters; it cannot be used for text transcription without a phoneme-to-pinyin/hanzi conversion step
- Trained primarily on Mandarin singing (M4Singer); generalization to other styles (e.g., Cantonese, English pop, rap) may be limited
- The custom phoneme vocabulary requires the accompanying tokenizer (
vocab.json) for decoding
Framework Versions
- Transformers (compatible with PEFT 0.18.1)
- PyTorch (with bfloat16 AMP support)
- PEFT 0.18.1 (training only; not required at inference)
- Downloads last month
- 43