Wav2Vec2-Large-XLSR-53 β€” Singing Phoneme CTC Model (Two-Stage Fine-Tuning, LoRA Merged)

Model Summary

This model is fine-tuned from jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn via a two-stage training pipeline. The original character-level CTC head (~3,500 Chinese characters) is replaced with a custom multilingual phoneme head (POLYGLOT, 39 phonemes), targeting phoneme-level temporal alignment of singing vocals (e.g., karaoke lyrics synchronization).

This checkpoint is a fully merged model. The Stage 2 LoRA adapter weights have been merged into the base model via merge_and_unload(). No PEFT dependency is required at inference time β€” load directly with transformers.


Model Architecture

  • Backbone: Wav2Vec2-Large (XLSR-53), encoder hidden size 1024
  • CTC Head: Dropout(0.1) + Linear(1024 β†’ 39) (replaces the original Chinese character CTC head)
  • Stage 2 LoRA Adapter (merged): Originally applied to all attention and FFN layers in the encoder (see hyperparameters below); weights have been merged into the backbone via merge_and_unload()

Phoneme Set (POLYGLOT β€” 39 tokens)

Category Phonemes
Special tokens <pad> (also CTC blank), <unk>, | (word boundary), <sil> (silence)
Stops β€” aspirated p t k
Stops β€” voiced / unaspirated b d g
Nasals m n ng
Fricatives f v s z sh zh h th dh hy
Affricates ch jh ts dz
Liquids l r
Glides w y
Vowels a i u e o ΓΌ ae ir

The phoneme set is designed to cover Mandarin Chinese (pinyin initial/final decomposition), English (ARPAbet mapping), and a subset of Japanese phonemes.

Note β€” Training coverage: Although the vocabulary is multilingual by design (zh / en / ja), all training data in both stages is Mandarin-only (AISHELL-3 + M4Singer). The English and Japanese phonemes (v, th, dh, hy, jh, dz, ΓΌ, ae, ir, …) exist in the vocabulary but have received no supervised signal; their behaviour at inference time is untested and likely unreliable.


Training Procedure

Stage 1 β€” Frozen Encoder, New CTC Head (AISHELL-3)

Setting Value
Training data AISHELL-3 (Mandarin speech, ~85 speakers)
Encoder Fully frozen (including CNN feature extractor)
Trainable parameters lm_head only (new CTC head)
Input Pre-extracted encoder hidden states cached to HDF5
Batch size 8 (Gradient Accumulation Γ— 4 = effective batch 32)
Learning rate 5e-4 (AdamW, weight decay 0.01)
Schedule Cosine decay with linear warmup (3 epochs)
Regularization Time masking (2Γ— per sample, max 30 frames)
Early stopping patience = 15 epochs (PER-based, min_delta = 0.001)
Evaluation metric Phoneme Error Rate (PER), CTC greedy decoding

Stage 2 β€” LoRA Fine-Tuning (M4Singer)

Setting Value
Training data M4Singer (Mandarin singing, ~700 songs, 20 singers)
Base model Stage 1 best checkpoint
LoRA rank 8
LoRA alpha 16
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, out_proj, intermediate_dense, output_dense
modules_to_save lm_head (full weights, not LoRA)
CNN feature extractor Frozen
Batch size 8 (Gradient Accumulation Γ— 4 = effective batch 32)
Learning rate LoRA adapter: 1e-5 / lm_head: 1e-6 (separate param groups, AdamW)
Schedule Cosine decay with linear warmup (10 epochs)
Regularization Wav2Vec2 built-in SpecAugment (mask_time_prob=0.05)
Early stopping patience = 100 epochs (PER-based, min_delta = 0.001)
Training precision bfloat16 AMP (CTC loss computed in float32)
Evaluation metric Phoneme Error Rate (PER), CTC greedy decoding

Usage

Load the Model

This checkpoint is already merged β€” load it directly with transformers, no PEFT required.

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

processor = Wav2Vec2Processor.from_pretrained("path/to/final")
model = Wav2Vec2ForCTC.from_pretrained("path/to/final")
model.eval()

Inference (CTC Greedy Decode)

import torchaudio

waveform, sr = torchaudio.load("singing.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(0)  # convert to mono

inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits  # (1, T, 39)

# CTC greedy decode: argmax β†’ collapse repeats β†’ remove blank (id=0)
pred_ids = logits.argmax(dim=-1)[0].tolist()
decoded = []
prev = -1
for idx in pred_ids:
    if idx != prev:
        if idx != 0:
            decoded.append(idx)
        prev = idx

id2phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}
phonemes = [id2phoneme[i] for i in decoded]
print(phonemes)

Merge LoRA Weights

Already done. This checkpoint was saved after calling merge_and_unload(). The code below is provided for reference only, in case you are working from an unmerged PEFT checkpoint.

# Only needed if starting from a raw PEFT checkpoint (not this one)
from peft import PeftModel
base_model = Wav2Vec2ForCTC.from_pretrained("path/to/stage1_best")
peft_model = PeftModel.from_pretrained(base_model, "path/to/stage2_peft")
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("path/to/final")
processor.save_pretrained("path/to/final")

Intended Use

  • Singing vocal alignment: Force-aligning phoneme sequences to audio timestamps (CTC forced alignment) for generating per-character / per-phoneme timing (.lrc, .srt, .ass)
  • Singing acoustics research: Phoneme-level feature extraction and articulation analysis for Mandarin singing

Out-of-Scope Use

  • General Mandarin ASR: Use the original jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn instead
  • Non-Mandarin singing: The phoneme set supports multilingual phonemes by design, but training data is Mandarin-only; generalization to other languages is unverified
  • Real-time / low-latency applications: The model is not optimized for streaming inference

Training Datasets

Dataset Language Domain Stage
AISHELL-3 Mandarin Speech Stage 1
M4Singer Mandarin Singing Stage 2

Limitations

  • The model outputs a phoneme sequence, not Chinese characters; it cannot be used for text transcription without a phoneme-to-pinyin/hanzi conversion step
  • Trained primarily on Mandarin singing (M4Singer); generalization to other styles (e.g., Cantonese, English pop, rap) may be limited
  • The custom phoneme vocabulary requires the accompanying tokenizer (vocab.json) for decoding

Framework Versions

  • Transformers (compatible with PEFT 0.18.1)
  • PyTorch (with bfloat16 AMP support)
  • PEFT 0.18.1 (training only; not required at inference)
Downloads last month
43
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lyonlu13/wav2vec2-large-zh-singing-phoneme-ctc