Wav2Vec2-Large-XLSR-53 — Singing Phoneme CTC Model (Two-Stage Fine-Tuning, LoRA Merged)

Model Summary

This model is fine-tuned from jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn via a two-stage training pipeline. The original character-level CTC head (~3,500 Chinese characters) is replaced with a custom multilingual phoneme head (POLYGLOT, 39 phonemes), targeting phoneme-level temporal alignment of singing vocals (e.g., karaoke lyrics synchronization).

This checkpoint is a fully merged model. The Stage 2 LoRA adapter weights have been merged into the base model via merge_and_unload(). No PEFT dependency is required at inference time — load directly with transformers.

Model Architecture

Backbone: Wav2Vec2-Large (XLSR-53), encoder hidden size 1024
CTC Head: Dropout(0.1) + Linear(1024 → 39) (replaces the original Chinese character CTC head)
Stage 2 LoRA Adapter (merged): Originally applied to all attention and FFN layers in the encoder (see hyperparameters below); weights have been merged into the backbone via merge_and_unload()

Phoneme Set (POLYGLOT — 39 tokens)

Category	Phonemes
Special tokens	`<pad>` (also CTC blank), `<unk>`, `\|` (word boundary), `<sil>` (silence)
Stops — aspirated	`p` `t` `k`
Stops — voiced / unaspirated	`b` `d` `g`
Nasals	`m` `n` `ng`
Fricatives	`f` `v` `s` `z` `sh` `zh` `h` `th` `dh` `hy`
Affricates	`ch` `jh` `ts` `dz`
Liquids	`l` `r`
Glides	`w` `y`
Vowels	`a` `i` `u` `e` `o` `ü` `ae` `ir`

The phoneme set is designed to cover Mandarin Chinese (pinyin initial/final decomposition), English (ARPAbet mapping), and a subset of Japanese phonemes.

Note — Training coverage: Although the vocabulary is multilingual by design (zh / en / ja), all training data in both stages is Mandarin-only (AISHELL-3 + M4Singer). The English and Japanese phonemes (v, th, dh, hy, jh, dz, ü, ae, ir, …) exist in the vocabulary but have received no supervised signal; their behaviour at inference time is untested and likely unreliable.

Training Procedure

Stage 1 — Frozen Encoder, New CTC Head (AISHELL-3)

Setting	Value
Training data	AISHELL-3 (Mandarin speech, ~85 speakers)
Encoder	Fully frozen (including CNN feature extractor)
Trainable parameters	`lm_head` only (new CTC head)
Input	Pre-extracted encoder hidden states cached to HDF5
Batch size	8 (Gradient Accumulation × 4 = effective batch 32)
Learning rate	5e-4 (AdamW, weight decay 0.01)
Schedule	Cosine decay with linear warmup (3 epochs)
Regularization	Time masking (2× per sample, max 30 frames)
Early stopping	patience = 15 epochs (PER-based, min_delta = 0.001)
Evaluation metric	Phoneme Error Rate (PER), CTC greedy decoding

Stage 2 — LoRA Fine-Tuning (M4Singer)

Setting	Value
Training data	M4Singer (Mandarin singing, ~700 songs, 20 singers)
Base model	Stage 1 best checkpoint
LoRA rank	8
LoRA alpha	16
LoRA dropout	0.05
LoRA target modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `intermediate_dense`, `output_dense`
`modules_to_save`	`lm_head` (full weights, not LoRA)
CNN feature extractor	Frozen
Batch size	8 (Gradient Accumulation × 4 = effective batch 32)
Learning rate	LoRA adapter: 1e-5 / lm_head: 1e-6 (separate param groups, AdamW)
Schedule	Cosine decay with linear warmup (10 epochs)
Regularization	Wav2Vec2 built-in SpecAugment (`mask_time_prob=0.05`)
Early stopping	patience = 100 epochs (PER-based, min_delta = 0.001)
Training precision	bfloat16 AMP (CTC loss computed in float32)
Evaluation metric	Phoneme Error Rate (PER), CTC greedy decoding

Usage

Load the Model

This checkpoint is already merged — load it directly with transformers, no PEFT required.

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

processor = Wav2Vec2Processor.from_pretrained("path/to/final")
model = Wav2Vec2ForCTC.from_pretrained("path/to/final")
model.eval()

Inference (CTC Greedy Decode)

import torchaudio

waveform, sr = torchaudio.load("singing.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(0)  # convert to mono

inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits  # (1, T, 39)

# CTC greedy decode: argmax → collapse repeats → remove blank (id=0)
pred_ids = logits.argmax(dim=-1)[0].tolist()
decoded = []
prev = -1
for idx in pred_ids:
    if idx != prev:
        if idx != 0:
            decoded.append(idx)
        prev = idx

id2phoneme = {v: k for k, v in processor.tokenizer.get_vocab().items()}
phonemes = [id2phoneme[i] for i in decoded]
print(phonemes)

Merge LoRA Weights

Already done. This checkpoint was saved after calling merge_and_unload(). The code below is provided for reference only, in case you are working from an unmerged PEFT checkpoint.

# Only needed if starting from a raw PEFT checkpoint (not this one)
from peft import PeftModel
base_model = Wav2Vec2ForCTC.from_pretrained("path/to/stage1_best")
peft_model = PeftModel.from_pretrained(base_model, "path/to/stage2_peft")
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("path/to/final")
processor.save_pretrained("path/to/final")

Intended Use

Singing vocal alignment: Force-aligning phoneme sequences to audio timestamps (CTC forced alignment) for generating per-character / per-phoneme timing (.lrc, .srt, .ass)
Singing acoustics research: Phoneme-level feature extraction and articulation analysis for Mandarin singing

Out-of-Scope Use

General Mandarin ASR: Use the original jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn instead
Non-Mandarin singing: The phoneme set supports multilingual phonemes by design, but training data is Mandarin-only; generalization to other languages is unverified
Real-time / low-latency applications: The model is not optimized for streaming inference

Training Datasets

Dataset	Language	Domain	Stage
AISHELL-3	Mandarin	Speech	Stage 1
M4Singer	Mandarin	Singing	Stage 2

Limitations

The model outputs a phoneme sequence, not Chinese characters; it cannot be used for text transcription without a phoneme-to-pinyin/hanzi conversion step
Trained primarily on Mandarin singing (M4Singer); generalization to other styles (e.g., Cantonese, English pop, rap) may be limited
The custom phoneme vocabulary requires the accompanying tokenizer (vocab.json) for decoding

Framework Versions

Transformers (compatible with PEFT 0.18.1)
PyTorch (with bfloat16 AMP support)
PEFT 0.18.1 (training only; not required at inference)

Downloads last month: 43

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for lyonlu13/wav2vec2-large-zh-singing-phoneme-ctc

Base model

jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn

Adapter

(1)

this model