Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

Pathumma Large V3 - DoRA Robust (Thai ASR) πŸŽ™οΈπŸ‡ΉπŸ‡­

This repository contains the fine-tuned PEFT adapter weights for Thai Automatic Speech Recognition (ASR). The model is built on top of nectec/Pathumma-whisper-th-large-v3 and optimized using DoRA (Weight-Decomposed Low-Rank Adaptation) to handle highly challenging audio environments.

This model was specifically fine-tuned using the LOTUSDIS dataset provided by NECTEC, which features a wide variety of difficult acoustic conditions and microphone types.

πŸš€ Key Features & Optimizations

  • Training Data (LOTUSDIS by NECTEC): Fine-tuned on the highly challenging LOTUSDIS dataset.
  • Data Curation via WER Filtering (Golden Dataset): The training data was meticulously curated by initially evaluating the Word Error Rate (WER) of individual audio files. Heavily corrupted or unlearnable samples from the LOTUSDIS dataset were filtered out to construct a high-quality "Golden Dataset".
  • Audio Preprocessing & Noise Reduction: Applied audio denoising techniques to clean the raw audio files, effectively stripping away background noise before feeding them into the training pipeline.
  • Balanced Microphone Distribution: Handled imbalanced data by applying Stratified Sampling across 6 different microphone types, ensuring the model doesn't overfit to a specific audio profile.
  • Advanced Fine-Tuning: Applied DoRA (all_linear targeting) instead of standard LoRA to achieve better magnitude and directional updates, pushing the Word Error Rate (WER) down to 35.8% on a highly difficult evaluation set.
  • Text Post-Processing Pipeline: Utilized PyThaiNLP for rigorous text normalization, solving Thai floating vowel issues, and converting Arabic numbers to Thai words to match competition standards.

πŸ’» How to Use

You can load this model directly using the transformers and peft libraries.

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

device = "cuda:0" if torch.cuda.is_available() else "cpu"
base_model_id = "nectec/Pathumma-whisper-th-large-v3"
peft_model_id = "pmootr/pathumma-large-v3-dora-robust"

# 1. Load Base Model and Processor
processor = WhisperProcessor.from_pretrained(base_model_id)
base_model = WhisperForConditionalGeneration.from_pretrained(base_model_id, device_map=device)

# 2. Attach DoRA Adapter and Merge
model = PeftModel.from_pretrained(base_model, peft_model_id).merge_and_unload()

# 3. Transcribe Audio
def transcribe(audio_path):
    # Note: Ensure the audio is preprocessed (noise reduction) for best results
    audio_array, sr = librosa.load(audio_path, sr=16000)
    inputs = processor(audio_array, sampling_rate=sr, return_tensors="pt")
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="thai", task="transcribe")
    
    with torch.no_grad():
        predicted_ids = model.generate(
            inputs.input_features.to(device, dtype=model.dtype),
            forced_decoder_ids=forced_decoder_ids,
            max_new_tokens=255,
            num_beams=5,
            repetition_penalty=1.2
        )
    text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return text.strip()

# Example:
# print(transcribe("sample_audio.wav"))
Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pmootr/pathumma-large-v3-dora-robust

Adapter
(5)
this model

Dataset used to train pmootr/pathumma-large-v3-dora-robust