SARC-Taigi-LLM-12b

This model is a specialized version of google/gemma-3-12b-it finetuned using IMA's 'Taiwan Tongues' Taigi Datasets and QLoRA by the Speech AI Research Center (SARC). It is optimized for Taiwanese (Taigi), and its multi-stage fine-tuning process enhances the model’s linguistic richness and cultural grounding in Taiwanese.

1. Main Capabilities

  • Taigi Dialogue and Consultation: Capable of understanding and responding to daily and professional inquiries in Taigi (using Taiwanese Chinese characters (Tâi-bûn Hàn-jī) or Romanization (Tâi-lô)).
  • Linguistic Knowledge Retrieval: Supports queries regarding the meaning, usage, and cultural background of Taigi vocabulary.
  • Logical Reasoning: Performs logical judgment and problem-solving specifically within a Taigi linguistic context.

2. Demostration

3. Training Pipeline

The model underwent a two-stage training process designed to build a robust linguistic foundation, followed by instruction alignment:

  • Phase 1: Continual Pre-Training (CPT)
    • Ministry of Education Dictionary of Frequently-Used Taiwanese Taigi.
    • Taigi Literature Collection (taigi-literature): A diverse corpus of classical and modern Taigi literary works.
  • Phase 2: Supervised Fine-Tuning (SFT)
    • Taigi-version Alpaca Dataset: Instruction-following data optimized for Taigi dialogue.
    • Grand Challenge Training Set: Multiple-choice questions from the training text of the 1st "Grand Challenge" (科技大擂台) competition.

4. Evaluation on << 2020 Grand Challenge, Talk to AI >> Final-Test Dataset

We evaluated the models using the << 2020 Grand Challenge, Talk to AI (科技大擂台,與AI對話)>> Final-Test Dataset, which consists of 1,000 multiple-choice reading comprehension questions. This serves as a benchmark for Taigi language understanding:

  • Question Example
  • Experimental Results
Model Accuracy Note
Stage Gemma-3-12b-it Gemma-3-27b-it
Original 0.80320 0.86214 Baseline performance
After CPT 0.88312 0.92296 Knowledge internalization
After SFT 0.89610 0.92582 Instruction alignment

5. Model Usage (QLoRA Adapter)

This model is released as a QLoRA Adapter (covering both CPT and SFT stages). To use it, you must load the base Gemma-3 model and apply the adapter as shown below:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# 1. 4-bit Quantization Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# 2. Load Base Model
model_id = "google/gemma-3-12b-it"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config, # Optional; depends on available GPU memory (VRAM)
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 3. Load SARC-Taigi-LLM-12b Adapter and Merge
adapter_id = "Speech-AI-Research-Center/SARC-Taigi-LLM-12b"
model = PeftModel.from_pretrained(model, adapter_id)
model = model.merge_and_unload()

# 4. Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True,
    padding_side="left",
)
if tokenizer.pad_token is None: 
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

model = model.eval()

6. Roadmap: Beyond SFT

While the current release is the result of CPT and SFT, this is only the beginning. Our multi-stage alignment strategy includes:

  • Phase I (CPT): Building linguistic foundation (Completed).
  • Phase II (SFT): Instruction and dialogue alignment (Current Release).
  • Phase III (GRPO): Future reinforcement learning using Group Relative Policy Optimization (GRPO) to further enhance self-correction and complex reasoning chains.

7. Training Resources

Learn how to perform this multi-stage fine-tuning (CPT + SFT) with our custom callbacks for Loss minimization and Gap stability on GitHub:

[GitHub: SARC-Taigi-LLM Training Pipeline]

Citation

If you find this project useful, please cite the IMA's Taiwan Tongues resource page and the Speech AI Research Center organization pages on Hugging Face and GitHub.

@misc{ima_taiwan_2026,
  title        = {IMA-Taiwan},
  author       = {Information Management Association of R.O.C. (IMA)},
  year         = {2026},
  howpublished = {https://huggingface.co/IMA-Taiwan},
  note         = {Hugging Face organization page for Taiwan Tongues resources}
}
@misc{sarc_hf_2026,
  title        = {Speech-AI-Research-Center},
  author       = {Speech AI Research Center (SARC)},
  year         = {2026},
  howpublished = {https://huggingface.co/Speech-AI-Research-Center},
  note         = {Hugging Face organization page for released Taigi model adapters}
}
@misc{sarctaigillm_repo_2026,
  title        = {Speech-AI-Research-Center},
  author       = {Speech AI Research Center (SARC)},
  year         = {2026},
  howpublished = {https://github.com/Speech-AI-Research-Center},
  note         = {GitHub organization page for released Taigi-LLM training project}
}

License

This model is subject to the Gemma Terms of Use. By using this model, you agree to comply with Google’s licensing requirements.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Speech-AI-Research-Center/SARC-Taigi-LLM-12b

Finetuned
(220)
this model

Datasets used to train Speech-AI-Research-Center/SARC-Taigi-LLM-12b