MonSub โ€” Mongol Editor LLM v2

ะœะพะฝะณะพะป ั…ัะปัะฝะด ั‚ัƒัะณะฐะนะปะฐะฝ ั‚ะพั…ะธั€ัƒัƒะปัะฐะฝ Whisper ASR output-ะธะนะณ ะทะฐัะฐั… ะฑะพะปะพะฝ ะตั€ำฉะฝั…ะธะน ะœะพะฝะณะพะป ั…ัะปะฝะธะน reasoning LLM. Qwen3.5-4B-Claude-4.6-Opus-Reasoning- Distilled ััƒัƒั€ัŒ ะดััั€ LoRA fine-tune ั…ะธะนััะฝ v2 ั…ัƒะฒะธะปะฑะฐั€.

Model Summary

Field Value
Base model Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Architecture Qwen3.5 causal LM, 4B parameters, bfloat16
Adapter type LoRA (PEFT), r=64, alpha=128
Adapter size ~340 MB (adapter_model.safetensors)
Trainable params ~84M (2.1% of base)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Context length 768 tokens (training); up to 32k (inference)
Language Mongolian (Cyrillic) + English transliteration
License Apache 2.0 (inherits from base model terms)

Intended Use

Primary production use cases:

  1. ASR post-editing โ€” fix raw Whisper output: punctuation restoration, capitalization, grammar normalization, brand name correction.
  2. Subtitle cleanup โ€” turn messy chunked transcripts into readable subtitle lines with chain-of-thought explanation of what was fixed.
  3. Short Mongolian summarization โ€” condense articles, interviews, or transcripts into 2-3 sentence summaries.
  4. Mongolian grammar Q&A โ€” explain Mongolian language rules, suffix agreement, conjunction usage.
  5. General Mongolian reasoning โ€” step-by-step analysis of questions with structured chain-of-thought responses (inherited from the Claude- distilled base).

Out-of-scope / not recommended:

  • Real-time conversational chat (model is optimized for edit tasks)
  • Long-form creative writing (limited to 512 output tokens typically)
  • Languages other than Mongolian + English transliterated terms
  • Factual questions outside the training-time knowledge cutoff (2024)

Training Details

Data composition

V2 extends V1 with augmented Mongolian data targeting known weaknesses:

Bucket Examples Purpose
Original news CPT + SFT 95,480 Base Mongolian fluency + punctuation
Brand name corrections 1,038 ร— 3 weight "ั‡ะธั‚ะฐ 5" โ†’ "GTA 5", "ะฐะธั„ะพะฝ" โ†’ "iPhone"
Anti-hallucination pairs 282 ร— 3 weight Clean text โ†’ unchanged output
Comma placement rules 100 ร— 3 weight Conjunction commas (ะฑำฉะณำฉำฉะด, ั…ะฐั€ะธะฝ, ะผำฉะฝ)
Mongolian proper nouns 100 ร— 3 weight "ัƒะปะฐะฐะฝะฑะฐะฐั‚ะฐั€" โ†’ "ะฃะปะฐะฐะฝะฑะฐะฐั‚ะฐั€"
Knowledge Q&A 26 ร— 8 weight History, culture, geography, literature
Summarization 7 ร— 8 weight Long text โ†’ short summary
Reasoning chains 5 ร— 8 weight Multi-perspective analysis
Content rewriting 6 ร— 8 weight Formalโ†”casual, longโ†”short
Language grammar rules 5 ร— 8 weight "ะณัะถ" vs "ะณัะดัะณ", suffix agreement
Total (weighted) 97,673

Hyperparameters

Parameter Value
Learning rate 5e-5 (v1 used 1e-4)
Scheduler Cosine with warmup
Warmup ratio 0.03
Batch size (effective) 16 (4 ร— 4 grad accum)
Epochs 3
Max sequence length 768
Precision bfloat16
Gradient checkpointing enabled (use_reentrant=False)
Weight decay 0.01
Max grad norm 1.0
Optimizer AdamW (torch)
Total steps 18,315

Training run

  • Hardware: NVIDIA A40 46 GB
  • Duration: ~9.5 hours (34,160 sec wall clock)
  • Resume strategy: initialized from V1 adapter, fine-tuned further
  • Final metrics:
    • train_loss: 0.7875 (down from V1's 1.105 โ€” 29% reduction)
    • train_samples_per_second: 8.58
    • train_steps_per_second: 0.536
    • Perplexity: exp(0.7875) โ‰ˆ 2.20

Evaluation

Heuristic eval (10 real-world cases)

Test case V1 quality V2 quality
Whisper brand-error chunk 3/10 5/10
News segment 9/10 9/10
Interview (commas + caps) 8/10 9/10
Short punctuation restore 10/10 10/10
Long punct + comma 9/10 9/10
Casual YouTube vlog intro 10/10 10/10
Tech review (iPhone) 8/10 8.5/10
Clean text (hallucination test) 5/10 โš ๏ธ 10/10 โœ…
Podcast discussion 8/10 8/10
Dates and numbers 9/10 9/10
Average ~79% ~86.5%

Key improvements in V2:

  • โœ… Hallucination guard โ€” the biggest win. On clean text inputs, V1 sometimes invented new facts (e.g. "...hot ะฑำฉะณำฉำฉะด 1924 ะพะฝะด ะฑะฐะนะณัƒัƒะปะฐะณะดัะฐะฝ"). V2 correctly outputs "ะ—ะฐัะฐั… ัˆะฐะฐั€ะดะปะฐะณะฐะณาฏะน" (no change needed).
  • โœ… Brand detection โ€” V2 explicitly identifies "ะ‘ั€ัะฝะดะธะนะฝ ะฝัั€ ะฑัƒั€ัƒัƒ ะฑะธั‡ะธะณะดััะฝ" in chain-of-thought and produces "IPhone" (was "ะะธั„ะพะฝ" in V1).
  • โœ… Cleaner CoT format โ€” more consistent structure: "ะ”ะฐั€ะฐะฐั… ะทาฏะนะป ะทะฐัะฐั… ั…ัั€ัะณั‚ัะน โ†’ ะ—ะฐัะฒะฐั€ะปะฐัะฐะฝ ั…ัƒะฒะธะปะฑะฐั€".
  • โœ… Lower perplexity โ€” 2.18 vs V1's 3.00 (27% more confident).

Residual weaknesses (for v3 roadmap):

  • Internal commas before "ะฑำฉะณำฉำฉะด/ั…ะฐั€ะธะฝ" still inconsistent
  • "iPhone" vs "IPhone" capitalization
  • "ะงะธั‚ะฐ 5 โ†’ GTA 5" is handled by backend post-processing dictionary, not the LoRA itself
  • Slower at inference (16s/case vs V1's 12s) due to more verbose CoT

Real-world production quality

When combined with the MonSub backend's brand dictionary post-processor (apply_brand_fixes()), end-to-end subtitle quality reaches ~90%+ on typical Mongolian YouTube/podcast content.

Usage

Direct inference (Python)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE = "Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled"
ADAPTER = "Tsedee/mongol-editor-llm-v2"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

def edit(instruction: str, text: str) -> str:
    msgs = [{"role": "user", "content": f"{instruction}\n\n{text}"}]
    prompt = tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            repetition_penalty=1.05,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_tokens = out[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

# Example: Whisper ASR fix
result = edit(
    instruction="ะ”ะฐั€ะฐะฐั… ASR-ััั ะณะฐั€ัะฐะฝ ั‚ะตะบัั‚ะธะนะณ ะทะฐัะฒะฐั€ะปะฐะถ, ะทำฉะฒ subtitle ะฑะพะปะณะพะฝะพ ัƒัƒ.",
    text="ำฉะฝำฉำฉะดำฉั€ ะฑะธะด ั‡ะธั‚ะฐ 5 ั‚ะพะณะปะพะพะผั‹ะฝ ัˆะธะฝั ั…ัƒะฒะธะปะฑะฐั€ั‹ะฝ ั‚ัƒั…ะฐะน ัั€ะธะปั†ะฐั… ะฑะพะปะฝะพ",
)
print(result)

Example outputs

Input (instruction): ะ”ะฐั€ะฐะฐั… ASR-ััั ะณะฐั€ัะฐะฝ ั‚ะตะบัั‚ะธะนะณ ะทะฐัะฒะฐั€ะปะฐะถ, ะทำฉะฒ subtitle ะฑะพะปะณะพะฝะพ ัƒัƒ. Input (text): ะฑะธ 10 ะถะธะปะธะนะฝ ำฉะผะฝำฉ ัะฝั ะบะพะผะฟะฐะฝะธะด ะพั€ะถ ะธั€ััะฝ ั‚ัั€ าฏะตะด ะณัƒั€ะฐะฒั…ะฐะฝ ั…าฏะฝ ะฐะถะธะปะปะฐะดะฐะณ ะฑะฐะนัะฐะฝ ะพะดะพะพ ะณัƒั€ะฒะฐะฝ ะทัƒัƒ ะณะฐั€ัƒะน ะฐะถะธะปั‚ะฐะฝั‚ะฐะน

V2 output:

ะ‘ะธ 10 ะถะธะปะธะนะฝ ำฉะผะฝำฉ ัะฝั ะบะพะผะฟะฐะฝะธะด ะพั€ะถ ะธั€ััะฝ ั‚ัั€ าฏะตะด ะณัƒั€ะฐะฒั…ะฐะฝ ั…าฏะฝ
ะฐะถะธะปะปะฐะดะฐะณ ะฑะฐะนัะฐะฝ ะพะดะพะพ ะณัƒั€ะฒะฐะฝ ะทัƒัƒ ะณะฐั€ัƒะน ะฐะถะธะปั‚ะฐะฝั‚ะฐะน.

Input (instruction): ะ”ะฐั€ะฐะฐั… ะฐััƒัƒะดะปั‹ะณ ัˆะธะฝะถะธะปะถ, ะดาฏะณะฝัะปั‚ ะณะฐั€ะณะฐะฝะฐ ัƒัƒ. Input (text): ะ—ะฐะปัƒัƒั ัะฐะณะฐะฐะด ะณะฐะดะฐะฐะด ั€ัƒัƒ ัะฒะถ ััƒั€ะฐะปั†ะดะฐะณ ะฒั?

V2 output: (chain-of-thought)

ะญะฝั ะฐััƒัƒะปั‚ั‹ะณ ั…ัะด ั…ัะดัะฝ ำฉะฝั†ะณำฉำฉั ะฐะฒั‡ าฏะทัŒะต:

1. ะ‘ะพะปะพะฒัั€ะพะปั‹ะฝ ั‡ะฐะฝะฐั€: ะžะปะพะฝ ะพั€ะพะฝะด ัั€ัะปั‚ั‚ัะน ะฑะฐะนะดะฐะณ ัˆะธะฝะถะปัั… ัƒั…ะฐะฐะฝ,
   ะธะฝะถะตะฝะตั€ั‡ะปัะป, ะฐะฝะฐะณะฐะฐั… ัƒั…ะฐะฐะฝั‹ ั‡ะธะณะปัะปััั€ ะดัะปั…ะธะนะฝ ะทัั€ัะณะปัะปั‚ัะน ะธั…
   ััƒั€ะณัƒัƒะปัŒะด ััƒั€ะฐะปั†ะฐั… ะฑะพะปะพะผะถ ะฑะฐะนะดะฐะณ.
2. ะะถะปั‹ะฝ ะทะฐั… ะทััะป: ะ“ะฐะดะฐะฐะดะฐะด ััƒั€ัะฐะฝ ะดะธะฟะปะพะผั‚ะพะน ั…าฏะฝ ะœะพะฝะณะพะปั‹ะฝ ะทะฐั… ะทััะป ะดััั€
   ำฉะฝะดำฉั€ าฏะฝัั‚ัะน ะผัั€ะณัะถะธะปั‚ัะฝ ะฑะพะปะถ ะฐะถะปั‹ะฝ ะฑะฐะนั€ ะพะปะพั…ะพะด ะธะปาฏาฏ ั…ัะปะฑะฐั€ ะฑะฐะนะดะฐะณ.
3. ะฅัะป ััƒั€ะฐั…: ะะฝะณะปะธ, ะฏะฟะพะฝ, ะกะพะปะพะฝะณะพั ั…ัะปะธะนะณ ะฑาฏั€ัะฝ ัะทัะผัˆะธั… ะฝัŒ ะณะฐะดะฐะฐะด ะพั€ะพะฝ
   ั…ะฐะผะณะธะนะฝ าฏั€ ะดาฏะฝั‚ัะน ะณะฐะทะฐั€.
4. ะžะปะพะฝ ัƒะปัั‹ะฝ ั‚ัƒั€ัˆะปะฐะณะฐ: ำจำฉั€ ัะพั‘ะป, ะฐะผัŒะดั€ะฐะปั‹ะฝ ั…ัะฒ ะผะฐัะณ ะผัะดั€ัั… ะฝัŒ ั…ัƒะฒัŒ
   ั…าฏะฝะธะน ั…ำฉะณะถะธะปะด ั‡ัƒั…ะฐะป.

ะ”าฏะณะฝัะปั‚: ะ—ะฐะปัƒัƒั ะณะฐะดะฐะฐะด ั€ัƒัƒ ะพั‡ะธั… ัˆะฐะปั‚ะณะฐะฐะฝ ะฝัŒ ะพะปะพะฝ ั‚ะฐะปั‹ะฝ โ€” ะฑะพะปะพะฒัั€ะพะป,
ะบะฐั€ัŒะตั€, ั…ัƒะฒัŒ ั…าฏะฝะธะน ั…ำฉะณะถะธะป ะณัััะฝ 3 ะณะพะป ะฑาฏั€ัะปะดัั…าฏาฏะฝ ั…ััะณะธะนะณ ะฐะณัƒัƒะปะถ ะฑะฐะนะฝะฐ.

Limitations

  • 4B parameters is ~100ร— smaller than Claude Opus / GPT-4. For broad general knowledge, a larger frontier model will outperform MonSub.
  • Mongolian-specific proper nouns not in training data (e.g. newer celebrities, 2025+ events) may not be recognized.
  • Brand correction relies on a hybrid: the LoRA detects brand errors, but the backend dictionary (editor_postprocess.py) does the actual replacement. Running the LoRA alone without the post-processor will miss some brand fixes.
  • Output verbosity โ€” the model likes to emit chain-of-thought even for simple fixes. For minimum-latency subtitle editing, strip the reasoning block before rendering (see example code below).
  • Not safety-tuned โ€” inherits the base model's alignment (which was distilled from Claude, so relatively aligned, but not explicitly RLHF'd by us).

Stripping chain-of-thought from output

Model output typically follows the pattern:

ะญะฝั ำฉะณาฏาฏะปะฑัั€ั‚ ะดะฐั€ะฐะฐั… ะทาฏะนะปั ะทะฐัะฐั… ั…ัั€ัะณั‚ัะน:
1. ...
2. ...

ะ—ะฐัะฒะฐั€ะปะฐัะฐะฝ ั…ัƒะฒะธะปะฑะฐั€:
<FINAL TEXT HERE>
</think>

<FINAL TEXT DUPLICATE>

To extract only the final corrected text:

def extract_final(raw: str) -> str:
    text = raw
    # Take content after "ะ—ะฐัะฒะฐั€ะปะฐัะฐะฝ ั…ัƒะฒะธะปะฑะฐั€:"
    for marker in ("ะ—ะฐัะฒะฐั€ะปะฐัะฐะฝ ั…ัƒะฒะธะปะฑะฐั€:", "ะ—ะฐัะฒะฐั€ะปะฐัะฐะฝ ำฉะณาฏาฏะปะฑัั€:"):
        if marker in text:
            text = text.split(marker, 1)[1]
            break
    # Cut at </think> if present (duplicate after)
    if "</think>" in text:
        text = text.split("</think>", 1)[0]
    return text.strip()

Related

Citation

If you use this model in your research or product, please cite:

@misc{monsub-editor-v2-2026,
  author       = {Tsedee},
  title        = {MonSub Mongol Editor v2: Mongolian Subtitle Correction LLM},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Tsedee/mongol-editor-llm-v2}},
}

Framework versions

  • PEFT 0.18.1
  • Transformers 5.5.0
  • PyTorch 2.6.0+cu124
  • Datasets 4.8.4
  • Tokenizers 0.22.2
Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Tsedee/mongol-editor-llm-v2

Space using Tsedee/mongol-editor-llm-v2 1

Evaluation results