MonSub โ Mongol Editor LLM v2
ะะพะฝะณะพะป ั ัะปัะฝะด ัััะณะฐะนะปะฐะฝ ัะพั ะธัััะปัะฐะฝ Whisper ASR output-ะธะนะณ ะทะฐัะฐั ะฑะพะปะพะฝ ะตัำฉะฝั ะธะน ะะพะฝะณะพะป ั ัะปะฝะธะน reasoning LLM. Qwen3.5-4B-Claude-4.6-Opus-Reasoning- Distilled ััััั ะดััั LoRA fine-tune ั ะธะนััะฝ v2 ั ัะฒะธะปะฑะฐั.
Model Summary
| Field | Value |
|---|---|
| Base model | Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Architecture | Qwen3.5 causal LM, 4B parameters, bfloat16 |
| Adapter type | LoRA (PEFT), r=64, alpha=128 |
| Adapter size | ~340 MB (adapter_model.safetensors) |
| Trainable params | ~84M (2.1% of base) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Context length | 768 tokens (training); up to 32k (inference) |
| Language | Mongolian (Cyrillic) + English transliteration |
| License | Apache 2.0 (inherits from base model terms) |
Intended Use
Primary production use cases:
- ASR post-editing โ fix raw Whisper output: punctuation restoration, capitalization, grammar normalization, brand name correction.
- Subtitle cleanup โ turn messy chunked transcripts into readable subtitle lines with chain-of-thought explanation of what was fixed.
- Short Mongolian summarization โ condense articles, interviews, or transcripts into 2-3 sentence summaries.
- Mongolian grammar Q&A โ explain Mongolian language rules, suffix agreement, conjunction usage.
- General Mongolian reasoning โ step-by-step analysis of questions with structured chain-of-thought responses (inherited from the Claude- distilled base).
Out-of-scope / not recommended:
- Real-time conversational chat (model is optimized for edit tasks)
- Long-form creative writing (limited to 512 output tokens typically)
- Languages other than Mongolian + English transliterated terms
- Factual questions outside the training-time knowledge cutoff (2024)
Training Details
Data composition
V2 extends V1 with augmented Mongolian data targeting known weaknesses:
| Bucket | Examples | Purpose |
|---|---|---|
| Original news CPT + SFT | 95,480 | Base Mongolian fluency + punctuation |
| Brand name corrections | 1,038 ร 3 weight | "ัะธัะฐ 5" โ "GTA 5", "ะฐะธัะพะฝ" โ "iPhone" |
| Anti-hallucination pairs | 282 ร 3 weight | Clean text โ unchanged output |
| Comma placement rules | 100 ร 3 weight | Conjunction commas (ะฑำฉะณำฉำฉะด, ั ะฐัะธะฝ, ะผำฉะฝ) |
| Mongolian proper nouns | 100 ร 3 weight | "ัะปะฐะฐะฝะฑะฐะฐัะฐั" โ "ะฃะปะฐะฐะฝะฑะฐะฐัะฐั" |
| Knowledge Q&A | 26 ร 8 weight | History, culture, geography, literature |
| Summarization | 7 ร 8 weight | Long text โ short summary |
| Reasoning chains | 5 ร 8 weight | Multi-perspective analysis |
| Content rewriting | 6 ร 8 weight | Formalโcasual, longโshort |
| Language grammar rules | 5 ร 8 weight | "ะณัะถ" vs "ะณัะดัะณ", suffix agreement |
| Total (weighted) | 97,673 |
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 5e-5 (v1 used 1e-4) |
| Scheduler | Cosine with warmup |
| Warmup ratio | 0.03 |
| Batch size (effective) | 16 (4 ร 4 grad accum) |
| Epochs | 3 |
| Max sequence length | 768 |
| Precision | bfloat16 |
| Gradient checkpointing | enabled (use_reentrant=False) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Optimizer | AdamW (torch) |
| Total steps | 18,315 |
Training run
- Hardware: NVIDIA A40 46 GB
- Duration: ~9.5 hours (34,160 sec wall clock)
- Resume strategy: initialized from V1 adapter, fine-tuned further
- Final metrics:
train_loss: 0.7875 (down from V1's 1.105 โ 29% reduction)train_samples_per_second: 8.58train_steps_per_second: 0.536- Perplexity: exp(0.7875) โ 2.20
Evaluation
Heuristic eval (10 real-world cases)
| Test case | V1 quality | V2 quality |
|---|---|---|
| Whisper brand-error chunk | 3/10 | 5/10 |
| News segment | 9/10 | 9/10 |
| Interview (commas + caps) | 8/10 | 9/10 |
| Short punctuation restore | 10/10 | 10/10 |
| Long punct + comma | 9/10 | 9/10 |
| Casual YouTube vlog intro | 10/10 | 10/10 |
| Tech review (iPhone) | 8/10 | 8.5/10 |
| Clean text (hallucination test) | 5/10 โ ๏ธ | 10/10 โ |
| Podcast discussion | 8/10 | 8/10 |
| Dates and numbers | 9/10 | 9/10 |
| Average | ~79% | ~86.5% |
Key improvements in V2:
- โ Hallucination guard โ the biggest win. On clean text inputs, V1 sometimes invented new facts (e.g. "...hot ะฑำฉะณำฉำฉะด 1924 ะพะฝะด ะฑะฐะนะณััะปะฐะณะดัะฐะฝ"). V2 correctly outputs "ะะฐัะฐั ัะฐะฐัะดะปะฐะณะฐะณาฏะน" (no change needed).
- โ Brand detection โ V2 explicitly identifies "ะััะฝะดะธะนะฝ ะฝัั ะฑัััั ะฑะธัะธะณะดััะฝ" in chain-of-thought and produces "IPhone" (was "ะะธัะพะฝ" in V1).
- โ Cleaner CoT format โ more consistent structure: "ะะฐัะฐะฐั ะทาฏะนะป ะทะฐัะฐั ั ัััะณััะน โ ะะฐัะฒะฐัะปะฐัะฐะฝ ั ัะฒะธะปะฑะฐั".
- โ Lower perplexity โ 2.18 vs V1's 3.00 (27% more confident).
Residual weaknesses (for v3 roadmap):
- Internal commas before "ะฑำฉะณำฉำฉะด/ั ะฐัะธะฝ" still inconsistent
- "iPhone" vs "IPhone" capitalization
- "ะงะธัะฐ 5 โ GTA 5" is handled by backend post-processing dictionary, not the LoRA itself
- Slower at inference (16s/case vs V1's 12s) due to more verbose CoT
Real-world production quality
When combined with the MonSub backend's brand dictionary post-processor
(apply_brand_fixes()), end-to-end subtitle quality reaches ~90%+ on
typical Mongolian YouTube/podcast content.
Usage
Direct inference (Python)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
BASE = "Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled"
ADAPTER = "Tsedee/mongol-editor-llm-v2"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base_model = AutoModelForCausalLM.from_pretrained(
BASE,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()
def edit(instruction: str, text: str) -> str:
msgs = [{"role": "user", "content": f"{instruction}\n\n{text}"}]
prompt = tokenizer.apply_chat_template(
msgs, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
repetition_penalty=1.05,
pad_token_id=tokenizer.pad_token_id,
)
new_tokens = out[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
# Example: Whisper ASR fix
result = edit(
instruction="ะะฐัะฐะฐั
ASR-ััั ะณะฐััะฐะฝ ัะตะบััะธะนะณ ะทะฐัะฒะฐัะปะฐะถ, ะทำฉะฒ subtitle ะฑะพะปะณะพะฝะพ ัั.",
text="ำฉะฝำฉำฉะดำฉั ะฑะธะด ัะธัะฐ 5 ัะพะณะปะพะพะผัะฝ ัะธะฝั ั
ัะฒะธะปะฑะฐััะฝ ััั
ะฐะน ััะธะปัะฐั
ะฑะพะปะฝะพ",
)
print(result)
Example outputs
Input (instruction): ะะฐัะฐะฐั
ASR-ััั ะณะฐััะฐะฝ ัะตะบััะธะนะณ ะทะฐัะฒะฐัะปะฐะถ, ะทำฉะฒ subtitle ะฑะพะปะณะพะฝะพ ัั.
Input (text): ะฑะธ 10 ะถะธะปะธะนะฝ ำฉะผะฝำฉ ัะฝั ะบะพะผะฟะฐะฝะธะด ะพัะถ ะธัััะฝ ััั าฏะตะด ะณััะฐะฒั
ะฐะฝ ั
าฏะฝ ะฐะถะธะปะปะฐะดะฐะณ ะฑะฐะนัะฐะฝ ะพะดะพะพ ะณััะฒะฐะฝ ะทัั ะณะฐััะน ะฐะถะธะปัะฐะฝัะฐะน
V2 output:
ะะธ 10 ะถะธะปะธะนะฝ ำฉะผะฝำฉ ัะฝั ะบะพะผะฟะฐะฝะธะด ะพัะถ ะธัััะฝ ััั าฏะตะด ะณััะฐะฒั
ะฐะฝ ั
าฏะฝ
ะฐะถะธะปะปะฐะดะฐะณ ะฑะฐะนัะฐะฝ ะพะดะพะพ ะณััะฒะฐะฝ ะทัั ะณะฐััะน ะฐะถะธะปัะฐะฝัะฐะน.
Input (instruction): ะะฐัะฐะฐั ะฐัััะดะปัะณ ัะธะฝะถะธะปะถ, ะดาฏะณะฝัะปั ะณะฐัะณะฐะฝะฐ ัั. Input (text): ะะฐะปััั ัะฐะณะฐะฐะด ะณะฐะดะฐะฐะด ััั ัะฒะถ ัััะฐะปัะดะฐะณ ะฒั?
V2 output: (chain-of-thought)
ะญะฝั ะฐัััะปััะณ ั
ัะด ั
ัะดัะฝ ำฉะฝัะณำฉำฉั ะฐะฒั าฏะทัะต:
1. ะะพะปะพะฒััะพะปัะฝ ัะฐะฝะฐั: ะะปะพะฝ ะพัะพะฝะด ัััะปัััะน ะฑะฐะนะดะฐะณ ัะธะฝะถะปัั
ัั
ะฐะฐะฝ,
ะธะฝะถะตะฝะตััะปัะป, ะฐะฝะฐะณะฐะฐั
ัั
ะฐะฐะฝั ัะธะณะปัะปััั ะดัะปั
ะธะนะฝ ะทัััะณะปัะปััะน ะธั
ัััะณััะปัะด ัััะฐะปัะฐั
ะฑะพะปะพะผะถ ะฑะฐะนะดะฐะณ.
2. ะะถะปัะฝ ะทะฐั
ะทััะป: ะะฐะดะฐะฐะดะฐะด ััััะฐะฝ ะดะธะฟะปะพะผัะพะน ั
าฏะฝ ะะพะฝะณะพะปัะฝ ะทะฐั
ะทััะป ะดััั
ำฉะฝะดำฉั าฏะฝัััะน ะผััะณัะถะธะปััะฝ ะฑะพะปะถ ะฐะถะปัะฝ ะฑะฐะนั ะพะปะพั
ะพะด ะธะปาฏาฏ ั
ัะปะฑะฐั ะฑะฐะนะดะฐะณ.
3. ะฅัะป ัััะฐั
: ะะฝะณะปะธ, ะฏะฟะพะฝ, ะกะพะปะพะฝะณะพั ั
ัะปะธะนะณ ะฑาฏััะฝ ัะทัะผัะธั
ะฝั ะณะฐะดะฐะฐะด ะพัะพะฝ
ั
ะฐะผะณะธะนะฝ าฏั ะดาฏะฝััะน ะณะฐะทะฐั.
4. ะะปะพะฝ ัะปััะฝ ััััะปะฐะณะฐ: ำจำฉั ัะพัะป, ะฐะผัะดัะฐะปัะฝ ั
ัะฒ ะผะฐัะณ ะผัะดััั
ะฝั ั
ัะฒั
ั
าฏะฝะธะน ั
ำฉะณะถะธะปะด ััั
ะฐะป.
ะาฏะณะฝัะปั: ะะฐะปััั ะณะฐะดะฐะฐะด ััั ะพัะธั
ัะฐะปัะณะฐะฐะฝ ะฝั ะพะปะพะฝ ัะฐะปัะฝ โ ะฑะพะปะพะฒััะพะป,
ะบะฐััะตั, ั
ัะฒั ั
าฏะฝะธะน ั
ำฉะณะถะธะป ะณัััะฝ 3 ะณะพะป ะฑาฏััะปะดัั
าฏาฏะฝ ั
ััะณะธะนะณ ะฐะณััะปะถ ะฑะฐะนะฝะฐ.
Limitations
- 4B parameters is ~100ร smaller than Claude Opus / GPT-4. For broad general knowledge, a larger frontier model will outperform MonSub.
- Mongolian-specific proper nouns not in training data (e.g. newer celebrities, 2025+ events) may not be recognized.
- Brand correction relies on a hybrid: the LoRA detects brand
errors, but the backend dictionary (
editor_postprocess.py) does the actual replacement. Running the LoRA alone without the post-processor will miss some brand fixes. - Output verbosity โ the model likes to emit chain-of-thought even for simple fixes. For minimum-latency subtitle editing, strip the reasoning block before rendering (see example code below).
- Not safety-tuned โ inherits the base model's alignment (which was distilled from Claude, so relatively aligned, but not explicitly RLHF'd by us).
Stripping chain-of-thought from output
Model output typically follows the pattern:
ะญะฝั ำฉะณาฏาฏะปะฑััั ะดะฐัะฐะฐั
ะทาฏะนะปั ะทะฐัะฐั
ั
ัััะณััะน:
1. ...
2. ...
ะะฐัะฒะฐัะปะฐัะฐะฝ ั
ัะฒะธะปะฑะฐั:
<FINAL TEXT HERE>
</think>
<FINAL TEXT DUPLICATE>
To extract only the final corrected text:
def extract_final(raw: str) -> str:
text = raw
# Take content after "ะะฐัะฒะฐัะปะฐัะฐะฝ ั
ัะฒะธะปะฑะฐั:"
for marker in ("ะะฐัะฒะฐัะปะฐัะฐะฝ ั
ัะฒะธะปะฑะฐั:", "ะะฐัะฒะฐัะปะฐัะฐะฝ ำฉะณาฏาฏะปะฑัั:"):
if marker in text:
text = text.split(marker, 1)[1]
break
# Cut at </think> if present (duplicate after)
if "</think>" in text:
text = text.split("</think>", 1)[0]
return text.strip()
Related
- Production site: monsub.vip โ full subtitle editor product using this model
- Interactive demo: HuggingFace Space
- V1 model: Tsedee/mongol-editor-llm-v1 (baseline before augmented data)
- Whisper ASR: Tsedee/whisper-large-v3-turbo-mn-2 (the upstream ASR this model corrects)
- Subtitle LoRA: Tsedee/monsub-subtitle-v3 (Whisper LoRA adapter used in the same pipeline)
Citation
If you use this model in your research or product, please cite:
@misc{monsub-editor-v2-2026,
author = {Tsedee},
title = {MonSub Mongol Editor v2: Mongolian Subtitle Correction LLM},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Tsedee/mongol-editor-llm-v2}},
}
Framework versions
- PEFT 0.18.1
- Transformers 5.5.0
- PyTorch 2.6.0+cu124
- Datasets 4.8.4
- Tokenizers 0.22.2
- Downloads last month
- 43
Model tree for Tsedee/mongol-editor-llm-v2
Base model
Qwen/Qwen3.5-4B-BaseSpace using Tsedee/mongol-editor-llm-v2 1
Evaluation results
- Training loss (3 epochs)self-reported0.787
- Internal eval quality (0-100)self-reported86.000
- Perplexity on held-outself-reported2.180