naxi-qwen3-14b-v5
A fine-tuned large language model for Naxi ↔ Chinese ↔ English translation
Naxi (纳西语) is a Sino-Tibetan language spoken by approximately 300,000 people in Lijiang, Yunnan Province, China. This model is the first publicly available large language model fine-tuned specifically on the Naxi language.
Quick Facts
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-14B |
| Fine-tuning method | LoRA (r=256, α=256, 7 modules) |
| Training precision | bf16 throughout |
| Training corpus | ~62,000 weighted examples |
| NTQS score | 82.3 / 100 |
| Best task | Phonological recognition (char_f1 = 0.962) |
| Languages | Naxi, Mandarin Chinese, English |
| License | CC BY-NC 4.0 |
| HuggingFace | Apeters247/naxi-qwen3-14b-v5 |
1. Background: The Naxi Language
Naxi (also written Nakhi, Nahi, Na, 纳西) is a Sino-Tibetan language of the Naxi–Bai branch, spoken primarily in Lijiang Prefecture, Yunnan Province, People's Republic of China, with smaller communities in Sichuan and Tibetan borderlands. The prestige variety — Lijiang Naxi or the Common dialect (COM in ISO 639-3: nxq) — is the focus of this model.
Typological profile
Naxi is a tonal, verb-final, agglutinative language with a six-tone system encoded in its Latin romanization through tone-final consonants (see §5). Unlike neighboring Tibetan or Chinese, Naxi encodes evidentiality and aspect morphologically, making it typologically unusual among languages of the Tibetan plateau and Yunnan corridor. It coexists with Mandarin, creating extensive contact-induced change, particularly in loanword phonology.
The Naxi people are also known for the Dongba script (东巴文), one of the last pictographic writing systems still in active religious use, employed by Dongba priests (tomba) for ritual texts. However, this model is trained exclusively on Latin romanization (Naxi Pinyin), the modern standardized orthography used in contemporary education and linguistics.
Endangerment status
UNESCO classifies Naxi as "vulnerable" (2010 assessment), with intergenerational transmission weakening in urban Lijiang due to Mandarin dominance in education and commerce. Younger speakers often code-switch extensively or acquire only passive competence. Digital infrastructure for Naxi is near-zero: no keyboard input method exists in mainstream operating systems, no online machine translation service supports it, and pre-existing NLP tooling is absent. This model represents an effort to begin closing that gap.
2. Academic Literature Review
The academic study of Naxi phonology, grammar, and lexicography spans roughly a century, with the most rigorous computational work emerging only in the past two decades.
2.1 Phonology and Romanization Standards
Michaud, Alexis (2008). Tones. Cambridge: Cambridge University Press. Chapter on Naxi tone system. The definitive modern reference for Naxi phonology. Michaud establishes the six-tone inventory (High ˥, Low ˩, Low-Rising ˩˥, Mid ˧, Mid-Rising ˧˥, Checked ˥ʔ) and their Chao numeral representations. His IPA transcriptions differ from naive Pinyin-based assumptions in important ways: the initial j- represents [c] (not [tɕ] as in Mandarin), and e represents [ɤ] (not [ə]). This model's training pipeline uses Michaud's standard throughout.
Michaud, Alexis et al. (2020+). Pangloss Collection. CNRS/LACITO Endangered Languages Archive. The Pangloss Collection contains Michaud's fieldwork recordings with native Naxi speakers — transcribed utterances with IPA annotations and interlinear glosses — constituting the largest open-access scholarly corpus of spoken Naxi. A subset of 1,952 items (converted from IPA to Naxi Pinyin) forms part of this model's training data. Available at: https://pangloss.cnrs.fr/
He Jiren 和即仁 & Jiang Zhuyi 姜竹仪 (1985). 《纳西语简志》 [Brief Description of the Naxi Language]. Beijing: Nationalities Press. The canonical Chinese-language grammatical description of Naxi, documenting morphosyntax, tone sandhi, and dialectal variation between the Lijiang (COM) and Yongning-adjacent varieties. Essential reference for training prompt templates.
He Jiren 和即仁 (2012). Naxi Language Reference Grammar. The most comprehensive modern treatment of Naxi morphosyntax, covering aspect marking, evidentiality, and nominal case. Informed the design of the task prompts.
Pinson, Thomas (2012). Naxi-English Dictionary. Summer Institute of Linguistics. Approximately 10,000 headwords with etymological notes on Chinese loanwords, IPA transcriptions, and example sentences. A core lexicographic reference; expansion of coverage into this model is ongoing.
2.2 Ethnographic and Cultural Sources
Rock, Joseph F. (1937). "The Origin of the Tso-la Books, or Books of Divination of the Na-khi or Chiang." Bulletin de l'École française d'Extrême-Orient (BEFEO), Vol. 37. Rock's BEFEO writings document Naxi ceremonial culture, place names, and ethnobotany in the Lijiang-Tibet corridor. Though written before modern Naxi romanization was standardized, his phonemic transcriptions provide early attestation of lexical items.
Rock, Joseph F. (1947). The Ancient Na-khi Kingdom of Southwest China. 2 vols. Cambridge: Harvard University Press. Rock's magnum opus, containing extensive Naxi vocabulary and cultural description.
Roosevelt, Theodore III (1941). A Preliminary Study of the Nashi People. Senior thesis, Harvard University. An early English-language ethnographic overview of the Naxi, notable for systematic observations on language use and social organization. Contributed 150 parallel cultural text segments to the training corpus.
Yang Fuquan 杨福泉 (2006+). Multiple monographs on Naxi culture, religion, and social history, including studies on Naxi-Tibetan exchange and the role of the Dongba tradition in contemporary identity. Contributed 177 parallel text segments to the training corpus.
2.3 Low-Resource NLP and Fine-Tuning
Hu, Edward J. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. The foundational method employed for fine-tuning: decomposing weight updates into low-rank matrices to dramatically reduce trainable parameters while preserving base model knowledge. This model uses rank r=256 — unusually high — motivated by the need to inject a complete new vocabulary (Naxi Pinyin) that has zero overlap with the base model's training distribution.
Dettmers, Tim et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. Evaluated in v1 of this series (see §4). QLoRA's 4-bit quantization proved incompatible with bf16 serving, causing a 30-point NTQS regression. Not used in v5.
Zhao, Wayne Xin et al. (2023). "A Survey of Large Language Models." arXiv:2303.18223. Provides context for the trade-offs between model scale and fine-tuning depth for low-resource languages.
Adelani, David Ifeoluwa et al. (2022). "A Few Thousand Translations Go a Long Way! Leveraging Pre-Trained Models for African Language Machine Translation." Proceedings of NAACL 2022. Demonstrates that LoRA fine-tuning on 10,000–60,000 parallel sentences can yield substantial translation quality improvements for extremely low-resource languages, validating the overall approach taken here.
Costa-jussà, Marta R. et al. (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation." arXiv:2207.04672 (Meta AI). NLLB demonstrates that multilingual pre-training provides useful cross-lingual transfer even for languages with minimal training data. The Qwen3 base model, pre-trained on hundreds of languages, provides analogous transfer — particularly from the Sinitic branch — to Naxi.
Han, Xu et al. (2021). "Pre-Trained Models: Past, Present and Future." AI Open 2:225–250. Motivates using a large multilingual pre-trained model as base rather than training from scratch, which would require orders of magnitude more data.
2.4 Chinese Linguistic Context and Loanword Phenomena
Naxi has absorbed extensive Mandarin loanwords across centuries of contact (Han immigration, Mu dynasty administration, 20th-century nationalization). The model includes a loanword detector trained to flag Chinese borrowings (借词 / 外来词) and score them appropriately in evaluation. Key patterns documented in the literature include:
- Phonological integration of Mandarin syllable-final consonants into Naxi's tone system
- Calques (structural borrowings) for abstract and administrative vocabulary
- Code-switching at the sentence level in contemporary urban speech
3. Training Corpus
The training data comprises 62,021 weighted examples (37,418 unique) assembled from multiple scholarly and community sources, spanning dictionary entries, parallel text, folk literature, and academic ethnographic writing. An 80/20 source-grouped train/eval split ensures zero leakage between sets.
3.1 Sources
| Category | Description | Items |
|---|---|---|
| Lexicographic database | Community Naxi–Chinese–English dictionary; 3,866 headwords with definitions, IPA transcriptions, POS tags, and example sentences | 3,866 entries + 1,307 examples |
| LACITO/Pangloss fieldwork | Michaud et al. scholarly recordings; IPA converted to Naxi Pinyin via custom greedy-matching algorithm (98.9% conversion rate) | 1,952 items |
| Parallel literary corpus | Digitized 20th-century trilingual parallel texts (Naxi / Chinese / English) covering narrative, poetry, and didactic prose | ~7,758 segments |
| Folk wisdom & humor | Contemporary Naxi proverbs, witticisms, and folk sayings collected 2024–2025; bilingual Naxi–Chinese | 186 items |
| Cultural monographs | Yang Fuquan, Rock 1937 BEFEO, Ayidan folk tales, Roosevelt 1941 thesis | ~463 items |
| Place names | Naxi toponyms from the Yamada collection, 445 entries (273 with IPA), covering Lijiang prefecture and surrounding areas | 445 items |
3.2 Corpus Construction
Training examples are formatted using the ChatML template (<|im_start|>system / user / assistant<|im_end|>) with task-specific system prompts. Five task types were defined:
sentence: Naxi → Chinese or English sentence translationreverse_zh: Chinese → Naxi (generation task; 3× upsampled)reverse_en: English → Naxi (generation task; 3× upsampled)dictionary: Chinese concept → Naxi headword + glossphonological: IPA/phoneme identification and Naxi Pinyin rendering
Generation tasks (reverse_zh + reverse_en) were upsampled 3× based on the finding in v2 that generation capacity is the most commercially useful and hardest skill to acquire — a "generation-heavy" (78% of weighted corpus) approach that proved decisive in achieving 71.7+ NTQS.
3.3 Data Quality Measures
- IPA leakage: Zero IPA symbols in Naxi-language outputs (converted via
ipa_to_pinyin()with greedy initial matching and tone-boundary splitting) - Tone alignment: Chao numerals (⁵⁵, ²¹, etc.) converted to Naxi Pinyin tone letters
- Polysemy disambiguation: 411 polysemous dictionary entries given part-of-speech tags to reduce hallucination
- Invalid rate: 0.0% invalid examples at corpus build time
- Leakage: ZERO train/eval contamination via source-grouped 80/20 split (previously 309 leaked items, now 0)
4. Model Training
4.1 Architecture
Base: Qwen/Qwen3-14B (14.8B parameters, 64-layer transformer, RoPE, GQA)
Method: LoRA (Low-Rank Adaptation)
Rank: r=256, α=256
Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (all 7)
Trainable: ~2.1% of total parameters
Precision: bf16 throughout (training + serving — never mixed)
4.2 Training Configuration
Optimizer: adamw_torch_fused (paged_adamw_32bit unavailable — requires cuMemAllocManaged)
Learning rate: 3e-4 (cosine schedule with warmup)
Epochs: 3
Batch size: 2 (per GPU)
Grad accum: 16 steps (effective batch = 32)
Max seq len: 2048 tokens (packing enabled)
Packed seqs: 15,180 packed from 61,652 examples at seq_len=2048
Hardware: Thunder Compute H100 (80GB VRAM)
VRAM usage: ~55–59% (46–50GB / 85GB) — stable throughout
Steps: 1,425
Total time: ~22 hours
4.3 Sequence Packing
Sequence packing (concatenating short examples end-to-end up to the context window) increased effective token throughput by approximately 4× compared to padded batching. This is critical for a corpus where most examples are 50–300 tokens — without packing, >80% of compute would be wasted on padding.
5. The Naxi Tone System and Pinyin Orthography
Understanding Naxi orthography is essential for interpreting model outputs. The romanization system uses a tone-final consonant convention:
| Final letter | Tone | Chao | IPA | Example |
|---|---|---|---|---|
l |
High | 55 | ˥ | jel [cjɤ˥] "to boil/cook" |
q |
Low | 21 | ˩ | laq [lɑ˩] "hand" |
f |
Low-rising | 13 | ˩˥ | baf [pɑ˩˥] "eight" (loanword) |
g |
Mid-rising | 35 | ˧˥ | juqg [tɕy˧˥] (rare) |
b |
Checked | 55ʔ | ˥ʔ | dab [tɑ˥ʔ] (rare) |
| (none) | Mid | 33 | ˧ | ha [hɑ˧] "food" |
Key phonological notes (Michaud 2008):
j→ [c], not [tɕ] (a palatal stop, not affricate)e→ [ɤ], not [ə] (back unrounded mid vowel)z→ voiceless [ts];zz→ voiced [dz]ng→ [ŋ] (velar nasal, can be syllable-initial)
The model is trained to produce valid Naxi Pinyin with correct tone markers. Tone completeness — the proportion of syllables bearing an explicit tone marker — is a core evaluation metric, rising from 42.3% in v2 to 69.2% in v5.
6. Evaluation: NTQS (Naxi Translation Quality Score)
Standard MT metrics (BLEU, chrF) are inadequate for Naxi because:
- No off-the-shelf tokenizer handles Naxi Pinyin correctly
- Tone markers (l/q/f/g/b) appear as trailing consonants and require phonology-aware parsing
- No reference translations exist in standard evaluation datasets
We developed NTQS (Naxi Translation Quality Score, 0–100) with five components:
| Component | Max | Measures |
|---|---|---|
| Tone Accuracy | 25 | % syllables with valid tone markers; marker correctness vs. reference |
| Semantic Fidelity | 30 | Token/character Jaccard overlap with reference (CJK bigrams or word tokens) |
| Loanword Handling | 15 | Correct flagging of Chinese borrowings when present |
| Dialect Appropriateness | 15 | Grammar score (0.6 × syllable_validity + 0.4 × tone_completeness) × 100 |
| Completeness | 15 | Output length ratio relative to source (neither truncated nor hallucinated) |
All evaluation is fully local — no LLM-as-judge, no external API calls. The phonology engine (naxi_phonology.py) validates syllable structure against the complete Naxi phoneme inventory documented in config/naxi_pinyin.yaml.
7. Results
7.1 Overall Performance
| Model | NTQS | Tone/25 | Semantic/30 | Dialect/15 | Complete/15 | char_f1 |
|---|---|---|---|---|---|---|
| naxi-qwen3-14b-v5 | 82.3 | 19.2 | 23.2 | 12.0 | 14.6 | 0.775 |
| naxi-qwen2.5-3b-v2 | 71.7 | 13.9 | 23.9 | 9.0 | 9.9 | 0.807 |
| naxi-qwen35-9b-v4 (Q4_K_M) | 37.98 | — | — | — | — | 0.342 |
| Zero-shot frontier LLMs (avg) | ~41.3 | ~20.0 | ~0.4 | ~8.0 | ~7.5 | ~0.1 |
Zero-shot frontier models (Qwen-2.5-72B, Claude-3.5-Sonnet, Gemini-2.0-Flash, DeepSeek-v3) achieve ~41–52 NTQS on the Naxi→zh direction only (they can comprehend but cannot generate Naxi). On generation tasks (→Naxi), zero-shot performance collapses to near-zero.
7.2 Per-Task Breakdown (v5)
| Task | char_f1 | exact_match | tone_completeness | Notes |
|---|---|---|---|---|
| phonological | 0.962 | 0.640 | — | Strongest task; IPA↔Pinyin conversion |
| reverse_en (EN→Naxi) | 0.811 | — | 0.687 | Strong generation from English |
| reverse_zh (ZH→Naxi) | 0.811 | — | 0.696 | Strong generation from Chinese |
| sentence (Naxi→ZH/EN) | 0.691 | — | — | Understanding task |
| dictionary | 0.609 | 0.000 | — | Weakest; no exact dictionary matches |
7.3 Key Improvements over v2
| Metric | v2 (3B) | v5 (14B) | Delta |
|---|---|---|---|
| NTQS total | 71.74 | 82.3 | +10.6 |
| tone_completeness | 0.423 | 0.692 | +63.6% |
| grammar_score | 69.09 | 80.21 | +16.1% |
| syllable_validity | 0.869 | 0.876 | +0.8% |
Tone completeness — the most critical weakness in v2 — improved dramatically. This is attributed to: (1) the 14B model's larger representational capacity, (2) sequence packing exposing more tone-marked syllables per training step, and (3) all 7 LoRA modules being adapted (vs. 4 in v2).
8. Learnings from Previous Fine-Tuning Iterations
v1 — Qwen2.5-14B QLoRA 4-bit (FAILED, NTQS 42.1)
Configuration: LoRA r=48, α=96, 4-bit QLoRA quantization, two-stage LAFT curriculum.
Failure mode: Precision mismatch. The model was trained in 4-bit quantized format but served in bf16. LoRA adapters trained under quantization learn compensation terms specific to the quantization error. When served in full precision, those compensation terms become noise. The result was NTQS 42.1 — statistically indistinguishable from zero-shot frontier performance.
Lesson: Never mix precision between training and serving. If training in 4-bit QLoRA, all inference must also use 4-bit quantization. If serving in bf16, train in bf16.
v2 — Qwen2.5-3B LoRA bf16 (SUCCESS, NTQS 71.7) ✓
Configuration: LoRA r=256, α=512, bf16 throughout, 4 attention modules only (q/k/v/o), single-stage training, generation-heavy corpus (78%).
Outcome: +30.4 NTQS over baseline. First successful model. Key validated findings:
- r=256 is necessary: High LoRA rank provides the capacity to inject a completely foreign orthographic system (Naxi Pinyin) with zero overlap to the base model's training vocabulary.
- Generation-heavy data (reverse_zh + reverse_en upsampled 3×) is the decisive factor for useful performance.
- bf16 precision match (train = serve) is non-negotiable.
- Single-stage training outperforms two-stage LAFT curriculum at this data scale.
Remaining weaknesses: tone_completeness only 42% (syllables frequently generated without tone markers); dictionary task weak (char_f1=0.56); 13% invalid syllables (hallucinations).
v3 — Qwen3.5-9B LoRA bf16 (ABANDONED)
Superseded before evaluation. Architecture investigation revealed Qwen3.5-9B is a vision-language model (VLM) with Mamba-Attention hybrid architecture (Qwen3_5ForConditionalGeneration), not a text-only transformer. Text-only LoRA fine-tuning on a VLM backbone is suboptimal; the model expects multimodal inputs that the Naxi corpus cannot provide.
Lesson: Verify model architecture before committing compute. AutoModelForCausalLM will load VLMs without error but with degraded text performance.
v4 — Qwen3.5-9B LoRA bf16 (REGRESSED, NTQS 37.98)
Configuration: LoRA r=64, α=128, RSLora, 7 modules. Evaluated via Q4_K_M GGUF (Ollama) and bf16 direct inference.
Failure modes (multiple compounding):
- LoRA rank r=64 is insufficient. Cutting rank from 256 to 64 reduces adapter capacity by 4×. The model never learned Naxi Pinyin orthography adequately — generating IPA notation instead (leaking from the base model's exposure to phonetics literature).
- VLM architecture issue (same as v3): Qwen3.5-9B's hybrid attention/Mamba structure is not optimized for pure text generation.
- Q4_K_M quantization further degrades tone marker accuracy — single-character tone markers (l/q/f/g/b) are semantically loaded but look like common English/Chinese characters; quantization conflates them.
- Unsloth tokenizer patches: Unsloth modifies the tokenizer class at load time. Direct Python inference (without Unsloth) produces different tokenization, causing format violations.
Lesson: Never drop LoRA rank below r=128 for Naxi. The orthographic injection problem is severe enough that high rank is load-bearing, not optional.
v5 — Qwen3-14B LoRA bf16 (CURRENT CHAMPION, NTQS 82.3) ✓
Addressed all prior failure modes:
- Correct architecture: Qwen3-14B is a text-only transformer (not VLM)
- High rank maintained: r=256 (same as v2 success)
- All 7 modules: Adds gate/up/down MLP projections not present in v2 — captures syntactic patterns more deeply
- bf16 throughout: Training and serving in identical precision
- Sequence packing: Dramatically more token-efficient training
- Correct optimizer:
adamw_torch_fusedinstead ofpaged_adamw_32bit(which requires CUDA managed memory unavailable on Thunder Compute prototyping instances)
9. Limitations and Known Issues
Dictionary task: char_f1=0.609, exact_match=0.000. The model paraphrases rather than recalling specific lexical items. A retrieval-augmented generation (RAG) grounding layer using the Pinson dictionary is planned to address this.
Tone completeness at 69%: Approximately 31% of generated syllables lack explicit tone markers. For mid-tone (unmarked) syllables this is often correct, but the model sometimes omits markers on toned syllables. Tone-focused data augmentation (synthetic examples emphasizing tone marking) is planned.
Syllable validity at 87.6%: About 12.4% of generated syllables are not valid Naxi Pinyin (hallucinated combinations). Post-processing filters using
naxi_phonology.pycan catch these at inference time.Dialect coverage: Training data is predominantly Common (COM) dialect (Lijiang variety). Lashi (LQ) and Yongning (YON) dialects are minimally represented. The Yongning dialect (spoken by ~5,000 people) has additional vowel contrasts and is treated as a separate language by some classifiers.
Dongba script: Not supported. The model generates exclusively in Latin romanization (Naxi Pinyin). Dongba characters require a separate rendering pipeline.
Context length: Maximum 2,048 tokens. Long documents require chunking.
Contemporary colloquial Naxi: Training data skews toward formal literary register. Informal spoken Naxi (with high code-switching density) may be handled less well.
10. Usage
With Transformers (bf16, recommended)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE = "Qwen/Qwen3-14B"
ADAPTER = "Apeters247/naxi-qwen3-14b-v5" # or local path to adapter/
tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
BASE,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER + "/adapter")
model.eval()
def translate(text: str, direction: str = "zh_to_naxi") -> str:
directions = {
"zh_to_naxi": ("Chinese", "Naxi"),
"en_to_naxi": ("English", "Naxi"),
"naxi_to_zh": ("Naxi", "Chinese"),
"naxi_to_en": ("Naxi", "English"),
}
src, tgt = directions[direction]
system = f"You are a Naxi language expert. Translate the following {src} text into {tgt}."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": text},
]
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
with torch.no_grad():
out = model.generate(input_ids, max_new_tokens=512, temperature=0.1, do_sample=True)
return tokenizer.decode(out[0][input_ids.shape[-1]:], skip_special_tokens=True)
# Examples
print(translate("我爱丽江", "zh_to_naxi"))
# → "Nge Liqjiangq gol pvl seiq." (approximate)
print(translate("ngeq pvl seiq", "naxi_to_en"))
# → "I like it." / "I love it."
With GGUF / llama.cpp (Q4_K_M)
# Download GGUF from HuggingFace
huggingface-cli download Apeters247/naxi-qwen3-14b-v5 \
naxi-qwen3-14b-v5-Q4_K_M.gguf --local-dir ./
# Run with llama.cpp
./llama-cli -m naxi-qwen3-14b-v5-Q4_K_M.gguf \
--chat-template chatml \
-sys "You are a Naxi language expert. Translate the following Chinese text into Naxi." \
-p "我爱丽江" \
-n 200
Warning: Q4_K_M quantization reduces tone accuracy. For production use, prefer bf16 inference if VRAM permits (requires ~30GB).
11. Project Infrastructure
This model is part of the Naxi Language Explorer project, an open-source effort to build translation and language documentation infrastructure for Naxi.
- API: FastAPI endpoint at naxiai.com (rate-limited public access)
- Leaderboard: Live NTQS benchmark at naxiai.com (compares fine-tuned models vs. zero-shot frontier LLMs)
- Corpus: Maintained in PostgreSQL with SQLAlchemy ORM; 15 tables including lexical database, parallel corpus, and evaluation records
- Evaluation: Fully local phonology-aware scoring via
ntqs_scorer.pyandnaxi_phonology.py - GPU training: Thunder Compute H100 instances via SSH (
tnr-0)
12. Planned Improvements
- RAG dictionary grounding: At inference time, retrieve relevant entries from the Pinson dictionary and inject into the prompt context. Expected: +5–15 NTQS on dictionary task.
- Tone-focused augmentation: Generate 5–10K synthetic examples emphasizing tone marking in varied phonological environments. Target: tone_completeness > 85%.
- Pinson 2012 OCR ingestion: ~10,000 additional dictionary entries pending OCR processing. This is the highest-value remaining linguistic resource.
- v6 planning: After RAG and tone augmentation, retrain on expanded corpus. Candidate base models: Qwen3-32B or a specialist Sino-Tibetan LM if available.
- Syllable validity filter: Post-processing layer using
is_valid_naxi_syllable()to catch and replace hallucinated phoneme combinations.
13. Citation
If you use this model in research, please cite:
@misc{naxi-qwen3-14b-v5-2026,
title = {naxi-qwen3-14b-v5: A Fine-Tuned LLM for Naxi--Chinese--English Translation},
author = {Peters, Andrew},
year = {2026},
month = {March},
howpublished = {HuggingFace Model Hub},
url = {https://huggingface.co/Apeters247/naxi-qwen3-14b-v5},
note = {NTQS 82.3/100; Qwen3-14B + LoRA r=256, bf16; 62,021 training examples}
}
Please also cite the key scholarly sources that made this possible:
@book{michaud2008tones,
author = {Michaud, Alexis},
title = {Tones},
publisher = {Cambridge University Press},
year = {2008}
}
@misc{pangloss2020,
author = {Michaud, Alexis and others},
title = {Pangloss Collection},
howpublished = {CNRS/LACITO Endangered Languages Archive},
url = {https://pangloss.cnrs.fr/},
year = {2020}
}
@inproceedings{hu2021lora,
title = {LoRA: Low-Rank Adaptation of Large Language Models},
author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Chen, Weizhu},
booktitle = {International Conference on Learning Representations},
year = {2022}
}
14. Acknowledgments
This work builds on the scholarly foundations laid by Alexis Michaud (CNRS/LACITO) and Thomas Pinson (SIL), whose fieldwork recordings and phonological analysis make computational work on Naxi possible. The Naxi community of Lijiang, Yunnan — particularly speakers who participated in the Pangloss fieldwork sessions — are the ultimate source of this data, and its preservation and accessibility is the motivation for this project.
Critical config facts for any future training run:
- Always use
torch_dtype=torch.bfloat16— neverload_in_4bit=True - LoRA rank must be r≥256 for Naxi (r=64 failed catastrophically in v4)
- Use
adamw_torch_fusedon Thunder Compute (notpaged_adamw_32bit) - Max seq_len=2048 for 14B model with packing + batch=2 on H100 80GB
- DB cannot be reached from host OS — run DB scripts inside
naxi-apiDocker container or viadocker exec - API leaderboard prefix is
/api/evaluate/scores(not/evaluate/scores)
HuggingFace repo: Apeters247/naxi-qwen3-14b-v5 (private)
adapter/— LoRA weights (PeftModel format)naxi-qwen3-14b-v5-Q4_K_M.gguf— 8.4GB quantizednaxi-qwen3-14b-v5-Q8_0.gguf— 15GB quantized
Last updated: 2026-03-08 Model status: Production (current champion, NTQS 82.3) Previous champion: naxi-qwen2.5-3b-v2 (NTQS 71.74)
- Downloads last month
- 14
4-bit
8-bit
Model tree for Apeters247/naxi-qwen3-14b-v5
Space using Apeters247/naxi-qwen3-14b-v5 1
Papers for Apeters247/naxi-qwen3-14b-v5
A Survey of Large Language Models
No Language Left Behind: Scaling Human-Centered Machine Translation
LoRA: Low-Rank Adaptation of Large Language Models
Evaluation results
- NTQS (Naxi Translation Quality Score)self-reported82.300
- chrFself-reported56.080
- Character F1self-reported0.775