How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lballore/llimba-3b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lballore/llimba-3b-instruct")
model = AutoModelForCausalLM.from_pretrained("lballore/llimba-3b-instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

LLiMba-3B-Instruct

LLiMba-3B-Instruct is a Sardinian-capable extension of Qwen2.5-3B-Instruct. It speaks fluent Sardinian (LSC, the standardized written form, with Logudorese and Campidanese accepted as input) and retains the multilingual capabilities of the base model across the languages Qwen2.5 supports. The full adaptation pipeline runs on a single 24GB consumer GPU.

Sardinian is a Romance language with roughly one million speakers, classified as endangered by UNESCO. Commercial translation services do not support it, and major LLMs do not produce it reliably. LLiMba is, to our knowledge, the first openly released LLM that can hold a Sardinian conversation, translate to and from Sardinian, and analyze Sardinian text.

This is the deployable model. For the post-CPT intermediate checkpoint (a research artifact useful only for re-running supervised fine-tuning with alternative recipes), see lballore/llimba-3b-instruct-cpt.

🎮 Try it live: lballore-llimba-demo.hf.space - interactive Gradio chat with both conversational and translation modes. No installation required.

📖 Read the paper: LLiMba: Sardinian on a Single GPU

Not to be confused with the 2024 University of Cagliari paper LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models (arXiv:2411.13453), which uses Sardinian as one of several case studies for a broader framework. Different acronym, independent project.

Quick start

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="lballore/llimba-3b-instruct",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Ses unu assistente chi chistionat in sardu."},
    {"role": "user", "content": "Salude! Comente ìstas?"},
]

out = pipe(messages, max_new_tokens=200, do_sample=False)
print(out[0]["generated_text"][-1]["content"])
# Bene, gràtzias. E tue comente ìstas?

For translation, change the system prompt:

messages = [
    {"role": "system", "content": "Tue ses unu tradutore espertu in limba sarda LSC."},
    {"role": "user", "content": "Translate to Sardinian: «The weather is rough today.»"},
]
out = pipe(messages, max_new_tokens=200, do_sample=False)

Recommended inference parameters

LLiMba ships with do_sample=False (greedy decoding) as the default. This produces deterministic, high-quality output for translation, factual Q&A, and short conversations, and is what the published FLORES benchmark numbers were measured with.

For other use cases, override the defaults at call time:

Use case Settings
Translation, factual Q&A do_sample=False (default)
Conversational chat temperature=0.3, top_p=0.9, top_k=40, repetition_penalty=1.05
Creative or long-form generation temperature=0.7, top_p=0.9, top_k=40, repetition_penalty=1.1

Temperatures above 0.7 can produce occasional language-boundary drift (Sardinian to Italian) and amplify the morphological hallucination described in Limitations on long open-ended prompts. The model was trained with Romance replay data specifically to mitigate this, but the safe upper bound for production deployments is around 0.7.

Languages

LLiMba inherits Qwen2.5's multilingual coverage and adds Sardinian. The Qwen2.5 documentation explicitly names Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Arabic as supported, with broader coverage beyond that list. Continued pretraining on Sardinian was paired with roughly 2.4M tokens of Italian, Spanish, Portuguese, and Catalan replay data to limit forgetting on the Romance branch closest to Sardinian.

For Sardinian, the primary output target is LSC (Limba Sarda Comuna), the standardized written form codified by the Regione Autonoma della Sardegna in 2006. The training corpus includes Logudorese and Campidanese material, so the model accepts dialectal input gracefully but tends to produce LSC in its outputs.

We did not run multilingual benchmarks on non-Romance languages after adaptation. Users relying on the model for languages such as Japanese or Arabic should validate on their own task before deploying.

Translation results

Evaluated on 997 parallel sentences from FLORES-200 using lm-evaluation-harness 0.4.11 with greedy decoding.

Direction Base BLEU LLiMba BLEU Base chrF LLiMba chrF
EN to SC 2.75 28.47 27.41 56.80
IT to SC 2.16 21.25 27.52 52.08
ES to SC 1.99 18.57 26.39 49.41
SC to EN 11.73 41.28 44.55 64.64
SC to IT 2.90 17.61 33.38 47.25
SC to ES 5.67 18.57 36.98 46.27

The strong SC to EN baseline (BLEU 41.28) is itself evidence that English generation survives adaptation: the base model already knew English, continued pretraining added Sardinian comprehension, and the combination yields high-quality translation out of Sardinian without specifically training for it.

Qualitative behavior

On focused factual queries about Sardinian topics, LLiMba produces verifiable answers when the underlying facts are present in the training data. Asked "Chie fiat Gigi Riva?" (Who was Gigi Riva?), it correctly identifies him as the Italian footballer born in Leggiuno in 1944, who joined Cagliari in 1963, won the 1969-70 Serie A title, scored 35 goals in 42 appearances for the national team, was nicknamed "Rombo de Tronu" by Gianni Brera, and died in 2024.

On cantu a tenore, the polyphonic vocal tradition of central Sardinia, it correctly names the four voices (boghe, bassu, contra, mesu boghe), the 2005 UNESCO recognition, and the Barbagia origin.

Conversational greetings, short translations, factual recall on canonical Sardinian topics, and grammatical analysis all work well. Long open-ended generation is more variable; see Limitations.

Training procedure

Base model. Qwen2.5-3B-Instruct (3.09B parameters, transformer decoder, 32 layers, 16 attention heads, RoPE positional embeddings, 128K context length).

Stage 1, continued pretraining. Full fine-tuning in bfloat16 for 2 epochs on approximately 13.9M tokens (11.5M Sardinian plus 2.4M Romance replay drawn from Italian, Spanish, Portuguese, and Catalan Wikipedias). Sequence length 4096. Effective batch 16 (1 per device with 16 gradient accumulation steps). Learning rate 5e-5 with cosine schedule and 50-step warmup. Paged AdamW 8-bit optimizer. Flash Attention 2. Gradient checkpointing enabled. Sequence packing disabled (packing leaks attention across document boundaries within a packed sequence and degraded model quality in our preliminary runs). Wall-clock time: 5.5 hours on one RTX 4090.

Stage 2, supervised fine-tuning. rsLoRA adapter at rank 256, alpha 256, dropout 0.05, targeting q, k, v, o, gate, up, and down projection matrices. 2 epochs on 14,404 instruction pairs (~12.8M tokens) with completion-only loss. Learning rate 2e-5 with cosine schedule and 50-step warmup. Other hyperparameters match Stage 1. The rsLoRA scaling correction (alpha/sqrt(r)) is what makes rank 256 trainable in practice; conventional LoRA scaling at this rank causes gradient collapse.

The released weights have the rsLoRA adapter merged into the base. Full training scripts, data preparation pipeline, and evaluation harness are at github.com/lballore/LLiMba.

Intended use

Research, education, language preservation, and personal use by speakers and learners of Sardinian. Specific use cases include conversational practice, translation between Sardinian and other Romance languages or English, language learning, text analysis, and as a starting point for further Sardinian NLP research.

Limitations

Hallucination on out-of-training facts. Like all 3B-class models, LLiMba fabricates when queried on content not present in training. The pattern is strongest for biographical specifics about partially-known figures: confident wrong dates, invented nicknames, plausible-sounding but false claims. Treat factual outputs with appropriate skepticism.

Morphological hallucination on long open-ended prompts. On extended unconstrained generation about Sardinian culture, the model occasionally produces phonotactically valid but non-attested Sardinian words (for example, cungafròngias, mojgas). The lexical resources for clean generation are present; the same model produces attested vocabulary on focused, structured queries. Mitigation: prefer structured prompts ("List the three main causes, one short sentence each") over open-ended ones ("Tell me about X") for production deployments.

Dialect skew. The model targets LSC and was reviewed by a single native speaker of the Nuorese variant. Logudorese and Campidanese input is handled, but speakers of those variants may find the model's output skews toward the standardized form rather than their local register.

Multilingual capability not benchmarked end-to-end. Continued pretraining can degrade non-target language capabilities. Romance replay data mitigates this for Italian, Spanish, Portuguese, and Catalan, and the strong SC to EN scores suggest English is well preserved, but other languages were not benchmarked after adaptation. Validate on your own task before relying on these.

Training data caveats. The pretraining corpus includes Sardinian translations of literary works whose copyright status was not exhaustively verified, so the corpus is not redistributed in raw form (the data collection pipeline and source pointers are released instead). The supervised fine-tuning data includes machine-translated Capybara entries (NLLB-200 3.3B as translator) which contain residual Italian-shaped grammatical structures rendered with Sardinian vocabulary.

Feedback and contributions

Bug reports, feature requests, native-speaker feedback on outputs, and any other discussion happens on GitHub. The Hugging Face Community tab on this repository is intentionally disabled to keep all conversation in one place.

If you're a Sardinian speaker spotting bad output, an incorrect dialect, or vocabulary the model is missing, an issue with the prompt and the response is genuinely useful and contributions of any size are welcome.

Out-of-scope use

The model is not suitable for high-stakes factual queries, medical or legal advice, or any application where hallucination would cause material harm. It should not be used as a sole authoritative source for Sardinian language standardization or pedagogy without human review.

License

Model weights are released under the Apache 2.0 license. See LICENSE for full terms.

The training and evaluation code at github.com/lballore/LLiMba is independently released also under Apache 2.0.

Acknowledgements

Native speaker review for the corpus and supervised fine-tuning data was contributed by the author. Source web texts come from salimbasarda.net, istorias.it, sardumatica.net, limbasardasudsardigna.it, and lacanas.it. Sardinian Wikipedia editors, and the Sardinian community of writers and translators made this project possible.

Citation

@misc{llimba2026,
  title         = {LLiMba: Sardinian on a Single GPU - Adapting a 3B Language Model to a Vanishing Romance Language},
  author        = {Luca Ballore},
  year          = {2026},
  eprint        = {2605.09015},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.09015}
}

@misc{llimba-3b-instruct,
  title     = {LLiMba-3B-Instruct},
  author    = {Luca Ballore},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/lballore/llimba-3b-instruct}
}
Downloads last month
44
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lballore/llimba-3b-instruct

Base model

Qwen/Qwen2.5-3B
Finetuned
(1)
this model

Datasets used to train lballore/llimba-3b-instruct

Space using lballore/llimba-3b-instruct 1

Collection including lballore/llimba-3b-instruct

Papers for lballore/llimba-3b-instruct

Evaluation results

  • EN-SC BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    28.470
  • EN-SC chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    56.800
  • IT-SC BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    21.250
  • IT-SC chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    52.080
  • ES-SC BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    18.570
  • ES-SC chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    49.410
  • SC-EN BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    41.280
  • SC-EN chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    64.640
  • SC-IT BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    17.610
  • SC-IT chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    47.250
  • SC-ES BLEU on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    18.570
  • SC-ES chrF on FLORES-200 (Sardinian subset, 997 sentences)
    self-reported
    46.270