Instructions to use lballore/llimba-3b-instruct-cpt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lballore/llimba-3b-instruct-cpt with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lballore/llimba-3b-instruct-cpt")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lballore/llimba-3b-instruct-cpt")
model = AutoModelForCausalLM.from_pretrained("lballore/llimba-3b-instruct-cpt")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lballore/llimba-3b-instruct-cpt with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lballore/llimba-3b-instruct-cpt"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lballore/llimba-3b-instruct-cpt",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lballore/llimba-3b-instruct-cpt

SGLang

How to use lballore/llimba-3b-instruct-cpt with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lballore/llimba-3b-instruct-cpt" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lballore/llimba-3b-instruct-cpt",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lballore/llimba-3b-instruct-cpt" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lballore/llimba-3b-instruct-cpt",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lballore/llimba-3b-instruct-cpt with Docker Model Runner:
```
docker model run hf.co/lballore/llimba-3b-instruct-cpt
```

llimba-3b-instruct-cpt / README.md

lballore

Initial release of llimba-3b-instruct-cpt

6d1df70 5 days ago

preview code

raw

history blame

8.12 kB

	---
	license: apache-2.0
	language:
	- srd
	- en
	- zh
	- fr
	- es
	- pt
	- de
	- it
	- ru
	- ja
	- ko
	- vi
	- th
	- ar
	library_name: transformers
	pipeline_tag: text-generation
	base_model: Qwen/Qwen2.5-3B-Instruct
	base_model_relation: finetune
	datasets:
	- lballore/llimba-corpus
	- lballore/llimba-flores-srd-eval
	- facebook/flores
	tags:
	- sardinian
	- limba-sarda-comuna
	- lsc
	- logudorese
	- campidanese
	- low-resource
	- endangered-language
	- romance
	- multilingual
	- qwen2.5
	- continued-pretraining
	- cpt
	- intermediate-checkpoint
	- research
	model-index:
	- name: llimba-3b-instruct-cpt
	results:
	- task:
	type: translation
	dataset:
	name: FLORES-200 (Sardinian subset, 997 sentences)
	type: facebook/flores
	metrics:
	- type: bleu
	value: 17.26
	name: EN-SC BLEU
	- type: chrf
	value: 47.81
	name: EN-SC chrF
	- type: bleu
	value: 12.71
	name: IT-SC BLEU
	- type: chrf
	value: 44.83
	name: IT-SC chrF
	- type: bleu
	value: 11.36
	name: ES-SC BLEU
	- type: chrf
	value: 43.35
	name: ES-SC chrF
	- type: bleu
	value: 33.52
	name: SC-EN BLEU
	- type: chrf
	value: 62.78
	name: SC-EN chrF
	- type: bleu
	value: 16.53
	name: SC-IT BLEU
	- type: chrf
	value: 48.83
	name: SC-IT chrF
	- type: bleu
	value: 19.31
	name: SC-ES BLEU
	- type: chrf
	value: 47.76
	name: SC-ES chrF
	co2_eq_emissions:
	emissions: 0.15
	source: estimated from RTX 4090 TDP and wall-clock training time
	training_type: continued-pretraining
	geographical_location: Switzerland
	hardware_used: 1x NVIDIA RTX 4090 (24GB)
	---

	# LLiMba-3B-Instruct-CPT

	⚠️ This is a research artifact, not the deployable model. For end users, use [lballore/llimba-3b-instruct](https://huggingface.co/lballore/llimba-3b-instruct) instead. That repo contains the supervised-fine-tuned model with instruction-following restored.

	This is the post-continued-pretraining intermediate checkpoint from the LLiMba project. It is the model after Stage 1 (continued pretraining on approximately 13.9M tokens of Sardinian and Romance replay) but before Stage 2 (supervised fine-tuning). It exists for users who want to apply their own SFT recipe on a Sardinian-fluent base, without re-running the 5.5-hour CPT step.

	## What this checkpoint is, and isn't

	After Stage 1, the model has learned Sardinian grammar, vocabulary, and orthography, and produces fluent Sardinian prose. However, continued pretraining partially erases instruction-following behavior, a well-documented phenomenon known as catastrophic forgetting. The model will respond to prompts, but reliability on structured instruction-following, chat behavior, and translation prompts is degraded compared to either the original Qwen2.5-3B-Instruct base or the final llimba-3b-instruct.

	The "instruct" in the name reflects lineage from Qwen2.5-3B-Instruct, not a claim that this checkpoint robustly follows instructions. Stage 2 (rsLoRA r=256 SFT) is what restores instruction-following on top of CPT-acquired Sardinian knowledge.

	## Intended use

	Two audiences:

	1. Researchers running their own SFT. Continued pretraining is the expensive stage. Starting from this checkpoint lets you experiment with alternative SFT recipes (different adapter methods, different instruction data, DPO, and so on) without redoing CPT. Apply your SFT, measure on the [FLORES-200 Sardinian eval set](https://huggingface.co/datasets/lballore/llimba-flores-srd-eval), and compare against the published llimba-3b-instruct numbers.

	2. Researchers reproducing the paper. This is the Stage 1 artifact behind the paper's CPT-only translation results.

	If you want to use Sardinian rather than research it, stop reading and download [llimba-3b-instruct](https://huggingface.co/lballore/llimba-3b-instruct).

	## Quick start

	```python
	from transformers import pipeline

	pipe = pipeline(
	"text-generation",
	model="lballore/llimba-3b-instruct-cpt",
	torch_dtype="auto",
	device_map="auto",
	)

	out = pipe(
	[{"role": "user", "content": "Cale est sa capitale de sa Sardigna?"}],
	max_new_tokens=200,
	do_sample=False,
	)
	print(out[0]["generated_text"][-1]["content"])
	```

	Behavior on this checkpoint is more "completion-like" than "chat-like". Short prompts may get short or unfocused responses. For applying an SFT adapter on top, follow the recipe in [github.com/lballore/LLiMba](https://github.com/lballore/LLiMba).

	## Translation results (CPT only)

	Evaluated on 997 parallel sentences from FLORES-200 using lm-evaluation-harness 0.4.11 with greedy decoding.

	\| Direction \| Base BLEU \| CPT BLEU \| Base chrF \| CPT chrF \|
	\|---\|---:\|---:\|---:\|---:\|
	\| EN to SC \| 2.75 \| 17.26 \| 27.41 \| 47.81 \|
	\| IT to SC \| 2.16 \| 12.71 \| 27.52 \| 44.83 \|
	\| ES to SC \| 1.99 \| 11.36 \| 26.39 \| 43.35 \|
	\| SC to EN \| 11.73 \| 33.52 \| 44.55 \| 62.78 \|
	\| SC to IT \| 2.90 \| 16.53 \| 33.38 \| 48.83 \|
	\| SC to ES \| 5.67 \| 19.31 \| 36.98 \| 47.76 \|

	CPT delivers most of the translation gain in the LLiMba pipeline (4 to 6 times BLEU improvement across all six directions). Stage 2 SFT adds smaller increments on top. See [llimba-3b-instruct](https://huggingface.co/lballore/llimba-3b-instruct) for post-SFT numbers.

	## Training procedure

	Full fine-tuning in bfloat16. Flash Attention 2. Paged AdamW 8-bit optimizer. 2 epochs on approximately 13.9M tokens (11.5M Sardinian plus 2.4M Romance replay drawn from Italian, Spanish, Portuguese, and Catalan Wikipedias). Sequence length 4096. Effective batch 16 (1 per device with 16 gradient accumulation steps). Learning rate 5e-5 with cosine schedule and 50-step warmup. Sequence packing disabled. Gradient checkpointing enabled. Wall-clock time approximately 5.5 hours on a single RTX 4090.

	The Romance replay component is critical: without it, the model representationally blurs Sardinian and Italian and mode-switches to Italian at sampling temperatures above 0.3. The replay text carries no language tag; the model learns language identity from the text itself.

	Sequence packing is disabled despite its throughput benefit because packing allows attention to leak across document boundaries within a packed sequence, which on a heterogeneous corpus produced markedly degraded model quality in our preliminary runs.

	See the [paper](https://arxiv.org/abs/2605.09015), Section 4.1, and [github.com/lballore/LLiMba](https://github.com/lballore/LLiMba) for the full training script and configuration files.

	## Limitations

	All limitations of the deployable model apply, plus the partial loss of instruction-following described above. See the [llimba-3b-instruct model card](https://huggingface.co/lballore/llimba-3b-instruct) for the full Limitations section: hallucination on out-of-training facts, morphological hallucination on long open-ended prompts, dialect skew toward LSC, and unbenchmarked multilingual capability on non-Romance languages.

	## License

	Model weights are released under the Apache 2.0 license. See [LICENSE](./LICENSE) for full terms.

	The training and evaluation code at [github.com/lballore/LLiMba](https://github.com/lballore/LLiMba) is independently released also under Apache 2.0.

	## Citation

	```bibtex
	@misc{llimba2026,
	title = {LLiMba: Sardinian on a Single GPU - Adapting a 3B Language Model to a Vanishing Romance Language},
	author = {<YOUR_NAME>},
	year = {2026},
	eprint = {2605.09015},
	archivePrefix = {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2605.09015}
	}
	```