Klovis-144M — French Language Model

A 144M-parameter French language model fully designed, implemented, and trained by Eric Houzelle. Every component — architecture, training pipeline, and inference engine — was written in PyTorch without relying on any pre-trained weights or third-party model code.

Klovis demonstrates that a single engineer can deliver a complete, modern Transformer with state-of-the-art architectural components, trained end-to-end on a single NVIDIA L40S GPU for a total compute budget of approximately €50.


Key Facts

Parameters 144M
Architecture Decoder-only Transformer
Language French
Tokenizer CamemBERT (camembert-base, 32k vocab)
Context window 256 tokens
Chat format ChatML
Training hardware 1× NVIDIA L40S
Total training cost ~€50
Pre-training data ~26M French texts (Wikipedia FR + FineWeb-2)
SFT data 6 curated French conversational datasets
SFT epochs 15
License Apache 2.0
Author Eric Houzelle

Quick Start

Installation

pip install transformers torch safetensors sentencepiece

Text Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Klovis-ai/Klovis-144M-french"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

prompt = "La France est un pays"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conversational Mode (ChatML)

The model was fine-tuned with ChatML formatting for assistant-style interactions:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Klovis-ai/Klovis-144M-french"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

prompt = (
    "<|system|>\n"
    "Tu es un assistant utile et concis. Réponds en français.<|end|>\n"
    "<|user|>\n"
    "Quelle est la capitale de la France ?<|end|>\n"
    "<|assistant|>\n"
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=150,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Special Tokens

Token Role
<|system|> Start of system message
<|user|> Start of user message
<|assistant|> Start of assistant response
<|end|> End of turn

Architecture

Klovis implements a decoder-only Transformer using the same building blocks found in LLaMA, Mistral, and Gemma — scaled down to a compact 144M-parameter footprint:

Component Detail
Embedding dim 768
Transformer layers 14
Query heads 12
KV heads (GQA) 4
FFN hidden dim 3072
FFN activation SwiGLU
Normalization RMSNorm (pre-norm)
Position encoding RoPE (Rotary Position Embedding)
Weight tying Input embeddings ↔ output projection

Grouped-Query Attention (GQA): 12 query heads share 4 KV heads, reducing KV-cache memory by 3× while preserving attention capacity.

Advanced Feature: Recurrent-Depth Transformer (RDT)

The codebase also implements an experimental Recurrent-Depth Transformer mode (inspired by OpenMythos/Parcae, Prairie et al. 2026), where a single Transformer block is applied iteratively:

Input → [Prelude Layers] → [Shared Block × T steps] → [Coda Layers] → Output

RDT components:

  • LTI Injection — Linear Time-Invariant state coupling with guaranteed spectral stability
  • Adaptive Computation Time (ACT) — learned per-position halting for dynamic compute allocation
  • Depth LoRA — low-rank adapters per recurrent step for step-wise specialization

Training

Phase 1 — Pre-training

Data ~26M French texts
Sources CATIE-AQ/wikipedia_fr_2022, HuggingFaceFW/fineweb-2 (fra_Latn)
Optimizer AdamW (β₁=0.9, β₂=0.95, weight decay 0.1)
Scheduler Linear warmup (2000 steps) → cosine decay
Precision Mixed precision (AMP + GradScaler)
Effective batch size 128 (32 × 4 gradient accumulation)
Progressive training Block size 256 → 512 at step 2000
Compilation torch.compile
Label smoothing 0.02

Phase 2 — Supervised Fine-Tuning (SFT)

The pre-trained model was fine-tuned on 6 curated French conversational datasets across 15 epochs, with prompt masking so that only assistant tokens contribute to the loss.

Epochs 15
Learning rate 2e-5
Effective batch size 128 (32 × 4 gradient accumulation)
Max grad norm 1.0
Chat format ChatML
Loss Cross-entropy on assistant tokens only

SFT Datasets

Dataset Format Details
Houzeric/everyday-conversations Flat (user/assistant) Custom-built
CATIE-AQ/facebook-community-alignment-dataset_french_conversation Multi-turn conversations Community alignment
angeluriot/french_instruct Multi-turn conversations 40k samples, filtered
jpacifico/French-Alpaca-dataset-Instruct-55K Alpaca format 20k samples
Houzeric/french-prompts-and-questions Flat (prompt/answer) 15k samples, filtered
Houzeric/physics-FR-qa-dataset Flat (question/answer) 5k samples, scientific QA

What to Expect

Klovis is a technical demonstration — showing that a single engineer can design, train, and deploy a modern Transformer on a single GPU for under €50.

With 144M parameters, the model is capable of:

  • Generating grammatically correct French text
  • Following the ChatML conversational format
  • Producing coherent responses on simple topics

Known limitations:

  • Factual responses are frequently incorrect or fabricated (hallucinations)
  • Logical reasoning is limited
  • Responses can be repetitive or drift off-topic
  • Context limited to 256 tokens
  • French only

This model is a demonstration of what a single developer can achieve with a modern architecture at small scale. It is not intended to replace larger models for production use.


Technical Details

Implementation Highlights

  • Custom implementation: every component (attention, RoPE, RMSNorm, SwiGLU, GQA, training loop, generation) is implemented in PyTorch — no external model code
  • Hugging Face compatible: inherits from PreTrainedModel and GenerationMixin, works with AutoModelForCausalLM
  • KV-cache inference: supports incremental decoding for efficient generation
  • Multiple weight-sharing modes: standard, shared FFN, full sharing, and Recurrent-Depth
  • Streaming chat: interactive CLI with real-time token-by-token output
  • Monitoring: integrated with Trackio for live training dashboards

Source Code

The full source code is available at: github.com/eric-houzelle/mini-gpt


License

This model is released under the Apache 2.0 License.


Designed and trained by Eric Houzelle.

Downloads last month
416
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support