Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX

Paper by Julian Kerignard | February 2026

Abstract

We present Julian, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to 39 billion tokens of bilingual English-French data (70%/30%) using JAX/Flax on Google Cloud TPU v4-32. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA.

Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves 53.5% on HellaSwag, outperforming OPT-1.3B (41.5%) which has over twice the parameters and was trained on 8x more data. We further analyze supervised fine-tuning (SFT) dynamics, revealing a critical disconnect between training loss reduction and downstream task performance.

Files

File	Description
`JulianKrg_600M_paper.pdf`	Full paper (English)
`julian_paper_fr.pdf`	Full paper (French)
`julian_paper.tex`	LaTeX source (English)
`julian_paper_fr.tex`	LaTeX source (French)

Key Results

Pretraining Performance

Model	Params	Tokens	HellaSwag	PIQA	LAMBADA
Julian-600M	600M	39B	53.5%	66.8%	37.3%
OPT-1.3B	1.3B	300B	41.5%	71.7%	58.0%
GPT-2 XL	1.5B	~40B	50.9%	70.8%	51.2%
Pythia-1B	1B	300B	37.6%	69.2%	56.6%
BLOOM-560M	560M	350B	37.1%	64.5%	36.5%

Julian-600M outperforms OPT-1.3B on HellaSwag with 2x fewer parameters and 7.7x fewer tokens.

Critical SFT Finding

Our paper provides a detailed analysis of supervised fine-tuning dynamics on 2.47M instruction-response pairs:

Configuration	Steps	Epochs	Loss	HellaSwag	PIQA	WinoGrande
Base model	-	-	-	53.5%	66.8%	53.8%
SFT-30K	30K	0.66	1.86	53.2%	66.5%	53.8%
SFT-100K	100K	2.2	1.69	53.2%	66.5%	52.8%

Key insight: Training loss decreases 9% between SFT-30K and SFT-100K, but benchmark performance stagnates or degrades. This reveals that training loss is not a reliable proxy for SFT quality — the model memorizes instruction patterns rather than improving generalization. We recommend limiting SFT to <1 epoch for datasets >1M examples and using held-out benchmarks as stopping criteria.

Architecture

Decoder-only Transformer (600M parameters)
├── Layers: 18
├── Hidden: 1280
├── Heads: 20 (head_dim = 64)
├── FFN: 5120 (SwiGLU)
├── Vocab: 50,000 (SentencePiece)
├── Context: 2048 tokens
├── Position: RoPE (θ = 10000)
└── Norm: RMSNorm (pre-norm)

Paper Contents

Introduction — Motivation and contributions
Related Work — Comparison with Pythia, OPT, LLaMA, TinyLlama
Model Architecture — Detailed design choices (RoPE, SwiGLU, RMSNorm)
Training Infrastructure — Multi-host TPU training with JAX, data pipeline, checkpointing
Data — Collection, cleaning, tokenization (Wikipedia, FineWeb-Edu, OSCAR, The Stack)
Pretraining — Hyperparameters, loss curves, scaling analysis
Supervised Fine-Tuning — SFT methodology, ChatML format, training dynamics
Evaluation — Benchmark results across 7 tasks with detailed comparisons
SFT Analysis — Critical findings on loss vs. benchmark divergence
Conclusion — Practical recommendations for efficient LLM training

Models

All model weights are openly available:

Model	Link
Julian-600M Base (39B tokens)	JulianKrgd/julian-600m-40b
Julian-600M Instruct SFT-30K	JulianKrgd/julian-600m-40b-instruct-sft30k
Julian-600M Instruct SFT-100K	JulianKrgd/julian-600m-40b-instruct-sft100k

Citation

@misc{kerignard2026julian,
  author = {Julian Kerignard},
  title = {Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX},
  year = {2026},
  url = {https://huggingface.co/JulianKrgd/julian-600m-paper}
}

License

Apache 2.0 — All paper content, LaTeX sources, and associated materials.

Acknowledgments

Google TPU Research Cloud for compute access
Hugging Face for model hosting and open-source tools
JAX/Flax teams for the ML framework

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including JulianKrgd/julian-600m-paper

600M Models

Collection

600M models • 6 items • Updated Mar 2