Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX
Paper by Julian Kerignard | February 2026
Abstract
We present Julian, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to 39 billion tokens of bilingual English-French data (70%/30%) using JAX/Flax on Google Cloud TPU v4-32. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA.
Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves 53.5% on HellaSwag, outperforming OPT-1.3B (41.5%) which has over twice the parameters and was trained on 8x more data. We further analyze supervised fine-tuning (SFT) dynamics, revealing a critical disconnect between training loss reduction and downstream task performance.
Files
| File | Description |
|---|---|
JulianKrg_600M_paper.pdf |
Full paper (English) |
julian_paper_fr.pdf |
Full paper (French) |
julian_paper.tex |
LaTeX source (English) |
julian_paper_fr.tex |
LaTeX source (French) |
Key Results
Pretraining Performance
| Model | Params | Tokens | HellaSwag | PIQA | LAMBADA |
|---|---|---|---|---|---|
| Julian-600M | 600M | 39B | 53.5% | 66.8% | 37.3% |
| OPT-1.3B | 1.3B | 300B | 41.5% | 71.7% | 58.0% |
| GPT-2 XL | 1.5B | ~40B | 50.9% | 70.8% | 51.2% |
| Pythia-1B | 1B | 300B | 37.6% | 69.2% | 56.6% |
| BLOOM-560M | 560M | 350B | 37.1% | 64.5% | 36.5% |
Julian-600M outperforms OPT-1.3B on HellaSwag with 2x fewer parameters and 7.7x fewer tokens.
Critical SFT Finding
Our paper provides a detailed analysis of supervised fine-tuning dynamics on 2.47M instruction-response pairs:
| Configuration | Steps | Epochs | Loss | HellaSwag | PIQA | WinoGrande |
|---|---|---|---|---|---|---|
| Base model | - | - | - | 53.5% | 66.8% | 53.8% |
| SFT-30K | 30K | 0.66 | 1.86 | 53.2% | 66.5% | 53.8% |
| SFT-100K | 100K | 2.2 | 1.69 | 53.2% | 66.5% | 52.8% |
Key insight: Training loss decreases 9% between SFT-30K and SFT-100K, but benchmark performance stagnates or degrades. This reveals that training loss is not a reliable proxy for SFT quality β the model memorizes instruction patterns rather than improving generalization. We recommend limiting SFT to <1 epoch for datasets >1M examples and using held-out benchmarks as stopping criteria.
Architecture
Decoder-only Transformer (600M parameters)
βββ Layers: 18
βββ Hidden: 1280
βββ Heads: 20 (head_dim = 64)
βββ FFN: 5120 (SwiGLU)
βββ Vocab: 50,000 (SentencePiece)
βββ Context: 2048 tokens
βββ Position: RoPE (ΞΈ = 10000)
βββ Norm: RMSNorm (pre-norm)
Paper Contents
- Introduction β Motivation and contributions
- Related Work β Comparison with Pythia, OPT, LLaMA, TinyLlama
- Model Architecture β Detailed design choices (RoPE, SwiGLU, RMSNorm)
- Training Infrastructure β Multi-host TPU training with JAX, data pipeline, checkpointing
- Data β Collection, cleaning, tokenization (Wikipedia, FineWeb-Edu, OSCAR, The Stack)
- Pretraining β Hyperparameters, loss curves, scaling analysis
- Supervised Fine-Tuning β SFT methodology, ChatML format, training dynamics
- Evaluation β Benchmark results across 7 tasks with detailed comparisons
- SFT Analysis β Critical findings on loss vs. benchmark divergence
- Conclusion β Practical recommendations for efficient LLM training
Models
All model weights are openly available:
| Model | Link |
|---|---|
| Julian-600M Base (39B tokens) | JulianKrgd/julian-600m-40b |
| Julian-600M Instruct SFT-30K | JulianKrgd/julian-600m-40b-instruct-sft30k |
| Julian-600M Instruct SFT-100K | JulianKrgd/julian-600m-40b-instruct-sft100k |
Citation
@misc{kerignard2026julian,
author = {Julian Kerignard},
title = {Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX},
year = {2026},
url = {https://huggingface.co/JulianKrgd/julian-600m-paper}
}
License
Apache 2.0 β All paper content, LaTeX sources, and associated materials.
Acknowledgments
- Google TPU Research Cloud for compute access
- Hugging Face for model hosting and open-source tools
- JAX/Flax teams for the ML framework