Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX

Paper by Julian Kerignard | February 2026

Abstract

We present Julian, a family of decoder-only language models ranging from 100M to 600M parameters, trained entirely from scratch on up to 39 billion tokens of bilingual English-French data (70%/30%) using JAX/Flax on Google Cloud TPU v4-32. Our largest model, Julian-600M, employs a modern transformer architecture with Rotary Position Embeddings (RoPE), SwiGLU activations, and RMSNorm, following the design principles of LLaMA.

Despite being trained on significantly fewer tokens than comparable models, Julian-600M achieves 53.5% on HellaSwag, outperforming OPT-1.3B (41.5%) which has over twice the parameters and was trained on 8x more data. We further analyze supervised fine-tuning (SFT) dynamics, revealing a critical disconnect between training loss reduction and downstream task performance.

Files

File Description
JulianKrg_600M_paper.pdf Full paper (English)
julian_paper_fr.pdf Full paper (French)
julian_paper.tex LaTeX source (English)
julian_paper_fr.tex LaTeX source (French)

Key Results

Pretraining Performance

Model Params Tokens HellaSwag PIQA LAMBADA
Julian-600M 600M 39B 53.5% 66.8% 37.3%
OPT-1.3B 1.3B 300B 41.5% 71.7% 58.0%
GPT-2 XL 1.5B ~40B 50.9% 70.8% 51.2%
Pythia-1B 1B 300B 37.6% 69.2% 56.6%
BLOOM-560M 560M 350B 37.1% 64.5% 36.5%

Julian-600M outperforms OPT-1.3B on HellaSwag with 2x fewer parameters and 7.7x fewer tokens.

Critical SFT Finding

Our paper provides a detailed analysis of supervised fine-tuning dynamics on 2.47M instruction-response pairs:

Configuration Steps Epochs Loss HellaSwag PIQA WinoGrande
Base model - - - 53.5% 66.8% 53.8%
SFT-30K 30K 0.66 1.86 53.2% 66.5% 53.8%
SFT-100K 100K 2.2 1.69 53.2% 66.5% 52.8%

Key insight: Training loss decreases 9% between SFT-30K and SFT-100K, but benchmark performance stagnates or degrades. This reveals that training loss is not a reliable proxy for SFT quality β€” the model memorizes instruction patterns rather than improving generalization. We recommend limiting SFT to <1 epoch for datasets >1M examples and using held-out benchmarks as stopping criteria.

Architecture

Decoder-only Transformer (600M parameters)
β”œβ”€β”€ Layers: 18
β”œβ”€β”€ Hidden: 1280
β”œβ”€β”€ Heads: 20 (head_dim = 64)
β”œβ”€β”€ FFN: 5120 (SwiGLU)
β”œβ”€β”€ Vocab: 50,000 (SentencePiece)
β”œβ”€β”€ Context: 2048 tokens
β”œβ”€β”€ Position: RoPE (ΞΈ = 10000)
└── Norm: RMSNorm (pre-norm)

Paper Contents

  1. Introduction β€” Motivation and contributions
  2. Related Work β€” Comparison with Pythia, OPT, LLaMA, TinyLlama
  3. Model Architecture β€” Detailed design choices (RoPE, SwiGLU, RMSNorm)
  4. Training Infrastructure β€” Multi-host TPU training with JAX, data pipeline, checkpointing
  5. Data β€” Collection, cleaning, tokenization (Wikipedia, FineWeb-Edu, OSCAR, The Stack)
  6. Pretraining β€” Hyperparameters, loss curves, scaling analysis
  7. Supervised Fine-Tuning β€” SFT methodology, ChatML format, training dynamics
  8. Evaluation β€” Benchmark results across 7 tasks with detailed comparisons
  9. SFT Analysis β€” Critical findings on loss vs. benchmark divergence
  10. Conclusion β€” Practical recommendations for efficient LLM training

Models

All model weights are openly available:

Model Link
Julian-600M Base (39B tokens) JulianKrgd/julian-600m-40b
Julian-600M Instruct SFT-30K JulianKrgd/julian-600m-40b-instruct-sft30k
Julian-600M Instruct SFT-100K JulianKrgd/julian-600m-40b-instruct-sft100k

Citation

@misc{kerignard2026julian,
  author = {Julian Kerignard},
  title = {Julian: Efficient Training of a Bilingual 600M Parameter Language Model on TPU with JAX},
  year = {2026},
  url = {https://huggingface.co/JulianKrgd/julian-600m-paper}
}

License

Apache 2.0 β€” All paper content, LaTeX sources, and associated materials.

Acknowledgments

  • Google TPU Research Cloud for compute access
  • Hugging Face for model hosting and open-source tools
  • JAX/Flax teams for the ML framework
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including JulianKrgd/julian-600m-paper