Julian 600M - 10B Tokens (Early Checkpoint)

A 600M parameter decoder-only language model trained from scratch using JAX/Flax on Google Cloud TPUs.

⚠️ Early Checkpoint: This is an intermediate checkpoint at 10B tokens (~25% training). See julian-600m-40b for the fully trained model.

Model Description

Julian is a causal language model designed for text generation, trained on a mix of English (70%) and French (30%) data. The architecture follows modern best practices with RoPE positional embeddings, SwiGLU activations, and RMSNorm.

Architecture

Component Configuration
Parameters 599.9M
Layers 18
Hidden Size 1280
Attention Heads 16
Head Dimension 80
Intermediate Size 5120 (SwiGLU)
Vocabulary 50,000 (SentencePiece)
Context Length 2048
Positional Encoding RoPE (θ=10000)
Normalization RMSNorm (pre-norm)

Benchmarks (at 10B tokens / ~25% training)

Evaluated using lm-evaluation-harness (0-shot).

Benchmark Score
HellaSwag 45.8%
PIQA 67.6%
LAMBADA 35.0%

Comparison with Open-Source Models

Model Params Tokens HellaSwag PIQA
OPT-350M 350M 300B 36.7% 64.6%
Pythia-410M 410M 300B 40.9% 66.8%
BLOOM-560M 560M 350B 37.1% 64.5%
Julian 600M 600M 10B 45.8% 67.6%
GPT-2 Large 774M ~40B 45.6% 72.1%
Pythia-1B 1B 300B 49.7% 70.7%
TinyLlama-1.1B 1.1B 3T 59.2% 73.3%
GPT-Neo-1.3B 1.3B 380B 38.7% 71.1%

💡 Key insight: Julian 600M matches GPT-2 Large (774M) on HellaSwag with only 10B tokens (vs ~40B) and 22% fewer parameters.

⚠️ Note: These results are at ~25% training. See julian-600m-40b for improved final scores.

Training Details (at this checkpoint)

Metric Value
Tokens Trained 10B
Target Tokens 39B
Training Steps ~76,000
Batch Size 256 (global)
Learning Rate 3e-4 → 3e-5 (cosine decay)
Hardware TPU v4-32
Framework JAX + Flax
Precision bfloat16

Training Data

Source Proportion
Wikipedia EN ~25%
Wikipedia FR ~10%
OSCAR (EN/FR) ~40%
The Stack (Code) ~15%
Gutenberg Books ~10%

Language ratio: 70% English, 30% French.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-10b")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-10b")

prompt = "La France est"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

💡 Recommendation: Use the fully trained julian-600m-40b for better results.

Model Family

Model Parameters Tokens Status
julian-600m-10b 600M 10B ✅ Early checkpoint
julian-600m-40b 600M 39.3B Released
julian-1b (planned) 1B 80B 📋 Planned

Limitations

  • Context Length: Limited to 2048 tokens
  • Languages: Primarily English and French
  • Training: Only ~25% complete at this checkpoint
  • Safety: Not instruction-tuned or safety-aligned

Why Julian outperforms GPT-2?

  1. Modern architecture: RoPE + SwiGLU + RMSNorm (like LLaMA)
  2. Better data: Curated mix with quality filtering
  3. Efficient training: Modern optimizations (bfloat16, gradient checkpointing)

Acknowledgments

  • Google Cloud TPU Research Program for compute resources
  • JAX/Flax team for the excellent ML framework
  • Hugging Face for model hosting

License

Apache 2.0

Citation

@misc{julian2025,
  author = {Julian Kerignard},
  title = {Julian: A 600M Parameter Language Model (10B Tokens Checkpoint)},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/JulianKrgd/julian-600m-10b}
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train JulianKrgd/julian-600m-10b

Collection including JulianKrgd/julian-600m-10b