Julian 600M - 10B Tokens (Early Checkpoint)
A 600M parameter decoder-only language model trained from scratch using JAX/Flax on Google Cloud TPUs.
⚠️ Early Checkpoint: This is an intermediate checkpoint at 10B tokens (~25% training). See julian-600m-40b for the fully trained model.
Model Description
Julian is a causal language model designed for text generation, trained on a mix of English (70%) and French (30%) data. The architecture follows modern best practices with RoPE positional embeddings, SwiGLU activations, and RMSNorm.
Architecture
| Component | Configuration |
|---|---|
| Parameters | 599.9M |
| Layers | 18 |
| Hidden Size | 1280 |
| Attention Heads | 16 |
| Head Dimension | 80 |
| Intermediate Size | 5120 (SwiGLU) |
| Vocabulary | 50,000 (SentencePiece) |
| Context Length | 2048 |
| Positional Encoding | RoPE (θ=10000) |
| Normalization | RMSNorm (pre-norm) |
Benchmarks (at 10B tokens / ~25% training)
Evaluated using lm-evaluation-harness (0-shot).
| Benchmark | Score |
|---|---|
| HellaSwag | 45.8% |
| PIQA | 67.6% |
| LAMBADA | 35.0% |
Comparison with Open-Source Models
| Model | Params | Tokens | HellaSwag | PIQA |
|---|---|---|---|---|
| OPT-350M | 350M | 300B | 36.7% | 64.6% |
| Pythia-410M | 410M | 300B | 40.9% | 66.8% |
| BLOOM-560M | 560M | 350B | 37.1% | 64.5% |
| Julian 600M | 600M | 10B | 45.8% | 67.6% |
| GPT-2 Large | 774M | ~40B | 45.6% | 72.1% |
| Pythia-1B | 1B | 300B | 49.7% | 70.7% |
| TinyLlama-1.1B | 1.1B | 3T | 59.2% | 73.3% |
| GPT-Neo-1.3B | 1.3B | 380B | 38.7% | 71.1% |
💡 Key insight: Julian 600M matches GPT-2 Large (774M) on HellaSwag with only 10B tokens (vs ~40B) and 22% fewer parameters.
⚠️ Note: These results are at ~25% training. See julian-600m-40b for improved final scores.
Training Details (at this checkpoint)
| Metric | Value |
|---|---|
| Tokens Trained | 10B |
| Target Tokens | 39B |
| Training Steps | ~76,000 |
| Batch Size | 256 (global) |
| Learning Rate | 3e-4 → 3e-5 (cosine decay) |
| Hardware | TPU v4-32 |
| Framework | JAX + Flax |
| Precision | bfloat16 |
Training Data
| Source | Proportion |
|---|---|
| Wikipedia EN | ~25% |
| Wikipedia FR | ~10% |
| OSCAR (EN/FR) | ~40% |
| The Stack (Code) | ~15% |
| Gutenberg Books | ~10% |
Language ratio: 70% English, 30% French.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("JulianKrgd/julian-600m-10b")
tokenizer = AutoTokenizer.from_pretrained("JulianKrgd/julian-600m-10b")
prompt = "La France est"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
💡 Recommendation: Use the fully trained julian-600m-40b for better results.
Model Family
| Model | Parameters | Tokens | Status |
|---|---|---|---|
| julian-600m-10b | 600M | 10B | ✅ Early checkpoint |
| julian-600m-40b | 600M | 39.3B | ✅ Released |
| julian-1b (planned) | 1B | 80B | 📋 Planned |
Limitations
- Context Length: Limited to 2048 tokens
- Languages: Primarily English and French
- Training: Only ~25% complete at this checkpoint
- Safety: Not instruction-tuned or safety-aligned
Why Julian outperforms GPT-2?
- Modern architecture: RoPE + SwiGLU + RMSNorm (like LLaMA)
- Better data: Curated mix with quality filtering
- Efficient training: Modern optimizations (bfloat16, gradient checkpointing)
Acknowledgments
- Google Cloud TPU Research Program for compute resources
- JAX/Flax team for the excellent ML framework
- Hugging Face for model hosting
License
Apache 2.0
Citation
@misc{julian2025,
author = {Julian Kerignard},
title = {Julian: A 600M Parameter Language Model (10B Tokens Checkpoint)},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/JulianKrgd/julian-600m-10b}
}
- Downloads last month
- 21