Oren Data Distillation Experiment
Collection
Two identical d10 models (100M params) trained to validate the hypothesis
that quality-filtered data enables more efficient training. • 2 items • Updated • 1
d10 model trained on 700M tokens of raw Common Crawl data
This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.
Architecture: 10-layer transformer with 640 hidden dimensions Training framework: nanochat Base tokenizer: BPE with 65K vocab
This model is part of Phase 2 of the Oren project, which validates the hypothesis:
"Quality-filtered training data enables smaller, more efficient models with comparable performance."
I trained two identical models:
Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.
import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine
# Load model
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
"sequence_len": 512,
"vocab_size": 65536,
"n_layer": 10,
"n_head": 10,
"n_kv_head": 10,
"n_embd": 640
})
model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()
# Generate text
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)
prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))
If you use this model, please cite:
@software{oren2025,
title={Oren: Quality Auditing for LLM Training Data},
author={Amir Valizadeh},
year={2025},
url={https://github.com/vitalune/Oren}
}
MIT License - See LICENSE file for details