Metadata Conditioned LLMs
Collection
Pretraining Data: English NOW corpus (english-corpora.org/now). Paper: arxiv.org/abs/2601.15236. Code: github.com/iamshnoo/metadata_localization • 92 items • Updated
This repo contains the leave out america 1b step4k model exported from the 4k checkpoint for the metadata localization project. It was trained from scratch on the project corpus, using the Llama 3.2 tokenizer and vocabulary.
pretrainleave_one_out1bwithout_metadata4kTrained from scratch; tokenizer/vocabulary from meta-llama/Llama-3.2-1B20/12/2025_15:54:51_combined_no_america_without_metadata_1bhttps://wandb.ai/iamshnoo/nanotron/runs/112kjd1xfinished114h 24m 56sKPI/train_lm_loss: 2.0795KPI/train_perplexity: 8.0008KPI/val_loss: 2.1485KPI/val_perplexity: 8.5716KPI/consumed_tokens/train: 41,943,040,000_step: 10,000train_steps: 10,000sequence_length: 2,048micro_batch_size: 8batch_accumulation_per_replica: 64learning_rate: 0.003min_decay_lr: 0.0003checkpoint_interval: 1,000Static plots below were exported from the private Weights & Biases run and embedded here for public access.
This model is part of the metadata localization release. Related checkpoints and variants are grouped in the public Hugging Face collection Metadata Conditioned LLMs.
Last synced: 2026-04-02 14:47:06 UTC