BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)
Model Description
This is a GPT-2 language model trained adapting the baseline model built for the BabyLM 2025 Challenge.
- Developed by: NeTS Lab
- Model type: Autoregressive Language Model (GPT-2 architecture)
- Language(s): Italian
- License: MIT
- Parent Model: GPT-2
- Tokenizer: BPE
Key Features
- Strict data constraints (3M words) child-directed speech corpus
- Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
- 768-dimensional embeddings with 12 attention heads and 12 layers
Model Details
Architecture
- Base Architecture: GPT-2 (12 layers, 12 attention heads)
- Hidden Size: 768
- Vocabulary Size: ~~16K
- Context Length: 1,024 tokens
- Parameters: ~~104M (estimated)
Training Configuration
- Training Type: Strict (BabyLM 2025 guidelines)
- Dataset Size: 3M words maximum
- Sequence Length: 512 tokens
- Batch Size: 16
- Learning Rate: 5e-5
- Training Steps: 200,000
- Warmup Steps: 2,000
- Epochs: 10
- Weight Decay: 0.0
- Gradient Clipping: 1.0
Training Data
The model was trained on a small italian dataset (Fusco et al. 2024), which includes:
- Size: 3M words maximum
- Sources: Child-directed speech and age-appropriate text
- Language: Italian
Intended Uses
Primary Use Cases
- Research into data-efficient language modeling
- Comparative studies of tokenization methods in low-resource settings
- Baseline model for BabyLM 2025 Challenge participants
Out-of-Scope Uses
- Production deployments requiring robust, general-purpose language understanding
- Safety-critical applications
- Tasks requiring knowledge beyond the training data scope
Performance
The model was trained following BabyLM 2025 Challenge protocols:
- Training loss: 2.51947
- Convergence: Achieved after 200,000 training steps
Usage
Loading the Model
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
# Generate text
input_text = "Il bambino gioca con"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Text Generation Parameters
- Max Length: 50 tokens (default)
- Sampling: Enabled by default
- Temperature: Adjustable (0.8 recommended)
Limitations and Biases
Known Limitations
- Limited training data (3M words) may result in knowledge gaps
- Domain specificity due to child-directed speech focus
- Context window limited to 1,024 tokens
Potential Biases
- Age-appropriate content bias from training data selection
- Italian language bias (monolingual training)
- Morphological bias toward Indo-European language patterns
Technical Specifications
Training Infrastructure
- Framework: PyTorch + Transformers
- Precision: float32
- Gradient Accumulation: Configured for effective batch size
- Monitoring: Weights & Biases integration
Model Configuration
{
"activation_function": "gelu_new",
"architectures": ["GPT2LMHeadModel"],
"attn_pdrop": 0.1,
"embd_pdrop": 0.1,
"layer_norm_epsilon": 1e-05,
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"vocab_size": 16384
}
Citation
If you use this model in your research, please cite:
@inproceedings{fusco-etal-2024-recurrent,
title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
author = "Fusco, Achille and
Barbini, Matilde and
Piccini Bianchessi, Maria Letizia and
Bressan, Veronica and
Neri, Sofia and
Rossi, Sarah and
Sgrizzi, Tommaso and
Chesi, Cristiano",
editor = "Dell'Orletta, Felice and
Lenci, Alessandro and
Montemagni, Simonetta and
Sprugnoli, Rachele",
booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
month = dec,
year = "2024",
address = "Pisa, Italy",
publisher = "CEUR Workshop Proceedings",
url = "https://aclanthology.org/2024.clicit-1.46/",
pages = "382--389",
ISBN = "979-12-210-7060-6"
}
Acknowledgments
- BabyLM 2025 Challenge organizers for providing the framework
- Hugging Face Transformers team for the modeling infrastructure
Contact
For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].
This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.
- Downloads last month
- 3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support