BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)

Model Description

This is a GPT-2 language model trained adapting the baseline model built for the BabyLM 2025 Challenge.

Developed by: NeTS Lab
Model type: Autoregressive Language Model (GPT-2 architecture)
Language(s): Italian
License: MIT
Parent Model: GPT-2
Tokenizer: BPE

Key Features

Strict data constraints (3M words) child-directed speech corpus
Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
768-dimensional embeddings with 12 attention heads and 12 layers

Model Details

Architecture

Base Architecture: GPT-2 (12 layers, 12 attention heads)
Hidden Size: 768
Vocabulary Size: ~~16K
Context Length: 1,024 tokens
Parameters: ~~104M (estimated)

Training Configuration

Training Type: Strict (BabyLM 2025 guidelines)
Dataset Size: 3M words maximum
Sequence Length: 512 tokens
Batch Size: 16
Learning Rate: 5e-5
Training Steps: 200,000
Warmup Steps: 2,000
Epochs: 10
Weight Decay: 0.0
Gradient Clipping: 1.0

Training Data

The model was trained on a small italian dataset (Fusco et al. 2024), which includes:

Size: 3M words maximum
Sources: Child-directed speech and age-appropriate text
Language: Italian

Intended Uses

Primary Use Cases

Research into data-efficient language modeling
Comparative studies of tokenization methods in low-resource settings
Baseline model for BabyLM 2025 Challenge participants

Out-of-Scope Uses

Production deployments requiring robust, general-purpose language understanding
Safety-critical applications
Tasks requiring knowledge beyond the training data scope

Performance

The model was trained following BabyLM 2025 Challenge protocols:

Training loss: 2.51947
Convergence: Achieved after 200,000 training steps

Usage

Loading the Model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")

# Generate text
input_text = "Il bambino gioca con"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Text Generation Parameters

Max Length: 50 tokens (default)
Sampling: Enabled by default
Temperature: Adjustable (0.8 recommended)

Limitations and Biases

Known Limitations

Limited training data (3M words) may result in knowledge gaps
Domain specificity due to child-directed speech focus
Context window limited to 1,024 tokens

Potential Biases

Age-appropriate content bias from training data selection
Italian language bias (monolingual training)
Morphological bias toward Indo-European language patterns

Technical Specifications

Training Infrastructure

Framework: PyTorch + Transformers
Precision: float32
Gradient Accumulation: Configured for effective batch size
Monitoring: Weights & Biases integration

Model Configuration

{
  "activation_function": "gelu_new",
  "architectures": ["GPT2LMHeadModel"],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "vocab_size": 16384
}

Citation

If you use this model in your research, please cite:

@inproceedings{fusco-etal-2024-recurrent,
    title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
    author = "Fusco, Achille  and
      Barbini, Matilde  and
      Piccini Bianchessi, Maria Letizia  and
      Bressan, Veronica  and
      Neri, Sofia  and
      Rossi, Sarah  and
      Sgrizzi, Tommaso  and
      Chesi, Cristiano",
    editor = "Dell'Orletta, Felice  and
      Lenci, Alessandro  and
      Montemagni, Simonetta  and
      Sprugnoli, Rachele",
    booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
    month = dec,
    year = "2024",
    address = "Pisa, Italy",
    publisher = "CEUR Workshop Proceedings",
    url = "https://aclanthology.org/2024.clicit-1.46/",
    pages = "382--389",
    ISBN = "979-12-210-7060-6"
}

Acknowledgments

BabyLM 2025 Challenge organizers for providing the framework
Hugging Face Transformers team for the modeling infrastructure

Contact

For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].

This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support