BabyLM 2025 GPT-2 with MorPiece Tokenizer (Strict Small Track)

Model Description

This is a GPT-2 language model trained adapting the baseline model built for the BabyLM 2025 Challenge.

  • Developed by: NeTS Lab
  • Model type: Autoregressive Language Model (GPT-2 architecture)
  • Language(s): Italian
  • License: MIT
  • Parent Model: GPT-2
  • Tokenizer: BPE

Key Features

  • Strict data constraints (3M words) child-directed speech corpus
  • Optimized for data efficiency default BabyLM 2025 baseline hyperparameter tuning
  • 768-dimensional embeddings with 12 attention heads and 12 layers

Model Details

Architecture

  • Base Architecture: GPT-2 (12 layers, 12 attention heads)
  • Hidden Size: 768
  • Vocabulary Size: ~~16K
  • Context Length: 1,024 tokens
  • Parameters: ~~104M (estimated)

Training Configuration

  • Training Type: Strict (BabyLM 2025 guidelines)
  • Dataset Size: 3M words maximum
  • Sequence Length: 512 tokens
  • Batch Size: 16
  • Learning Rate: 5e-5
  • Training Steps: 200,000
  • Warmup Steps: 2,000
  • Epochs: 10
  • Weight Decay: 0.0
  • Gradient Clipping: 1.0

Training Data

The model was trained on a small italian dataset (Fusco et al. 2024), which includes:

  • Size: 3M words maximum
  • Sources: Child-directed speech and age-appropriate text
  • Language: Italian

Intended Uses

Primary Use Cases

  • Research into data-efficient language modeling
  • Comparative studies of tokenization methods in low-resource settings
  • Baseline model for BabyLM 2025 Challenge participants

Out-of-Scope Uses

  • Production deployments requiring robust, general-purpose language understanding
  • Safety-critical applications
  • Tasks requiring knowledge beyond the training data scope

Performance

The model was trained following BabyLM 2025 Challenge protocols:

  • Training loss: 2.51947
  • Convergence: Achieved after 200,000 training steps

Usage

Loading the Model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("NeTS-lab/babylm_ita-bpe-3m-gpt2")

# Generate text
input_text = "Il bambino gioca con"
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, do_sample=True, temperature=0.8)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Text Generation Parameters

  • Max Length: 50 tokens (default)
  • Sampling: Enabled by default
  • Temperature: Adjustable (0.8 recommended)

Limitations and Biases

Known Limitations

  • Limited training data (3M words) may result in knowledge gaps
  • Domain specificity due to child-directed speech focus
  • Context window limited to 1,024 tokens

Potential Biases

  • Age-appropriate content bias from training data selection
  • Italian language bias (monolingual training)
  • Morphological bias toward Indo-European language patterns

Technical Specifications

Training Infrastructure

  • Framework: PyTorch + Transformers
  • Precision: float32
  • Gradient Accumulation: Configured for effective batch size
  • Monitoring: Weights & Biases integration

Model Configuration

{
  "activation_function": "gelu_new",
  "architectures": ["GPT2LMHeadModel"],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "layer_norm_epsilon": 1e-05,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "vocab_size": 16384
}

Citation

If you use this model in your research, please cite:

@inproceedings{fusco-etal-2024-recurrent,
    title = "Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-{LM} Training on Child-Directed Speech in {I}talian",
    author = "Fusco, Achille  and
      Barbini, Matilde  and
      Piccini Bianchessi, Maria Letizia  and
      Bressan, Veronica  and
      Neri, Sofia  and
      Rossi, Sarah  and
      Sgrizzi, Tommaso  and
      Chesi, Cristiano",
    editor = "Dell'Orletta, Felice  and
      Lenci, Alessandro  and
      Montemagni, Simonetta  and
      Sprugnoli, Rachele",
    booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
    month = dec,
    year = "2024",
    address = "Pisa, Italy",
    publisher = "CEUR Workshop Proceedings",
    url = "https://aclanthology.org/2024.clicit-1.46/",
    pages = "382--389",
    ISBN = "979-12-210-7060-6"
}

Acknowledgments

  • BabyLM 2025 Challenge organizers for providing the framework
  • Hugging Face Transformers team for the modeling infrastructure

Contact

For questions about this model or the training process, please [cristiano.chesi@iusspavia.it].


This model was developed as part of research into data-efficient language modeling and morphologically-aware tokenization techniques.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support