--- license: cc-by-4.0 language: - de - en - fr - da - sv - pl - es - nl - it base_model: - jhu-clsp/mmBERT-small pipeline_tag: text-classification --- ### Model Description mmbert-cap is a compact multilingual transformer model for classifying text into Comparative Agendas Project (CAP) policy categories. It is designed to provide strong and consistent performance across multiple languages and document types while remaining computationally efficient. For further details, read the [documentation](https://huggingface.co/Sami92/mmbert-cap/blob/main/documentation.pdf). - **Model type:** Multilingual transformer (mmbert-small, ~110M parameters) - **Language(s):** Danish, Dutch, English, German, Norwegian, Spanish, Swedish - **Finetuned from model:** mmbert-small (Marone et al. 2025) ## Uses ### Direct Use - Classification of political texts into CAP policy categories - Applicable to news articles, press releases, and social media posts - Multilingual political text analysis ### Downstream Use - Policy agenda research - Political communication analysis - Dataset labeling / annotation support ### Out-of-Scope Use - Non-political text classification - Tasks outside CAP taxonomy - Decision-making without human validation ## Bias, Risks, and Limitations - CAP annotation is inherently subjective; model reflects annotation biases - Lower performance for some categories (e.g., public lands, foreign trade) ### Recommendations - Evaluate on a small in-domain dataset before deployment - Combine quantitative metrics with qualitative inspection ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Sami92/mmbert-cap-int8") model = AutoModelForSequenceClassification.from_pretrained("Sami92/mmbert-cap-int8") inputs = tokenizer("Your text here", return_tensors="pt", truncation=True) outputs = model(**inputs) ``` ## Training Details ### Training Data - ~171k manually labeled documents (cleaned) - ~442k additional documents with soft labels - Sources: news, press releases, social media - Languages: 7 European languages ### Training Procedure - Data cleaning using confident learning (Cleanlab) - Ensemble teacher models (XLM-R, XL, mmbert-base) - Knowledge distillation to mmbert-small - Training on soft labels #### Training Hyperparameters - **Batch size:** 64 - **Learning rate:** 2e-5 ## Evaluation ### Metrics - Macro F1 score - Accuracy - Accuracy@2 (multi-label setting) ### Results - **Macro F1:** 0.80 - **Accuracy:** 0.81 - **Category range (F1):** 0.65–0.90 - **Languages (F1):** 0.74–0.84 - **Document types (F1):** 0.77–0.79 #### Summary The model achieves competitive performance while remaining efficient and consistent across languages and document types. ## Technical Specifications ### Model Architecture and Objective - Transformer-based multilingual classifier - ~110M parameters - Single-label CAP classification ### Compute Infrastructure #### Hardware - NVIDIA H100 GPU (evaluation) - AMD EPYC CPU (qint8 inference) #### Software - PyTorch - Hugging Face Transformers ## Acknowledgments We thank the researchers who shared their datasets with us. Gunnar Thesen and Erik de Vries provided the MaML dataset of news articles, Rens Vliegenhart contributed a dataset of Dutch newspaper articles, and Cornelius Erfort shared a dataset of press releases. High-quality data are essential for reliable machine learning classifiers, and this model could not have been trained without their support. We also thank Thomas Haase for his work in annotating the social media test data.