mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Paper • 2509.06888 • Published • 15
mmBERT-CJKE는 mmBERT를 한국어(Ko), 중국어(Zh), 일본어(Ja), 영어(En) 4개 언어에 특화하여 강화한 인코더 모델입니다.
mmBERT-CJKE is an encoder model based on mmBERT, enhanced for Korean, Chinese, Japanese, and English through continued pre-training.
| Property | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-base |
| Architecture | ModernBERT (RoPE, Flash Attention 2, Unpadding) |
| Parameters | ~307M (base) |
| Max Sequence Length | 8,192 tokens |
| Tokenizer | Gemma 3 (262k vocab) |
| Training Data | FineWeb-2 + Wikipedia + Dolma |
| Training Tokens | ~55B (Phase A: 5B + Phase B: 50B) |
| Languages | Ko (30%), Zh (25%), Ja (25%), En (20%) |
| Precision | bfloat16 |
| License | Apache 2.0 |
학습은 2단계로 진행되었습니다:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("CaveduckAI/mmBERT-cjke-base")
model = AutoModel.from_pretrained("CaveduckAI/mmBERT-cjke-base")
# Masked Language Modeling
from transformers import pipeline
mlm = pipeline("fill-mask", model="CaveduckAI/mmBERT-cjke-base")
mlm("서울은 대한민국의 [MASK]이다.")
# Feature Extraction
inputs = tokenizer("한국어 텍스트입니다.", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
Benchmark results will be added after evaluation.
| Language | Benchmark | Tasks |
|---|---|---|
| Korean | KLUE | TC, STS, NLI, NER, RE, DP, MRC, DST |
| Chinese | CLUE | TNEWS, OCNLI, AFQMC, CMRC2018, MSRA NER |
| Japanese | JGLUE | MARC-ja, JNLI, JSTS, JSQuAD, JCommonsenseQA |
| English | GLUE | MNLI, QQP, QNLI, SST-2, CoLA, RTE, MRPC |
| Cross-lingual | XNLI | ko, zh, ja, en |
@misc{mmbert-cjke-2026,
title={mmBERT-CJKE: ModernBERT Encoder Optimized for Korean, Chinese, Japanese, and English},
author={CaveduckAI},
year={2026},
url={https://huggingface.co/CaveduckAI/mmBERT-cjke-base}
}
Last updated: 2026-04-14
Base model
jhu-clsp/mmBERT-base