Fill-Mask
Transformers
Safetensors
modernbert
multilingual
cjk
encoder
masked-language-model
korean
chinese
japanese
english

mmBERT-CJKE (base)

mmBERT-CJKEmmBERT를 한국어(Ko), 중국어(Zh), 일본어(Ja), 영어(En) 4개 언어에 특화하여 강화한 인코더 모델입니다.

mmBERT-CJKE is an encoder model based on mmBERT, enhanced for Korean, Chinese, Japanese, and English through continued pre-training.

Highlights

  • ModernBERT 아키텍처 기반 (Flash Attention 2, Unpadding)
  • 8,192 토큰 컨텍스트 (기존 CJK 인코더 대비 16x)
  • Gemma 3 토크나이저로 CJK 토큰화 효율 개선
  • XLM-R 대비 2-4x 빠른 추론 속도

Model Details

Property Value
Base Model jhu-clsp/mmBERT-base
Architecture ModernBERT (RoPE, Flash Attention 2, Unpadding)
Parameters ~307M (base)
Max Sequence Length 8,192 tokens
Tokenizer Gemma 3 (262k vocab)
Training Data FineWeb-2 + Wikipedia + Dolma
Training Tokens ~55B (Phase A: 5B + Phase B: 50B)
Languages Ko (30%), Zh (25%), Ja (25%), En (20%)
Precision bfloat16
License Apache 2.0

Training

학습은 2단계로 진행되었습니다:

Phase A: Tokenizer Adaptation (토크나이저 적응)

  • 목적: Gemma 3 토크나이저의 새 임베딩 안정화
  • 데이터: 다국어 균등 혼합 (각 25%)
  • 학습률: 5e-5 (임베딩 레이어 10x higher)
  • 마스크 비율: 15%
  • 토큰 수: ~5B

Phase B: CJK+E Enhancement (CJK 강화)

  • 목적: CJK 성능 극대화
  • 데이터: Ko 30% / Zh 25% / Ja 25% / En 20%
  • 학습률: 1e-4 with inverse sqrt decay
  • 마스크 비율: 15% -> 5% (점진적 감소)
  • 토큰 수: ~50B

Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("CaveduckAI/mmBERT-cjke-base")
model = AutoModel.from_pretrained("CaveduckAI/mmBERT-cjke-base")

# Masked Language Modeling
from transformers import pipeline
mlm = pipeline("fill-mask", model="CaveduckAI/mmBERT-cjke-base")
mlm("서울은 대한민국의 [MASK]이다.")

# Feature Extraction
inputs = tokenizer("한국어 텍스트입니다.", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state

Benchmarks

Benchmark results will be added after evaluation.

Target Benchmarks

Language Benchmark Tasks
Korean KLUE TC, STS, NLI, NER, RE, DP, MRC, DST
Chinese CLUE TNEWS, OCNLI, AFQMC, CMRC2018, MSRA NER
Japanese JGLUE MARC-ja, JNLI, JSTS, JSQuAD, JCommonsenseQA
English GLUE MNLI, QQP, QNLI, SST-2, CoLA, RTE, MRPC
Cross-lingual XNLI ko, zh, ja, en

Citation

@misc{mmbert-cjke-2026,
  title={mmBERT-CJKE: ModernBERT Encoder Optimized for Korean, Chinese, Japanese, and English},
  author={CaveduckAI},
  year={2026},
  url={https://huggingface.co/CaveduckAI/mmBERT-cjke-base}
}

References


Last updated: 2026-04-14

Downloads last month
2
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CaveduckAI/mmBERT-cjke-base

Finetuned
(87)
this model

Datasets used to train CaveduckAI/mmBERT-cjke-base

Paper for CaveduckAI/mmBERT-cjke-base