SikuRoBERTa (GN) for Bronze Inscription Restoration and Dating

Model Description

This model is adapted from SIKU-BERT/sikuroberta, with additional domain-adaptive pretraining (DAPT), task-adaptive pretraining (TAPT), and integration of a Glyph Net (GN) module.
It is trained on the BIRD dataset (Bronze Inscription Restoration and Dating), the first fully encoded bronze inscription corpus with chronological labels.

  • Backbone: RoBERTa trained on the Siku Quanshu corpus
  • Enhancements: Glyph Net (GN), glyph-biased sampling, DAPT, TAPT
  • Tasks:
    • Masked language modeling (inscription restoration)
    • Dynasty- and period-level classification (dating)

Intended Use

  • Restoration of damaged or missing characters in bronze inscriptions
  • Chronological classification (dynasty / period dating)
  • Research in digital humanities, ancient Chinese NLP, and paleography

Training Data

The model was trained on the BIRD dataset:

  • 41k tokens of transcribed and chronologically labeled bronze inscriptions
  • Deduplicated, filtered, and corrected based on Complete Collection of Yin and Zhou Bronze Inscriptions (CASS, 2007)
  • [UNK] placeholders for undeciphered glyphs
  • Supplemented with 2M tokens of Pre-Qin texts (Analects, Mencius, Zuo Zhuan, Mozi, Guanzi, etc.) for domain-adaptive pretraining

Dataset repo: wjhuah/BIRD


Case Study: Hu Ding Restoration

We employed the SikuRoBERTa (GN) model with two decoding strategies:
parallel mask filling and greedy iterative decoding.
The table below compares predicted tokens with expert gold restorations.

Top-1 and Top-5 predictions versus gold characters

(excerpt of the first six damaged positions in the Hu Ding inscription)

Mask Position Gold Pred@1 Top-5 Predictions
01 廟, 室, 宮, 寢, 廷
02 王, 公, 君, 伯, 尹
03 芾, 純, 衡, 衣, 韍
05 於, 于, 揚, 無, 多
06 于, 揚, 穆, 於, 侑
07 年, 人, 世, 壽, 歲

On 22 expert restorations (Huang, 2022), the model achieved:

  • Exact@1: 50.00% (11/22)
  • Exact@5: 59.09% (13/22)
  • Exact@10: 68.18% (15/22)

Greedy decoding yielded comparable coverage, though with slightly lower accuracy.
In addition to reproducing expert restorations, the system also generated plausible candidates for undeciphered characters, providing potential references for paleographic analysis.

Model completions for undeciphered positions (Top-10 shown)

Mask Position Top-10 Predictions
04 鑾, 旂, 舄, 筆, 㫃, 金, 矢, 黃, 弓, 璋
08 介, 伯, 市, 限, 客, 期, 制, 政, 宰, 人
15 之, 外, 一, 若, 內, 賜, 邑, 大, 下, 又
16 賜, 折, 喬, 杜, 乘, 造, 擇, 柞, 之, 于
17 則, 許, 弗, 不, 人, 亦, 也, 而, 帛, 乃
18 則, 曰, 不, 弗, 許, 告, 厥, 多, 有, 用
28 其, 厥, 若, 越, 乃, 我, 以, 汝, 如, 余

How to Use

Quick start

A minimal example for masked language modeling:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "wjhuah/SikuRoBERTa_Bronze"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "唯王元年六月既朢乙亥,王在周穆王太[mask]"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

If you use this model, please cite our EMNLP 2025 paper:

@inproceedings{hua2025bird,
  title     = {BIRD: Bronze Inscription Restoration and Dating},
  author    = {Hua, Wenjie and Nguyen, Hoang H. and Ge, Gangyan},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year      = {2025},
  publisher = {Association for Computational Linguistics}
}
Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wjhuah/SikuRoBERTa_Bronze

Finetuned
(1)
this model
Quantizations
1 model

Dataset used to train wjhuah/SikuRoBERTa_Bronze