SikuRoBERTa (GN) for Bronze Inscription Restoration and Dating

Model Description

This model is adapted from SIKU-BERT/sikuroberta, with additional domain-adaptive pretraining (DAPT), task-adaptive pretraining (TAPT), and integration of a Glyph Net (GN) module.
It is trained on the BIRD dataset (Bronze Inscription Restoration and Dating), the first fully encoded bronze inscription corpus with chronological labels.

Backbone: RoBERTa trained on the Siku Quanshu corpus
Enhancements: Glyph Net (GN), glyph-biased sampling, DAPT, TAPT
Tasks:
- Masked language modeling (inscription restoration)
- Dynasty- and period-level classification (dating)

Intended Use

Restoration of damaged or missing characters in bronze inscriptions
Chronological classification (dynasty / period dating)
Research in digital humanities, ancient Chinese NLP, and paleography

Training Data

The model was trained on the BIRD dataset:

41k tokens of transcribed and chronologically labeled bronze inscriptions
Deduplicated, filtered, and corrected based on Complete Collection of Yin and Zhou Bronze Inscriptions (CASS, 2007)
[UNK] placeholders for undeciphered glyphs
Supplemented with 2M tokens of Pre-Qin texts (Analects, Mencius, Zuo Zhuan, Mozi, Guanzi, etc.) for domain-adaptive pretraining

Dataset repo: wjhuah/BIRD

Case Study: Hu Ding Restoration

We employed the SikuRoBERTa (GN) model with two decoding strategies:
parallel mask filling and greedy iterative decoding.
The table below compares predicted tokens with expert gold restorations.

Top-1 and Top-5 predictions versus gold characters

(excerpt of the first six damaged positions in the Hu Ding inscription)

Mask Position	Gold	Pred@1	Top-5 Predictions
01	室	廟	廟, 室, 宮, 寢, 廷
02	王	王	王, 公, 君, 伯, 尹
03	芾	芾	芾, 純, 衡, 衣, 韍
05	命	於	於, 于, 揚, 無, 多
06	于	于	于, 揚, 穆, 於, 侑
07	年	年	年, 人, 世, 壽, 歲

On 22 expert restorations (Huang, 2022), the model achieved:

Exact@1: 50.00% (11/22)
Exact@5: 59.09% (13/22)
Exact@10: 68.18% (15/22)

Greedy decoding yielded comparable coverage, though with slightly lower accuracy.
In addition to reproducing expert restorations, the system also generated plausible candidates for undeciphered characters, providing potential references for paleographic analysis.

Model completions for undeciphered positions (Top-10 shown)

Mask Position	Top-10 Predictions
04	鑾, 旂, 舄, 筆, 㫃, 金, 矢, 黃, 弓, 璋
08	介, 伯, 市, 限, 客, 期, 制, 政, 宰, 人
15	之, 外, 一, 若, 內, 賜, 邑, 大, 下, 又
16	賜, 折, 喬, 杜, 乘, 造, 擇, 柞, 之, 于
17	則, 許, 弗, 不, 人, 亦, 也, 而, 帛, 乃
18	則, 曰, 不, 弗, 許, 告, 厥, 多, 有, 用
28	其, 厥, 若, 越, 乃, 我, 以, 汝, 如, 余

How to Use

Quick start

A minimal example for masked language modeling:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "wjhuah/SikuRoBERTa_Bronze"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "唯王元年六月既朢乙亥，王在周穆王太[mask]"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

If you use this model, please cite our EMNLP 2025 paper:

@inproceedings{hua2025bird,
  title     = {BIRD: Bronze Inscription Restoration and Dating},
  author    = {Hua, Wenjie and Nguyen, Hoang H. and Ge, Gangyan},
  booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year      = {2025},
  publisher = {Association for Computational Linguistics}
}

Downloads last month: 7

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for wjhuah/SikuRoBERTa_Bronze

Base model

SIKU-BERT/sikuroberta

Finetuned

(1)

this model

Quantizations

1 model

wjhuah
/

SikuRoBERTa_Bronze