YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Propositionizer-mT5-Small v2 (Multilingual)
A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.
Overview
| Property | Value |
|---|---|
| Base Model | google/mt5-small (300M params) |
| Training Method | Claude → mT5-small distillation |
| Languages | English, Korean, Japanese, Chinese |
| Training Data | v1: ~9,700 + v2: ~5,900 Korean examples |
| Format | ONNX (int8 quantized) |
| License | Apache 2.0 |
Based on the Dense X Retrieval (Propositionizer) approach, extended to multilingual.
Usage
Transformers.js (Browser / Node.js)
import { pipeline } from '@huggingface/transformers';
const decomposer = await pipeline(
'text2text-generation',
'liliplanet/propositionizer-mt5-small'
);
const result = await decomposer(
'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
{ max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]
Python (Transformers)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")
input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Input Format
Follow the Propositionizer format:
Title: {title}. Section: {section}. Content: {content}
Training
v1
- Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
- Labeling: Claude Haiku 4.5 atomic fact decomposition
- Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
- Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16
v2 (current)
- Focus: Korean quality improvement
- Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
- Improved prompt: language drift prevention, proper noun preservation
- Training: continued training from v1, 3 epochs, lr=5e-4
- Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3
v2 improvements over v1
- Korean language drift (한국어→영어 전환) resolved
- Repetition loop eliminated
- Proper noun preservation improved (CEO, Q2 등)
Comparison with Original Propositionizer
| Original (Flan-T5-Large) | This Model (mT5-Small) | |
|---|---|---|
| Parameters | 780M | 300M |
| Languages | English only | EN, KO, JA, ZH |
| Teacher | GPT-4 | Claude |
| Training Data | English only | Multilingual |
Known Limitations
- Small model (300M) has limited capacity for complex decompositions
- May hallucinate facts not present in the source text, especially with uncommon proper nouns
- Best suited for short-to-medium length paragraphs (< 500 chars)
- Korean complex sentences with many IT terms may produce errors
Citation
@article{chen2023densex,
title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
journal={arXiv preprint arXiv:2312.06648},
year={2023}
}
Part of MemRosetta
This model is a component of the MemRosetta project for multilingual memory and knowledge extraction.
- Downloads last month
- 719
Model tree for liliplanet/propositionizer-mt5-small
Base model
google/mt5-small