YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Propositionizer-mT5-Small v2 (Multilingual)

A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.

Overview

Property	Value
Base Model	google/mt5-small (300M params)
Training Method	Claude → mT5-small distillation
Languages	English, Korean, Japanese, Chinese
Training Data	v1: ~9,700 + v2: ~5,900 Korean examples
Format	ONNX (int8 quantized)
License	Apache 2.0

Based on the Dense X Retrieval (Propositionizer) approach, extended to multilingual.

Usage

Transformers.js (Browser / Node.js)

import { pipeline } from '@huggingface/transformers';

const decomposer = await pipeline(
  'text2text-generation',
  'liliplanet/propositionizer-mt5-small'
);

const result = await decomposer(
  'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
  { max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]

Python (Transformers)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Input Format

Follow the Propositionizer format:

Title: {title}. Section: {section}. Content: {content}

Training

v1

Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
Labeling: Claude Haiku 4.5 atomic fact decomposition
Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16

v2 (current)

Focus: Korean quality improvement
Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
Improved prompt: language drift prevention, proper noun preservation
Training: continued training from v1, 3 epochs, lr=5e-4
Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3

v2 improvements over v1

Korean language drift (한국어→영어 전환) resolved
Repetition loop eliminated
Proper noun preservation improved (CEO, Q2 등)

Comparison with Original Propositionizer

	Original (Flan-T5-Large)	This Model (mT5-Small)
Parameters	780M	300M
Languages	English only	EN, KO, JA, ZH
Teacher	GPT-4	Claude
Training Data	English only	Multilingual

Known Limitations

Small model (300M) has limited capacity for complex decompositions
May hallucinate facts not present in the source text, especially with uncommon proper nouns
Best suited for short-to-medium length paragraphs (< 500 chars)
Korean complex sentences with many IT terms may produce errors

Citation

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023}
}

Part of MemRosetta

This model is a component of the MemRosetta project for multilingual memory and knowledge extraction.

Downloads last month: 719

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for liliplanet/propositionizer-mt5-small

Base model

google/mt5-small

Quantized

(3)

this model

Datasets used to train liliplanet/propositionizer-mt5-small

Paper for liliplanet/propositionizer-mt5-small

Dense X Retrieval: What Retrieval Granularity Should We Use?

Paper • 2312.06648 • Published Dec 11, 2023 • 1