YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Propositionizer-mT5-Small v2 (Multilingual)

A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.

Overview

Property Value
Base Model google/mt5-small (300M params)
Training Method Claude → mT5-small distillation
Languages English, Korean, Japanese, Chinese
Training Data v1: ~9,700 + v2: ~5,900 Korean examples
Format ONNX (int8 quantized)
License Apache 2.0

Based on the Dense X Retrieval (Propositionizer) approach, extended to multilingual.

Usage

Transformers.js (Browser / Node.js)

import { pipeline } from '@huggingface/transformers';

const decomposer = await pipeline(
  'text2text-generation',
  'liliplanet/propositionizer-mt5-small'
);

const result = await decomposer(
  'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
  { max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]

Python (Transformers)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Input Format

Follow the Propositionizer format:

Title: {title}. Section: {section}. Content: {content}

Training

v1

  • Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
  • Labeling: Claude Haiku 4.5 atomic fact decomposition
  • Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
  • Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16

v2 (current)

  • Focus: Korean quality improvement
  • Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
  • Improved prompt: language drift prevention, proper noun preservation
  • Training: continued training from v1, 3 epochs, lr=5e-4
  • Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3

v2 improvements over v1

  • Korean language drift (한국어→영어 전환) resolved
  • Repetition loop eliminated
  • Proper noun preservation improved (CEO, Q2 등)

Comparison with Original Propositionizer

Original (Flan-T5-Large) This Model (mT5-Small)
Parameters 780M 300M
Languages English only EN, KO, JA, ZH
Teacher GPT-4 Claude
Training Data English only Multilingual

Known Limitations

  • Small model (300M) has limited capacity for complex decompositions
  • May hallucinate facts not present in the source text, especially with uncommon proper nouns
  • Best suited for short-to-medium length paragraphs (< 500 chars)
  • Korean complex sentences with many IT terms may produce errors

Citation

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023}
}

Part of MemRosetta

This model is a component of the MemRosetta project for multilingual memory and knowledge extraction.

Downloads last month
719
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liliplanet/propositionizer-mt5-small

Base model

google/mt5-small
Quantized
(3)
this model

Datasets used to train liliplanet/propositionizer-mt5-small

Paper for liliplanet/propositionizer-mt5-small