PLUM-13B Pretrained Checkpoint

PLUM-13B Pretrained (0-shot)

This README documents the Hugging Face checkpoint wjdghks950/plum-13b-pretrained for PLUM, the segmentation-enabled LMM introduced in PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding (NeurIPS 2025 Spotlight).

📄 Paper: arXiv:2505.20759 💽 Dataset: PARTONOMY-Core

🧭 Overview

PLUM is a segmentation-enabled LMM designed for part-level visual understanding. It avoids <SEG>-token distribution shift via span tagging and conditions on previous predictions through a feedback loop.

This checkpoint corresponds to a pretrained (zero-shot) PLUM-13B model used as the base for later PARTONOMY fine-tuning.

⚙️ Setup

From the repo root:

cd src/models/PLUM
conda env create -f environment.yml
conda activate partonomy
pip install flash-attn --no-build-isolation

If you plan to train or re-evaluate on the full datasets, follow the dataset and weight setup in the main README.md (PARTONOMY datasets + SAM ViT-H weights).

💬 Inference (Chat)

Use chat.py with the Hugging Face checkpoint:

cd src/models/PLUM
CUDA_VISIBLE_DEVICES=0 python chat.py \
  --version wjdghks950/plum-13b-pretrained \
  --precision bf16 \
  --use_bidir_bio \
  --use_feedback_loop

You will be prompted for:

A text prompt (e.g., “What parts do you see in this airplane?”)
An image path

Notes:

--use_bidir_bio and --use_feedback_loop match the pretrained configuration used in this codebase.
You can enable quantized inference with --load_in_8bit or --load_in_4bit if needed.

✅ Evaluation – PARTONOMY

Edit src/models/PLUM/scripts/run_validate_partonomy.sh and set:

MODEL_CKPT_PATH to the HF checkpoint or a local merged model path.

Then run:

cd src/models/PLUM
chmod +x scripts/run_validate_partonomy.sh
./scripts/run_validate_partonomy.sh

🧪 Training Data (Pretraining)

The pretrained PLUM-13B checkpoint is trained using the same multi-source recipe described in the main repository README, including:

Semantic segmentation: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images, PartImageNet
Referring segmentation: refCOCO, refCOCO+, refCOCOg, refCLEF
VQA: LLaVA-Instruct-150K
Reasoning segmentation: ReasonSeg

See README.md for download links and the exact on-disk layout.

🧠 Citation

If you use this checkpoint, please cite:

@inproceedings{
  blume-kim-2025-partonomy,
  title={{PARTONOMY}: Large Multimodal Models with Part-Level Visual Understanding},
  author={Ansel Blume and Jeonghwan Kim and Hyeonjeong Ha and Elen Chatikyan and Xiaomeng Jin and Khanh Duy Nguyen and Nanyun Peng and Kai-Wei Chang and Derek Hoiem and Heng Ji},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=yjLew3Nd7z}
}

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for wjdghks950/plum-13b-pretrained

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Paper • 2505.20759 • Published May 27, 2025