Improve VST-7B-SFT model card with metadata, paper link, and usage clarity

by nielsr HF Staff - opened Nov 11, 2025

←

This PR enhances the model card by adding key metadata and improving its clarity and discoverability:

pipeline_tag: image-text-to-text: This tag accurately reflects the model's functionality of processing visual (image/video) and text inputs to generate text. It will help users find this model when searching for multimodal models.
library_name: transformers: The inclusion of transformers as the library_name ensures that users are provided with an automated, functional code snippet on the model page, facilitating easier adoption. Evidence for compatibility is found in config.json, tokenizer_config.json, and the provided code snippet.
Hugging Face Paper Link: Added a direct link to the Hugging Face paper page, complementing the existing arXiv link and improving the discoverability of the research on the platform.
Improved Title: The model card title has been updated to # VST-7B-SFT: Visual Spatial Tuning for better clarity.
"Sample Usage" Section: The "Quickstart" section has been renamed to "Sample Usage" and includes a clearer introduction for installing dependencies. The provided code snippet has been retained as it correctly targets VST-7B-SFT.

These updates will make the model more accessible and easier to use for the community.

rayruiyang changed pull request status to merged Nov 11, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment