MOSS-VL / README.md
huazzeng's picture
Release current version
6a72916

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: MOSS-VL
emoji: 🌱
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
python_version: '3.10'
app_file: app.py
pinned: false
short_description: 'MOSS-VL: Toward Advanced Video Understanding'
license: apache-2.0
models:
  - OpenMOSS-Team/MOSS-VL-Instruct-0408
tags:
  - vision-language
  - multimodal
  - image-understanding
  - video-understanding

MOSS-VL-Instruct-0408 Demo

An interactive demo for MOSS-VL-Instruct-0408, an 11B-parameter instruction-tuned vision-language model developed by the OpenMOSS Team. Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in video understanding.

Highlights

  • Outstanding Video Understanding — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct.
  • Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks).
  • Reliable Instruction Following — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.

Architecture

MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning:

  • Millisecond-level latency for instantaneous responses
  • Natively supports interleaved modalities — processes complex sequences of images and videos within a unified pipeline
  • Absolute Timestamps injected alongside each sampled frame for precise temporal perception
  • Cross-attention RoPE (XRoPE) — maps text tokens and video patches into a unified 3D coordinate space (time, height, width)

Capabilities

  • Image Understanding: scene description, object recognition, visual reasoning
  • Video Understanding: temporal reasoning, action recognition, key event localization
  • OCR & Document Parsing: text extraction and structured document parsing
  • Visual Question Answering: open-ended questions about any image or video

Usage

  1. Upload an image or video using the input panel, or pick one of the example prompts on the welcome screen
  2. Enter your question or prompt in the text box
  3. (Optional) Adjust generation parameters in the sidebar's Generation Settings
  4. Press Enter or click Send to get the model's response

Note: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster.

Model Details

Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},
  note          = {GitHub repository}
}