Spaces:

OpenMOSS-Team
/

MOSS-VL

Running on Zero

App Files Files Community

MOSS-VL / README.md

huazzeng

Release current version

6a72916 5 days ago

preview code

raw

history blame contribute delete

3.15 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: MOSS-VL
emoji: 🌱
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
python_version: '3.10'
app_file: app.py
pinned: false
short_description: 'MOSS-VL: Toward Advanced Video Understanding'
license: apache-2.0
models:
  - OpenMOSS-Team/MOSS-VL-Instruct-0408
tags:
  - vision-language
  - multimodal
  - image-understanding
  - video-understanding

MOSS-VL-Instruct-0408 Demo

An interactive demo for MOSS-VL-Instruct-0408, an 11B-parameter instruction-tuned vision-language model developed by the OpenMOSS Team. Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in video understanding.

Highlights

Outstanding Video Understanding — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct.
Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks).
Reliable Instruction Following — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.

Architecture

MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning:

Millisecond-level latency for instantaneous responses
Natively supports interleaved modalities — processes complex sequences of images and videos within a unified pipeline
Absolute Timestamps injected alongside each sampled frame for precise temporal perception
Cross-attention RoPE (XRoPE) — maps text tokens and video patches into a unified 3D coordinate space (time, height, width)

Capabilities

Image Understanding: scene description, object recognition, visual reasoning
Video Understanding: temporal reasoning, action recognition, key event localization
OCR & Document Parsing: text extraction and structured document parsing
Visual Question Answering: open-ended questions about any image or video

Usage

Upload an image or video using the input panel, or pick one of the example prompts on the welcome screen
Enter your question or prompt in the text box
(Optional) Adjust generation parameters in the sidebar's Generation Settings
Press Enter or click Send to get the model's response

Note: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster.

Model Details

Model: OpenMOSS-Team/MOSS-VL-Instruct-0408
Parameters: 11B (BF16)
Base Model: MOSS-VL-Base-0408
License: Apache 2.0

Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},
  note          = {GitHub repository}
}