Spaces:
Running on Zero
Running on Zero
A newer version of the Gradio SDK is available: 6.12.0
metadata
title: MOSS-VL
emoji: 🌱
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
python_version: '3.10'
app_file: app.py
pinned: false
short_description: 'MOSS-VL: Toward Advanced Video Understanding'
license: apache-2.0
models:
- OpenMOSS-Team/MOSS-VL-Instruct-0408
tags:
- vision-language
- multimodal
- image-understanding
- video-understanding
MOSS-VL-Instruct-0408 Demo
An interactive demo for MOSS-VL-Instruct-0408, an 11B-parameter instruction-tuned vision-language model developed by the OpenMOSS Team. Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in video understanding.
Highlights
- Outstanding Video Understanding — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct.
- Strong General Multimodal Perception — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks).
- Reliable Instruction Following — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
Architecture
MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning:
- Millisecond-level latency for instantaneous responses
- Natively supports interleaved modalities — processes complex sequences of images and videos within a unified pipeline
- Absolute Timestamps injected alongside each sampled frame for precise temporal perception
- Cross-attention RoPE (XRoPE) — maps text tokens and video patches into a unified 3D coordinate space (time, height, width)
Capabilities
- Image Understanding: scene description, object recognition, visual reasoning
- Video Understanding: temporal reasoning, action recognition, key event localization
- OCR & Document Parsing: text extraction and structured document parsing
- Visual Question Answering: open-ended questions about any image or video
Usage
- Upload an image or video using the input panel, or pick one of the example prompts on the welcome screen
- Enter your question or prompt in the text box
- (Optional) Adjust generation parameters in the sidebar's Generation Settings
- Press Enter or click Send to get the model's response
Note: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster.
Model Details
- Model: OpenMOSS-Team/MOSS-VL-Instruct-0408
- Parameters: 11B (BF16)
- Base Model: MOSS-VL-Base-0408
- License: Apache 2.0
Citation
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
note = {GitHub repository}
}