huazzeng commited on
Commit
6a72916
·
0 Parent(s):

Release current version

Browse files
.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=xet diff=xet merge=xet -text
37
+ asserts/cleaned_small_logo.png filter=lfs diff=lfs merge=lfs -text
38
+ asserts/pure_logo.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ uploads/
2
+ .sii/
3
+ *.pyc
4
+ __pycache__/
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: MOSS-VL
3
+ emoji: 🌱
4
+ colorFrom: green
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 5.50.0
8
+ python_version: "3.10"
9
+ app_file: app.py
10
+ pinned: false
11
+ short_description: 'MOSS-VL: Toward Advanced Video Understanding'
12
+ license: apache-2.0
13
+ models:
14
+ - OpenMOSS-Team/MOSS-VL-Instruct-0408
15
+ tags:
16
+ - vision-language
17
+ - multimodal
18
+ - image-understanding
19
+ - video-understanding
20
+ ---
21
+
22
+ # MOSS-VL-Instruct-0408 Demo
23
+
24
+ An interactive demo for **MOSS-VL-Instruct-0408**, an 11B-parameter instruction-tuned vision-language model developed by the [OpenMOSS Team](https://github.com/OpenMOSS). Built on MOSS-VL-Base-0408 through supervised fine-tuning, it serves as a high-performance offline multimodal engine with particular strength in **video understanding**.
25
+
26
+ ## Highlights
27
+
28
+ - **Outstanding Video Understanding** — Long-form video comprehension, temporal reasoning, action recognition, and second-level event localization. Top-tier results on VideoMME and MLVU, with +8.3 pts on VSI-bench over Qwen3-VL-8B-Instruct.
29
+ - **Strong General Multimodal Perception** — Robust image understanding, fine-grained object recognition, OCR, and document parsing (83.9 on document/OCR benchmarks).
30
+ - **Reliable Instruction Following** — Enhanced alignment with user intent through supervised fine-tuning on diverse multimodal instruction data.
31
+
32
+ ## Architecture
33
+
34
+ MOSS-VL adopts a **cross-attention-based architecture** that decouples visual encoding from cognitive reasoning:
35
+
36
+ - Millisecond-level latency for instantaneous responses
37
+ - Natively supports **interleaved modalities** — processes complex sequences of images and videos within a unified pipeline
38
+ - **Absolute Timestamps** injected alongside each sampled frame for precise temporal perception
39
+ - **Cross-attention RoPE (XRoPE)** — maps text tokens and video patches into a unified 3D coordinate space (time, height, width)
40
+
41
+ ## Capabilities
42
+
43
+ - **Image Understanding**: scene description, object recognition, visual reasoning
44
+ - **Video Understanding**: temporal reasoning, action recognition, key event localization
45
+ - **OCR & Document Parsing**: text extraction and structured document parsing
46
+ - **Visual Question Answering**: open-ended questions about any image or video
47
+
48
+ ## Usage
49
+
50
+ 1. Upload an **image** or **video** using the input panel, or pick one of the example prompts on the welcome screen
51
+ 2. Enter your question or prompt in the text box
52
+ 3. (Optional) Adjust generation parameters in the sidebar's **Generation Settings**
53
+ 4. Press **Enter** or click **Send** to get the model's response
54
+
55
+ > **Note**: The model weights (~22 GB) may take a few minutes to load on first use (cold start). Subsequent requests will be faster.
56
+
57
+ ## Model Details
58
+
59
+ - **Model**: [OpenMOSS-Team/MOSS-VL-Instruct-0408](https://huggingface.co/OpenMOSS-Team/MOSS-VL-Instruct-0408)
60
+ - **Parameters**: 11B (BF16)
61
+ - **Base Model**: MOSS-VL-Base-0408
62
+ - **License**: Apache 2.0
63
+
64
+ ## Citation
65
+
66
+ ```bibtex
67
+ @misc{moss_vl_2026,
68
+ title = {{MOSS-VL Technical Report}},
69
+ author = {OpenMOSS Team},
70
+ year = {2026},
71
+ howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
72
+ note = {GitHub repository}
73
+ }
74
+ ```
app.py ADDED
@@ -0,0 +1,1091 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import ctypes
3
+ import site
4
+
5
+ # nvidia-npp-cu12 installs libnppicc.so.12 inside site-packages/nvidia/npp/lib/,
6
+ # which is not on LD_LIBRARY_PATH. Load it globally before torchcodec is imported
7
+ # so the dynamic linker can resolve it when torchcodec dlopen's its shared libs.
8
+ def _preload_npp():
9
+ for _sp in site.getsitepackages():
10
+ _p = os.path.join(_sp, "nvidia", "npp", "lib", "libnppicc.so.12")
11
+ if os.path.exists(_p):
12
+ ctypes.CDLL(_p, mode=ctypes.RTLD_GLOBAL)
13
+ return
14
+ _preload_npp()
15
+
16
+ import queue
17
+ import uuid
18
+ import traceback
19
+ import threading
20
+
21
+ import gradio as gr
22
+ import torch
23
+ from transformers import AutoModelForCausalLM, AutoProcessor
24
+
25
+ import modelscope_studio.components.antd as antd
26
+ import modelscope_studio.components.antdx as antdx
27
+ import modelscope_studio.components.base as ms
28
+ import modelscope_studio.components.pro as pro
29
+
30
+ try:
31
+ import spaces
32
+ HAS_SPACES = True
33
+ except ImportError:
34
+ HAS_SPACES = False
35
+
36
+ # ---------------------------------------------------------------------------
37
+ # Model
38
+ # ---------------------------------------------------------------------------
39
+ MODEL_ID = "OpenMOSS-Team/MOSS-VL-Instruct-0408"
40
+
41
+ print("Loading processor...")
42
+ processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
43
+
44
+ print("Loading model...")
45
+ try:
46
+ model = AutoModelForCausalLM.from_pretrained(
47
+ MODEL_ID,
48
+ trust_remote_code=True,
49
+ torch_dtype=torch.bfloat16,
50
+ device_map="auto",
51
+ attn_implementation="flash_attention_2",
52
+ )
53
+ except Exception:
54
+ model = AutoModelForCausalLM.from_pretrained(
55
+ MODEL_ID,
56
+ trust_remote_code=True,
57
+ torch_dtype=torch.bfloat16,
58
+ device_map="auto",
59
+ attn_implementation="sdpa",
60
+ )
61
+
62
+ model.eval()
63
+ print("Model ready.")
64
+
65
+ # ---------------------------------------------------------------------------
66
+ # Theme (Ant Design token — matches Qwen style but in MOSS green accent)
67
+ # ---------------------------------------------------------------------------
68
+ THEME = {
69
+ "token": {
70
+ "colorPrimary": "#4f7c6a",
71
+ }
72
+ }
73
+
74
+ # ---------------------------------------------------------------------------
75
+ # Welcome screen config
76
+ # ---------------------------------------------------------------------------
77
+ def welcome_config():
78
+ return {
79
+ "title": "MOSS-VL",
80
+ "description": "Multimodal vision-language model. Upload an image or video and ask anything.",
81
+ "icon": "asserts/cleaned_small_logo.png",
82
+ "elem_style": {
83
+ "maxWidth": "960px",
84
+ "margin": "40px auto 0",
85
+ "width": "100%",
86
+ "textAlign": "center",
87
+ },
88
+ "prompts": {
89
+ "title": "What can I help with?",
90
+ "elem_style": {
91
+ "width": "100%",
92
+ "display": "flex",
93
+ "flexWrap": "wrap",
94
+ "gap": "12px",
95
+ "justifyContent": "center",
96
+ "alignItems": "stretch",
97
+ },
98
+ "styles": {
99
+ "title": {
100
+ "width": "100%",
101
+ "textAlign": "center",
102
+ "marginBottom": "6px",
103
+ "fontSize": "14px",
104
+ },
105
+ "item": {
106
+ "flex": "1 1 0",
107
+ "maxWidth": "420px",
108
+ "minWidth": "280px",
109
+ },
110
+ },
111
+ "items": [
112
+ {
113
+ "label": "🖼️ Image Perception",
114
+ "children": [
115
+ {
116
+ "label": "Image Caption",
117
+ "children": [
118
+ {"label": "", "description": "请详细描述这张图片的内容。"},
119
+ {"label": "", "description": "Describe this image in detail."},
120
+ ],
121
+ },
122
+ {
123
+ "label": "Multi-Image Caption",
124
+ "children": [
125
+ {"label": "", "description": "这几张图片分别是什么?请逐一详细说明。"},
126
+ {"label": "", "description": "What are these pictures? Please explain in detail one by one."},
127
+ ],
128
+ },
129
+ ],
130
+ },
131
+ {
132
+ "label": "📄 OCR / Document",
133
+ "children": [
134
+ {
135
+ "label": "OCR",
136
+ "children": [
137
+ {"label": "", "description": "提取图片中的所有文字。"},
138
+ {"label": "", "description": "Extract all text in the image."},
139
+ ],
140
+ },
141
+ {
142
+ "label": "Document Parsing",
143
+ "children": [
144
+ {"label": "", "description": "将文档转换为 Markdown 格式。"},
145
+ {"label": "", "description": "Convert this document to Markdown."},
146
+ ],
147
+ },
148
+ ],
149
+ },
150
+ {
151
+ "label": "🎬 Video Understanding",
152
+ "children": [
153
+ {
154
+ "label": "Video Caption",
155
+ "children": [
156
+ {"label": "", "description": "请描述这个视频的内容。"},
157
+ {"label": "", "description": "Describe this video."},
158
+ ],
159
+ },
160
+ {
161
+ "label": "Temporal Grounding",
162
+ "children": [
163
+ {"label": "", "description": "观看此视频并确定主要的叙事片段。对于每个不同的时间块,提供时间戳并描述发生了什么。"},
164
+ {"label": "", "description": "Watch this video and identify the main narrative segments. For each distinct time block, provide the timestamps and describe what happens."},
165
+ ],
166
+ },
167
+ ],
168
+ },
169
+ ],
170
+ },
171
+ }
172
+
173
+
174
+ def user_config():
175
+ return {
176
+ "actions": ["edit", "delete"],
177
+ }
178
+
179
+
180
+ def bot_config(disabled_actions=None):
181
+ actions = ["copy", "retry", "delete"]
182
+ if disabled_actions:
183
+ actions = [a for a in actions if a not in disabled_actions]
184
+ return {
185
+ "avatar": _logo_url,
186
+ "header": "MOSS-VL",
187
+ "actions": actions,
188
+ }
189
+
190
+
191
+ def _file_path(f) -> str:
192
+ """Extract real filesystem path from either a plain string or a Gradio file dict."""
193
+ if isinstance(f, str):
194
+ return f
195
+ if isinstance(f, dict):
196
+ return f.get("path") or f.get("name") or ""
197
+ return ""
198
+
199
+
200
+ # ---------------------------------------------------------------------------
201
+ # Inference (multi-turn — yields loading placeholder then final reply)
202
+ # ---------------------------------------------------------------------------
203
+ _VIDEO_EXTENSIONS = frozenset({".mp4", ".avi", ".mov", ".mkv", ".webm", ".flv", ".wmv", ".m4v"})
204
+
205
+
206
+ def _build_model_messages(history):
207
+ """Convert pro.Chatbot history to the model's multi-turn message format.
208
+
209
+ User turns become ``[{type: image, image: path}, {type: text, text: ...}]``.
210
+ Assistant turns become plain strings. Loading placeholders are skipped.
211
+ """
212
+ model_messages = []
213
+ for msg in history:
214
+ if msg.get("loading"):
215
+ continue
216
+ role = msg["role"]
217
+ if role == "user":
218
+ content_parts = []
219
+ for part in msg.get("content", []):
220
+ if part["type"] == "file":
221
+ for f in (part.get("content") or []):
222
+ path = _file_path(f)
223
+ if path and os.path.exists(path):
224
+ ext = os.path.splitext(path)[1].lower()
225
+ if ext in _VIDEO_EXTENSIONS:
226
+ content_parts.append({"type": "video", "video": path})
227
+ else:
228
+ content_parts.append({"type": "image", "image": path})
229
+ elif part["type"] == "text":
230
+ t = part.get("content", "")
231
+ if t.strip():
232
+ content_parts.append({"type": "text", "text": t})
233
+ if content_parts:
234
+ model_messages.append({"role": "user", "content": content_parts})
235
+ elif role == "assistant":
236
+ text_parts = []
237
+ for part in msg.get("content", []):
238
+ if isinstance(part, dict) and part.get("type") == "text":
239
+ text_parts.append(part.get("content", ""))
240
+ text = "\n".join(text_parts).strip()
241
+ if text:
242
+ model_messages.append({"role": "assistant", "content": text})
243
+ return model_messages
244
+
245
+
246
+ # Media defaults matching the official inference reference
247
+ _IMAGE_MEDIA_DEFAULTS = {
248
+ "min_pixels": 4096,
249
+ "max_pixels": 16777216,
250
+ "multi_image_max_pixels": 201326592,
251
+ "patch_size": 16,
252
+ "temporal_patch_size": 1,
253
+ "merge_size": 2,
254
+ "image_mean": [0.5, 0.5, 0.5],
255
+ "image_std": [0.5, 0.5, 0.5],
256
+ }
257
+ _VIDEO_MEDIA_DEFAULTS = {
258
+ "min_pixels": 4096,
259
+ "max_pixels": 16777216,
260
+ "video_max_pixels": 201326592,
261
+ "patch_size": 16,
262
+ "temporal_patch_size": 1,
263
+ "merge_size": 2,
264
+ "video_fps": 1.0,
265
+ "min_frames": 1,
266
+ "max_frames": 256,
267
+ "num_extract_threads": 4,
268
+ "image_mean": [0.5, 0.5, 0.5],
269
+ "image_std": [0.5, 0.5, 0.5],
270
+ }
271
+
272
+
273
+ def _run_generate(messages, enable_thinking, max_new_tokens, temperature, top_p, repetition_penalty, last_image_path=None, video_fps=1.0, max_frames=256):
274
+ """
275
+ messages: list of history dicts in pro.Chatbot format.
276
+ The caller must have already appended an assistant bubble as the last item.
277
+ Yields: (updated history list, new_last_image_path)
278
+ """
279
+ history = list(messages) if messages else []
280
+
281
+ # Last item is the pre-created assistant bubble; user message is second-to-last
282
+ user_msg = None
283
+ for msg in reversed(history[:-1]):
284
+ if msg["role"] == "user":
285
+ user_msg = msg
286
+ break
287
+ if user_msg is None:
288
+ return
289
+
290
+ text = ""
291
+ new_image = None
292
+ for part in user_msg.get("content", []):
293
+ if part["type"] == "text":
294
+ text = part["content"]
295
+ elif part["type"] == "file":
296
+ files = part["content"]
297
+ if files:
298
+ new_image = _file_path(files[0])
299
+
300
+ if new_image and os.path.exists(new_image):
301
+ last_image_path = new_image
302
+
303
+ if not text.strip():
304
+ history[-1]["loading"] = False
305
+ history[-1]["content"] = [{"type": "text", "content": "⚠️ Please enter a prompt."}]
306
+ yield history, last_image_path
307
+ return
308
+
309
+ # Yield loading bubble immediately before heavy model work
310
+ yield history, last_image_path
311
+
312
+ try:
313
+ model_messages = _build_model_messages(history[:-1])
314
+
315
+ # Detect media types to pick correct defaults
316
+ has_image = any(
317
+ p.get("type") == "image"
318
+ for m in model_messages
319
+ for p in (m["content"] if isinstance(m["content"], list) else [])
320
+ )
321
+ has_video = any(
322
+ p.get("type") == "video"
323
+ for m in model_messages
324
+ for p in (m["content"] if isinstance(m["content"], list) else [])
325
+ )
326
+ media_kwargs = {}
327
+ if has_image:
328
+ media_kwargs.update(_IMAGE_MEDIA_DEFAULTS)
329
+ if has_video:
330
+ media_kwargs.update({**_VIDEO_MEDIA_DEFAULTS, "video_fps": float(video_fps), "max_frames": int(max_frames)})
331
+
332
+ do_sample = temperature > 0.0
333
+ query = {
334
+ "messages": model_messages,
335
+ "media_kwargs": media_kwargs,
336
+ "generate_kwargs": {
337
+ "max_new_tokens": int(max_new_tokens),
338
+ "temperature": float(temperature),
339
+ "top_k": 50,
340
+ "top_p": float(top_p),
341
+ "repetition_penalty": float(repetition_penalty),
342
+ "do_sample": do_sample,
343
+ "vision_chunked_length": 64,
344
+ },
345
+ }
346
+
347
+ # Use the official offline_generate streaming API (queue-based)
348
+ in_q: "queue.Queue[dict]" = queue.Queue()
349
+ out_q: "queue.Queue[str]" = queue.Queue()
350
+
351
+ worker = threading.Thread(
352
+ target=model.offline_generate,
353
+ args=(processor, in_q, out_q),
354
+ kwargs={"vision_chunked_length": 64},
355
+ daemon=True,
356
+ )
357
+ worker.start()
358
+ in_q.put(dict(query))
359
+
360
+ partial_text = ""
361
+ try:
362
+ while True:
363
+ token = out_q.get(timeout=300)
364
+ if token == "<|round_start|>":
365
+ continue
366
+ if token == "<|round_end|>":
367
+ break
368
+ if token.startswith("[ERROR] "):
369
+ raise RuntimeError(token)
370
+ partial_text += token
371
+ history[-1]["loading"] = False
372
+ history[-1]["content"] = [{"type": "text", "content": partial_text + "▋"}]
373
+ yield history, last_image_path
374
+ finally:
375
+ in_q.put({"stop_offline_generate": True})
376
+ worker.join(timeout=30.0)
377
+
378
+ if partial_text:
379
+ history[-1]["content"] = [{"type": "text", "content": partial_text}]
380
+
381
+ except torch.cuda.OutOfMemoryError:
382
+ history[-1]["loading"] = False
383
+ history[-1]["content"] = [{"type": "text", "content": "❌ Out of memory — try a smaller image or fewer Max New Tokens."}]
384
+ except Exception:
385
+ history[-1]["loading"] = False
386
+ history[-1]["content"] = [{"type": "text", "content": f"❌ Error:\n```\n{traceback.format_exc()}\n```"}]
387
+
388
+ yield history, last_image_path
389
+
390
+
391
+ if HAS_SPACES:
392
+ @spaces.GPU(duration=120)
393
+ def run_generate(messages, enable_thinking, max_new_tokens, temperature, top_p, repetition_penalty, last_image_path=None, video_fps=1.0, max_frames=256):
394
+ yield from _run_generate(messages, enable_thinking, max_new_tokens, temperature, top_p, repetition_penalty, last_image_path, video_fps, max_frames)
395
+ else:
396
+ def run_generate(messages, enable_thinking, max_new_tokens, temperature, top_p, repetition_penalty, last_image_path=None, video_fps=1.0, max_frames=256):
397
+ yield from _run_generate(messages, enable_thinking, max_new_tokens, temperature, top_p, repetition_penalty, last_image_path, video_fps, max_frames)
398
+
399
+
400
+ # ---------------------------------------------------------------------------
401
+ # CSS
402
+ # ---------------------------------------------------------------------------
403
+ CSS = """
404
+ /* Use 100vh (absolute) so body.offsetHeight = viewport height.
405
+ iFrameResizer reads offsetHeight — this prevents it from expanding
406
+ the iframe beyond the viewport and making the outer page scroll. */
407
+ html {
408
+ height: 100vh !important;
409
+ overflow: hidden !important;
410
+ }
411
+ body {
412
+ height: 100vh !important;
413
+ overflow: hidden !important;
414
+ }
415
+ .gradio-container {
416
+ padding: 0 !important;
417
+ height: 100vh !important;
418
+ overflow: hidden !important;
419
+ }
420
+ .gradio-container > main.fillable {
421
+ padding: 0 !important;
422
+ height: 100vh !important;
423
+ overflow: hidden !important;
424
+ }
425
+ footer {
426
+ display: none !important;
427
+ }
428
+ /* Height locked via JS-set --app-height to avoid iframe 100vh feedback loop */
429
+ #chatbot {
430
+ height: var(--app-height, 780px);
431
+ max-height: var(--app-height, 780px);
432
+ }
433
+ /* Propagate fixed height through any wrapper divs down to the ant-col children */
434
+ #chatbot > *,
435
+ #chatbot .ant-row,
436
+ #chatbot .ant-col {
437
+ height: 100% !important;
438
+ }
439
+ /* Gradio injects extra wrapper divs between ant-col and chatbot-chat; propagate height */
440
+ #chatbot .ant-col > div {
441
+ height: 100% !important;
442
+ }
443
+ /* Sidebar col: full-height gray background, override antd gutter padding */
444
+ #chatbot .sidebar-col {
445
+ height: 100% !important;
446
+ background-color: var(--ms-gr-ant-color-bg-layout) !important;
447
+ padding-left: 0 !important;
448
+ padding-right: 0 !important;
449
+ }
450
+ #chatbot .chatbot-conversations {
451
+ height: 100%;
452
+ background-color: var(--ms-gr-ant-color-bg-layout);
453
+ padding-left: 4px;
454
+ padding-right: 4px;
455
+ overflow-y: auto;
456
+ }
457
+ #chatbot .chatbot-conversations .chatbot-conversations-list {
458
+ padding-left: 0;
459
+ padding-right: 0;
460
+ }
461
+ #chatbot .chatbot-chat {
462
+ padding: 32px;
463
+ padding-top: 64px;
464
+ padding-bottom: 24px;
465
+ height: 100%;
466
+ display: flex;
467
+ flex-direction: column;
468
+ overflow: hidden;
469
+ }
470
+ @media (max-width: 768px) {
471
+ #chatbot .chatbot-chat {
472
+ padding: 10px;
473
+ padding-bottom: 16px;
474
+ }
475
+ }
476
+ #chatbot .chatbot-chat .chatbot-chat-messages {
477
+ flex: 1;
478
+ min-height: 0;
479
+ overflow-y: auto;
480
+ }
481
+ #chatbot .chatbot-chat .chatbot-chat-messages > div {
482
+ height: 100% !important;
483
+ display: flex !important;
484
+ flex-direction: column !important;
485
+ }
486
+ /* Vertically center welcome content only (safe — won't break scroll when messages exist) */
487
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages {
488
+ display: flex;
489
+ flex-direction: column;
490
+ }
491
+ /* Equal-height top-level cards */
492
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-items {
493
+ display: flex !important;
494
+ align-items: stretch !important;
495
+ }
496
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item {
497
+ display: flex !important;
498
+ flex-direction: column !important;
499
+ height: auto !important;
500
+ flex: 1 1 0 !important;
501
+ }
502
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item > * {
503
+ flex: 1;
504
+ display: flex;
505
+ flex-direction: column;
506
+ height: 100%;
507
+ }
508
+ /* Sub-group rows within each card */
509
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item .ant-prompts-items {
510
+ display: flex !important;
511
+ flex-direction: column !important;
512
+ align-items: stretch !important;
513
+ flex: 1;
514
+ height: 100%;
515
+ }
516
+ /* Sub-groups (level 2) */
517
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item .ant-prompts-item {
518
+ flex: 1 1 0 !important;
519
+ display: flex !important;
520
+ flex-direction: column !important;
521
+ box-sizing: border-box !important;
522
+ }
523
+ /* Leaf prompt buttons (level 3): smaller font and compact height */
524
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item .ant-prompts-item .ant-prompts-item {
525
+ flex: 1 1 0 !important;
526
+ height: auto !important;
527
+ display: flex !important;
528
+ align-items: center !important;
529
+ padding: 4px 8px !important;
530
+ box-sizing: border-box !important;
531
+ font-size: 11px !important;
532
+ line-height: 1.4 !important;
533
+ }
534
+ /* Sub-group label — smaller font */
535
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-prompts-item .ant-prompts-title {
536
+ font-size: 11px !important;
537
+ opacity: 0.65;
538
+ margin-bottom: 4px !important;
539
+ padding: 0 !important;
540
+ }
541
+ /* Make \n in description render as real line breaks */
542
+ .ant-prompts-item-description {
543
+ white-space: pre-wrap !important;
544
+ }
545
+ /* Welcome header: icon stacked above title */
546
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-welcome {
547
+ display: flex !important;
548
+ flex-direction: column !important;
549
+ align-items: center !important;
550
+ }
551
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-welcome-icon {
552
+ font-size: 80px !important;
553
+ margin-bottom: 8px !important;
554
+ margin-inline-end: 0 !important;
555
+ }
556
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-welcome-icon img {
557
+ width: 80px !important;
558
+ height: 80px !important;
559
+ }
560
+ #chatbot .chatbot-chat-messages .ms-gr-pro-chatbot-messages .ant-welcome-title {
561
+ font-size: 36px !important;
562
+ }
563
+ /* Bot avatar: no circle crop, transparent-friendly */
564
+ #chatbot .ant-avatar {
565
+ border-radius: 0 !important;
566
+ background: transparent !important;
567
+ border: none !important;
568
+ box-shadow: none !important;
569
+ }
570
+ #chatbot .ant-avatar img {
571
+ border-radius: 0 !important;
572
+ object-fit: contain !important;
573
+ }
574
+ """
575
+
576
+ # ---------------------------------------------------------------------------
577
+ # UI
578
+ # ---------------------------------------------------------------------------
579
+ _ROOT_PATH = os.environ.get("GRADIO_ROOT_PATH", "").rstrip("/")
580
+ _ASSETS_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "asserts")
581
+ _LOGO_PATH = os.path.join(_ASSETS_DIR, "pure_logo.png")
582
+ _logo_url = "https://huggingface.co/spaces/OpenMOSS-Team/MOSS-VL/resolve/main/asserts/pure_logo.png"
583
+
584
+ # One-shot snapshot of window.innerHeight → --app-height.
585
+ # Reads once after iFrameResizer has set the initial iframe size, then
586
+ # NEVER updates. This breaks the feedback loop where iFrameResizer grows
587
+ # the iframe in response to content height and our JS keeps chasing it.
588
+ _SYNC_HEIGHT_JS = """
589
+ () => {
590
+ let attempts = 0;
591
+ const snapshot = () => {
592
+ const h = window.innerHeight;
593
+ // Only accept plausible values (iframe default is 150px).
594
+ if (h > 500) {
595
+ document.documentElement.style.setProperty('--app-height', h + 'px');
596
+ return; // one-shot: stop polling, never listen for resize
597
+ }
598
+ // Poll every 50ms up to 2 seconds; after that let CSS fallback (780px) take over.
599
+ if (attempts++ < 40) {
600
+ setTimeout(snapshot, 50);
601
+ }
602
+ };
603
+ snapshot();
604
+ }
605
+ """
606
+
607
+ # Per-row height equalization for the 3-column welcome prompt grid.
608
+ # Structure assumed: 3 top-level column items, each with 4 leaf items (2 groups × 2 leaves).
609
+ # Columns are identified as prompts-items that are NOT nested inside another prompts-item.
610
+ # Then for each row index 0-3, we equalize min-height across the 3 columns.
611
+ _EQUALIZE_ROWS_JS = """
612
+ () => {
613
+ const fix = () => {
614
+ const all = [...document.querySelectorAll('[class*="prompts-item"]')];
615
+ if (all.length < 12) { setTimeout(fix, 400); return; }
616
+
617
+ // Top-level column items: not contained in any other prompts-item
618
+ const cols = all.filter(el => !el.parentElement.closest('[class*="prompts-item"]'));
619
+ if (cols.length !== 3) { setTimeout(fix, 400); return; }
620
+
621
+ // For each column collect leaf items (no nested prompts-item) in DOM order
622
+ const colLeaves = cols.map(col =>
623
+ [...col.querySelectorAll('[class*="prompts-item"]')]
624
+ .filter(el => !el.querySelector('[class*="prompts-item"]'))
625
+ );
626
+ if (!colLeaves.every(l => l.length === 4)) { setTimeout(fix, 400); return; }
627
+
628
+ // Check all items have rendered height
629
+ if (colLeaves.flat().some(el => el.getBoundingClientRect().height < 5)) {
630
+ setTimeout(fix, 400); return;
631
+ }
632
+
633
+ // Equalize row by row
634
+ for (let r = 0; r < 4; r++) {
635
+ const row = colLeaves.map(col => col[r]);
636
+ const maxH = Math.max(...row.map(el => el.getBoundingClientRect().height));
637
+ row.forEach(el => { el.style.minHeight = maxH + 'px'; });
638
+ }
639
+ };
640
+ setTimeout(fix, 1500);
641
+ }
642
+ """
643
+
644
+ with gr.Blocks(css=CSS, fill_width=True, title="MOSS-VL Demo") as demo:
645
+
646
+ # Generation settings (shared state)
647
+ gen_settings = gr.State({
648
+ "max_new_tokens": 512,
649
+ "temperature": 0.0,
650
+ "top_p": 1.0,
651
+ "repetition_penalty": 1.0,
652
+ })
653
+
654
+ # Conversation state
655
+ state = gr.State({
656
+ "conversation_contexts": {}, # id -> {"history": [...]}
657
+ "conversations": [], # [{key, label}, ...]
658
+ "conversation_id": "",
659
+ })
660
+
661
+ with ms.Application(), antdx.XProvider(theme=THEME):
662
+ with antd.Row(gutter=[20, 20], wrap=False, elem_id="chatbot"):
663
+
664
+ # ── LEFT SIDEBAR ──
665
+ with antd.Col(
666
+ md=dict(flex="0 0 260px", span=24, order=0),
667
+ span=0,
668
+ order=1,
669
+ elem_style=dict(width=0),
670
+ elem_classes="sidebar-col",
671
+ ) as sidebar_col:
672
+ with ms.Div(elem_classes="chatbot-conversations"):
673
+ with antd.Flex(vertical=True, gap="small", elem_style=dict(height="100%")):
674
+
675
+ # Logo
676
+ gr.HTML(
677
+ f'<div style="display:flex;align-items:center;justify-content:center;'
678
+ f'gap:8px;padding:8px;white-space:nowrap;">'
679
+ f'<img src="{_logo_url}" '
680
+ f'style="width:40px;height:40px;object-fit:contain;display:block;" />'
681
+ f'<span style="font-size:22px;font-weight:600;line-height:1;">MOSS-VL</span>'
682
+ f'</div>'
683
+ )
684
+
685
+ # New conversation button
686
+ with antd.Button(
687
+ value=None,
688
+ color="primary",
689
+ variant="filled",
690
+ block=True,
691
+ ) as add_conv_btn:
692
+ ms.Text("New Conversation")
693
+ with ms.Slot("icon"):
694
+ antd.Icon("PlusOutlined")
695
+
696
+ # Conversation list
697
+ with antdx.Conversations(
698
+ elem_classes="chatbot-conversations-list",
699
+ ) as conversations:
700
+ with ms.Slot("menu.items"):
701
+ with antd.Menu.Item(
702
+ label="Delete", key="delete", danger=True
703
+ ) as conv_delete_item:
704
+ with ms.Slot("icon"):
705
+ antd.Icon("DeleteOutlined")
706
+
707
+ # Settings accordion at bottom of sidebar
708
+ with antd.Collapse(ghost=True):
709
+ with antd.Collapse.Item(
710
+ label="⚙ Generation Settings",
711
+ key="settings",
712
+ ):
713
+ max_new_tokens = gr.Slider(64, 8192, value=4096, step=64, label="Max New Tokens")
714
+ temperature = gr.Slider(0.0, 1.5, value=0.5, step=0.05, label="Temperature")
715
+ top_p = gr.Slider(0.1, 1.0, value=1.0, step=0.05, label="Top-p")
716
+ repetition_penalty = gr.Slider(1.0, 2.0, value=1.05, step=0.05, label="Repetition Penalty")
717
+ with antd.Collapse.Item(
718
+ label="🎬 Video Sampling",
719
+ key="video",
720
+ ):
721
+ video_fps = gr.Slider(0.1, 4.0, value=1.0, step=0.1, label="FPS")
722
+ max_frames = gr.Slider(8, 512, value=256, step=8, label="Max Frames")
723
+
724
+ # ── MAIN CHAT AREA ──
725
+ with antd.Col(flex=1, elem_style=dict(height="100%")):
726
+ with antd.Flex(
727
+ vertical=True,
728
+ gap="small",
729
+ elem_classes="chatbot-chat",
730
+ ):
731
+ # Chatbot
732
+ chatbot = pro.Chatbot(
733
+ elem_classes="chatbot-chat-messages",
734
+ height=0,
735
+ welcome_config=welcome_config(),
736
+ user_config=user_config(),
737
+ bot_config=bot_config(),
738
+ )
739
+
740
+ # Multimodal input (built-in + button for attachments)
741
+ with pro.MultimodalInput(
742
+ placeholder="Message MOSS-VL…",
743
+ upload_config={
744
+ "accept": "image/*,video/*",
745
+ "multiple": False,
746
+ },
747
+ ) as chat_input:
748
+ with ms.Slot("prefix"):
749
+ with antd.Flex(gap=4, wrap=True):
750
+ with antd.Button(value=None, type="text") as clear_btn:
751
+ with ms.Slot("icon"):
752
+ antd.Icon("ClearOutlined")
753
+
754
+ # ── EVENT HANDLERS ──
755
+
756
+ def preprocess(state_value, clear_input=True):
757
+ history = state_value["conversation_contexts"].get(
758
+ state_value["conversation_id"], {}
759
+ ).get("history", [])
760
+ updates = {
761
+ conversations: gr.update(
762
+ active_key=state_value["conversation_id"],
763
+ items=[{**c, "disabled": c["key"] != state_value["conversation_id"]}
764
+ for c in state_value["conversations"]],
765
+ ),
766
+ add_conv_btn: gr.update(disabled=True),
767
+ clear_btn: gr.update(disabled=True),
768
+ conv_delete_item: gr.update(disabled=True),
769
+ chatbot: gr.update(
770
+ value=history,
771
+ bot_config=bot_config(disabled_actions=["retry", "edit", "delete"]),
772
+ user_config={"actions": []},
773
+ ),
774
+ state: gr.update(value=state_value),
775
+ }
776
+ if clear_input:
777
+ updates[chat_input] = gr.update(value=None, loading=True)
778
+ else:
779
+ updates[chat_input] = gr.update(loading=True)
780
+ return updates
781
+
782
+ def postprocess(state_value):
783
+ history = state_value["conversation_contexts"].get(
784
+ state_value["conversation_id"], {}
785
+ ).get("history", [])
786
+ return {
787
+ chat_input: gr.update(loading=False),
788
+ conv_delete_item: gr.update(disabled=False),
789
+ clear_btn: gr.update(disabled=False),
790
+ conversations: gr.update(items=state_value["conversations"]),
791
+ add_conv_btn: gr.update(disabled=False),
792
+ chatbot: gr.update(
793
+ value=history,
794
+ bot_config=bot_config(),
795
+ user_config=user_config(),
796
+ ),
797
+ state: gr.update(value=state_value),
798
+ }
799
+
800
+ def add_user_message(input_value, state_value):
801
+ text = input_value.get("text", "") if input_value else ""
802
+ files = input_value.get("files", []) if input_value else []
803
+
804
+ persistent_files = [_file_path(f) for f in files]
805
+
806
+ if not state_value["conversation_id"]:
807
+ conv_id = str(uuid.uuid4())
808
+ state_value["conversation_id"] = conv_id
809
+ state_value["conversations"].append({"label": text[:40] or "New Chat", "key": conv_id})
810
+ state_value["conversation_contexts"][conv_id] = {"history": [], "last_image_path": None}
811
+
812
+ ctx = state_value["conversation_contexts"][state_value["conversation_id"]]
813
+ history = ctx["history"]
814
+
815
+ history.append({
816
+ "key": str(uuid.uuid4()),
817
+ "role": "user",
818
+ "content": [
819
+ {"type": "file", "content": persistent_files},
820
+ {"type": "text", "content": text},
821
+ ],
822
+ })
823
+
824
+ history.append({
825
+ "key": str(uuid.uuid4()),
826
+ "role": "assistant",
827
+ "header": "MOSS-VL",
828
+ "loading": True,
829
+ "content": [{"type": "text", "content": ""}],
830
+ })
831
+
832
+ return preprocess(state_value, clear_input=True)
833
+
834
+ def generate_response(state_value, max_tok, temp, top_p_, rep_pen, v_fps, v_max_frames):
835
+ conv_id = state_value.get("conversation_id", "")
836
+ if not conv_id or conv_id not in state_value.get("conversation_contexts", {}):
837
+ return
838
+ ctx = state_value["conversation_contexts"][conv_id]
839
+ history = ctx["history"]
840
+ last_img = ctx.get("last_image_path")
841
+
842
+ for updated_history, new_last_img in run_generate(
843
+ history, False, max_tok, temp, top_p_, rep_pen, last_img, v_fps, v_max_frames
844
+ ):
845
+ ctx["history"] = updated_history
846
+ ctx["last_image_path"] = new_last_img
847
+ yield updated_history, state_value
848
+
849
+ def apply_welcome_prompt(e: gr.EventData, input_value):
850
+ if input_value is None:
851
+ input_value = {}
852
+ input_value["text"] = e._data["payload"][0]["value"]["description"]
853
+ return gr.update(value=input_value)
854
+
855
+ def new_chat(state_value):
856
+ if not state_value["conversation_id"]:
857
+ return gr.skip()
858
+ state_value["conversation_id"] = ""
859
+ return (
860
+ gr.update(active_key=""),
861
+ gr.update(value=None),
862
+ gr.update(value=state_value),
863
+ )
864
+
865
+ def select_conversation(state_value, e: gr.EventData):
866
+ key = e._data["payload"][0]
867
+ if state_value["conversation_id"] == key or key not in state_value["conversation_contexts"]:
868
+ return gr.skip()
869
+ state_value["conversation_id"] = key
870
+ history = state_value["conversation_contexts"][key]["history"]
871
+ return (
872
+ gr.update(active_key=key),
873
+ gr.update(value=history),
874
+ gr.update(value=state_value),
875
+ )
876
+
877
+ def conversation_menu(state_value, e: gr.EventData):
878
+ conv_id = e._data["payload"][0]["key"]
879
+ operation = e._data["payload"][1]["key"]
880
+ if operation == "delete":
881
+ del state_value["conversation_contexts"][conv_id]
882
+ state_value["conversations"] = [
883
+ c for c in state_value["conversations"] if c["key"] != conv_id
884
+ ]
885
+ if state_value["conversation_id"] == conv_id:
886
+ state_value["conversation_id"] = ""
887
+ return (
888
+ gr.update(items=state_value["conversations"], active_key=""),
889
+ gr.update(value=None),
890
+ gr.update(value=state_value),
891
+ )
892
+ else:
893
+ return (
894
+ gr.update(items=state_value["conversations"]),
895
+ gr.skip(),
896
+ gr.update(value=state_value),
897
+ )
898
+ return gr.skip()
899
+
900
+ def clear_history(state_value):
901
+ if not state_value["conversation_id"]:
902
+ return gr.skip()
903
+ state_value["conversation_contexts"][state_value["conversation_id"]]["history"] = []
904
+ return gr.update(value=None), gr.update(value=state_value)
905
+
906
+ def prepare_retry(state_value, e: gr.EventData):
907
+ index = e._data["payload"][0]["index"]
908
+ ctx = state_value["conversation_contexts"][state_value["conversation_id"]]
909
+ ctx["history"] = ctx["history"][:index]
910
+
911
+ ctx["history"].append({
912
+ "key": str(uuid.uuid4()),
913
+ "role": "assistant",
914
+ "header": "MOSS-VL",
915
+ "loading": True,
916
+ "content": [{"type": "text", "content": ""}],
917
+ })
918
+
919
+ return preprocess(state_value, clear_input=False)
920
+
921
+ def delete_message(state_value, e: gr.EventData):
922
+ index = e._data["payload"][0]["index"]
923
+ history = state_value["conversation_contexts"][state_value["conversation_id"]]["history"]
924
+ history.pop(index)
925
+ return gr.update(value=state_value)
926
+
927
+ def handle_edit(state_value, e: gr.EventData):
928
+ payload = e._data["payload"][0]
929
+ index = payload["index"]
930
+
931
+ ctx = state_value["conversation_contexts"][state_value["conversation_id"]]
932
+
933
+ # Extract new text from the edited content
934
+ new_content = payload.get("value", "")
935
+ if isinstance(new_content, list):
936
+ # content is a list of parts — extract text
937
+ new_text = " ".join(
938
+ p.get("content", "") or p.get("text", "")
939
+ for p in new_content
940
+ if isinstance(p, dict) and p.get("type") == "text"
941
+ )
942
+ elif isinstance(new_content, str):
943
+ new_text = new_content
944
+ else:
945
+ new_text = ""
946
+
947
+ # Update the user message at index with the new text, keep files intact
948
+ original_msg = ctx["history"][index]
949
+ new_parts = []
950
+ for part in original_msg.get("content", []):
951
+ if part.get("type") == "file":
952
+ new_parts.append(part)
953
+ elif part.get("type") == "text":
954
+ new_parts.append({"type": "text", "content": new_text})
955
+ if not any(p.get("type") == "text" for p in new_parts):
956
+ new_parts.append({"type": "text", "content": new_text})
957
+ ctx["history"][index]["content"] = new_parts
958
+
959
+ # Drop everything after the edited message (old assistant reply + later turns)
960
+ ctx["history"] = ctx["history"][:index + 1]
961
+
962
+ # Append loading assistant bubble
963
+ ctx["history"].append({
964
+ "key": str(uuid.uuid4()),
965
+ "role": "assistant",
966
+ "header": "MOSS-VL",
967
+ "loading": True,
968
+ "content": [{"type": "text", "content": ""}],
969
+ })
970
+
971
+ return preprocess(state_value, clear_input=False)
972
+
973
+ # Wire events
974
+ ui_outputs = [
975
+ chat_input, conv_delete_item, clear_btn,
976
+ add_conv_btn, conversations, chatbot, state,
977
+ ]
978
+ stream_outputs = [chatbot, state]
979
+ gen_settings = [max_new_tokens, temperature, top_p, repetition_penalty, video_fps, max_frames]
980
+
981
+ # Submit: add message → stream tokens → restore UI
982
+ submit_step1 = chat_input.submit(
983
+ fn=add_user_message,
984
+ inputs=[chat_input, state],
985
+ outputs=ui_outputs,
986
+ )
987
+ submit_step2 = submit_step1.then(
988
+ fn=generate_response,
989
+ inputs=[state] + gen_settings,
990
+ outputs=stream_outputs,
991
+ )
992
+ submit_step2.then(
993
+ fn=postprocess,
994
+ inputs=[state],
995
+ outputs=ui_outputs,
996
+ )
997
+
998
+ chat_input.cancel(
999
+ fn=postprocess,
1000
+ inputs=[state],
1001
+ outputs=ui_outputs,
1002
+ cancels=[submit_step1, submit_step2],
1003
+ queue=False,
1004
+ )
1005
+
1006
+ chatbot.welcome_prompt_select(
1007
+ fn=apply_welcome_prompt,
1008
+ inputs=[chat_input],
1009
+ outputs=[chat_input],
1010
+ )
1011
+
1012
+ add_conv_btn.click(
1013
+ fn=new_chat,
1014
+ inputs=[state],
1015
+ outputs=[conversations, chatbot, state],
1016
+ )
1017
+
1018
+ conversations.active_change(
1019
+ fn=select_conversation,
1020
+ inputs=[state],
1021
+ outputs=[conversations, chatbot, state],
1022
+ )
1023
+
1024
+ conversations.menu_click(
1025
+ fn=conversation_menu,
1026
+ inputs=[state],
1027
+ outputs=[conversations, chatbot, state],
1028
+ )
1029
+
1030
+ clear_btn.click(
1031
+ fn=clear_history,
1032
+ inputs=[state],
1033
+ outputs=[chatbot, state],
1034
+ )
1035
+
1036
+ chatbot.delete(
1037
+ fn=delete_message,
1038
+ inputs=[state],
1039
+ outputs=[state],
1040
+ )
1041
+
1042
+ # Edit: update message → stream tokens → restore UI
1043
+ edit_step1 = chatbot.edit(
1044
+ fn=handle_edit,
1045
+ inputs=[state],
1046
+ outputs=ui_outputs,
1047
+ )
1048
+ edit_step2 = edit_step1.then(
1049
+ fn=generate_response,
1050
+ inputs=[state] + gen_settings,
1051
+ outputs=stream_outputs,
1052
+ )
1053
+ edit_step2.then(
1054
+ fn=postprocess,
1055
+ inputs=[state],
1056
+ outputs=ui_outputs,
1057
+ )
1058
+
1059
+ # Retry: prepare → stream tokens → restore UI
1060
+ retry_step1 = chatbot.retry(
1061
+ fn=prepare_retry,
1062
+ inputs=[state],
1063
+ outputs=ui_outputs,
1064
+ )
1065
+ retry_step2 = retry_step1.then(
1066
+ fn=generate_response,
1067
+ inputs=[state] + gen_settings,
1068
+ outputs=stream_outputs,
1069
+ )
1070
+ retry_step2.then(
1071
+ fn=postprocess,
1072
+ inputs=[state],
1073
+ outputs=ui_outputs,
1074
+ )
1075
+
1076
+ # Lock chatbot height to actual viewport height (avoids iframe 100vh loop)
1077
+ demo.load(fn=None, inputs=None, outputs=None, js=_SYNC_HEIGHT_JS)
1078
+
1079
+ # Per-row height equalization for the welcome prompt grid
1080
+ demo.load(fn=None, inputs=None, outputs=None, js=_EQUALIZE_ROWS_JS)
1081
+
1082
+
1083
+ demo.queue(default_concurrency_limit=1, max_size=20)
1084
+
1085
+ # Mount asserts directory as /assets so logo can be served without going
1086
+ # through gradio's cache validation (which rejects paths not in temp dir)
1087
+ from fastapi.staticfiles import StaticFiles
1088
+ demo.app.mount("/assets", StaticFiles(directory=_ASSETS_DIR), name="assets")
1089
+
1090
+ if __name__ == "__main__":
1091
+ demo.launch(ssr_mode=False, root_path=_ROOT_PATH)
asserts/cleaned_small_logo.png ADDED

Git LFS Details

  • SHA256: 37bf0241f5fbab73fc955827f477befc95b88b0d3618c31628099b70ea1dd232
  • Pointer size: 131 Bytes
  • Size of remote file: 941 kB
asserts/pure_logo.png ADDED

Git LFS Details

  • SHA256: f109ce5103134e1ff4f34d52bbea9e6a5d7ee37962757dedab508301b202e0f7
  • Pointer size: 132 Bytes
  • Size of remote file: 1.55 MB
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ ffmpeg
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://pypi.nvidia.com
2
+
3
+ torch==2.8.0
4
+ torchvision==0.23.0
5
+ transformers==4.57.1
6
+ accelerate==1.12.0
7
+ torchcodec==0.7.0
8
+ numpy
9
+ pillow==11.3.0
10
+ joblib==1.5.2
11
+ einops==0.8.2
12
+ ninja==1.13.0
13
+ packaging==26.0
14
+ spaces
15
+ modelscope_studio==1.6.1
16
+ nvidia-npp-cu12