update readme

Browse files

Files changed (10) hide show

.gitattributes +2 -0
README.md +338 -3
assets/data-scale-csr-effect.svg +2734 -0
assets/table2.png +3 -0
assets/table3.png +3 -0
assets/training_record/vica-train_grad_norm.svg +0 -0
assets/training_record/vica-train_learning_rate.svg +1240 -0
assets/training_record/vica-train_loss_with_ema.svg +0 -0
assets/vsi-bench-comparison.svg +1993 -0
assets/vsi-bench-table.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,7 +1,342 @@
 ---
-base_model: lmms-lab/LLaVA-Video-7B-Qwen2
 tags:
-  - llava
   - vision-language
-  - fine-tuned
 ---

 ---
+license: apache-2.0
 tags:
+  - multimodal
   - vision-language
+  - video understanding
+  - spatial reasoning
+  - visuospatial cognition
+  - llava
+  - qwen
+  - llava-video
+datasets:
+  - nkkbr/ViCA-322K
+  - nkkbr/ViCA-thinking-2.68k
+language:
+  - en
+library_name: transformers
+pipeline_tag: visual-question-answering
+model_name: ViCA-7B
+base_model: lmms-lab/LLaVA-Video-7B-Qwen2
+---
+# ViCA-7B: Visuospatial Cognitive Assistant
+## Overview
+**ViCA-7B** is a vision-language model specifically fine-tuned for *visuospatial reasoning* in indoor video environments. Built upon the LLaVA-Video-7B-Qwen2 architecture, it is trained using our newly proposed **ViCA-322K dataset**, which emphasizes both structured spatial annotations and complex instruction-based reasoning tasks.
+ViCA-7B achieves **state-of-the-art performance** on [VSI-Bench](https://github.com/vision-x-nyu/thinking-in-space), outperforming both proprietary models like **GPT-4o** and **Gemini-1.5 Pro**, as well as larger open-source baselines.
+> **ViCA-7B sets a new standard for open-source multimodal spatial reasoning on indoor videos, making it a strong candidate for embodied AI and robotics use cases.**
+<p align="center">
+  <img src="assets/vsi-bench-comparison.svg" width="700"/>
+</p>
+<p align="center"><b>Figure 1:</b> Performance comparison of ViCA-7B and other models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>.</p>
+## Model Architecture and Training Strategy
+ViCA-7B is built upon the [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) framework, using **Qwen2-7B** as the language backbone and **SigLIP** as the visual encoder.
+**Key Training Features**
+- **Fixed-Length Visual Tokenization**
+  Each video is uniformly sampled into 64 frames, and each frame is encoded into 210 visual tokens, resulting in a total of **13,440 visual tokens per example**. This fixed-length design ensures consistent memory usage and stable optimization across batches.
+- **Multimodal Alignment via Lightweight Projector**
+  A simple MLP-based projector maps visual embeddings into the language embedding space, enabling effective fusion between video content and textual prompts during both training and inference.
+- **Efficient Distributed Training with DeepSpeed**
+  Training is conducted using **DeepSpeed ZeRO-3 Offload** on **8× NVIDIA H100 80GB GPUs**, with full parameter and optimizer state partitioning across devices. This setup supports large batch sizes and minimizes GPU memory overhead.
+- **Mixed-Precision Computation (fp16)**
+  We adopt **mixed-precision training (fp16)** to accelerate computation and reduce memory usage, without compromising accuracy. This is combined with ZeRO-3 partitioning to further enhance training scalability.
+The training was conducted over **55 hours**, covering both base and complex spatial reasoning subsets.
+## Training Dynamics
+<p align="center">
+  <img src="assets/training_record/vica-train_loss_with_ema.svg" width="30%"/>
+  <img src="assets/training_record/vica-train_learning_rate.svg" width="30%"/>
+  <img src="assets/training_record/vica-train_grad_norm.svg" width="30%"/>
+</p>
+<p align="center">
+  <b>Figure 2:</b> Training loss, learning rate schedule, and gradient norm curves during ViCA-7B fine-tuning.
+  These curves illustrate a stable optimization process and smooth convergence under the DeepSpeed ZeRO-3 setup.
+</p>
+## Dataset
+ViCA-7B is fine-tuned on two complementary datasets:
+- [**ViCA-322K**](https://huggingface.co/datasets/nkkbr/ViCA-322K):
+  A large-scale dataset covering both **base spatial reasoning tasks** (e.g., object distance, size, count, appearance order) and **complex spatial reasoning tasks** involving natural language questions and scene understanding. This dataset forms the core of the model's spatial reasoning capabilities.
+- [**ViCA-thinking-2.68k**](https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68k):
+  A focused dataset used for instruction tuning to enhance the model's ability to **generate step-by-step reasoning traces** before outputting final answers. This supports more interpretable and cognitively-aligned response generation.
+ For details, please refer to the individual dataset pages linked above.
+## Evaluation: VSI-BENCH Benchmark
+<p align="center">
+  <img src="assets/vsi-bench-table.png" width="800"/>
+</p>
+<p align="center"><b>Figure 3:</b> Quantitative comparison of ViCA-7B and baseline models on <a href="https://github.com/vision-x-nyu/thinking-in-space">VSI-Bench</a>. ViCA-7B achieves the best overall performance across both numerical and multiple-choice tasks.</p>
+### Effect of CSR Data
+| Configuration        | Avg Score |
+|----------------------|-----------|
+| Base-only (281K)     | 55.39     |
+| Full with CSR (322K) | **60.14** |
+> CSR(Complex Spatial Reasoning) boosts generalization and **accelerates learning**, with notable performance jumps at intermediate checkpoints (e.g., +2.02 at 50–55%).
+### Data Scale vs. Performance
+Performance improves significantly between **5% → 60%** of data usage. After **80%**, improvements plateau, indicating dataset is well-matched to model capacity.
+<p align="center">
+  <img src="assets/data-scale-csr-effect.svg" width="750"/>
+</p>
+<p align="center"><b>Figure 4:</b> Performance of ViCA-7B under varying training data sizes (from 5% to 100%). The full dataset (including Complex Spatial Reasoning, CSR) consistently outperforms the base-only configuration. Notably, the CSR-enhanced model shows a +2.02 score jump between 50% and 55%, and a final performance gain of +4.75 at full scale. Performance plateaus beyond 80%, indicating the dataset is well-aligned with the model capacity.</p>
+## Intermediate Checkpoints and Evaluation Outputs
+To support detailed analysis and reproducibility, we provide two sets of intermediate checkpoints saved at every **5% increment** of the training data. These models are trained for a single epoch and are useful for understanding how performance evolves as training progresses.
+We also release the corresponding **raw evaluation outputs** (e.g., `.json` prediction files) for each checkpoint.
+The evaluation script used to produce these outputs is available in our [GitHub repository](https://github.com/nkkbr/ViCA).
+### Full Dataset (ViCA-322K: Base + CSR)
+This series corresponds to the full training set, including both base spatial reasoning and complex spatial reasoning (CSR):
+| Data Usage | Checkpoint                                                | Data Usage | Checkpoint                                                  |
+| ---------- | --------------------------------------------------------- | ---------- | ----------------------------------------------------------- |
+| 5%         | [`nkkbr/ViCA-5p`](https://huggingface.co/nkkbr/ViCA-5p)   | 55%        | [`nkkbr/ViCA-55p`](https://huggingface.co/nkkbr/ViCA-55p)   |
+| 10%        | [`nkkbr/ViCA-10p`](https://huggingface.co/nkkbr/ViCA-10p) | 60%        | [`nkkbr/ViCA-60p`](https://huggingface.co/nkkbr/ViCA-60p)   |
+| 15%        | [`nkkbr/ViCA-15p`](https://huggingface.co/nkkbr/ViCA-15p) | 65%        | [`nkkbr/ViCA-65p`](https://huggingface.co/nkkbr/ViCA-65p)   |
+| 20%        | [`nkkbr/ViCA-20p`](https://huggingface.co/nkkbr/ViCA-20p) | 70%        | [`nkkbr/ViCA-70p`](https://huggingface.co/nkkbr/ViCA-70p)   |
+| 25%        | [`nkkbr/ViCA-25p`](https://huggingface.co/nkkbr/ViCA-25p) | 75%        | [`nkkbr/ViCA-75p`](https://huggingface.co/nkkbr/ViCA-75p)   |
+| 30%        | [`nkkbr/ViCA-30p`](https://huggingface.co/nkkbr/ViCA-30p) | 80%        | [`nkkbr/ViCA-80p`](https://huggingface.co/nkkbr/ViCA-80p)   |
+| 35%        | [`nkkbr/ViCA-35p`](https://huggingface.co/nkkbr/ViCA-35p) | 85%        | [`nkkbr/ViCA-85p`](https://huggingface.co/nkkbr/ViCA-85p)   |
+| 40%        | [`nkkbr/ViCA-40p`](https://huggingface.co/nkkbr/ViCA-40p) | 90%        | [`nkkbr/ViCA-90p`](https://huggingface.co/nkkbr/ViCA-90p)   |
+| 45%        | [`nkkbr/ViCA-45p`](https://huggingface.co/nkkbr/ViCA-45p) | 95%        | [`nkkbr/ViCA-95p`](https://huggingface.co/nkkbr/ViCA-95p)   |
+| 50%        | [`nkkbr/ViCA-50p`](https://huggingface.co/nkkbr/ViCA-50p) | 100% (This repo)       | [`nkkbr/ViCA`](https://huggingface.co/nkkbr/ViCA) |
+Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_all_data/).
+### Base-only Subset (ViCA-322K: Base)
+This series is trained **only** on the base spatial reasoning subset of ViCA-322K, without any CSR examples:
+| Data Usage | Checkpoint                                                          | Data Usage | Checkpoint                                                            |
+| ---------- | ------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------- |
+| 5%         | [`nkkbr/ViCA-base-5p`](https://huggingface.co/nkkbr/ViCA-base-5p)   | 55%        | [`nkkbr/ViCA-base-55p`](https://huggingface.co/nkkbr/ViCA-base-55p)   |
+| 10%        | [`nkkbr/ViCA-base-10p`](https://huggingface.co/nkkbr/ViCA-base-10p) | 60%        | [`nkkbr/ViCA-base-60p`](https://huggingface.co/nkkbr/ViCA-base-60p)   |
+| 15%        | [`nkkbr/ViCA-base-15p`](https://huggingface.co/nkkbr/ViCA-base-15p) | 65%        | [`nkkbr/ViCA-base-65p`](https://huggingface.co/nkkbr/ViCA-base-65p)   |
+| 20%        | [`nkkbr/ViCA-base-20p`](https://huggingface.co/nkkbr/ViCA-base-20p) | 70%        | [`nkkbr/ViCA-base-70p`](https://huggingface.co/nkkbr/ViCA-base-70p)   |
+| 25%        | [`nkkbr/ViCA-base-25p`](https://huggingface.co/nkkbr/ViCA-base-25p) | 75%        | [`nkkbr/ViCA-base-75p`](https://huggingface.co/nkkbr/ViCA-base-75p)   |
+| 30%        | [`nkkbr/ViCA-base-30p`](https://huggingface.co/nkkbr/ViCA-base-30p) | 80%        | [`nkkbr/ViCA-base-80p`](https://huggingface.co/nkkbr/ViCA-base-80p)   |
+| 35%        | [`nkkbr/ViCA-base-35p`](https://huggingface.co/nkkbr/ViCA-base-35p) | 85%        | [`nkkbr/ViCA-base-85p`](https://huggingface.co/nkkbr/ViCA-base-85p)   |
+| 40%        | [`nkkbr/ViCA-base-40p`](https://huggingface.co/nkkbr/ViCA-base-40p) | 90%        | [`nkkbr/ViCA-base-90p`](https://huggingface.co/nkkbr/ViCA-base-90p)   |
+| 45%        | [`nkkbr/ViCA-base-45p`](https://huggingface.co/nkkbr/ViCA-base-45p) | 95%        | [`nkkbr/ViCA-base-95p`](https://huggingface.co/nkkbr/ViCA-base-95p)   |
+| 50%        | [`nkkbr/ViCA-base-50p`](https://huggingface.co/nkkbr/ViCA-base-50p) | 100%       | [`nkkbr/ViCA-base`](https://huggingface.co/nkkbr/ViCA-base) |
+Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_only_base/).
+## Source-wise Checkpoints
+While the full **ViCA-322K** dataset was curated by us, the underlying videos and associated metadata are sourced from three distinct indoor video datasets:
+* **ARKitScenes**
+* **ScanNet**
+* **ScanNet++**
+To better understand how each source contributes to model performance, we fine-tuned ViCA-7B on subsets of ViCA-322K that exclusively use data from each source. For each subset, we provide checkpoints trained with **10% increments** of the available data, from 10% to 100%.
+Corresponding **raw evaluation outputs** (e.g., `.json` predictions) are also provided for all checkpoints.
+### ARKitScenes-Only Checkpoints
+| Data Usage | Checkpoint                                                                        | Data Usage | Checkpoint                                                                          |
+| ---------- | --------------------------------------------------------------------------------- | ---------- | ----------------------------------------------------------------------------------- |
+| 10%        | [`nkkbr/ViCA-ARKitScenes-10p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-10p) | 60%        | [`nkkbr/ViCA-ARKitScenes-60p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-60p)   |
+| 20%        | [`nkkbr/ViCA-ARKitScenes-20p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-20p) | 70%        | [`nkkbr/ViCA-ARKitScenes-70p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-70p)   |
+| 30%        | [`nkkbr/ViCA-ARKitScenes-30p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-30p) | 80%        | [`nkkbr/ViCA-ARKitScenes-80p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-80p)   |
+| 40%        | [`nkkbr/ViCA-ARKitScenes-40p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-40p) | 90%        | [`nkkbr/ViCA-ARKitScenes-90p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-90p)   |
+| 50%        | [`nkkbr/ViCA-ARKitScenes-50p`](https://huggingface.co/nkkbr/ViCA-ARKitScenes-50p) | 100%       | [`nkkbr/ViCA-ARKitScenes`](https://huggingface.co/nkkbr/ViCA-ARKitScenes) |
+🔗 Raw evaluation outputs: [ARKitScenes results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_arkitscenes/)
+### ScanNet++-Only Checkpoints
+| Data Usage | Checkpoint                                                                    | Data Usage | Checkpoint                                                                      |
+| ---------- | ----------------------------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------- |
+| 10%        | [`nkkbr/ViCA-ScanNetPP-10p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-10p) | 60%        | [`nkkbr/ViCA-ScanNetPP-60p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-60p)   |
+| 20%        | [`nkkbr/ViCA-ScanNetPP-20p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-20p) | 70%        | [`nkkbr/ViCA-ScanNetPP-70p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-70p)   |
+| 30%        | [`nkkbr/ViCA-ScanNetPP-30p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-30p) | 80%        | [`nkkbr/ViCA-ScanNetPP-80p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-80p)   |
+| 40%        | [`nkkbr/ViCA-ScanNetPP-40p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-40p) | 90%        | [`nkkbr/ViCA-ScanNetPP-90p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-90p)   |
+| 50%        | [`nkkbr/ViCA-ScanNetPP-50p`](https://huggingface.co/nkkbr/ViCA-ScanNetPP-50p) | 100%       | [`nkkbr/ViCA-ScanNetPP`](https://huggingface.co/nkkbr/ViCA-ScanNetPP) |
+🔗 Raw evaluation outputs: [ScanNet++ results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannetpp/)
+### ScanNet-Only Checkpoints
+| Data Usage | Checkpoint                                                                | Data Usage | Checkpoint                                                                  |
+| ---------- | ------------------------------------------------------------------------- | ---------- | --------------------------------------------------------------------------- |
+| 10%        | [`nkkbr/ViCA-ScanNet-10p`](https://huggingface.co/nkkbr/ViCA-ScanNet-10p) | 60%        | [`nkkbr/ViCA-ScanNet-60p`](https://huggingface.co/nkkbr/ViCA-ScanNet-60p)   |
+| 20%        | [`nkkbr/ViCA-ScanNet-20p`](https://huggingface.co/nkkbr/ViCA-ScanNet-20p) | 70%        | [`nkkbr/ViCA-ScanNet-70p`](https://huggingface.co/nkkbr/ViCA-ScanNet-70p)   |
+| 30%        | [`nkkbr/ViCA-ScanNet-30p`](https://huggingface.co/nkkbr/ViCA-ScanNet-30p) | 80%        | [`nkkbr/ViCA-ScanNet-80p`](https://huggingface.co/nkkbr/ViCA-ScanNet-80p)   |
+| 40%        | [`nkkbr/ViCA-ScanNet-40p`](https://huggingface.co/nkkbr/ViCA-ScanNet-40p) | 90%        | [`nkkbr/ViCA-ScanNet-90p`](https://huggingface.co/nkkbr/ViCA-ScanNet-90p)   |
+| 50%        | [`nkkbr/ViCA-ScanNet-50p`](https://huggingface.co/nkkbr/ViCA-ScanNet-50p) | 100%       | [`nkkbr/ViCA-ScanNet`](https://huggingface.co/nkkbr/ViCA-ScanNet) |
+🔗 Raw evaluation outputs: [ScanNet results](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi-bench_scannet/)
+## Additional Probing
+### Time Instructions
+Including 64 frame timestamps in the prompt slightly **hurts** performance, suggesting that models fail to leverage temporal alignment and are negatively impacted by instruction verbosity.
+<p align="center">
+  <img src="assets/table3.png" width="400"/>
+</p>
+<p align="center"><b>Figure 5:</b> Adding explicit frame timestamps (64 values) degrades model performance on VSI-Bench, indicating an inability to exploit temporal alignment and sensitivity to prompt length.</p>
+---
+### More Frames
+Increasing input from 64 to 128 frames doubles the number of visual tokens (13,440 → 26,880) but yields **no performance gain**, highlighting overfitting to fixed token length and architectural inflexibility.
+<p align="center">
+  <img src="assets/table2.png" width="400"/>
+</p>
+<p align="center"><b>Figure 6:</b> Comparison between 64-frame and 128-frame inputs. Despite doubling the visual token count, performance remains unchanged, indicating overfitting to fixed-length input and limited adaptability to variable-length sequences.</p>
+## Potential Applications
+ViCA-7B supports a broad range of spatially grounded multimodal applications:
+- **Indoor navigation assistants**
+- **Robotics planning and spatial querying**
+- **Smart room arrangement and AR layout analysis**
+- **Scene understanding for embodied AI agents**
+## Known Limitations
+- Limited temporal reasoning: Time instructions not effectively utilized
+- Frame scaling issues: Models expect fixed input lengths
+- No depth/point cloud: Only RGB video input supported
+- Zero-shot generalization is good, but not task-agnostic
+## Inference
+*Here is a runnable example using ViCA-7B on a VSI-Bench question.*
+```python
+# This inference script is adapted from:
+# https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2
+# pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
+from llava.conversation import conv_templates, SeparatorStyle
+from PIL import Image
+import requests
+import copy
+import torch
+import sys
+import warnings
+from decord import VideoReader, cpu
+import numpy as np
+import json
+from tqdm import tqdm
+import os
+warnings.filterwarnings("ignore")
+def load_video(video_path, max_frames_num,fps=1,force_sample=False):
+    if max_frames_num == 0:
+        return np.zeros((1, 336, 336, 3))
+    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
+    total_frame_num = len(vr)
+    video_time = total_frame_num / vr.get_avg_fps()
+    fps = round(vr.get_avg_fps()/fps)
+    frame_idx = [i for i in range(0, len(vr), fps)]
+    frame_time = [i/fps for i in frame_idx]
+    if len(frame_idx) > max_frames_num or force_sample:
+        sample_fps = max_frames_num
+        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
+        frame_idx = uniform_sampled_frames.tolist()
+        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
+    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
+    spare_frames = vr.get_batch(frame_idx).asnumpy()
+    # import pdb;pdb.set_trace()
+    return spare_frames,frame_time,video_time
+pretrained = 'nkkbr/ViCA'
+model_name = "llava_qwen"
+device = "cuda"
+device_map = "auto"
+tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  # Add any other thing you want to pass in llava_model_args
+model.eval()
+from datasets import load_dataset
+vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
+vsi_bench = vsi_bench['test']
+data_curr = vsi_bench[1000]
+video_path = f"[VIDEO PATH]"
+max_frames_num = 64
+video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
+video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
+video = [video]
+conv_template = "qwen_1_5"
+# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
+time_instruciton = ""
+question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
+question += f"These are frames of a video.\n\n"
+question += f"Question: {data_curr['question']}\n"
+if data_curr['options'] is not None:
+    question += '\n'.join(data_curr['options']) + "\n"
+    question += f"Answer with the option’s letter from the given choices directly.\n"
+else:
+    question += f"Please answer the question using a single word or phrase.\n"
+print(f"Prompt:\n{question}")
+conv = copy.deepcopy(conv_templates[conv_template])
+conv.append_message(conv.roles[0], question)
+conv.append_message(conv.roles[1], None)
+prompt_question = conv.get_prompt()
+input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
+cont = model.generate(
+    input_ids,
+    images=video,
+    modalities= ["video"],
+    do_sample=False,
+    temperature=0,
+    max_new_tokens=1024,
+)
+text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
+print(repr(text_outputs))
+```
 ---