Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: Qwen/Qwen3.5-2B | |
| pipeline_tag: video-text-to-text | |
| library_name: transformers | |
| tags: | |
| - video | |
| - multimodal | |
| - video-captioning | |
| - temporal-grounding | |
| - qwen | |
| - text-generation | |
| - VLM | |
| extra_gated_heading: "Access Marlin 2B" | |
| extra_gated_description: "Marlin 2B is free to use. Please share a few details so we can keep you posted on new releases and gather feedback." | |
| extra_gated_fields: | |
| Full name: text | |
| Affiliation or company: text | |
| What do you want to use Marlin for?: text | |
| extra_gated_button_content: "Get access to Marlin 2B" | |
| <img src="https://huggingface.co/datasets/NemoStation/marlin-assets/resolve/main/marlin.svg" width="40" align="left" alt="Marlin" /> | |
| <h1> <font size="5.5">Marlin: a tiny VLM to extract structured information from videos</font></h1> | |
| <br clear="left"/> | |
| Marlin is a 2B video VLM tuned for the two questions developers actually like ask their videos: **what** is happening, and **when?** It produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to span-grounded (start, end) ranges in the video. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), and competitive with Gemini-2.5 at a fraction of the cost. | |
| ## β¨ Key features | |
| - π **State-of-the-art dense captioning at 2B.** Tops the CaReBench leaderboard and sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K, two of the most rigorous fine-grained video-captioning benchmarks in the community. | |
| - β±οΈ **Best-in-class temporal grounding at 2B.** On Tencent's TimeLens-Bench (Charades / ActivityNet / QVHighlights), Marlin beats Qwen2.5-VL-7B by +6.4 mIoU and matches Gemini-2.0-Flash. | |
| - π₯ **Built to deploy.** 2B params, vLLM- and swift-deploy-compatible, runs on a single consumer GPU. Same canonical training prompt at inference time, no special wrappers required. | |
| - π οΈ **Developer-friendly.** Standard HF `transformers` API, two convenience methods (`.caption`, `.find`) that return parsed dicts, raw `.generate()` access for custom prompts, Gradio demo ready out of the box. | |
| <p> | |
| <a href="https://vlm.nemostation.com/"> | |
| <img src="https://img.shields.io/badge/βΆ_Try_it_live-Gradio_demo-FF6B35?style=for-the-badge" alt="Try it live"/> | |
| </a> | |
| | |
| <a href="https://nemostation.com/"> | |
| <img src="https://img.shields.io/badge/π_Developed_by-NemoStation_team-7DD3FC?style=for-the-badge" alt="Developed by NemoStation team"/> | |
| </a> | |
| </p> | |
| Need Marlin tailored to your specific video processing needs? Our team can help with custom fine-tuning and integrations β [**contact us**](mailto:aryan@letsnemo.com?subject=Interested%20in%20fine-tuning%20Marlin%202B%20for%20my%20use%20case&body=Hi%20guys%2C%0A%0AI%27d%20love%20to%20chat%20about%20using%20Marlin%202B%20for%20%5Bbriefly%20describe%20your%20use%20case%5D.%0A%0AQuick%20context%3A%0A%E2%80%A2%20Use%20case%3A%0A%E2%80%A2%20Type%20of%20videos%20%2F%20volume%3A%0A%E2%80%A2%20What%20I%27d%20want%20fine-tuned%20or%20integrated%3A%0A%0ADo%20you%20have%20a%20few%20minutes%20for%20a%20call%20this%20week%3F%0A%0AThanks%21) βοΈ | |
| ## Examples | |
| <img src="https://huggingface.co/datasets/NemoStation/marlin-assets/resolve/main/caption_example.jpg" alt="Marlin caption mode example" width="100%"/> | |
| <img src="https://huggingface.co/datasets/NemoStation/marlin-assets/resolve/main/find_example.jpg" alt="Marlin find mode example" width="100%"/> | |
| ## π§ Model & training | |
| **Architecture.** Marlin is a fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. The model exposes two modes (`caption` and `find`) through custom modeling code in `modeling_marlin.py`, which wraps a single canonical training prompt per mode and parses the structured output into typed Python dicts. | |
| **Training data.** We assembled a high-quality training corpus by combining sparse public annotations (ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens) with dense re-annotations from **Gemini-3-Flash in thinking mode**, followed by targeted human review on the highest-impact splits. The teacher pipeline was tuned specifically to produce *temporally grounded atomic events and actions*, with explicit `<start-end>` boundaries per claim rather than free-form prose. The final mix is **~400K high-quality clip-level annotations** for caption mode and a separate grounding-tuned split for find mode. | |
| **Training technique.** Two-stage post-training on a single H100. Stage 1 is supervised fine-tuning (SFT) on the curated dataset above, with a fixed canonical prompt per mode and Tarsier-schema output formatting. Stage 2 is preference optimization via **SimPO** (Simple Preference Optimization) on a teacher-distilled preference set. For each clip, candidate completions from the SFT checkpoint are scored against a stronger Gemini-3-Flash judge using a rich rubric (factual accuracy, completeness, temporal alignment), and the resulting win/lose pairs align Marlin without a reference model, making it cheaper and more stable than DPO at this scale. βοΈ Recipe paper coming soon. | |
| ## π Evaluation | |
| Marlin is, to our knowledge, the **strongest open video VLM in its weight class** on both axes that matter for video analysis in production: fine-grained dense captioning and natural-language temporal grounding. The three-panel figure below summarises the trajectory from the Qwen3.5-2B base, through Marlin-SFT, to Marlin-SimPO (the release checkpoint) across: | |
| - **CaReBench** β [CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval](https://arxiv.org/abs/2501.00513) | |
| - **DREAM-1K** β [Tarsier: Recipes for Training and Evaluating Large Video Description Models](https://arxiv.org/abs/2407.00634) | |
| - **TimeLens-Bench** β [TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs](https://arxiv.org/abs/2512.14698) | |
| <img src="https://huggingface.co/datasets/NemoStation/marlin-assets/resolve/main/release_marlin_3up.png" alt="Marlin 2B trajectory across CaReBench, DREAM-1K, and TimeLens-Charades" width="100%"/> | |
| Same training pipeline on every panel; same evaluation harness across all rows. On captioning, Marlin closes the gap to its Gemini-2.5-Flash teacher to within 0.21 / 0.43 of 10. On temporal grounding, Marlin sits on the Pareto frontier in the 2B band and matches Gemini-2.5-Flash (non-thinking). Specialised 7B+ models on these benchmarks (TimeLens-7B/8B, MiMo-VL, Time-R1) still carry the upper frontier becasue they have task-specific data during training; Marlin is the strongest *general-purpose* model on these tasks at 2B. | |
| ## Quickstart | |
| The model ships with custom modeling code that adds two convenience methods (`caption` and `find`) directly to the model object. Loading with `trust_remote_code=True` returns a ready-to-use instance: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM | |
| marlin = AutoModelForCausalLM.from_pretrained( | |
| "NemoStation/Marlin-2B", | |
| trust_remote_code=True, | |
| dtype=torch.bfloat16, | |
| device_map={"": "cuda"}, | |
| ) | |
| marlin.compile() # optional β wraps torch.compile, faster after first call | |
| ``` | |
| ### Caption mode β `marlin.caption()` | |
| ```python | |
| result = marlin.caption("video.mp4") | |
| print(result["caption"]) # full raw caption text (Scene: ... Events: ...) | |
| print(result["scene"]) # parsed Scene paragraph | |
| for ev in result["events"]: | |
| print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}") | |
| ``` | |
| Optional kwargs: | |
| - `max_new_tokens=2048` (default) β generation token cap. | |
| - `prompt=None` β override the canonical training prompt (almost always leave as `None`). | |
| - `do_sample=False`, `temperature=1.0`, `top_p=1.0` β sampling controls. | |
| The model was trained on dense captions of variable length and will produce as much detail as it sees fit within `max_new_tokens`. | |
| ### Find mode β `marlin.find()` | |
| ```python | |
| result = marlin.find("video.mp4", event="a person enters the room") | |
| print(result["raw"]) # "From 14.3 to 18.2." raw model output | |
| print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure | |
| print(result["format_ok"]) # True if output matched the trained format | |
| ``` | |
| ## System requirements | |
| - `transformers >= 5.7.0` (for native `qwen3_5` architecture) | |
| - `torch >= 2.11.0` | |
| - `torchcodec` (video decoding) | |
| - `qwen-vl-utils >= 0.0.14` | |
| - `av` (torchcodec system dep) | |
| - `pillow` | |
| Install: | |
| ```bash | |
| pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow | |
| ``` | |
| ## Video preprocessing | |
| The custom modeling code sets these env vars internally (matches the training-time setup). If you want to override them, set them in your shell **before** importing transformers: | |
| | Env var | Default | What it does | | |
| |---|---|---| | |
| | `FORCE_QWENVL_VIDEO_READER` | `torchcodec` | Video decoder backend | | |
| | `VIDEO_MAX_PIXELS` | `200704` | Max pixels per frame (~448Γ448) | | |
| | `FPS` | `2.0` | Frame sampling rate | | |
| | `FPS_MAX_FRAMES` | `240` | Cap on total frames (covers ~2 min videos) | | |
| | `FPS_MIN_FRAMES` | `4` | Floor for very short videos | | |
| ## Capabilities | |
| - **Caption** (Mode 1): produces `Scene: <paragraph>` + `Events: <X.X - Y.Y> <description>` format. | |
| - **Find** (Mode 2): given a natural-language event query, returns `From X.X to Y.Y.`. | |
| - **Multichunk reasoning** (limited in this checkpoint): `<think>`-style chunked-video reasoning with explicit chunk-time β source-time arithmetic. Not directly exposed via `.caption()` / `.find()` β use a raw prompt if needed. | |
| ## Training data | |
| - **Caption mode**: ANet, LSMDC, YC2, COIN, GOT-10k/LaSOT β Gemini-generated dense captions. | |
| - **Find mode**: HC-STVG, VidSTG, TimeLens β ground-truth spans + multichunk variants. | |
| ## Advanced β raw inference | |
| If you want to bypass the helper methods and call `generate()` directly (e.g., for custom prompts), the standard transformers pattern works: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "NemoStation/Marlin-2B", | |
| trust_remote_code=True, | |
| dtype=torch.bfloat16, | |
| device_map={"": "cuda"}, | |
| ) | |
| processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) | |
| messages = [{"role": "user", "content": [ | |
| {"type": "video", "video": "video.mp4"}, | |
| {"type": "text", "text": "Your custom prompt here"}, | |
| ]}] | |
| inputs = processor.apply_chat_template( | |
| messages, tokenize=True, add_generation_prompt=True, | |
| return_tensors="pt", return_dict=True, | |
| ).to(model.device) | |
| with torch.inference_mode(): | |
| out = model.generate(**inputs, max_new_tokens=512, do_sample=False) | |
| out = out[:, inputs["input_ids"].shape[1]:] | |
| text = processor.batch_decode(out, skip_special_tokens=True)[0] | |
| print(text) | |
| ``` | |
| ## Notes on output | |
| The model emits a `<think>` token at the start of every response (an artifact of training with `add_non_thinking_prefix=True`). The `.caption()` and `.find()` methods strip this automatically. If you're using `generate()` directly, strip `<think>...</think>` (with or without closing tag) from the start of the output. | |