Jonna Marie Matthiesen Claude Opus 4.6 (1M context) commited on about 1 month ago

Commit

fcac823

1 Parent(s): e824fce

Update README: migrate workflow from embedl-models to flash-head

Replace embedl-models / Docker container instructions with the
flash-head vLLM plugin workflow (pip install flash-head). Update
code examples to use standard vLLM imports and add GitHub badge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

README.md +119 -0

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+base_model:
+- nvidia/Cosmos-Reason2-8B
+tags:
+- nvidia
+- cosmos
+- cosmos-reason2
+- multimodal
+- vlm
+- quantized
+- flashhead
+- qwen3_vl
+pipeline_tag: image-text-to-text
+license: other
+license_name: embedl-models-community-licence-1.0
+license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
+---
+# Cosmos-Reason2-8B-W4A16-FlashHead
+[![GitHub](https://img.shields.io/badge/GitHub-flash--head-black?logo=github)](https://github.com/embedl/flash-head)
+**Optimized version of [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) using quantization and FlashHead, Embedl's efficient replacement for the language model head.**
+Designed for **low-latency inference** on **NVIDIA GPUs**, leveraging:
+- FlashHead
+- Quantization (W4A16)
+- vLLM plugin via [`flash-head`](https://github.com/embedl/flash-head)
+---
+## Model Details
+| **Field** | **Value** |
+|---|---|
+| **Base Model** | [nvidia/Cosmos-Reason2-8B](https://huggingface.co/nvidia/Cosmos-Reason2-8B) |
+| **Input / Output** | Text + Image / Video -> Text |
+| **Optimizations** | FlashHead LM Head + Quantization (W4A16) |
+| **Developers** | Embedl |
+| **Licenses** | Upstream: [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
+---
+## Installation
+```bash
+pip install flash-head
+```
+The [`flash-head`](https://github.com/embedl/flash-head) vLLM plugin is required. It activates automatically at startup.
+---
+## Usage Examples
+### vLLM Serve
+```bash
+vllm serve embedl/Cosmos-Reason2-8B-W4A16-FlashHead \
+    --max-model-len 8192 \
+    --gpu-memory-utilization 0.75
+```
+### vLLM Video Inference
+```python
+from vllm import LLM, SamplingParams
+if __name__ == "__main__":
+    model = "embedl/Cosmos-Reason2-8B-W4A16-FlashHead"
+    video_url = "https://nvidia-cosmos.github.io/cosmos-cookbook/gallery/vs_assets/clip_1_short.mp4"
+    messages = [
+        {
+            "role": "system",
+            "content": [{"type": "text", "text": "You are a helpful assistant."}],
+        },
+        {
+            "role": "user",
+            "content": [
+                {"type": "video_url", "video_url": {"url": video_url, "fps": 4}},
+                {"type": "text", "text": "Describe this video in detail."},
+            ],
+        },
+    ]
+    llm = LLM(
+        model=model,
+        limit_mm_per_prompt={
+            "video": {"count": 1, "num_frames": 12, "width": 1280, "height": 720},
+            "image": 0,
+            "audio": 0,
+        },
+        media_io_kwargs={"video": {"num_frames": -1}},
+        max_model_len=8192,
+        mm_processor_kwargs={"truncation": False},
+        gpu_memory_utilization=0.75,
+        trust_remote_code=True,
+    )
+    output = llm.chat(messages, sampling_params=SamplingParams(temperature=0.0, max_tokens=256))
+    print(output[0].outputs[0].text)
+```
+---
+## License
+- **Upstream:** [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
+- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
+---
+## Contact
+- Enterprise and Commercial Inquiries: `models@embedl.com`
+- Technical Issues and Early Access: [`https://github.com/embedl/flash-head`](https://github.com/embedl/flash-head)
+- More Information and Model Releases: `https://embedl.com`