Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,114 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: onnx
|
| 4 |
+
tags:
|
| 5 |
+
- vision-language
|
| 6 |
+
- vlm
|
| 7 |
+
- image-captioning
|
| 8 |
+
- vqa
|
| 9 |
+
- moondream
|
| 10 |
+
- onnx
|
| 11 |
+
base_model: vikhyatk/moondream2
|
| 12 |
+
pipeline_tag: image-to-text
|
| 13 |
+
language:
|
| 14 |
+
- en
|
| 15 |
---
|
| 16 |
+
|
| 17 |
+
# Moondream2 β Compact VLM (ONNX)
|
| 18 |
+
|
| 19 |
+
ONNX export of [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) β a 1.86B-param vision-language model from Vikhyat Korrapati. Surprisingly capable for its size on captioning, visual question answering, and basic spatial grounding. Apache-2.0 throughout.
|
| 20 |
+
|
| 21 |
+
Re-hosted under Heliosoph for distribution stability β `vikhyatk/moondream2` is the authoritative upstream but it doesn't publish its own ONNX exports, and the upstream file layout drifts across revisions. This repo is the canonical ONNX form for the model.
|
| 22 |
+
|
| 23 |
+
Credit: Vikhyat Korrapati β [Moondream](https://github.com/vikhyat/moondream).
|
| 24 |
+
|
| 25 |
+
## What this repo contains
|
| 26 |
+
|
| 27 |
+
Moondream2 is multi-component β Optimum's ONNX exporter splits it into separate files for the vision encoder, text decoder, and token embeddings. **All files must be present in the same directory** for the model to load.
|
| 28 |
+
|
| 29 |
+
```
|
| 30 |
+
vision_encoder.onnx # SigLIP-based image encoder
|
| 31 |
+
decoder_model_merged.onnx # Phi-1.5-based text decoder (with KV cache merged in)
|
| 32 |
+
embed_tokens.onnx # Token embedding layer (separated for inference efficiency)
|
| 33 |
+
config.json # HuggingFace model config
|
| 34 |
+
generation_config.json # Decoder generation defaults (max_length, EOS, etc.)
|
| 35 |
+
preprocessor_config.json # Image preprocessing (resize, normalize)
|
| 36 |
+
tokenizer.json # Tokenizer vocab + merges
|
| 37 |
+
tokenizer_config.json
|
| 38 |
+
special_tokens_map.json
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
If a component's weights exceed the 2GB protobuf limit, Optimum emits a sibling `.onnx.data` external-data file alongside the `.onnx` β keep them together; the `.onnx` references the `.data` by relative filename.
|
| 42 |
+
|
| 43 |
+
## Input / output shape
|
| 44 |
+
|
| 45 |
+
| Stage | Input | Output |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| Vision encoder | RGB image, NCHW float32, preprocessor-normalized | Image feature tokens |
|
| 48 |
+
| Text decoder | Image features + input token ids + KV cache | Next-token logits + updated KV cache |
|
| 49 |
+
| Embed tokens | Token ids | Token embeddings (fed back into decoder) |
|
| 50 |
+
|
| 51 |
+
Exact tensor shapes and names depend on the Optimum version used at export β verify in Netron before wiring.
|
| 52 |
+
|
| 53 |
+
## How to use
|
| 54 |
+
|
| 55 |
+
The runtime pattern is **greedy decoding orchestrated outside the ONNX graph**, similar to the standard encoder-decoder pattern for ONNX-exported LLMs:
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
import onnxruntime as ort
|
| 59 |
+
import numpy as np
|
| 60 |
+
|
| 61 |
+
vision_enc = ort.InferenceSession("vision_encoder.onnx")
|
| 62 |
+
text_dec = ort.InferenceSession("decoder_model_merged.onnx")
|
| 63 |
+
embed = ort.InferenceSession("embed_tokens.onnx")
|
| 64 |
+
|
| 65 |
+
# 1. Encode the image
|
| 66 |
+
image_features = vision_enc.run(None, {"pixel_values": preprocessed_image})[0]
|
| 67 |
+
|
| 68 |
+
# 2. Greedy decode loop with KV cache
|
| 69 |
+
input_ids = np.array([[BOS_TOKEN]], dtype=np.int64)
|
| 70 |
+
generated = []
|
| 71 |
+
past_kv = None
|
| 72 |
+
for step in range(max_new_tokens):
|
| 73 |
+
embeds = embed.run(None, {"input_ids": input_ids})[0]
|
| 74 |
+
outputs = text_dec.run(None, {
|
| 75 |
+
"inputs_embeds": embeds,
|
| 76 |
+
"image_features": image_features,
|
| 77 |
+
"past_key_values": past_kv,
|
| 78 |
+
})
|
| 79 |
+
next_token = outputs[0][:, -1, :].argmax(-1)
|
| 80 |
+
if next_token.item() == EOS_TOKEN: break
|
| 81 |
+
generated.append(next_token.item())
|
| 82 |
+
input_ids = next_token.reshape(1, 1)
|
| 83 |
+
past_kv = outputs[1:]
|
| 84 |
+
|
| 85 |
+
text = tokenizer.decode(generated)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
The `onnxruntime-genai` model builder doesn't currently accept Moondream2's architecture, so raw onnxruntime sessions + a hand-rolled decode loop is the way (same shape as the TrOCR / Florence-2 patterns).
|
| 89 |
+
|
| 90 |
+
## When to pick Moondream2
|
| 91 |
+
|
| 92 |
+
- **Compact VLM use cases**: 1.86B params, ~2 GB on disk β runs on CPU at usable latency.
|
| 93 |
+
- **Captioning + VQA**: short-form image-to-text. Punches above its size class.
|
| 94 |
+
- **Side-by-side VLM comparison**: pairs well with Florence-2 (similar size, different architecture) and Phi-3.5 Vision (~2Γ larger, different training) for "small VLM" evals.
|
| 95 |
+
|
| 96 |
+
For larger / higher-quality VLM tasks, reach for [Phi-3.5 Vision](https://huggingface.co/Heliosoph/phi-3.5-vision-instruct-onnx) or upstream Qwen2-VL / Llama-3.2 Vision. For OCR-specific use, [Florence-2](https://huggingface.co/Heliosoph/florence-2-base-ft-fp16-onnx) has dedicated task-tokens that often beat generalist VLMs on document text.
|
| 97 |
+
|
| 98 |
+
## Provenance + reproducibility caveat
|
| 99 |
+
|
| 100 |
+
The ONNX export in this repo was done locally via Optimum from the upstream PyTorch weights β one-off, no checked-in reproducible script. If a clean re-export is ever needed (Optimum / transformers / Moondream2 version churn breaks an inference path), the rough recipe is:
|
| 101 |
+
|
| 102 |
+
```bash
|
| 103 |
+
optimum-cli export onnx \
|
| 104 |
+
--model vikhyatk/moondream2 \
|
| 105 |
+
--task image-to-text \
|
| 106 |
+
--trust-remote-code \
|
| 107 |
+
./moondream2-onnx-staging/
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
Verify the produced file list matches what's shipped here; Optimum's exact output filenames depend on its version. Watch for any new component files (Moondream2 has had architecture tweaks across versions that could add or remove split points).
|
| 111 |
+
|
| 112 |
+
## License
|
| 113 |
+
|
| 114 |
+
**Apache-2.0** β same as upstream `vikhyatk/moondream2`. `LICENSE` file included. The ONNX-export step doesn't change licensing β same model, different serialization format.
|