File size: 5,286 Bytes
e31e689
 
13536c4
 
 
 
 
 
 
 
 
 
 
 
e31e689
13536c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
library_name: onnx
tags:
- vision-language
- vlm
- image-captioning
- vqa
- moondream
- onnx
base_model: vikhyatk/moondream2
pipeline_tag: image-to-text
language:
- en
---

# Moondream2 β€” Compact VLM (ONNX)

ONNX export of [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) β€” a 1.86B-param vision-language model from Vikhyat Korrapati. Surprisingly capable for its size on captioning, visual question answering, and basic spatial grounding. Apache-2.0 throughout.

Re-hosted under Heliosoph for distribution stability β€” `vikhyatk/moondream2` is the authoritative upstream but it doesn't publish its own ONNX exports, and the upstream file layout drifts across revisions. This repo is the canonical ONNX form for the model.

Credit: Vikhyat Korrapati β€” [Moondream](https://github.com/vikhyat/moondream).

## What this repo contains

Moondream2 is multi-component β€” Optimum's ONNX exporter splits it into separate files for the vision encoder, text decoder, and token embeddings. **All files must be present in the same directory** for the model to load.

```
vision_encoder.onnx              # SigLIP-based image encoder
decoder_model_merged.onnx        # Phi-1.5-based text decoder (with KV cache merged in)
embed_tokens.onnx                # Token embedding layer (separated for inference efficiency)
config.json                      # HuggingFace model config
generation_config.json           # Decoder generation defaults (max_length, EOS, etc.)
preprocessor_config.json         # Image preprocessing (resize, normalize)
tokenizer.json                   # Tokenizer vocab + merges
tokenizer_config.json
special_tokens_map.json
```

If a component's weights exceed the 2GB protobuf limit, Optimum emits a sibling `.onnx.data` external-data file alongside the `.onnx` β€” keep them together; the `.onnx` references the `.data` by relative filename.

## Input / output shape

| Stage | Input | Output |
|---|---|---|
| Vision encoder | RGB image, NCHW float32, preprocessor-normalized | Image feature tokens |
| Text decoder | Image features + input token ids + KV cache | Next-token logits + updated KV cache |
| Embed tokens | Token ids | Token embeddings (fed back into decoder) |

Exact tensor shapes and names depend on the Optimum version used at export β€” verify in Netron before wiring.

## How to use

The runtime pattern is **greedy decoding orchestrated outside the ONNX graph**, similar to the standard encoder-decoder pattern for ONNX-exported LLMs:

```python
import onnxruntime as ort
import numpy as np

vision_enc = ort.InferenceSession("vision_encoder.onnx")
text_dec   = ort.InferenceSession("decoder_model_merged.onnx")
embed      = ort.InferenceSession("embed_tokens.onnx")

# 1. Encode the image
image_features = vision_enc.run(None, {"pixel_values": preprocessed_image})[0]

# 2. Greedy decode loop with KV cache
input_ids = np.array([[BOS_TOKEN]], dtype=np.int64)
generated = []
past_kv = None
for step in range(max_new_tokens):
    embeds = embed.run(None, {"input_ids": input_ids})[0]
    outputs = text_dec.run(None, {
        "inputs_embeds": embeds,
        "image_features": image_features,
        "past_key_values": past_kv,
    })
    next_token = outputs[0][:, -1, :].argmax(-1)
    if next_token.item() == EOS_TOKEN: break
    generated.append(next_token.item())
    input_ids = next_token.reshape(1, 1)
    past_kv = outputs[1:]

text = tokenizer.decode(generated)
```

The `onnxruntime-genai` model builder doesn't currently accept Moondream2's architecture, so raw onnxruntime sessions + a hand-rolled decode loop is the way (same shape as the TrOCR / Florence-2 patterns).

## When to pick Moondream2

- **Compact VLM use cases**: 1.86B params, ~2 GB on disk β€” runs on CPU at usable latency.
- **Captioning + VQA**: short-form image-to-text. Punches above its size class.
- **Side-by-side VLM comparison**: pairs well with Florence-2 (similar size, different architecture) and Phi-3.5 Vision (~2Γ— larger, different training) for "small VLM" evals.

For larger / higher-quality VLM tasks, reach for [Phi-3.5 Vision](https://huggingface.co/Heliosoph/phi-3.5-vision-instruct-onnx) or upstream Qwen2-VL / Llama-3.2 Vision. For OCR-specific use, [Florence-2](https://huggingface.co/Heliosoph/florence-2-base-ft-fp16-onnx) has dedicated task-tokens that often beat generalist VLMs on document text.

## Provenance + reproducibility caveat

The ONNX export in this repo was done locally via Optimum from the upstream PyTorch weights β€” one-off, no checked-in reproducible script. If a clean re-export is ever needed (Optimum / transformers / Moondream2 version churn breaks an inference path), the rough recipe is:

```bash
optimum-cli export onnx \
  --model vikhyatk/moondream2 \
  --task image-to-text \
  --trust-remote-code \
  ./moondream2-onnx-staging/
```

Verify the produced file list matches what's shipped here; Optimum's exact output filenames depend on its version. Watch for any new component files (Moondream2 has had architecture tweaks across versions that could add or remove split points).

## License

**Apache-2.0** β€” same as upstream `vikhyatk/moondream2`. `LICENSE` file included. The ONNX-export step doesn't change licensing β€” same model, different serialization format.