flyingbertman commited on
Commit
13536c4
Β·
verified Β·
1 Parent(s): e31e689

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md CHANGED
@@ -1,3 +1,114 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: onnx
4
+ tags:
5
+ - vision-language
6
+ - vlm
7
+ - image-captioning
8
+ - vqa
9
+ - moondream
10
+ - onnx
11
+ base_model: vikhyatk/moondream2
12
+ pipeline_tag: image-to-text
13
+ language:
14
+ - en
15
  ---
16
+
17
+ # Moondream2 β€” Compact VLM (ONNX)
18
+
19
+ ONNX export of [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) β€” a 1.86B-param vision-language model from Vikhyat Korrapati. Surprisingly capable for its size on captioning, visual question answering, and basic spatial grounding. Apache-2.0 throughout.
20
+
21
+ Re-hosted under Heliosoph for distribution stability β€” `vikhyatk/moondream2` is the authoritative upstream but it doesn't publish its own ONNX exports, and the upstream file layout drifts across revisions. This repo is the canonical ONNX form for the model.
22
+
23
+ Credit: Vikhyat Korrapati β€” [Moondream](https://github.com/vikhyat/moondream).
24
+
25
+ ## What this repo contains
26
+
27
+ Moondream2 is multi-component β€” Optimum's ONNX exporter splits it into separate files for the vision encoder, text decoder, and token embeddings. **All files must be present in the same directory** for the model to load.
28
+
29
+ ```
30
+ vision_encoder.onnx # SigLIP-based image encoder
31
+ decoder_model_merged.onnx # Phi-1.5-based text decoder (with KV cache merged in)
32
+ embed_tokens.onnx # Token embedding layer (separated for inference efficiency)
33
+ config.json # HuggingFace model config
34
+ generation_config.json # Decoder generation defaults (max_length, EOS, etc.)
35
+ preprocessor_config.json # Image preprocessing (resize, normalize)
36
+ tokenizer.json # Tokenizer vocab + merges
37
+ tokenizer_config.json
38
+ special_tokens_map.json
39
+ ```
40
+
41
+ If a component's weights exceed the 2GB protobuf limit, Optimum emits a sibling `.onnx.data` external-data file alongside the `.onnx` β€” keep them together; the `.onnx` references the `.data` by relative filename.
42
+
43
+ ## Input / output shape
44
+
45
+ | Stage | Input | Output |
46
+ |---|---|---|
47
+ | Vision encoder | RGB image, NCHW float32, preprocessor-normalized | Image feature tokens |
48
+ | Text decoder | Image features + input token ids + KV cache | Next-token logits + updated KV cache |
49
+ | Embed tokens | Token ids | Token embeddings (fed back into decoder) |
50
+
51
+ Exact tensor shapes and names depend on the Optimum version used at export β€” verify in Netron before wiring.
52
+
53
+ ## How to use
54
+
55
+ The runtime pattern is **greedy decoding orchestrated outside the ONNX graph**, similar to the standard encoder-decoder pattern for ONNX-exported LLMs:
56
+
57
+ ```python
58
+ import onnxruntime as ort
59
+ import numpy as np
60
+
61
+ vision_enc = ort.InferenceSession("vision_encoder.onnx")
62
+ text_dec = ort.InferenceSession("decoder_model_merged.onnx")
63
+ embed = ort.InferenceSession("embed_tokens.onnx")
64
+
65
+ # 1. Encode the image
66
+ image_features = vision_enc.run(None, {"pixel_values": preprocessed_image})[0]
67
+
68
+ # 2. Greedy decode loop with KV cache
69
+ input_ids = np.array([[BOS_TOKEN]], dtype=np.int64)
70
+ generated = []
71
+ past_kv = None
72
+ for step in range(max_new_tokens):
73
+ embeds = embed.run(None, {"input_ids": input_ids})[0]
74
+ outputs = text_dec.run(None, {
75
+ "inputs_embeds": embeds,
76
+ "image_features": image_features,
77
+ "past_key_values": past_kv,
78
+ })
79
+ next_token = outputs[0][:, -1, :].argmax(-1)
80
+ if next_token.item() == EOS_TOKEN: break
81
+ generated.append(next_token.item())
82
+ input_ids = next_token.reshape(1, 1)
83
+ past_kv = outputs[1:]
84
+
85
+ text = tokenizer.decode(generated)
86
+ ```
87
+
88
+ The `onnxruntime-genai` model builder doesn't currently accept Moondream2's architecture, so raw onnxruntime sessions + a hand-rolled decode loop is the way (same shape as the TrOCR / Florence-2 patterns).
89
+
90
+ ## When to pick Moondream2
91
+
92
+ - **Compact VLM use cases**: 1.86B params, ~2 GB on disk β€” runs on CPU at usable latency.
93
+ - **Captioning + VQA**: short-form image-to-text. Punches above its size class.
94
+ - **Side-by-side VLM comparison**: pairs well with Florence-2 (similar size, different architecture) and Phi-3.5 Vision (~2Γ— larger, different training) for "small VLM" evals.
95
+
96
+ For larger / higher-quality VLM tasks, reach for [Phi-3.5 Vision](https://huggingface.co/Heliosoph/phi-3.5-vision-instruct-onnx) or upstream Qwen2-VL / Llama-3.2 Vision. For OCR-specific use, [Florence-2](https://huggingface.co/Heliosoph/florence-2-base-ft-fp16-onnx) has dedicated task-tokens that often beat generalist VLMs on document text.
97
+
98
+ ## Provenance + reproducibility caveat
99
+
100
+ The ONNX export in this repo was done locally via Optimum from the upstream PyTorch weights β€” one-off, no checked-in reproducible script. If a clean re-export is ever needed (Optimum / transformers / Moondream2 version churn breaks an inference path), the rough recipe is:
101
+
102
+ ```bash
103
+ optimum-cli export onnx \
104
+ --model vikhyatk/moondream2 \
105
+ --task image-to-text \
106
+ --trust-remote-code \
107
+ ./moondream2-onnx-staging/
108
+ ```
109
+
110
+ Verify the produced file list matches what's shipped here; Optimum's exact output filenames depend on its version. Watch for any new component files (Moondream2 has had architecture tweaks across versions that could add or remove split points).
111
+
112
+ ## License
113
+
114
+ **Apache-2.0** β€” same as upstream `vikhyatk/moondream2`. `LICENSE` file included. The ONNX-export step doesn't change licensing β€” same model, different serialization format.