florianleibert commited on
Commit
8bbd4c5
·
verified ·
1 Parent(s): 805eed0

add model card with github backlink

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: mlx
4
+ tags:
5
+ - mlx
6
+ - apple-silicon
7
+ - speculative-decoding
8
+ - qwen3
9
+ - dflash
10
+ - offline-bundle
11
+ base_model:
12
+ - mlx-community/Qwen3-4B-bf16
13
+ - z-lab/Qwen3-4B-DFlash-b16
14
+ ---
15
+
16
+ # micromodel-ship
17
+
18
+ Offline Apple Silicon inference bundle for Qwen3-4B with DFlash exact speculative decoding.
19
+
20
+ **Source code:** [github.com/florianleibert/micromodels](https://github.com/florianleibert/micromodels)
21
+
22
+ This repo hosts the shippable offline tarball (`micromodel-ship-offline.tar.gz`) that is too large to live in the GitHub repo. The tarball contains the full runnable runtime, both model payloads (target + draft), and the helper scripts needed to serve a local OpenAI-compatible API.
23
+
24
+ ## Contents
25
+
26
+ `micromodel-ship-offline.tar.gz` bundles:
27
+
28
+ - target model: `mlx-community/Qwen3-4B-bf16`
29
+ - DFlash draft: `z-lab/Qwen3-4B-DFlash-b16`
30
+ - MLX-based runtime with exact speculative decoding
31
+ - minimal OpenAI-compatible API server (`POST /v1/chat/completions`)
32
+ - run, chat, serve, and benchmark scripts
33
+
34
+ ## Quick start
35
+
36
+ ```bash
37
+ curl -L -o micromodel-ship-offline.tar.gz \
38
+ https://huggingface.co/florianleibert/micromodel-ship/resolve/main/micromodel-ship-offline.tar.gz
39
+ tar -xzf micromodel-ship-offline.tar.gz
40
+ cd micromodel-ship
41
+ uv sync
42
+ ./scripts/serve.sh
43
+ ```
44
+
45
+ Health check:
46
+
47
+ ```bash
48
+ curl http://127.0.0.1:8051/healthz
49
+ ```
50
+
51
+ ## Performance
52
+
53
+ Measured on Apple M5 Max, macOS 26.4, parallel-replay verifier, 101-token prompt:
54
+
55
+ | Max new tokens | Runtime | Generation tok/s | End-to-end tok/s |
56
+ |---:|---|---:|---:|
57
+ | 512 | Plain MLX-LM BF16 | 55.13 | 48.67 |
58
+ | 512 | DFlash BF16 | 190.73 | 186.89 |
59
+ | 1024 | Plain MLX-LM BF16 | 48.18 | 44.05 |
60
+ | 1024 | DFlash BF16 | 159.35 | 157.98 |
61
+
62
+ Observed: `3.46x` decode / `3.84x` end-to-end speedup at 512 tokens. Full numbers in the GitHub repo's `PERFORMANCE.md`.
63
+
64
+ ## Requirements
65
+
66
+ - Apple Silicon (M-series)
67
+ - macOS
68
+ - Python with `uv`
69
+
70
+ ## Links
71
+
72
+ - GitHub repo (source, issues, releases): [florianleibert/micromodels](https://github.com/florianleibert/micromodels)