hmahadik commited on
Commit
3947888
Β·
verified Β·
1 Parent(s): 960c2dd

docs: add ONNX section + fp16/ORT caveat

Browse files
Files changed (1) hide show
  1. README.md +86 -14
README.md CHANGED
@@ -147,10 +147,26 @@ print(parse_call(raw)) # ('turn_on_lights', '')
147
 
148
  ## Training data
149
 
150
- - **Size**: 367 train / 100 eval examples.
151
- - **Mix**: paraphrase expansion + multi-tool sequences + `respond()`
152
- fallbacks for ambiguous / out-of-scope prompts (so the model has a
153
- clean exit when no tool fits, rather than hallucinating one).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  - **Buzzer schema**: pattern-only (binary GPIO on the reference HAT β€” no
155
  PWM). Old `frequency_hz` / `duration_seconds` prompts are routed
156
  through `respond()` as out-of-scope negatives.
@@ -201,19 +217,19 @@ for a smaller dataset:
201
 
202
  ## Smoke-test results
203
 
204
- 10-prompt Ollama smoke against the registered model:
205
 
206
  | Smoke pass-rate |
207
  |-----------------|
208
- | **8 / 10 (80 %)** |
209
 
210
- The model handles the simple control prompts cleanly (`turn on the
211
- lights`, `blink red 3 times`, `play a beep`, `take a picture`, `good
212
- morning` β†’ respond). Known weak prompts at 367-example scale: `set led
213
- red brightness 50` (hallucinated `acceptor(...)` β€” likely Q4_K_M
214
- quantization artifact on `<tool_2>`) and `set alarm 5 minutes`
215
- (misroutes). Plan: paraphrase-expand the dataset to 2–3k examples for the
216
- next checkpoint.
217
 
218
  ## Latency
219
 
@@ -223,13 +239,69 @@ Measured against a local Ollama using the standalone client above:
223
  - Target on SL2619 (2Γ— Cortex-A55 @ 2 GHz): **0.5 – 1.2 s** with the CPU
224
  governor pinned to `performance`. On-device measurement pending.
225
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  ## Files
227
 
228
  ```
229
- functiongemma-physical-ai-Q4_K_M.gguf # 253 MB, weights
230
  Modelfile # Ollama Modelfile (function-token format)
231
  tools.json # 13-tool schema (mobile-actions format)
232
  token_map.json # function-token <-> tool-name map
 
 
233
  README.md # this file
234
  ```
235
 
 
147
 
148
  ## Training data
149
 
150
+ ### v5 (current β€” use this for training)
151
+
152
+ - **Size**: 1,400 train / 150 eval (v5 dataset, `coral_v5_compact.jsonl`).
153
+ - **Multi-tool**: 292 multi-tool examples in train (20.9%), 50 in eval (33.3%). Google
154
+ mobile-actions target is 33.4%; train is capped by pool size β€” the ~450 Haiku-generated
155
+ multi-tool examples deduplicated to 343 unique. Future: spawn more agents.
156
+ - **Generation**: base hand-written examples + `paraphrases_cache.json` (generated by parallel
157
+ Claude Haiku agents). 971 new single-tool + 450 new multi-tool paraphrases before dedup.
158
+ - **Coverage fixes**: explicit brightness form ("set led red brightness 50") β€” 46 examples.
159
+ Bare alarm form ("set alarm 5 minutes", no preposition) β€” 36 examples. Both were zero in v4
160
+ and caused the two known smoke-test failures.
161
+ - **Non-determinism fix**: `set_led_color_examples()` previously used unseeded `random.sample`;
162
+ now iterates all 18 templates Γ— 12 colors deterministically (216 examples vs ~60).
163
+ - **Eval harness**: `scripts/eval_harness.py` β€” greedy decode against eval JSONL, per-tool F1,
164
+ arg-match rate, multi-tool sequence accuracy. Run on GPU host post-training.
165
+
166
+ ### v4 (previous)
167
+
168
+ - **Size**: 367 train / 100 eval.
169
+ - **Multi-tool**: 13% (vs Google mobile-actions 33.4%).
170
  - **Buzzer schema**: pattern-only (binary GPIO on the reference HAT β€” no
171
  PWM). Old `frequency_hz` / `duration_seconds` prompts are routed
172
  through `respond()` as out-of-scope negatives.
 
217
 
218
  ## Smoke-test results
219
 
220
+ **v4 checkpoint (367-example training):**
221
 
222
  | Smoke pass-rate |
223
  |-----------------|
224
+ | 8 / 10 (80 %) |
225
 
226
+ Note: 21/22 smoke prompts are NOT in the held-out eval set, so 80% measures training
227
+ memorization, not generalization. The two failures β€” `set led red brightness 50`
228
+ (hallucinated `acceptor(...)`) and `set alarm 5 minutes` (misrouted) β€” were caused by
229
+ absent phrasing patterns, now fixed in v5.
230
+
231
+ **v5 checkpoint: pending GPU training run.** Use `scripts/eval_harness.py` for
232
+ proper per-tool precision/recall/F1 against the 150-example held-out eval set.
233
 
234
  ## Latency
235
 
 
239
  - Target on SL2619 (2Γ— Cortex-A55 @ 2 GHz): **0.5 – 1.2 s** with the CPU
240
  governor pinned to `performance`. On-device measurement pending.
241
 
242
+ ## ONNX exports (for compiler toolchains)
243
+
244
+ For compiler-targeted backends (ONNX Runtime, IREE/MLIR, OpenVINO, TensorRT,
245
+ Synaptics Torq), the model is also published as ONNX with KV-cache support
246
+ (`text-generation-with-past`). Both exports are derived from the same
247
+ `coral-functiongemma-v4c-compact` checkpoint as the GGUF above.
248
+
249
+ | Path | Precision | Weight init dtype | Size | ORT runnable |
250
+ |------|-----------|-------------------|------|--------------|
251
+ | `onnx/compact-fp32/model.onnx` | fp32 | 237 / 237 FLOAT | 1.7 GB | yes |
252
+ | `onnx/compact-fp16/model.onnx` | fp16 | 237 / 237 FLOAT16 | 833 MB | no β€” see note |
253
+
254
+ Both files are structurally valid (`onnx.checker.check_model(..., full_check=True)`
255
+ passes). Each export ships with the matching tokenizer and `config.json` so it
256
+ can be loaded directly:
257
+
258
+ ```python
259
+ from transformers import AutoTokenizer
260
+ import onnxruntime as ort
261
+ import numpy as np, json
262
+
263
+ MODEL = "onnx/compact-fp32" # or downloaded local path
264
+ tok = AutoTokenizer.from_pretrained(MODEL)
265
+ sess = ort.InferenceSession(f"{MODEL}/model.onnx", providers=["CPUExecutionProvider"])
266
+
267
+ tools = json.load(open("tools.json"))["tools"]
268
+ prompt = tok.apply_chat_template(
269
+ [{"role": "developer",
270
+ "content": "You are a model that can do function calling with the following functions\n",
271
+ "tool_calls": None},
272
+ {"role": "user", "content": "Turn on the lights", "tool_calls": None}],
273
+ tools=tools, tokenize=False, add_generation_prompt=True,
274
+ )
275
+ # Then feed input_ids + empty past_key_values.* (shape (1, num_kv_heads, 0, head_dim))
276
+ # greedy-decode in a loop, stop on <end>. See repo for full snippet.
277
+ ```
278
+
279
+ Smoke decode of "Turn on the lights" against the fp32 ONNX returns
280
+ `<tool_0>()<end>` (= `turn_on_lights()`), matching the GGUF output.
281
+
282
+ ### fp16 + ONNX Runtime caveat
283
+
284
+ The fp16 ONNX file is structurally valid but **does not currently load in
285
+ ONNX Runtime β‰₯ 1.20** for this model: ORT's `SimplifiedLayerNormFusion` pass
286
+ chokes on the `InsertedPrecisionFreeCast_*` nodes that the fp16 conversion
287
+ inserts around Gemma3's RMSNorm layers. The error is graph-optimizer-internal
288
+ and reproduces with `ORT_DISABLE_ALL`. This is an ORT bug, not an ONNX-spec
289
+ issue β€” the file passes `onnx.checker` and the graph is well-formed.
290
+
291
+ For compiler frontends that consume ONNX directly (IREE / MLIR, TensorRT,
292
+ OpenVINO, Synaptics Torq), the fp16 file should ingest fine. For runtime
293
+ inference via `onnxruntime` itself, use the fp32 export and let your compiler
294
+ or runtime do its own dtype conversion / quantization downstream.
295
+
296
  ## Files
297
 
298
  ```
299
+ functiongemma-physical-ai-Q4_K_M.gguf # 253 MB, GGUF Q4_K_M weights (Ollama / llama.cpp)
300
  Modelfile # Ollama Modelfile (function-token format)
301
  tools.json # 13-tool schema (mobile-actions format)
302
  token_map.json # function-token <-> tool-name map
303
+ onnx/compact-fp32/ # ONNX export, fp32, with KV cache (1.7 GB)
304
+ onnx/compact-fp16/ # ONNX export, fp16, with KV cache (833 MB) β€” see ORT caveat above
305
  README.md # this file
306
  ```
307