hmahadik commited on
Commit
35645d8
Β·
verified Β·
1 Parent(s): 408c7e9

Update model card for v7 (10 tools, list_alarms removed)

Browse files
Files changed (1) hide show
  1. README.md +75 -51
README.md CHANGED
@@ -25,16 +25,17 @@ SL2619 "Coral" edge board (Google IO 2026 demo).
25
 
26
  | Revision | File | Tool count | Notes |
27
  |----------|------|-----------:|-------|
28
- | **v6 (current)** | [`functiongemma-physical-ai-v6-Q5_K_M.gguf`](./functiongemma-physical-ai-v6-Q5_K_M.gguf) | 11 | Camera + vision dropped. Single-tool routing **95.5%**, multi-tool exact-match 23.9%. |
 
29
  | v4c (legacy) | [`functiongemma-physical-ai-Q4_K_M.gguf`](./functiongemma-physical-ai-Q4_K_M.gguf) | 13 | Earlier checkpoint, includes camera/scene tools. |
30
 
31
- Schema ships as [`tools.json`](./tools.json) (11 tools, current). Token-to-tool
32
  mapping is in [`token_map.json`](./token_map.json).
33
 
34
  ## Output format β€” function tokens
35
 
36
  Tool calls emit as **function tokens**: each tool name compiles to a single
37
- special-vocabulary token (`<tool_0>` … `<tool_10>` for v6) and a single
38
  `<end>` terminator. A complete call decodes in roughly 8–15 output tokens,
39
  vs ~30–80 for native FunctionGemma's
40
  `<start_function_call>call:NAME{...}<end_function_call>` syntax. On a
@@ -44,8 +45,9 @@ voice-UX latency.
44
  Sample output: `<tool_3>(3,"red")<end>` for `blink_lights(count=3, color="red")`.
45
 
46
  `<tool_0>` β†’ `turn_on_lights`, `<tool_3>` β†’ `blink_lights`,
47
- `<tool_9>` β†’ `get_system_status`, `<tool_10>` β†’ `respond` (v6 numbering).
48
- Full mapping in [`token_map.json`](./token_map.json).
 
49
 
50
  > ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or `<eos>`),
51
  > NOT on `<end>`. Multi-tool sequences emit `<tool_A>(args)<end><tool_B>(args)<end>`,
@@ -150,23 +152,32 @@ print(parse_call(raw)) # ('turn_on_lights', '')
150
 
151
  ## Training data
152
 
153
- ### v5 (current β€” use this for training)
154
-
155
- - **Size**: 1,400 train / 150 eval (v5 dataset, `coral_v5_compact.jsonl`).
156
- - **Multi-tool**: 292 multi-tool examples in train (20.9%), 50 in eval (33.3%). Google
157
- mobile-actions target is 33.4%; train is capped by pool size β€” the ~450 Haiku-generated
158
- multi-tool examples deduplicated to 343 unique. Future: spawn more agents.
159
- - **Generation**: base hand-written examples + `paraphrases_cache.json` (generated by parallel
160
- Claude Haiku agents). 971 new single-tool + 450 new multi-tool paraphrases before dedup.
161
- - **Coverage fixes**: explicit brightness form ("set led red brightness 50") β€” 46 examples.
162
- Bare alarm form ("set alarm 5 minutes", no preposition) β€” 36 examples. Both were zero in v4
163
- and caused the two known smoke-test failures.
164
- - **Non-determinism fix**: `set_led_color_examples()` previously used unseeded `random.sample`;
165
- now iterates all 18 templates Γ— 12 colors deterministically (216 examples vs ~60).
166
- - **Eval harness**: `scripts/eval_harness.py` β€” greedy decode against eval JSONL, per-tool F1,
167
- arg-match rate, multi-tool sequence accuracy. Run on GPU host post-training.
168
-
169
- ### v4 (previous)
 
 
 
 
 
 
 
 
 
170
 
171
  - **Size**: 367 train / 100 eval.
172
  - **Multi-tool**: 13% (vs Google mobile-actions 33.4%).
@@ -188,16 +199,18 @@ The training recipe is a direct port of Brinq's SmartPanel v14 trainer
188
  for a smaller dataset:
189
 
190
  - **Full bf16 fine-tune (no LoRA)**.
191
- - **Mean-init** for new `<tool_0>..<tool_12>` and `<end>` special tokens
192
  (init = mean of existing input embeddings; random init under-converges
193
  for tiny models on small datasets).
194
  - **Completion-only loss mask**: hand-rolled, masking everything before
195
  `<start_of_turn>model\n`. TRL 0.25's `completion_only_loss=True` is a
196
  no-op on flat-text data and FunctionGemma's chat template lacks
197
  `{% generation %}` markers required for `assistant_only_loss`.
198
- - **15 epochs**, lr `3e-5`, cosine schedule, 0.1 warmup. (367 examples here
199
- vs SmartPanel v14's ~21k β€” the higher epoch count compensates for the
200
- smaller dataset.)
 
 
201
  - **Effective batch 16** = `per_device_train_batch_size=2 Γ—
202
  gradient_accumulation_steps=8` (kept this way to avoid the 8 GiB
203
  cross-entropy logit allocation OOM that bites Gemma3's 262k vocab).
@@ -218,29 +231,36 @@ for a smaller dataset:
218
  }
219
  ```
220
 
221
- ## Smoke-test results
222
 
223
- **v4 checkpoint (367-example training):**
224
 
225
- | Smoke pass-rate |
226
- |-----------------|
227
- | 8 / 10 (80 %) |
 
 
 
228
 
229
- Note: 21/22 smoke prompts are NOT in the held-out eval set, so 80% measures training
230
- memorization, not generalization. The two failures β€” `set led red brightness 50`
231
- (hallucinated `acceptor(...)`) and `set alarm 5 minutes` (misrouted) β€” were caused by
232
- absent phrasing patterns, now fixed in v5.
 
 
 
233
 
234
- **v5 checkpoint: pending GPU training run.** Use `scripts/eval_harness.py` for
235
- proper per-tool precision/recall/F1 against the 150-example held-out eval set.
 
236
 
237
  ## Latency
238
 
239
- Measured against a local Ollama using the standalone client above:
240
-
241
- - **~1.1 – 1.3 s** per call on a laptop CPU.
242
- - Target on SL2619 (2Γ— Cortex-A55 @ 2 GHz): **0.5 – 1.2 s** with the CPU
243
- governor pinned to `performance`. On-device measurement pending.
244
 
245
  ## ONNX exports (for compiler toolchains)
246
 
@@ -299,13 +319,15 @@ or runtime do its own dtype conversion / quantization downstream.
299
  ## Files
300
 
301
  ```
302
- functiongemma-physical-ai-Q4_K_M.gguf # 253 MB, GGUF Q4_K_M weights (Ollama / llama.cpp)
303
- Modelfile # Ollama Modelfile (function-token format)
304
- tools.json # 13-tool schema (mobile-actions format)
305
- token_map.json # function-token <-> tool-name map
306
- onnx/compact-fp32/ # ONNX export, fp32, with KV cache (1.7 GB)
307
- onnx/compact-fp16/ # ONNX export, fp16, with KV cache (833 MB) β€” see ORT caveat above
308
- README.md # this file
 
 
309
  ```
310
 
311
  ## License
@@ -319,5 +341,7 @@ By using this model you agree to those terms. Base model:
319
  - Base model: <https://huggingface.co/google/functiongemma-270m-it>
320
  - Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
321
  - Hardware demo (Coralboard, Google IO 2026 β€” full physical setup,
322
- WLED-over-USB-CDC, Grinn HAT, etc.):
323
- <https://github.com/BrinqAI/coral-functiongemma-demo>
 
 
 
25
 
26
  | Revision | File | Tool count | Notes |
27
  |----------|------|-----------:|-------|
28
+ | **v7 (current)** | [`functiongemma-physical-ai-v7-Q5_K_M.gguf`](./functiongemma-physical-ai-v7-Q5_K_M.gguf) | 10 | `list_alarms` removed; alarm-query prompts route via `respond()`. 250-row eval: **86.8%** overall, **92.8%** single-tool, **75.0%** multi-tool exact-match, **0.0%** parse failure. |
29
+ | v6 (previous) | [`functiongemma-physical-ai-v6-Q5_K_M.gguf`](./functiongemma-physical-ai-v6-Q5_K_M.gguf) | 11 | Camera + vision dropped. Single-tool routing 95.5%, multi-tool exact-match 23.9%. |
30
  | v4c (legacy) | [`functiongemma-physical-ai-Q4_K_M.gguf`](./functiongemma-physical-ai-Q4_K_M.gguf) | 13 | Earlier checkpoint, includes camera/scene tools. |
31
 
32
+ Schema ships as [`tools.json`](./tools.json) (10 tools, current). Token-to-tool
33
  mapping is in [`token_map.json`](./token_map.json).
34
 
35
  ## Output format β€” function tokens
36
 
37
  Tool calls emit as **function tokens**: each tool name compiles to a single
38
+ special-vocabulary token (`<tool_0>` … `<tool_9>` for v7) and a single
39
  `<end>` terminator. A complete call decodes in roughly 8–15 output tokens,
40
  vs ~30–80 for native FunctionGemma's
41
  `<start_function_call>call:NAME{...}<end_function_call>` syntax. On a
 
45
  Sample output: `<tool_3>(3,"red")<end>` for `blink_lights(count=3, color="red")`.
46
 
47
  `<tool_0>` β†’ `turn_on_lights`, `<tool_3>` β†’ `blink_lights`,
48
+ `<tool_8>` β†’ `get_system_status`, `<tool_9>` β†’ `respond` (v7 numbering;
49
+ v6 used `<tool_9>` and `<tool_10>` for those β€” bumped down by one when
50
+ `list_alarms` was removed). Full mapping in [`token_map.json`](./token_map.json).
51
 
52
  > ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or `<eos>`),
53
  > NOT on `<end>`. Multi-tool sequences emit `<tool_A>(args)<end><tool_B>(args)<end>`,
 
152
 
153
  ## Training data
154
 
155
+ ### v7 (current)
156
+
157
+ - **Size**: 2,000 train / 250 eval (`coral_v7_compact.jsonl`).
158
+ - **Schema change**: `list_alarms` removed. Out-of-scope alarm-query prompts
159
+ ("what alarms do I have?") are deliberately routed through `respond()`
160
+ rather than answered by a query tool. Compact token map shifted accordingly:
161
+ `get_system_status` is now `<tool_8>` (was `<tool_9>`), `respond` is
162
+ `<tool_9>` (was `<tool_10>`).
163
+ - **Multi-tool**: 84 of 250 eval rows (33.6%) are multi-tool sequences,
164
+ matching the Google mobile-actions distribution.
165
+ - **GGUF eval (Q5_K_M, greedy)**: overall **86.8%** (217/250), single-tool
166
+ **92.8%** (154/166), multi-tool exact-match **75.0%** (63/84), parse
167
+ failure **0.0%** (0/250). Per-tool F1 ranges from 0.74 (`respond`) to
168
+ 1.00 (`cancel_alarm`).
169
+ - **Known weak spots** (informal on-device REPL): "tell me a joke" / "what
170
+ alarms do I have" tend to misroute to `play_buzzer` instead of `respond` β€”
171
+ more `respond()` negatives sharing keywords with physical-action tools
172
+ would help in v8.
173
+
174
+ ### v6 (previous)
175
+
176
+ - **Size**: 1,400 train / 150 eval (v5/v6 dataset lineage, `coral_v5_compact.jsonl`).
177
+ - **Tool count**: 11. Cameras / vision tools dropped from earlier
178
+ checkpoints; alarm-list tool kept.
179
+
180
+ ### v4 (legacy)
181
 
182
  - **Size**: 367 train / 100 eval.
183
  - **Multi-tool**: 13% (vs Google mobile-actions 33.4%).
 
199
  for a smaller dataset:
200
 
201
  - **Full bf16 fine-tune (no LoRA)**.
202
+ - **Mean-init** for new `<tool_0>..<tool_9>` and `<end>` special tokens
203
  (init = mean of existing input embeddings; random init under-converges
204
  for tiny models on small datasets).
205
  - **Completion-only loss mask**: hand-rolled, masking everything before
206
  `<start_of_turn>model\n`. TRL 0.25's `completion_only_loss=True` is a
207
  no-op on flat-text data and FunctionGemma's chat template lacks
208
  `{% generation %}` markers required for `assistant_only_loss`.
209
+ - **8 epochs**, lr `3e-5`, cosine schedule, 0.1 warmup. (2,000 examples in
210
+ v7 β€” fewer epochs than v4's 15 because dataset size grew 5Γ—.)
211
+ - **Tool-token loss weight 4.0** to keep the new function tokens learning
212
+ faster than the rest of the vocabulary (Gemma3's 262k-vocab dilutes the
213
+ signal otherwise).
214
  - **Effective batch 16** = `per_device_train_batch_size=2 Γ—
215
  gradient_accumulation_steps=8` (kept this way to avoid the 8 GiB
216
  cross-entropy logit allocation OOM that bites Gemma3's 262k vocab).
 
231
  }
232
  ```
233
 
234
+ ## Eval results
235
 
236
+ **v7 checkpoint (2,000 train / 250 eval), Q5_K_M GGUF, greedy decode:**
237
 
238
+ | Metric | Result |
239
+ |--------|--------|
240
+ | Overall accuracy | 217 / 250 = **86.8%** |
241
+ | Single-tool accuracy | 154 / 166 = **92.8%** |
242
+ | Multi-tool exact-match | 63 / 84 = **75.0%** |
243
+ | Parse failure rate | 0 / 250 = **0.0%** |
244
 
245
+ Per-tool F1: `cancel_alarm` 1.00, `get_system_status` 0.96, `set_alarm` 0.93,
246
+ `set_neopixel_pattern` 0.92, `turn_on_lights` 0.90, `blink_lights` 0.89,
247
+ `turn_off_lights` 0.89, `set_led_color` 0.88, `play_buzzer` 0.83,
248
+ `respond` 0.74. (`respond` is the lowest because the model occasionally
249
+ chooses a physical-action tool with a hallucinated text argument when the
250
+ prompt shares keywords with one β€” an issue the dispatcher's enum validation
251
+ catches at runtime.)
252
 
253
+ **On-device latency** (SL2619 / 2Γ— Cortex-A55 @ 2 GHz, `performance` governor):
254
+ ~42 s cold prefill (one-time), ~1.6 s / turn warm β€” measured across a 33-prompt
255
+ exhaustive REPL run on the actual Coralboard.
256
 
257
  ## Latency
258
 
259
+ - **~1.1 – 1.3 s** per call on a laptop CPU (Ollama / standalone client above).
260
+ - **~1.6 s / turn warm**, ~42 s cold prefill on SL2619 (2Γ— Cortex-A55 @ 2 GHz)
261
+ with the CPU governor pinned to `performance`. Measured 2026-05-05 on the
262
+ Grinn Coralboard with the v7 GGUF + the `Function_calling/` demo from
263
+ [BrinqAI/sl2610-examples](https://github.com/BrinqAI/sl2610-examples/tree/coralboard/functiongemma/Function_calling).
264
 
265
  ## ONNX exports (for compiler toolchains)
266
 
 
319
  ## Files
320
 
321
  ```
322
+ functiongemma-physical-ai-v7-Q5_K_M.gguf # 248 MB, GGUF Q5_K_M, 10-tool v7 schema (current)
323
+ functiongemma-physical-ai-v6-Q5_K_M.gguf # 248 MB, GGUF Q5_K_M, 11-tool v6 schema (previous)
324
+ functiongemma-physical-ai-Q4_K_M.gguf # 253 MB, GGUF Q4_K_M, v4c (legacy)
325
+ Modelfile # Ollama Modelfile (function-token format)
326
+ tools.json # 10-tool schema (mobile-actions format, current)
327
+ token_map.json # function-token <-> tool-name map
328
+ onnx/compact-fp32/ # ONNX export, fp32, with KV cache (1.7 GB)
329
+ onnx/compact-fp16/ # ONNX export, fp16, with KV cache (833 MB) β€” see ORT caveat above
330
+ README.md # this file
331
  ```
332
 
333
  ## License
 
341
  - Base model: <https://huggingface.co/google/functiongemma-270m-it>
342
  - Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
343
  - Hardware demo (Coralboard, Google IO 2026 β€” full physical setup,
344
+ WLED-over-USB-CDC, Grinn HAT, end-to-end voice + text REPL):
345
+ <https://github.com/BrinqAI/sl2610-examples/tree/coralboard/functiongemma/Function_calling>
346
+ (BrinqAI fork of the upstream Synaptics demo repo,
347
+ [synaptics-astra-demos/sl2610-examples](https://github.com/synaptics-astra-demos/sl2610-examples)).