hmahadik's picture
v10: unified set_lights, named-args output, 6 tools
2a22670 verified
---
license: gemma
license_link: https://ai.google.dev/gemma/terms
base_model: google/functiongemma-270m-it
language:
- en
tags:
- function-calling
- edge
- on-device
- physical-ai
- iot
- octopus-v2
- synaptics-sl2619
- gemma3
pipeline_tag: text-generation
inference: false
---
# FunctionGemma 270M β€” Physical AI (v10, Octopus v2)
Fine-tuned [`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it)
for voice-controlled physical-AI / household-IoT actions on a Synaptics
SL2619 "Coral" edge board (Google IO 2026 demo).
**Current revision:** [`functiongemma-physical-ai-v10-Q5_K_M.gguf`](./functiongemma-physical-ai-v10-Q5_K_M.gguf)
β€” 6 tools, ~248 MB Q5_K_M, ~0.48 s cold prefill on the 2-core
Cortex-A55, 97.9 % mean token accuracy on eval.
Schema ships as [`tools.json`](./tools.json). Token-to-tool mapping is
in [`token_map.json`](./token_map.json).
## Tool surface (6 tools)
| Token | Name | Args | Purpose |
|---|---|---|---|
| `<tool_0>` | `set_lights` | `color?`, `effect?`, `state?` | Drive whatever lights are connected β€” HAT 3-LED indicators or a WLED-driven addressable strip / ring. All three args optional; the model emits only what the user implied. |
| `<tool_1>` | `play_buzzer` | `pattern` | Named pattern on the piezo buzzer: `beep`, `double_beep`, `chirp`, `siren`, `alarm`, `success`, `error`. |
| `<tool_2>` | `set_alarm` | `duration` or `time`, `label?` | Schedule an alarm. Fires the buzzer plus a visible flash. |
| `<tool_3>` | `cancel_alarm` | `label?` | Cancel one alarm by label, or all if no label given. |
| `<tool_4>` | `get_system_status` | `metric` | `cpu`, `memory`, `temperature`, `npu`, or `all`. |
| `<tool_5>` | `respond` | `message` | Natural-language reply when no physical-action tool fits, or when the request is ambiguous and the model needs to ask for clarification. |
The model is **hardware-agnostic** for lighting: it parses user intent
into semantic args (`color`, `effect`, `state`) and leaves the dispatcher
to map those onto whatever LED hardware is detected at launch β€” the
HAT's three indicator LEDs, a WLED-driven strip, or a Neopixel ring. The
user vocabulary is hardware-agnostic too: "lights", "LEDs", "strip",
"indicators" all refer to whatever is wired up.
## Prompt format
The v10 model is trained
[Octopus v2](https://arxiv.org/abs/2404.01744) style: no schema, no
tools list, just a bare user turn.
```
<start_of_turn>user
{user_text}<end_of_turn>
<start_of_turn>model
```
Tool semantics live in the model weights (via the special functional
tokens `<tool_0>` … `<tool_5>` plus `<end>`), not in the prompt. The
`tools.json` schema in this repo is the dispatcher's arg-validation
contract and is embedded in the GGUF metadata for schema-drift checks,
but it is **not** loaded into the inference prompt. Typical prompts are
~13 tokens.
## Output format β€” functional tokens, named args
Tool calls emit as **functional tokens with named arguments**, per the
Mercedes-Benz Octopus v2 convention
([arXiv 2501.02342](https://arxiv.org/abs/2501.02342)). Each tool name
compiles to a single special-vocabulary token (`<tool_0>` … `<tool_5>`);
arguments are written as `name="value"` pairs; a single `<end>` token
terminates the call. The model emits **only the args the user implied**
β€” absent args are simply not present.
Examples:
| User says | Model emits | Resolves to |
|---|---|---|
| `turn the lights red` | `<tool_0>(color="red")<end>` | `set_lights(color="red")` |
| `rainbow on the strip` | `<tool_0>(effect="rainbow")<end>` | `set_lights(effect="rainbow")` |
| `lights off` | `<tool_0>(state="off")<end>` | `set_lights(state="off")` |
| `red sparkle` | `<tool_0>(color="red", effect="sparkle")<end>` | `set_lights(color="red", effect="sparkle")` |
| `set an alarm in 5 minutes` | `<tool_2>(duration="5 minutes")<end>` | `set_alarm(duration="5 minutes")` |
| `cancel all alarms` | `<tool_3>()<end>` | `cancel_alarm()` |
| `what's the cpu` | `<tool_4>(metric="cpu")<end>` | `get_system_status(metric="cpu")` |
| `good morning` | `<tool_5>(message="Good morning. ...")<end>` | `respond(message="...")` |
A complete call decodes in roughly 8–20 output tokens, well inside the
sub-second voice-UX budget on a 2-core Cortex-A55.
> ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or
> `<eos>`), NOT on `<end>`. The model can emit multi-tool sequences
> `<tool_A>(args)<end><tool_B>(args)<end>`, so stopping at the first
> `<end>` truncates legitimate multi-tool output.
## Quick start (Ollama)
```bash
hf download BrinqAI/functiongemma-270m-physical-ai \
functiongemma-physical-ai-v10-Q5_K_M.gguf Modelfile tools.json token_map.json \
--local-dir ./fg-physical-ai
cd fg-physical-ai
ollama create functiongemma-physical-ai -f Modelfile
```
The shipped `Modelfile` bakes in the stop tokens (`<end_of_turn>`,
`<eos>`) and decode parameters (`temperature=0`, `num_ctx=1024`,
`num_predict=80`).
## Calling the model
Send a **bare user turn** β€” no schema, no tools list. With Ollama, use
`raw=true`:
```python
import json
import re
import urllib.request
OLLAMA_URL = "http://localhost:11434"
MODEL = "functiongemma-physical-ai"
reverse_token_map = json.load(open("token_map.json"))["reverse"]
NAMED_ARG_RE = re.compile(r'(\w+)\s*=\s*"((?:[^"\\]|\\.)*)"')
def build_prompt(user_text: str) -> str:
return (
f"<start_of_turn>user\n{user_text}<end_of_turn>\n"
f"<start_of_turn>model\n"
)
def call_model(user_text: str) -> str:
body = json.dumps({
"model": MODEL,
"prompt": build_prompt(user_text),
"raw": True,
"stream": False,
"options": {
"temperature": 0.0,
"top_p": 1.0,
"num_predict": 80,
"stop": ["<end_of_turn>", "<eos>"],
},
}).encode()
req = urllib.request.Request(
f"{OLLAMA_URL}/api/generate",
data=body,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=60) as resp:
return json.loads(resp.read())["response"]
def parse_call(raw: str) -> tuple[str | None, dict[str, str]]:
"""Return (tool_name, kwargs). tool_name is None on parse fail."""
m = re.match(r"\s*(<tool_\d+>)\((.*?)\)<end>", raw)
if not m:
return None, {}
tok, body = m.group(1), m.group(2)
kwargs = {k: v for k, v in NAMED_ARG_RE.findall(body)}
return reverse_token_map.get(tok), kwargs
raw = call_model("turn the lights red")
print(raw) # e.g. '<tool_0>(color="red")<end>'
print(parse_call(raw)) # ('set_lights', {'color': 'red'})
```
For `llama-cpp-python` directly, use `detokenize(..., special=True)` so
the `<tool_N>` and `<end>` tokens render in the output instead of being
stripped.
## Training data
Training data was generated from Haiku-authored phrasing templates
crossed with deterministic entity pools, then lightly augmented with
Moonshine-flavored ASR noise (dropped function words, lowercased traces,
filler-word prepends). Each record is a flat `{input, output}` pair β€”
no tools / messages array, no chat template.
| | |
|---|---|
| Train rows | 5,222 |
| Eval rows | 920 |
| Tools | 6 |
| Per-template entity expansion | color Γ— effect Γ— state pools for `set_lights`; pattern pool for `play_buzzer`; duration / time pools for `set_alarm`; metric pool for `get_system_status` |
| ASR-style augmentation | Moonshine-sim noise on a fraction of records (dropped articles, lowercased traces, filler prepends) |
| Multi-tool fraction | None β€” single-tool emphasis; multi-tool routines composed at dispatch time |
The `set_lights` tool also gets explicit **failure-mode rows** that
route bare ambiguous prompts to `respond()` β€” e.g. "rainbow" alone
("Did you mean the lights? Try 'rainbow on the lights'."), "siren" alone
(prompts the user toward `play_buzzer`), and bare "on" / "off"
(asks what the user wants to act on).
## Methodology
- **Full bf16 fine-tune** (no LoRA).
- **Functional tokens**: `<tool_0>` … `<tool_5>` + `<end>` added as
`additional_special_tokens`; new embeddings **mean-initialized** from
the existing input-embedding matrix (random init under-converges on
small datasets at this scale).
- **Completion-only loss mask**: hand-rolled β€” labels before
`<start_of_turn>model\n` are masked to `-100`. The model learns only
from the assistant turn, not the user prompt.
- **5 epochs**, lr `3e-5`, cosine schedule, 0.1 warmup, weight decay
0.01.
- **Effective batch = 16**
(`per_device_train_batch_size=8 Γ— gradient_accumulation_steps=2`).
- **`max_length=256`** β€” the trained prompt format is ~13 tokens and
the assistant turn fits comfortably under 64 tokens, including
`respond()` messages.
- bf16, gradient checkpointing, `adamw_torch_fused`,
`metric_for_best_model="eval_loss"` + `load_best_model_at_end=True`.
- Training wallclock: **5 min on a single H100** (~15–20 min on a 4090).
### Citation
```bibtex
@article{chen2024octopusv2,
title = {Octopus v2: On-device language model for super agent},
author = {Chen, Wei and Li, Zhiyuan},
journal = {arXiv preprint arXiv:2404.01744},
year = {2024},
url = {https://arxiv.org/abs/2404.01744}
}
@article{merc2025octopusv2,
title = {Octopus v2 named-arg function calling},
journal = {arXiv preprint arXiv:2501.02342},
year = {2025},
url = {https://arxiv.org/abs/2501.02342}
}
```
## Results
### Training metrics (final epoch)
| | |
|---|---|
| Final train loss | 0.493 |
| Final eval loss | **0.046** |
| Mean token accuracy (eval) | **97.9 %** |
### Held-out smoke test (post-train, 36 prompts spanning all 6 tools)
| | |
|---|---|
| Smoke-test routing accuracy | **35 / 36 (97.2 %)** |
The 36-prompt suite covers single-tool happy paths for every tool plus
failure modes the model is expected to deflect: ambiguous color words
without a target ("make it red"), effect names without a target
("rainbow"), unsupported features ("play a tone at 2000 hz"), and
out-of-scope appliances. Failure-mode prompts all route to `respond()`
with a helpful clarification message.
### On-device benchmark (Coralboard, 2-core Cortex-A55 @ 2 GHz, Q5_K_M GGUF)
Measured with `llama-cpp-python` 0.3.16, `n_ctx=1024`, `n_threads=2`,
CPU governor `performance`, 8 representative prompts spanning all 6
tools.
| | |
|---|---|
| Model load | 2.23 s |
| Prompt tokens | 11–16 (mean ~13) |
| **Cold prefill (turn 1)** | **0.48 s** |
| Warm prefill (turn 2+, avg) | 0.47 s |
| Decode rate | **~9.7 tok/s** |
| Decode time, typical tool call (3–8 output tokens) | 0.3–0.8 s |
| Decode time, `respond()` (~25 output tokens) | ~2.6 s |
| End-to-end first turn (model load + prefill + decode) | ~3.4 s |
## Files
```
functiongemma-physical-ai-v10-Q5_K_M.gguf # ~248 MB, Q5_K_M weights (Ollama / llama.cpp)
Modelfile # Ollama Modelfile (functional-token format)
tools.json # 6-tool schema, canonical mobile-actions format
token_map.json # functional-token <-> tool-name map
README.md # this file
```
Earlier checkpoint GGUFs from the project's development history
(`functiongemma-physical-ai-v9-Q5_K_M.gguf`,
`functiongemma-physical-ai-v7-Q5_K_M.gguf`,
`functiongemma-physical-ai-v6-Q5_K_M.gguf`,
`functiongemma-physical-ai-Q4_K_M.gguf`) remain in the repo for
reproducibility. They use different tool surfaces and (for v7 and
earlier) a different inference-prompt format; new deployments should use
the v10 file above.
## License
Released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
By using this model you agree to those terms. Base model:
[`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it).
## Links
- Base model: <https://huggingface.co/google/functiongemma-270m-it>
- Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
- Mercedes-Benz Octopus v2 (named-arg variant): <https://arxiv.org/abs/2501.02342>
- Hardware demo + integration code (Synaptics Coralboard, Grinn HAT,
WLED-over-USB-CDC, full PyQt UI):
<https://github.com/synaptics-astra-demos/sl2610-examples> β†’
`Function_calling/`