Instructions to use BrinqAI/functiongemma-270m-physical-ai with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BrinqAI/functiongemma-270m-physical-ai with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="BrinqAI/functiongemma-270m-physical-ai",
	filename="functiongemma-physical-ai-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use BrinqAI/functiongemma-270m-physical-ai with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

LM Studio
Jan

vLLM

How to use BrinqAI/functiongemma-270m-physical-ai with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BrinqAI/functiongemma-270m-physical-ai"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrinqAI/functiongemma-270m-physical-ai",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Ollama
How to use BrinqAI/functiongemma-270m-physical-ai with Ollama:
```
ollama run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Unsloth Studio new

How to use BrinqAI/functiongemma-270m-physical-ai with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Pi new

How to use BrinqAI/functiongemma-270m-physical-ai with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "BrinqAI/functiongemma-270m-physical-ai:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use BrinqAI/functiongemma-270m-physical-ai with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use BrinqAI/functiongemma-270m-physical-ai with Docker Model Runner:
```
docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Lemonade

How to use BrinqAI/functiongemma-270m-physical-ai with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Run and chat with the model

lemonade run user.functiongemma-270m-physical-ai-Q4_K_M

List all available models

lemonade list

functiongemma-270m-physical-ai / README.md

hmahadik

v10: unified set_lights, named-args output, 6 tools

2a22670 verified 7 days ago

preview code

raw

history blame contribute delete

12.3 kB

	---
	license: gemma
	license_link: https://ai.google.dev/gemma/terms
	base_model: google/functiongemma-270m-it
	language:
	- en
	tags:
	- function-calling
	- edge
	- on-device
	- physical-ai
	- iot
	- octopus-v2
	- synaptics-sl2619
	- gemma3
	pipeline_tag: text-generation
	inference: false
	---

	# FunctionGemma 270M — Physical AI (v10, Octopus v2)

	Fine-tuned [`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it)
	for voice-controlled physical-AI / household-IoT actions on a Synaptics
	SL2619 "Coral" edge board (Google IO 2026 demo).

	Current revision: [`functiongemma-physical-ai-v10-Q5_K_M.gguf`](./functiongemma-physical-ai-v10-Q5_K_M.gguf)
	— 6 tools, ~248 MB Q5_K_M, ~0.48 s cold prefill on the 2-core
	Cortex-A55, 97.9 % mean token accuracy on eval.

	Schema ships as [`tools.json`](./tools.json). Token-to-tool mapping is
	in [`token_map.json`](./token_map.json).

	## Tool surface (6 tools)

	\| Token \| Name \| Args \| Purpose \|
	\|---\|---\|---\|---\|
	\| `<tool_0>` \| `set_lights` \| `color?`, `effect?`, `state?` \| Drive whatever lights are connected — HAT 3-LED indicators or a WLED-driven addressable strip / ring. All three args optional; the model emits only what the user implied. \|
	\| `<tool_1>` \| `play_buzzer` \| `pattern` \| Named pattern on the piezo buzzer: `beep`, `double_beep`, `chirp`, `siren`, `alarm`, `success`, `error`. \|
	\| `<tool_2>` \| `set_alarm` \| `duration` or `time`, `label?` \| Schedule an alarm. Fires the buzzer plus a visible flash. \|
	\| `<tool_3>` \| `cancel_alarm` \| `label?` \| Cancel one alarm by label, or all if no label given. \|
	\| `<tool_4>` \| `get_system_status` \| `metric` \| `cpu`, `memory`, `temperature`, `npu`, or `all`. \|
	\| `<tool_5>` \| `respond` \| `message` \| Natural-language reply when no physical-action tool fits, or when the request is ambiguous and the model needs to ask for clarification. \|

	The model is hardware-agnostic for lighting: it parses user intent
	into semantic args (`color`, `effect`, `state`) and leaves the dispatcher
	to map those onto whatever LED hardware is detected at launch — the
	HAT's three indicator LEDs, a WLED-driven strip, or a Neopixel ring. The
	user vocabulary is hardware-agnostic too: "lights", "LEDs", "strip",
	"indicators" all refer to whatever is wired up.

	## Prompt format

	The v10 model is trained
	[Octopus v2](https://arxiv.org/abs/2404.01744) style: no schema, no
	tools list, just a bare user turn.

	```
	<start_of_turn>user
	{user_text}<end_of_turn>
	<start_of_turn>model

	```

	Tool semantics live in the model weights (via the special functional
	tokens `<tool_0>` … `<tool_5>` plus `<end>`), not in the prompt. The
	`tools.json` schema in this repo is the dispatcher's arg-validation
	contract and is embedded in the GGUF metadata for schema-drift checks,
	but it is not loaded into the inference prompt. Typical prompts are
	~13 tokens.

	## Output format — functional tokens, named args

	Tool calls emit as functional tokens with named arguments, per the
	Mercedes-Benz Octopus v2 convention
	([arXiv 2501.02342](https://arxiv.org/abs/2501.02342)). Each tool name
	compiles to a single special-vocabulary token (`<tool_0>` … `<tool_5>`);
	arguments are written as `name="value"` pairs; a single `<end>` token
	terminates the call. The model emits only the args the user implied
	— absent args are simply not present.

	Examples:

	\| User says \| Model emits \| Resolves to \|
	\|---\|---\|---\|
	\| `turn the lights red` \| `<tool_0>(color="red")<end>` \| `set_lights(color="red")` \|
	\| `rainbow on the strip` \| `<tool_0>(effect="rainbow")<end>` \| `set_lights(effect="rainbow")` \|
	\| `lights off` \| `<tool_0>(state="off")<end>` \| `set_lights(state="off")` \|
	\| `red sparkle` \| `<tool_0>(color="red", effect="sparkle")<end>` \| `set_lights(color="red", effect="sparkle")` \|
	\| `set an alarm in 5 minutes` \| `<tool_2>(duration="5 minutes")<end>` \| `set_alarm(duration="5 minutes")` \|
	\| `cancel all alarms` \| `<tool_3>()<end>` \| `cancel_alarm()` \|
	\| `what's the cpu` \| `<tool_4>(metric="cpu")<end>` \| `get_system_status(metric="cpu")` \|
	\| `good morning` \| `<tool_5>(message="Good morning. ...")<end>` \| `respond(message="...")` \|

	A complete call decodes in roughly 8–20 output tokens, well inside the
	sub-second voice-UX budget on a 2-core Cortex-A55.

	> ⚠️ Inference servers MUST stop generation on `<end_of_turn>` (or
	> `<eos>`), NOT on `<end>`. The model can emit multi-tool sequences
	> `<tool_A>(args)<end><tool_B>(args)<end>`, so stopping at the first
	> `<end>` truncates legitimate multi-tool output.

	## Quick start (Ollama)

	```bash
	hf download BrinqAI/functiongemma-270m-physical-ai \
	functiongemma-physical-ai-v10-Q5_K_M.gguf Modelfile tools.json token_map.json \
	--local-dir ./fg-physical-ai

	cd fg-physical-ai
	ollama create functiongemma-physical-ai -f Modelfile
	```

	The shipped `Modelfile` bakes in the stop tokens (`<end_of_turn>`,
	`<eos>`) and decode parameters (`temperature=0`, `num_ctx=1024`,
	`num_predict=80`).

	## Calling the model

	Send a bare user turn — no schema, no tools list. With Ollama, use
	`raw=true`:

	```python
	import json
	import re
	import urllib.request

	OLLAMA_URL = "http://localhost:11434"
	MODEL = "functiongemma-physical-ai"

	reverse_token_map = json.load(open("token_map.json"))["reverse"]

	NAMED_ARG_RE = re.compile(r'(\w+)\s=\s"((?:[^"\\]\|\\.)*)"')


	def build_prompt(user_text: str) -> str:
	return (
	f"<start_of_turn>user\n{user_text}<end_of_turn>\n"
	f"<start_of_turn>model\n"
	)


	def call_model(user_text: str) -> str:
	body = json.dumps({
	"model": MODEL,
	"prompt": build_prompt(user_text),
	"raw": True,
	"stream": False,
	"options": {
	"temperature": 0.0,
	"top_p": 1.0,
	"num_predict": 80,
	"stop": ["<end_of_turn>", "<eos>"],
	},
	}).encode()
	req = urllib.request.Request(
	f"{OLLAMA_URL}/api/generate",
	data=body,
	headers={"Content-Type": "application/json"},
	)
	with urllib.request.urlopen(req, timeout=60) as resp:
	return json.loads(resp.read())["response"]


	def parse_call(raw: str) -> tuple[str \| None, dict[str, str]]:
	"""Return (tool_name, kwargs). tool_name is None on parse fail."""
	m = re.match(r"\s(<tool_\d+>)\((.?)\)<end>", raw)
	if not m:
	return None, {}
	tok, body = m.group(1), m.group(2)
	kwargs = {k: v for k, v in NAMED_ARG_RE.findall(body)}
	return reverse_token_map.get(tok), kwargs


	raw = call_model("turn the lights red")
	print(raw) # e.g. '<tool_0>(color="red")<end>'
	print(parse_call(raw)) # ('set_lights', {'color': 'red'})
	```

	For `llama-cpp-python` directly, use `detokenize(..., special=True)` so
	the `<tool_N>` and `<end>` tokens render in the output instead of being
	stripped.

	## Training data

	Training data was generated from Haiku-authored phrasing templates
	crossed with deterministic entity pools, then lightly augmented with
	Moonshine-flavored ASR noise (dropped function words, lowercased traces,
	filler-word prepends). Each record is a flat `{input, output}` pair —
	no tools / messages array, no chat template.

	\| \| \|
	\|---\|---\|
	\| Train rows \| 5,222 \|
	\| Eval rows \| 920 \|
	\| Tools \| 6 \|
	\| Per-template entity expansion \| color × effect × state pools for `set_lights`; pattern pool for `play_buzzer`; duration / time pools for `set_alarm`; metric pool for `get_system_status` \|
	\| ASR-style augmentation \| Moonshine-sim noise on a fraction of records (dropped articles, lowercased traces, filler prepends) \|
	\| Multi-tool fraction \| None — single-tool emphasis; multi-tool routines composed at dispatch time \|

	The `set_lights` tool also gets explicit failure-mode rows that
	route bare ambiguous prompts to `respond()` — e.g. "rainbow" alone
	("Did you mean the lights? Try 'rainbow on the lights'."), "siren" alone
	(prompts the user toward `play_buzzer`), and bare "on" / "off"
	(asks what the user wants to act on).

	## Methodology

	- Full bf16 fine-tune (no LoRA).
	- Functional tokens: `<tool_0>` … `<tool_5>` + `<end>` added as
	`additional_special_tokens`; new embeddings mean-initialized from
	the existing input-embedding matrix (random init under-converges on
	small datasets at this scale).
	- Completion-only loss mask: hand-rolled — labels before
	`<start_of_turn>model\n` are masked to `-100`. The model learns only
	from the assistant turn, not the user prompt.
	- 5 epochs, lr `3e-5`, cosine schedule, 0.1 warmup, weight decay
	0.01.
	- Effective batch = 16
	(`per_device_train_batch_size=8 × gradient_accumulation_steps=2`).
	- `max_length=256` — the trained prompt format is ~13 tokens and
	the assistant turn fits comfortably under 64 tokens, including
	`respond()` messages.
	- bf16, gradient checkpointing, `adamw_torch_fused`,
	`metric_for_best_model="eval_loss"` + `load_best_model_at_end=True`.
	- Training wallclock: 5 min on a single H100 (~15–20 min on a 4090).

	### Citation

	```bibtex
	@article{chen2024octopusv2,
	title = {Octopus v2: On-device language model for super agent},
	author = {Chen, Wei and Li, Zhiyuan},
	journal = {arXiv preprint arXiv:2404.01744},
	year = {2024},
	url = {https://arxiv.org/abs/2404.01744}
	}

	@article{merc2025octopusv2,
	title = {Octopus v2 named-arg function calling},
	journal = {arXiv preprint arXiv:2501.02342},
	year = {2025},
	url = {https://arxiv.org/abs/2501.02342}
	}
	```

	## Results

	### Training metrics (final epoch)

	\| \| \|
	\|---\|---\|
	\| Final train loss \| 0.493 \|
	\| Final eval loss \| 0.046 \|
	\| Mean token accuracy (eval) \| 97.9 % \|

	### Held-out smoke test (post-train, 36 prompts spanning all 6 tools)

	\| \| \|
	\|---\|---\|
	\| Smoke-test routing accuracy \| 35 / 36 (97.2 %) \|

	The 36-prompt suite covers single-tool happy paths for every tool plus
	failure modes the model is expected to deflect: ambiguous color words
	without a target ("make it red"), effect names without a target
	("rainbow"), unsupported features ("play a tone at 2000 hz"), and
	out-of-scope appliances. Failure-mode prompts all route to `respond()`
	with a helpful clarification message.

	### On-device benchmark (Coralboard, 2-core Cortex-A55 @ 2 GHz, Q5_K_M GGUF)

	Measured with `llama-cpp-python` 0.3.16, `n_ctx=1024`, `n_threads=2`,
	CPU governor `performance`, 8 representative prompts spanning all 6
	tools.

	\| \| \|
	\|---\|---\|
	\| Model load \| 2.23 s \|
	\| Prompt tokens \| 11–16 (mean ~13) \|
	\| Cold prefill (turn 1) \| 0.48 s \|
	\| Warm prefill (turn 2+, avg) \| 0.47 s \|
	\| Decode rate \| ~9.7 tok/s \|
	\| Decode time, typical tool call (3–8 output tokens) \| 0.3–0.8 s \|
	\| Decode time, `respond()` (~25 output tokens) \| ~2.6 s \|
	\| End-to-end first turn (model load + prefill + decode) \| ~3.4 s \|

	## Files

	```
	functiongemma-physical-ai-v10-Q5_K_M.gguf # ~248 MB, Q5_K_M weights (Ollama / llama.cpp)
	Modelfile # Ollama Modelfile (functional-token format)
	tools.json # 6-tool schema, canonical mobile-actions format
	token_map.json # functional-token <-> tool-name map
	README.md # this file
	```

	Earlier checkpoint GGUFs from the project's development history
	(`functiongemma-physical-ai-v9-Q5_K_M.gguf`,
	`functiongemma-physical-ai-v7-Q5_K_M.gguf`,
	`functiongemma-physical-ai-v6-Q5_K_M.gguf`,
	`functiongemma-physical-ai-Q4_K_M.gguf`) remain in the repo for
	reproducibility. They use different tool surfaces and (for v7 and
	earlier) a different inference-prompt format; new deployments should use
	the v10 file above.

	## License

	Released under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
	By using this model you agree to those terms. Base model:
	[`google/functiongemma-270m-it`](https://huggingface.co/google/functiongemma-270m-it).

	## Links

	- Base model: <https://huggingface.co/google/functiongemma-270m-it>
	- Octopus v2 paper: <https://arxiv.org/abs/2404.01744>
	- Mercedes-Benz Octopus v2 (named-arg variant): <https://arxiv.org/abs/2501.02342>
	- Hardware demo + integration code (Synaptics Coralboard, Grinn HAT,
	WLED-over-USB-CDC, full PyQt UI):
	<https://github.com/synaptics-astra-demos/sl2610-examples> →
	`Function_calling/`