Instructions to use BrinqAI/functiongemma-270m-physical-ai with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BrinqAI/functiongemma-270m-physical-ai with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="BrinqAI/functiongemma-270m-physical-ai",
	filename="functiongemma-physical-ai-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use BrinqAI/functiongemma-270m-physical-ai with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

LM Studio
Jan

vLLM

How to use BrinqAI/functiongemma-270m-physical-ai with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BrinqAI/functiongemma-270m-physical-ai"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrinqAI/functiongemma-270m-physical-ai",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Ollama
How to use BrinqAI/functiongemma-270m-physical-ai with Ollama:
```
ollama run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Unsloth Studio new

How to use BrinqAI/functiongemma-270m-physical-ai with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for BrinqAI/functiongemma-270m-physical-ai to start chatting

Pi new

How to use BrinqAI/functiongemma-270m-physical-ai with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "BrinqAI/functiongemma-270m-physical-ai:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use BrinqAI/functiongemma-270m-physical-ai with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use BrinqAI/functiongemma-270m-physical-ai with Docker Model Runner:
```
docker model run hf.co/BrinqAI/functiongemma-270m-physical-ai:Q4_K_M
```

Lemonade

How to use BrinqAI/functiongemma-270m-physical-ai with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull BrinqAI/functiongemma-270m-physical-ai:Q4_K_M

Run and chat with the model

lemonade run user.functiongemma-270m-physical-ai-Q4_K_M

List all available models

lemonade list

hmahadik commited on 18 days ago

Commit

3947888

verified ·

1 Parent(s): 960c2dd

docs: add ONNX section + fp16/ORT caveat

Browse files

Files changed (1) hide show

README.md +86 -14

README.md CHANGED Viewed

@@ -147,10 +147,26 @@ print(parse_call(raw))      # ('turn_on_lights', '')
 ## Training data
-- **Size**: 367 train / 100 eval examples.
-- **Mix**: paraphrase expansion + multi-tool sequences + `respond()`
-  fallbacks for ambiguous / out-of-scope prompts (so the model has a
-  clean exit when no tool fits, rather than hallucinating one).
 - **Buzzer schema**: pattern-only (binary GPIO on the reference HAT — no
   PWM). Old `frequency_hz` / `duration_seconds` prompts are routed
   through `respond()` as out-of-scope negatives.
@@ -201,19 +217,19 @@ for a smaller dataset:
 ## Smoke-test results
-10-prompt Ollama smoke against the registered model:
 | Smoke pass-rate |
 |-----------------|
-| **8 / 10 (80 %)** |
-The model handles the simple control prompts cleanly (`turn on the
-lights`, `blink red 3 times`, `play a beep`, `take a picture`, `good
-morning` → respond). Known weak prompts at 367-example scale: `set led
-red brightness 50` (hallucinated `acceptor(...)` — likely Q4_K_M
-quantization artifact on `<tool_2>`) and `set alarm 5 minutes`
-(misroutes). Plan: paraphrase-expand the dataset to 2–3k examples for the
-next checkpoint.
 ## Latency
@@ -223,13 +239,69 @@ Measured against a local Ollama using the standalone client above:
 - Target on SL2619 (2× Cortex-A55 @ 2 GHz): **0.5 – 1.2 s** with the CPU
   governor pinned to `performance`. On-device measurement pending.
 ## Files
 ```
-functiongemma-physical-ai-Q4_K_M.gguf   # 253 MB, weights
 Modelfile                                # Ollama Modelfile (function-token format)
 tools.json                               # 13-tool schema (mobile-actions format)
 token_map.json                           # function-token <-> tool-name map
 README.md                                # this file
 ```

 ## Training data
+### v5 (current — use this for training)
+- **Size**: 1,400 train / 150 eval (v5 dataset, `coral_v5_compact.jsonl`).
+- **Multi-tool**: 292 multi-tool examples in train (20.9%), 50 in eval (33.3%). Google
+  mobile-actions target is 33.4%; train is capped by pool size — the ~450 Haiku-generated
+  multi-tool examples deduplicated to 343 unique. Future: spawn more agents.
+- **Generation**: base hand-written examples + `paraphrases_cache.json` (generated by parallel
+  Claude Haiku agents). 971 new single-tool + 450 new multi-tool paraphrases before dedup.
+- **Coverage fixes**: explicit brightness form ("set led red brightness 50") — 46 examples.
+  Bare alarm form ("set alarm 5 minutes", no preposition) — 36 examples. Both were zero in v4
+  and caused the two known smoke-test failures.
+- **Non-determinism fix**: `set_led_color_examples()` previously used unseeded `random.sample`;
+  now iterates all 18 templates × 12 colors deterministically (216 examples vs ~60).
+- **Eval harness**: `scripts/eval_harness.py` — greedy decode against eval JSONL, per-tool F1,
+  arg-match rate, multi-tool sequence accuracy. Run on GPU host post-training.
+### v4 (previous)
+- **Size**: 367 train / 100 eval.
+- **Multi-tool**: 13% (vs Google mobile-actions 33.4%).
 - **Buzzer schema**: pattern-only (binary GPIO on the reference HAT — no
   PWM). Old `frequency_hz` / `duration_seconds` prompts are routed
   through `respond()` as out-of-scope negatives.
 ## Smoke-test results
+**v4 checkpoint (367-example training):**
 | Smoke pass-rate |
 |-----------------|
+| 8 / 10 (80 %) |
+Note: 21/22 smoke prompts are NOT in the held-out eval set, so 80% measures training
+memorization, not generalization. The two failures — `set led red brightness 50`
+(hallucinated `acceptor(...)`) and `set alarm 5 minutes` (misrouted) — were caused by
+absent phrasing patterns, now fixed in v5.
+**v5 checkpoint: pending GPU training run.** Use `scripts/eval_harness.py` for
+proper per-tool precision/recall/F1 against the 150-example held-out eval set.
 ## Latency
 - Target on SL2619 (2× Cortex-A55 @ 2 GHz): **0.5 – 1.2 s** with the CPU
   governor pinned to `performance`. On-device measurement pending.
+## ONNX exports (for compiler toolchains)
+For compiler-targeted backends (ONNX Runtime, IREE/MLIR, OpenVINO, TensorRT,
+Synaptics Torq), the model is also published as ONNX with KV-cache support
+(`text-generation-with-past`). Both exports are derived from the same
+`coral-functiongemma-v4c-compact` checkpoint as the GGUF above.
+| Path | Precision | Weight init dtype | Size | ORT runnable |
+|------|-----------|-------------------|------|--------------|
+| `onnx/compact-fp32/model.onnx` | fp32 | 237 / 237 FLOAT | 1.7 GB | yes |
+| `onnx/compact-fp16/model.onnx` | fp16 | 237 / 237 FLOAT16 | 833 MB | no — see note |
+Both files are structurally valid (`onnx.checker.check_model(..., full_check=True)`
+passes). Each export ships with the matching tokenizer and `config.json` so it
+can be loaded directly:
+```python
+from transformers import AutoTokenizer
+import onnxruntime as ort
+import numpy as np, json
+MODEL = "onnx/compact-fp32"  # or downloaded local path
+tok = AutoTokenizer.from_pretrained(MODEL)
+sess = ort.InferenceSession(f"{MODEL}/model.onnx", providers=["CPUExecutionProvider"])
+tools = json.load(open("tools.json"))["tools"]
+prompt = tok.apply_chat_template(
+    [{"role": "developer",
+      "content": "You are a model that can do function calling with the following functions\n",
+      "tool_calls": None},
+     {"role": "user", "content": "Turn on the lights", "tool_calls": None}],
+    tools=tools, tokenize=False, add_generation_prompt=True,
+)
+# Then feed input_ids + empty past_key_values.* (shape (1, num_kv_heads, 0, head_dim))
+# greedy-decode in a loop, stop on <end>. See repo for full snippet.
+```
+Smoke decode of "Turn on the lights" against the fp32 ONNX returns
+`<tool_0>()<end>` (= `turn_on_lights()`), matching the GGUF output.
+### fp16 + ONNX Runtime caveat
+The fp16 ONNX file is structurally valid but **does not currently load in
+ONNX Runtime ≥ 1.20** for this model: ORT's `SimplifiedLayerNormFusion` pass
+chokes on the `InsertedPrecisionFreeCast_*` nodes that the fp16 conversion
+inserts around Gemma3's RMSNorm layers. The error is graph-optimizer-internal
+and reproduces with `ORT_DISABLE_ALL`. This is an ORT bug, not an ONNX-spec
+issue — the file passes `onnx.checker` and the graph is well-formed.
+For compiler frontends that consume ONNX directly (IREE / MLIR, TensorRT,
+OpenVINO, Synaptics Torq), the fp16 file should ingest fine. For runtime
+inference via `onnxruntime` itself, use the fp32 export and let your compiler
+or runtime do its own dtype conversion / quantization downstream.
 ## Files
 ```
+functiongemma-physical-ai-Q4_K_M.gguf   # 253 MB, GGUF Q4_K_M weights (Ollama / llama.cpp)
 Modelfile                                # Ollama Modelfile (function-token format)
 tools.json                               # 13-tool schema (mobile-actions format)
 token_map.json                           # function-token <-> tool-name map
+onnx/compact-fp32/                       # ONNX export, fp32, with KV cache (1.7 GB)
+onnx/compact-fp16/                       # ONNX export, fp16, with KV cache (833 MB) — see ORT caveat above
 README.md                                # this file
 ```