Nexus-TinyFunction-1.2B-v2.0

A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.

No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.

Highlights

Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
94.25% simple function calling — accurate single-tool selection and argument extraction
JSON Syntax Reliability: 99.3% — near-perfect structured output
Parallel & Multiple tool calling — handles complex multi-tool scenarios

Benchmark Results

BFCL V4 Benchmark: All Models (Q8_0 GGUF)

The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.

Average BFCL Score Ranking

Inference Speed Comparison

Head-to-Head: vs LFM2.5 Nova (Same Base Model)

Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).

All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.

JSON Syntax Reliability

Model	JSON Validity	Invalid	Tool Calls*
Nexus-TinyFunction-1.2B-v2.0	99.3%	18	2458
xLAM-2 3B	99.8%	4	2485
xLAM-2 1B	99.6%	10	2480
Qwen3.5 4B	99.1%	21	2423
Qwen3.5 2B	99.0%	24	2296
Qwen3.5 0.8B	98.9%	25	2220
LFM2.5 Nova 1.2B	98.3%	23	1334
LFM2.5 Base 1.2B	96.8%	13	407

*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.

BFCL V4 Official Leaderboard Comparison

How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.

Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The # rank shown is each model's official rank across all 8 non-live categories.

Why LFM2.5?

We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:

Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.

Model Details


Developed by	Nexus-Syntegra
Base model	LiquidAI/LFM2.5-1.2B-Instruct
Architecture	Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters
Context length	32,768 tokens
License	Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0
Language	English
Fine-tune method	QLoRA SFT + 3-stage curriculum learning + DPO
Hardware	Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks
Format	ChatML with `<tools>` / `<tool_call>` XML tags

Prompt Format

This model uses ChatML format with XML-tagged tool definitions and tool calls.

Important: Do not use apply_chat_template(tools=...) — the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.

System Prompt with Tools

<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.

<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>

Single Tool Call

<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>

Parallel Tool Calls

When multiple tools should be called, the model outputs them as a JSON array:

<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>

Irrelevance (No Tool Match)

When no tool matches the user's query, the model responds in plain text without any tool call tags:

<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)

tools = [{"name": "get_weather", "description": "Get weather for a city",
          "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                         "required": ["city"]}}]

system_prompt = (
    "You are a function calling AI assistant. You are provided with function "
    "signatures within <tools></tools> XML tags. You may call one or more functions "
    "to assist with the user query. Don't make assumptions about what values to "
    "plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather in Tokyo and London?"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))

With llama.cpp

GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.

# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
  Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .

# Run server with function calling support
./llama-server \
  --model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
  --jinja \
  --ctx-size 4096 \
  --port 8080

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF

ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0

Training Details

Method

QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.

Training Data

~38,000 curated examples from public datasets and synthetic augmentation:

Dataset	Examples	Purpose
Public function-calling datasets	~16,500	General function calling and irrelevance detection
Synthetic (BFCL-derived + augmented)	~22,000	Edge cases, curriculum labels
Total	~38,500

Hyperparameters

Parameter	Value
LoRA rank (r)	128
LoRA alpha	128
Target modules	q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Effective batch size	32 (2 x 16 gradient accumulation)
Learning rate (SFT)	2e-5 (cosine with 10% warmup)
Curriculum stages	3 (foundation / disambiguation / adversarial)
DPO beta	0.1
DPO learning rate	1e-6
Precision	bf16
Packing	Enabled
Hardware	Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks

iMatrix-Enhanced GGUF Quantizations

These quantizations use importance matrix (iMatrix) data computed from domain-specific calibration data to improve quality at lower bit widths. The iMatrix tells the quantizer which weights are most important for the model's actual use case, resulting in better quality at the same file size compared to standard quantization.

For standard (non-iMatrix) quantizations including Q8_0, see Nexus-TinyFunction-1.2B-v2.0-GGUF. For full-precision weights, see Nexus-TinyFunction-1.2B-v2.0.

File	Quant	Size	Description
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q2_K.gguf	Q2_K	~461 MB	Q2_K
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_L.gguf	Q3_K_L	~606 MB	Q3_K_L
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_M.gguf	Q3_K_M	~573 MB	Q3_K_M
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_S.gguf	Q3_K_S	~532 MB	Q3_K_S
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q4_K_M.gguf	Q4_K_M	~697 MB	Q4_K_M
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q4_K_S.gguf	Q4_K_S	~668 MB	Q4_K_S
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q5_K_M.gguf	Q5_K_M	~804 MB	Q5_K_M
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q5_K_S.gguf	Q5_K_S	~787 MB	Q5_K_S
Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q6_K.gguf	Q6_K	~918 MB	Q6_K

Limitations

Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
Argument extraction for complex nested objects and optional parameters can be imprecise
English only — trained exclusively on English data
Context length — quality may degrade with very long tool lists near the 32K limit
Not suitable for safety-critical, medical, legal, or financial applications
Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)

Acknowledgements

Liquid AI for the LFM2.5-1.2B-Instruct base model
Unsloth for efficient LoRA training

Citation

@misc{Nexus_TinyFunction_1_2B_v2_0,
  title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
  author = {Nexus-Syntegra},
  year = {2026},
  url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
  note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}

Downloads last month: 68

GGUF

Model size

1B params

Architecture

lfm2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

Model tree for nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-IMatrix-GGUF

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

LiquidAI/LFM2.5-1.2B-Instruct

Adapter

nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0

Quantized

(1)

this model

Evaluation results

Simple Function Calling on BFCL v4
self-reported

94.250
Multiple Function Calling on BFCL v4
self-reported

91.500
Parallel Function Calling on BFCL v4
self-reported

81.500
Parallel Multiple on BFCL v4
self-reported

78.500
Irrelevance Detection on BFCL v4
self-reported

80.420