Nexus Syntegra Nexus TinyFunction Banner

Nexus-TinyFunction-1.2B-v2.0

A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.

No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.

Highlights

  • Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
  • No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
  • Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
  • Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
  • 94.25% simple function calling — accurate single-tool selection and argument extraction
  • JSON Syntax Reliability: 99.3% — near-perfect structured output
  • Parallel & Multiple tool calling — handles complex multi-tool scenarios

Benchmark Results

BFCL V4 Benchmark: All Models (Q8_0 GGUF)

The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.

BFCL Benchmark Comparison

Average BFCL Score Ranking

BFCL Ranking

Inference Speed Comparison

Speed Comparison

Head-to-Head: vs LFM2.5 Nova (Same Base Model)

Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).

Head-to-Head vs Nova

All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.

JSON Syntax Reliability

JSON Syntax Reliability

Model JSON Validity Invalid Tool Calls*
Nexus-TinyFunction-1.2B-v2.0 99.3% 18 2458
xLAM-2 3B 99.8% 4 2485
xLAM-2 1B 99.6% 10 2480
Qwen3.5 4B 99.1% 21 2423
Qwen3.5 2B 99.0% 24 2296
Qwen3.5 0.8B 98.9% 25 2220
LFM2.5 Nova 1.2B 98.3% 23 1334
LFM2.5 Base 1.2B 96.8% 13 407

*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.

BFCL V4 Official Leaderboard Comparison

How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.

Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The # rank shown is each model's official rank across all 8 non-live categories.

BFCL V4 Leaderboard Ranking

BFCL V4 Per-Category Comparison

Why LFM2.5?

We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:

  • Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
  • Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
  • Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
  • Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.

Model Details

Developed by Nexus-Syntegra
Base model LiquidAI/LFM2.5-1.2B-Instruct
Architecture Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters
Context length 32,768 tokens
License Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0
Language English
Fine-tune method QLoRA SFT + 3-stage curriculum learning + DPO
Hardware Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks
Format ChatML with <tools> / <tool_call> XML tags

Prompt Format

This model uses ChatML format with XML-tagged tool definitions and tool calls.

Important: Do not use apply_chat_template(tools=...) — the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.

System Prompt with Tools

<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.

<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>

Single Tool Call

<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>

Parallel Tool Calls

When multiple tools should be called, the model outputs them as a JSON array:

<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>

Irrelevance (No Tool Match)

When no tool matches the user's query, the model responds in plain text without any tool call tags:

<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>

How to Use

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)

tools = [{"name": "get_weather", "description": "Get weather for a city",
          "parameters": {"type": "object", "properties": {"city": {"type": "string"}},
                         "required": ["city"]}}]

system_prompt = (
    "You are a function calling AI assistant. You are provided with function "
    "signatures within <tools></tools> XML tags. You may call one or more functions "
    "to assist with the user query. Don't make assumptions about what values to "
    "plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "What's the weather in Tokyo and London?"},
]

input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))

With llama.cpp

GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.

# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
  Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .

# Run server with function calling support
./llama-server \
  --model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
  --jinja \
  --ctx-size 4096 \
  --port 8080

With Ollama

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF

ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0

Training Details

Method

QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.

Training Data

~38,000 curated examples from public datasets and synthetic augmentation:

Dataset Examples Purpose
Public function-calling datasets ~16,500 General function calling and irrelevance detection
Synthetic (BFCL-derived + augmented) ~22,000 Edge cases, curriculum labels
Total ~38,500

Hyperparameters

Parameter Value
LoRA rank (r) 128
LoRA alpha 128
Target modules q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Effective batch size 32 (2 x 16 gradient accumulation)
Learning rate (SFT) 2e-5 (cosine with 10% warmup)
Curriculum stages 3 (foundation / disambiguation / adversarial)
DPO beta 0.1
DPO learning rate 1e-6
Precision bf16
Packing Enabled
Hardware Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks

iMatrix-Enhanced GGUF Quantizations

These quantizations use importance matrix (iMatrix) data computed from domain-specific calibration data to improve quality at lower bit widths. The iMatrix tells the quantizer which weights are most important for the model's actual use case, resulting in better quality at the same file size compared to standard quantization.

For standard (non-iMatrix) quantizations including Q8_0, see Nexus-TinyFunction-1.2B-v2.0-GGUF. For full-precision weights, see Nexus-TinyFunction-1.2B-v2.0.

Limitations

  • Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
  • Argument extraction for complex nested objects and optional parameters can be imprecise
  • English only — trained exclusively on English data
  • Context length — quality may degrade with very long tool lists near the 32K limit
  • Not suitable for safety-critical, medical, legal, or financial applications
  • Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)

Acknowledgements

  • Liquid AI for the LFM2.5-1.2B-Instruct base model
  • Unsloth for efficient LoRA training

Citation

@misc{Nexus_TinyFunction_1_2B_v2_0,
  title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
  author = {Nexus-Syntegra},
  year = {2026},
  url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
  note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}
Downloads last month
68
GGUF
Model size
1B params
Architecture
lfm2
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-IMatrix-GGUF

Evaluation results