Nexus-TinyFunction-1.2B-v2.0
A fast, tiny function-calling model fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct. Built on LFM2.5's hybrid recurrent-attention architecture for significantly faster inference than transformer-only models of similar or even smaller size — fast enough to run on the CPU of a mobile phone.
No thinking trace or chain-of-thought required. As an instruct-tuned model, it produces accurate tool calls directly without verbose reasoning overhead, keeping latency low and token usage minimal.
Highlights
- Blazing fast inference — hybrid recurrent-attention architecture runs faster than similarly-sized (and even smaller) pure transformer models on both GPU and CPU
- No thinking trace needed — direct tool calls without chain-of-thought overhead, unlike reasoning-based models
- Runs anywhere — Q4_K_M quantization fits in ~700 MB, fast enough for Android phones, Raspberry Pi, edge servers
- Strong irrelevance detection (80.42%) — reliably refuses to call tools when no tool matches the query, avoiding hallucinated function calls
- 94.25% simple function calling — accurate single-tool selection and argument extraction
- JSON Syntax Reliability: 99.3% — near-perfect structured output
- Parallel & Multiple tool calling — handles complex multi-tool scenarios
Benchmark Results
BFCL V4 Benchmark: All Models (Q8_0 GGUF)
The following charts compare models we tested locally in Q8_0 GGUF quantization on the same hardware under identical conditions.
Average BFCL Score Ranking
Inference Speed Comparison
Head-to-Head: vs LFM2.5 Nova (Same Base Model)
Direct comparison with NovachronoAI/LFM2.5-1.2B-Nova-Function-Calling, the other LFM2.5-based function-calling fine-tune. Both models share the same base (LiquidAI/LFM2.5-1.2B-Instruct, BFCL V4 non-live avg: 24.8%).
All scores from BFCL V4, Q8_0 GGUF quantization via llama-server on a single NVIDIA RTX 5090.
JSON Syntax Reliability
| Model | JSON Validity | Invalid | Tool Calls* |
|---|---|---|---|
| Nexus-TinyFunction-1.2B-v2.0 | 99.3% | 18 | 2458 |
| xLAM-2 3B | 99.8% | 4 | 2485 |
| xLAM-2 1B | 99.6% | 10 | 2480 |
| Qwen3.5 4B | 99.1% | 21 | 2423 |
| Qwen3.5 2B | 99.0% | 24 | 2296 |
| Qwen3.5 0.8B | 98.9% | 25 | 2220 |
| LFM2.5 Nova 1.2B | 98.3% | 23 | 1334 |
| LFM2.5 Base 1.2B | 96.8% | 13 | 407 |
*Tool Calls = samples where the model attempted a tool call (out of 2,501 total per model). Models with fewer tool calls responded with plain text more often — the base model only attempted 407/2,501 calls.
BFCL V4 Official Leaderboard Comparison
How does a 1.2B model compare to frontier API models? We evaluated on 5 of 8 BFCL V4 non-live categories (Python Simple, Multiple, Parallel, Parallel Multiple, Irrelevance Detection). Java Simple, JavaScript Simple, and the combined Simple AST average are excluded — we did not train on or evaluate these categories.
Transparency: Official leaderboard scores use API inference at full precision. Our scores are from Q8_0 GGUF quantization via llama-server. The
#rank shown is each model's official rank across all 8 non-live categories.
Why LFM2.5?
We chose LiquidAI/LFM2.5-1.2B-Instruct as the base model for several reasons:
- Faster than transformers at any size — LFM2.5's hybrid recurrent-attention architecture achieves faster inference than pure transformer models of similar size, and even outpaces many smaller transformer models. The sub-quadratic scaling on sequence length is especially beneficial for function-calling workloads where tool definitions consume significant context.
- Built for edge and mobile — At 1.2B parameters, the model runs on consumer hardware, Android phones, Raspberry Pi, and edge servers. The Q4_K_M quantization fits in under 700 MB of RAM.
- Instruct-tuned, not reasoning-dependent — The base model is already instruct-tuned, so function calls are produced directly without chain-of-thought or thinking traces. This keeps latency low and avoids wasting tokens on reasoning overhead.
- Massive improvement over base — The base LFM2.5-1.2B-Instruct averages just 24.8% on BFCL V4 non-live categories. Our fine-tune brings that to 85.2% — a 60pp gain.
Model Details
| Developed by | Nexus-Syntegra |
| Base model | LiquidAI/LFM2.5-1.2B-Instruct |
| Architecture | Hybrid recurrent-attention (Lfm2ForCausalLM), 1.2B parameters |
| Context length | 32,768 tokens |
| License | Apache 2.0 (fine-tune); base model weights subject to LFM Open License v1.0 |
| Language | English |
| Fine-tune method | QLoRA SFT + 3-stage curriculum learning + DPO |
| Hardware | Single NVIDIA RTX 5090 (32 GB) — training, quantization, and all benchmarks |
| Format | ChatML with <tools> / <tool_call> XML tags |
Prompt Format
This model uses ChatML format with XML-tagged tool definitions and tool calls.
Important: Do not use
apply_chat_template(tools=...)— the base model's chat template formats tools differently than our fine-tune expects. Instead, include the tools directly in the system message as shown below.
System Prompt with Tools
<|im_start|>system
You are a function calling AI assistant. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
<tools>
[{"name": "get_weather", "description": "Get current weather for a location", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}]
</tools><|im_end|>
Single Tool Call
<|im_start|>user
What's the weather in Tokyo?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call><|im_end|>
Parallel Tool Calls
When multiple tools should be called, the model outputs them as a JSON array:
<|im_start|>user
What's the weather in Tokyo and London?<|im_end|>
<|im_start|>assistant
<tool_call>
[{"name": "get_weather", "arguments": {"city": "Tokyo"}}, {"name": "get_weather", "arguments": {"city": "London"}}]
</tool_call><|im_end|>
Irrelevance (No Tool Match)
When no tool matches the user's query, the model responds in plain text without any tool call tags:
<|im_start|>user
Tell me a joke<|im_end|>
<|im_start|>assistant
Sure! Why do programmers prefer dark mode? Because light attracts bugs!<|im_end|>
How to Use
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
model_id = "nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tools = [{"name": "get_weather", "description": "Get weather for a city",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}]
system_prompt = (
"You are a function calling AI assistant. You are provided with function "
"signatures within <tools></tools> XML tags. You may call one or more functions "
"to assist with the user query. Don't make assumptions about what values to "
"plug into functions.\n\n<tools>\n" + json.dumps(tools) + "\n</tools>"
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather in Tokyo and London?"},
]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
)
input_ids = input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=False))
With llama.cpp
GGUF quantizations are available at nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF.
# Download the Q4_K_M quantization
huggingface-cli download nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-GGUF \
Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf --local-dir .
# Run server with function calling support
./llama-server \
--model Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf \
--jinja \
--ctx-size 4096 \
--port 8080
With Ollama
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./Nexus-TinyFunction-1.2B-v2.0-q4_k_m.gguf
PARAMETER temperature 0
PARAMETER num_ctx 4096
EOF
ollama create nexus-tinyfunction-1.2b-v2.0 -f Modelfile
ollama run nexus-tinyfunction-1.2b-v2.0
Training Details
Method
QLoRA Supervised Fine-Tuning (SFT) with 3-stage curriculum learning, followed by Direct Preference Optimization (DPO). Trained using Unsloth + TRL SFTTrainer.
Training Data
~38,000 curated examples from public datasets and synthetic augmentation:
| Dataset | Examples | Purpose |
|---|---|---|
| Public function-calling datasets | ~16,500 | General function calling and irrelevance detection |
| Synthetic (BFCL-derived + augmented) | ~22,000 | Edge cases, curriculum labels |
| Total | ~38,500 |
Hyperparameters
| Parameter | Value |
|---|---|
| LoRA rank (r) | 128 |
| LoRA alpha | 128 |
| Target modules | q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3 |
| Effective batch size | 32 (2 x 16 gradient accumulation) |
| Learning rate (SFT) | 2e-5 (cosine with 10% warmup) |
| Curriculum stages | 3 (foundation / disambiguation / adversarial) |
| DPO beta | 0.1 |
| DPO learning rate | 1e-6 |
| Precision | bf16 |
| Packing | Enabled |
| Hardware | Single NVIDIA RTX 5090 (32 GB) — all training, quantization, and benchmarks |
iMatrix-Enhanced GGUF Quantizations
These quantizations use importance matrix (iMatrix) data computed from domain-specific calibration data to improve quality at lower bit widths. The iMatrix tells the quantizer which weights are most important for the model's actual use case, resulting in better quality at the same file size compared to standard quantization.
For standard (non-iMatrix) quantizations including Q8_0, see Nexus-TinyFunction-1.2B-v2.0-GGUF. For full-precision weights, see Nexus-TinyFunction-1.2B-v2.0.
| File | Quant | Size | Description |
|---|---|---|---|
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q2_K.gguf | Q2_K | ~461 MB | Q2_K |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_L.gguf | Q3_K_L | ~606 MB | Q3_K_L |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_M.gguf | Q3_K_M | ~573 MB | Q3_K_M |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q3_K_S.gguf | Q3_K_S | ~532 MB | Q3_K_S |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q4_K_M.gguf | Q4_K_M | ~697 MB | Q4_K_M |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q4_K_S.gguf | Q4_K_S | ~668 MB | Q4_K_S |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q5_K_M.gguf | Q5_K_M | ~804 MB | Q5_K_M |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q5_K_S.gguf | Q5_K_S | ~787 MB | Q5_K_S |
| Nexus-TinyFunction-1.2B-v2.0-IMatrix-Q6_K.gguf | Q6_K | ~918 MB | Q6_K |
Limitations
- Parallel function calling is the weakest dimension — the model sometimes drops or merges parallel calls
- Argument extraction for complex nested objects and optional parameters can be imprecise
- English only — trained exclusively on English data
- Context length — quality may degrade with very long tool lists near the 32K limit
- Not suitable for safety-critical, medical, legal, or financial applications
- Fine-tuning contributions licensed under Apache 2.0; base model weights remain subject to the LFM Open License v1.0 ($10M annual revenue commercial use threshold)
Acknowledgements
Citation
@misc{Nexus_TinyFunction_1_2B_v2_0,
title = {Nexus-TinyFunction-1.2B-v2.0: Function Calling Fine-Tune of LFM2.5-1.2B},
author = {Nexus-Syntegra},
year = {2026},
url = {https://huggingface.co/nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0},
note = {Fine-tuned from LiquidAI/LFM2.5-1.2B-Instruct for function calling}
}
- Downloads last month
- 68
2-bit
3-bit
4-bit
5-bit
6-bit
Model tree for nexus-syntegra/Nexus-TinyFunction-1.2B-v2.0-IMatrix-GGUF
Base model
LiquidAI/LFM2.5-1.2B-BaseEvaluation results
- Simple Function Calling on BFCL v4self-reported94.250
- Multiple Function Calling on BFCL v4self-reported91.500
- Parallel Function Calling on BFCL v4self-reported81.500
- Parallel Multiple on BFCL v4self-reported78.500
- Irrelevance Detection on BFCL v4self-reported80.420








