A Guide to Agent Tool Calling Evaluation

Community Article Published January 22, 2026

In the evolution of large language models from simple "conversationalists" to "autonomous agents," Function Calling / Tool Use capability is the core of the core.

However, in actual development, developers often face this pain point: models claim to "support tool calling," but in real business scenarios they "fall apart completely"—either randomly calling APIs during casual user chats, or generating JSON parameters that are missing critical fields, causing backend services to frequently error out.

So how do you issue a qualified "tool driving license" to your model? What are the testing standards? How should the evaluation process be designed?

This article will introduce how to use the EvalScope evaluation framework, adopting the K2 verification standard open-sourced by MoonshotAI (Dark Side of the Moon), through a set of out-of-the-box best practices to quantitatively assess whether a model "knows how to use tools and can use them well."

1. Why is Tool Calling Evaluation So Important?

In traditional evaluations, we often focus on model performance in knowledge Q&A, logical reasoning, or text generation quality. But in the Agent era, evaluation standards have fundamentally changed: from focusing on "making sense" to focusing on "flawless execution." Models must not only understand human language but also accurately speak "machine language."

According to the industry status revealed by MoonshotAI's open-source K2-Vendor-Verifier project, many models have two fatal flaws when handling tool calls:

Blurred Decision Boundaries (Call vs. Not Call): Models can't distinguish between "user is chatting" and "user is giving commands." For example, when a user says "nice weather today," the model shouldn't attempt to call a weather API, but many models will incorrectly trigger it.
Parameter Hallucination (Schema Violation): Parameters generated by models often miss required fields or have type errors (e.g., passing a string to a numeric field), causing API calls to fail directly.

If these two problems cannot be solved, the Agent you build will never go live. Therefore, we need a deterministic, regression-testable evaluation process that not only tests "can it call successfully" but also "should it call" and "are the parameters correct."

2. Our Solution

To address the above pain points, the EvalScope framework integrates a standardized GeneralFunctionCall evaluation flow. We have referenced MoonshotAI K2's verification logic and encapsulated it into an easy-to-use evaluation tool.

The core value of this solution lies in:

Dual Verification Mechanism: Verifies both decision accuracy (whether the model correctly judged if a tool should be called) and parameter validity (whether the generated JSON perfectly conforms to the Schema definition).
High-Quality Benchmark Data: Supports using synthetic data generated by the K2-Thinking model, covering numerous edge cases, which is better at exposing model weaknesses than ordinary conversational data.
Automated Metric Calculation: Directly outputs F1 Score and Schema pass rate without manual intervention.

3. Quick Start Guide

This section will guide you on how to evaluate your model using preset datasets or custom business data.

1. Prepare Data

To make the evaluation tool "targeted," we need to prepare data in standard JSONL format.

Each line represents a test sample, containing three core fields:

messages: Conversation context.
tools: Tool definitions available to the model (toolbox).
should_call_tool: This is the key label. It tells the evaluator whether, in this context, the standard answer is "call tool (True)" or "reply directly (False)."

Standard Data Sample Example:

{
  "messages": [
    { "role": "system", "content": "You are an assistant" },
    { "role": "user", "content": "Convert 37 degrees Celsius to Fahrenheit" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "convert_temperature",
        "description": "Temperature conversion",
        "parameters": {
          "type": "object",
          "properties": {
             "celsius": { "type": "number", "description": "Celsius temperature" }
          },
          "required": ["celsius"],
          "additionalProperties": false
        }
      }
    }
  ],
  "should_call_tool": true
}

💡 Tip:

To test the model's "interference resistance," be sure to include negative samples with should_call_tool: false (e.g., user is just greeting).

If you don't have your own data, you can directly use EvalScope's preset evalscope/GeneralFunctionCall-Test dataset.

2. Run Evaluation

We have greatly simplified code complexity. It's recommended to evaluate models through API integration (closest to real production environment).

Scenario A: Using Official Preset Dataset (Recommended for First Experience)

Data sourced from MoonshotAI's open-source samples, with the core "should trigger tool" labels generated by the K2-Thinking model. This ensures evaluation standard consistency, greatly facilitating developers' experiment reproduction and automated regression testing.

from evalscope import TaskConfig, run_task
import os

# Configure evaluation task
task_cfg = TaskConfig(
    model='qwen-plus',  # Model name under test
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1', # OpenAI-compatible API endpoint
    api_key=os.getenv('DASHSCOPE_API_KEY'), # Get Key from environment variable
    datasets=['general_fc'],  # Core parameter: specify 'tool calling (general_fc)' evaluation mode
)

# Start task, automatically print results
run_task(task_cfg=task_cfg)

Scenario B: Using Custom Business Data

If your business scenario is specialized (e.g., finance, healthcare), it's recommended to build your own test set.

First, create a file named example.jsonl with the following content (including positive and negative samples):

{"messages":[{"role":"system","content":"You are an assistant"},{"role":"user","content":"Please add 2 and 3"}],"tools":[{"type":"function","function":{"name":"add","description":"Add two numbers","parameters":{"type":"object","properties":{"a":{"type":"number","description":"First number"},"b":{"type":"number","description":"Second number"}},"required":["a","b"],"additionalProperties":false}}}],"should_call_tool":true}
{"messages":[{"role":"system","content":"You are an assistant"},{"role":"user","content":"Nice weather today, let's chat"}],"tools":[{"type":"function","function":{"name":"add","description":"Add two numbers","parameters":{"type":"object","properties":{"a":{"type":"number","description":"First number"},"b":{"type":"number","description":"Second number"}},"required":["a","b"],"additionalProperties":false}}}],"should_call_tool":false}
{"messages":[{"role":"system","content":"You are an assistant"},{"role":"user","content":"Convert 37 degrees Celsius to Fahrenheit"}],"tools":[{"type":"function","function":{"name":"convert_temperature","description":"Convert Celsius to Fahrenheit","parameters":{"type":"object","properties":{"celsius":{"type":"number","description":"Celsius temperature value"}},"required":["celsius"],"additionalProperties":false}}}],"should_call_tool":true}

Save the above file to custom_eval/text/fc/example.jsonl, then run the following code:

from evalscope import TaskConfig, run_task
import os

task_cfg = TaskConfig(
    model='qwen-plus',
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    datasets=['general_fc'], 
    # Inject local data path through dataset_args
    dataset_args={
        'general_fc': {
            "local_path": "custom_eval/text/fc",  # Folder path where your JSONL file is located
            "subset_list": ["example"]            # Corresponding filename (without extension)
        }
    },
)

run_task(task_cfg=task_cfg)

3. Interpret the Report

After evaluation completes, the console will output a metric report. You need to focus on the following four core metrics:

Metric Name	Meaning	Why is it Important?
`tool_call_f1`	Decision F1 Score	Measures whether the model is "clear-headed." Low scores indicate the model often calls tools when unnecessary, or doesn't respond when it should. This is the first threshold of Agent intelligence.
`schema_accuracy`	Parameter Format Accuracy	Measures whether the JSON parameters generated by the model conform to the Schema definition. If this is low, it means the model often produces "hallucinatory parameters," which will cause code crashes.
`count_finish_reason_tool_call`	Trigger Call Count	Total number of times the model predicted a tool call was needed.
`count_successful_tool_call`	Perfect Call Count	Number of samples where both decision was correct and parameters were completely correct.

Result Analysis Example:

+-----------+------------+-------------------------------+----------+-------+---------+
| Model     | Dataset    | Metric                        | Subset   |   Num |   Score |
+===========+============+===============================+==========+=======+=========+
| modeltest | general_fc | count_finish_reason_tool_call | default  |    10 |  3      |
+-----------+------------+-------------------------------+----------+-------+---------+
| modeltest | general_fc | count_successful_tool_call    | default  |    10 |  2      |
+-----------+------------+-------------------------------+----------+-------+---------+
| modeltest | general_fc | schema_accuracy               | default  |    10 |  0.6667 |
+-----------+------------+-------------------------------+----------+-------+---------+
| modeltest | general_fc | tool_call_f1                  | default  |    10 |  0.5    |
+-----------+------------+-------------------------------+----------+-------+---------+

Brief Commentary: In this example, although the model triggered 3 tool calls (count_finish_reason_tool_call), only 2 were completely successful (count_successful_tool_call).

Problem Analysis: schema_accuracy of 0.6667 indicates that one call was triggered but had incorrect parameter format (possibly missing fields or wrong types); tool_call_f1 of only 0.5 indicates serious misjudgment in the "should it call" decision (missed calls or false triggers).
Improvement Suggestions: This model may need fine-tuning with more samples containing complex Schemas, or optimize the System instructions in the Prompt to clarify tool calling boundaries.

4. Summary

Tool calling capability is the bridge connecting LLMs to the real world. By introducing EvalScope combined with MoonshotAI K2's verification standard, we no longer rely on subjective feelings to evaluate models, but let the data speak.

This set of best practices not only helps you screen out the most suitable model for your business but can also serve as a "ruler" for subsequent model fine-tuning or Prompt optimization, ensuring your Agent capabilities steadily improve.

Now, use this toolkit to give your model a comprehensive checkup!

Benchmark Smarter: Tailor Your Model Evaluation Suite with EvalScope

January 22, 2026

Best Practices for Evaluating the Qwen3-Omni Model

January 22, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote