BioTool-finetuned-Qwen3-4B
Qwen3-4B-Instruct-2507 fully fine-tuned on the BioTool training split (5,632 samples). The resulting model produces structured tool calls against 127 biomedical APIs spanning NCBI E-utilities + BLAST, UniProt and Ensembl REST.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("gxx27/BioTool-finetuned-Qwen3-4B")
mdl = AutoModelForCausalLM.from_pretrained(
"gxx27/BioTool-finetuned-Qwen3-4B",
torch_dtype="auto",
device_map="auto",
)
system = (
"You are a biomedicine function-calling assistant. Always respond by calling "
"exactly one function from the provided tools with a single tool call. Do not "
"answer with natural language."
)
question = "What is the genomic location of the BRCA1 gene in humans?"
prompt = tok.apply_chat_template(
[
{"role": "system", "content": system},
{"role": "user", "content": question},
],
tokenize=False,
add_generation_prompt=True,
)
out = mdl.generate(**tok(prompt, return_tensors="pt").to(mdl.device), max_new_tokens=256)
print(tok.decode(out[0][tok(prompt, return_tensors="pt")["input_ids"].shape[1]:],
skip_special_tokens=False))
To execute the resulting tool call, use the Python wrappers in the BioTool repository:
import sys
sys.path.insert(0, "/path/to/BioTool")
from ensembl.lookup.api import lookup_by_symbol
print(lookup_by_symbol(species="human", symbol="BRCA1"))
Training data
- Source: the BioTool training split (
data/BioTool_train.json). - Format: ShareGPT-style conversations of the form
system → user → function_call, wherefunction_call.valueis a JSON object{"name": <tool_name>, "arguments": <dict>}. - Coverage: 127 tools across 3 databases (NCBI / UniProt / Ensembl).
- Size: 5,632 samples (an additional 1,408 samples form the held-out test set used to evaluate the model).
Training setup
- Base model:
Qwen/Qwen3-4B-Instruct-2507 - Tuning method: full fine-tuning (no LoRA)
- Framework: LLaMA-Factory
- Template:
qwen3_nothink - Cutoff length: 2,048
- Optimizer: AdamW (fused),
lr=2e-5, cosine schedule,warmup_ratio=0.1 - Epochs: 3
- Effective batch size: 16 (per-device 1 × grad-accum 16)
- Precision: bf16
- Hyperparameter file:
qwen3_4b.yamlin the BioTool repo'sllamafactory_cfgs/
Evaluation
On the BioTool test split (1,408 samples), this model achieves the highest BioTool Score among the open-source models we evaluated, while also being the smallest (4B parameters). Per-database breakdowns and head-to-head numbers against GPT-5.1, GPT-5.1-Codex, Claude Sonnet 4.5 and Gemini 3 Pro are reported in the paper.
Citation
@misc{gao2026biotoolcomprehensivetoolcallingdataset,
title={BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models},
author={Xin Gao and Ruiyi Zhang and Meixi Du and Peijia Qin and Pengtao Xie},
year={2026},
eprint={2605.05758},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.05758},
}
License
Released under the Apache 2.0 license, inheriting from the base model. The underlying API responses used during training are subject to the licenses of the respective NCBI, UniProt and Ensembl services.
- Downloads last month
- -
Model tree for gxx27/BioTool-finetuned-Qwen3-4B
Base model
Qwen/Qwen3-4B-Instruct-2507