BioTool-finetuned-Qwen3-4B

Qwen3-4B-Instruct-2507 fully fine-tuned on the BioTool training split (5,632 samples). The resulting model produces structured tool calls against 127 biomedical APIs spanning NCBI E-utilities + BLAST, UniProt and Ensembl REST.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gxx27/BioTool-finetuned-Qwen3-4B")
mdl = AutoModelForCausalLM.from_pretrained(
    "gxx27/BioTool-finetuned-Qwen3-4B",
    torch_dtype="auto",
    device_map="auto",
)

system = (
    "You are a biomedicine function-calling assistant. Always respond by calling "
    "exactly one function from the provided tools with a single tool call. Do not "
    "answer with natural language."
)
question = "What is the genomic location of the BRCA1 gene in humans?"

prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": system},
        {"role": "user",   "content": question},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
out = mdl.generate(**tok(prompt, return_tensors="pt").to(mdl.device), max_new_tokens=256)
print(tok.decode(out[0][tok(prompt, return_tensors="pt")["input_ids"].shape[1]:],
                 skip_special_tokens=False))

To execute the resulting tool call, use the Python wrappers in the BioTool repository:

import sys
sys.path.insert(0, "/path/to/BioTool")
from ensembl.lookup.api import lookup_by_symbol

print(lookup_by_symbol(species="human", symbol="BRCA1"))

Training data

  • Source: the BioTool training split (data/BioTool_train.json).
  • Format: ShareGPT-style conversations of the form system → user → function_call, where function_call.value is a JSON object {"name": <tool_name>, "arguments": <dict>}.
  • Coverage: 127 tools across 3 databases (NCBI / UniProt / Ensembl).
  • Size: 5,632 samples (an additional 1,408 samples form the held-out test set used to evaluate the model).

Training setup

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Tuning method: full fine-tuning (no LoRA)
  • Framework: LLaMA-Factory
  • Template: qwen3_nothink
  • Cutoff length: 2,048
  • Optimizer: AdamW (fused), lr=2e-5, cosine schedule, warmup_ratio=0.1
  • Epochs: 3
  • Effective batch size: 16 (per-device 1 × grad-accum 16)
  • Precision: bf16
  • Hyperparameter file: qwen3_4b.yaml in the BioTool repo's llamafactory_cfgs/

training loss

Evaluation

On the BioTool test split (1,408 samples), this model achieves the highest BioTool Score among the open-source models we evaluated, while also being the smallest (4B parameters). Per-database breakdowns and head-to-head numbers against GPT-5.1, GPT-5.1-Codex, Claude Sonnet 4.5 and Gemini 3 Pro are reported in the paper.

Citation

@misc{gao2026biotoolcomprehensivetoolcallingdataset,
      title={BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models},
      author={Xin Gao and Ruiyi Zhang and Meixi Du and Peijia Qin and Pengtao Xie},
      year={2026},
      eprint={2605.05758},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.05758},
}

License

Released under the Apache 2.0 license, inheriting from the base model. The underlying API responses used during training are subject to the licenses of the respective NCBI, UniProt and Ensembl services.

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gxx27/BioTool-finetuned-Qwen3-4B

Finetuned
(1676)
this model

Dataset used to train gxx27/BioTool-finetuned-Qwen3-4B

Paper for gxx27/BioTool-finetuned-Qwen3-4B