Qwen3-4B-EnvTuning

Overview

Qwen3-4B-EnvTuning is a continued-training checkpoint built on top of Qwen/Qwen3-4B-Instruct-2507.

This model follows the training idea in the paper Don't Just Fine-tune the Agent, Tune the Environment, which shifts agent learning from static trajectory imitation to environment-based exploration. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.

Base model: Qwen/Qwen3-4B-Instruct-2507
Released model: IcyFish/Qwen3-4B-EnvTuning
Model type: Causal Language Model
Training style: continued training based on the Environment Tuning paradigm

Introduction

The paper studies agent training under extreme data scarcity. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.

The method centers on three ingredients:

Structured curriculum: train the agent from easy skills to harder multi-turn tool-use behaviors.
Actionable environment augmentation: replace vague failures with corrective hints that reveal tool dependencies and constraints.
Fine-grained progress rewards: provide denser turn-level learning signals instead of only sparse episode-level success.

The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.

The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.

Training Pipeline

This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:

Start from a strong instruction-tuned base model.
Train with a staged curriculum rather than optimizing the full task from the beginning.
Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.
Use fine-grained progress rewards to stabilize long-horizon learning.
Remove the extra environment assistance in the final stage to better match real evaluation conditions.

The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.

Training Setup and Evaluation

This checkpoint was not evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.

Training data used for this checkpoint: 400 BFCL V3 training instances
Evaluation setting: tested on 400 unseen BFCL V3 instances that were not used for training

Category	Correct	Total	Accuracy
`multi_turn_base`	69	100	69.00%
`multi_turn_long_context`	65	100	65.00%
`multi_turn_miss_func`	64	100	64.00%
`multi_turn_miss_param`	56	100	56.00%
`OVERALL`	254	400	63.50%

These numbers should be understood as the evaluation results of this released checkpoint, rather than results reported in the original paper.

Model Details

Unless otherwise noted, this checkpoint keeps the same underlying architecture as Qwen3-4B-Instruct-2507:

Architecture: Qwen3ForCausalLM
Parameters: 4.0B
Non-embedding parameters: 3.6B
Layers: 36
Attention heads: 32 for Q and 8 for KV
Native context length: 262,144

For the original architecture and upstream model information, please refer to:

Qwen blog: https://qwenlm.github.io/blog/qwen3/
Qwen GitHub: https://github.com/QwenLM/Qwen3
Qwen documentation: https://qwen.readthedocs.io/en/latest/

Quick Start

Use the model with the latest version of transformers. With transformers<4.51.0, you may encounter:

KeyError: 'qwen3'

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "IcyFish/Qwen3-4B-EnvTuning"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language models."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)

For serving:

python -m sglang.launch_server \
  --model-path IcyFish/Qwen3-4B-EnvTuning \
  --context-length 262144

vllm serve IcyFish/Qwen3-4B-EnvTuning --max-model-len 262144

If you encounter out-of-memory issues, consider reducing the effective context length, for example to 32768.

Notes

This repository releases a derived checkpoint, not the original upstream Qwen release.
This checkpoint follows the same method family as the paper, but it is not one of the exact models reported in the paper's main experiments.
The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.
The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.

License

This model is released under the same license link referenced from the upstream Qwen checkpoint:

https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507/blob/main/LICENSE

Please review the upstream license terms before downstream use.

Citation

If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.

@article{lu2025dont,
  title={Don't Just Fine-tune the Agent, Tune the Environment},
  author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
  journal={arXiv preprint arXiv:2510.10197},
  year={2025},
  url={https://arxiv.org/abs/2510.10197}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}