Qwen3-4B-EnvTuning

Hugging Face Model Base Model arXiv OpenReview

Overview

Qwen3-4B-EnvTuning is a continued-training checkpoint built on top of Qwen/Qwen3-4B-Instruct-2507.

This model follows the training idea in the paper Don't Just Fine-tune the Agent, Tune the Environment, which shifts agent learning from static trajectory imitation to environment-based exploration. The core idea is to improve agent capability by tuning the learning environment itself, instead of relying only on fine-tuning the policy with pre-collected demonstrations.

  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Released model: IcyFish/Qwen3-4B-EnvTuning
  • Model type: Causal Language Model
  • Training style: continued training based on the Environment Tuning paradigm

Introduction

The paper studies agent training under extreme data scarcity. In multi-turn tool-use settings, plain SFT on synthetic trajectories often overfits, while direct RL tends to suffer from cold-start and unstable optimization. Environment Tuning addresses this by redesigning the interaction loop between agent and environment so that exploration becomes more learnable.

The method centers on three ingredients:

  • Structured curriculum: train the agent from easy skills to harder multi-turn tool-use behaviors.
  • Actionable environment augmentation: replace vague failures with corrective hints that reveal tool dependencies and constraints.
  • Fine-grained progress rewards: provide denser turn-level learning signals instead of only sparse episode-level success.

The paper reports that this paradigm can train competitive agents from only a small number of problem instances, with better out-of-distribution generalization than pure SFT baselines.

The original paper includes an introduction figure illustrating the difference between static SFT, standard RL, and Environment Tuning. To keep this Hugging Face repository lightweight and push-friendly, the figure is not embedded as a local binary asset here.

Training Pipeline

This checkpoint is a Qwen3-4B-based release inspired by the training pipeline proposed in the paper. At a high level, the recipe consists of:

  1. Start from a strong instruction-tuned base model.
  2. Train with a staged curriculum rather than optimizing the full task from the beginning.
  3. Use augmented environment feedback in the middle stages to turn failed tool interactions into useful supervision.
  4. Use fine-grained progress rewards to stabilize long-horizon learning.
  5. Remove the extra environment assistance in the final stage to better match real evaluation conditions.

The paper also provides a pipeline figure showing the curriculum stages, augmented feedback, and the agent learning loop. This repository keeps the README text-only for compatibility with the current Hugging Face push restrictions on binary assets.

Training Setup and Evaluation

This checkpoint was not evaluated in the original paper. It is a follow-up model release that keeps the same training philosophy and core method, but uses a different concrete training setup.

  • Training data used for this checkpoint: 400 BFCL V3 training instances
  • Evaluation setting: tested on 400 unseen BFCL V3 instances that were not used for training
Category Correct Total Accuracy
multi_turn_base 69 100 69.00%
multi_turn_long_context 65 100 65.00%
multi_turn_miss_func 64 100 64.00%
multi_turn_miss_param 56 100 56.00%
OVERALL 254 400 63.50%

These numbers should be understood as the evaluation results of this released checkpoint, rather than results reported in the original paper.

Model Details

Unless otherwise noted, this checkpoint keeps the same underlying architecture as Qwen3-4B-Instruct-2507:

  • Architecture: Qwen3ForCausalLM
  • Parameters: 4.0B
  • Non-embedding parameters: 3.6B
  • Layers: 36
  • Attention heads: 32 for Q and 8 for KV
  • Native context length: 262,144

For the original architecture and upstream model information, please refer to:

Quick Start

Use the model with the latest version of transformers. With transformers<4.51.0, you may encounter:

KeyError: 'qwen3'
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "IcyFish/Qwen3-4B-EnvTuning"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language models."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)

For serving:

python -m sglang.launch_server \
  --model-path IcyFish/Qwen3-4B-EnvTuning \
  --context-length 262144
vllm serve IcyFish/Qwen3-4B-EnvTuning --max-model-len 262144

If you encounter out-of-memory issues, consider reducing the effective context length, for example to 32768.

Notes

  • This repository releases a derived checkpoint, not the original upstream Qwen release.
  • This checkpoint follows the same method family as the paper, but it is not one of the exact models reported in the paper's main experiments.
  • The figures from the paper are referenced conceptually in the README, but local binary image assets are intentionally omitted to keep the repository easy to publish on Hugging Face.
  • The BFCL V3 results reported above are model-specific numbers for this checkpoint and should not be confused with either upstream Qwen3 results or the original paper's reported models.

License

This model is released under the same license link referenced from the upstream Qwen checkpoint:

Please review the upstream license terms before downstream use.

Citation

If you use this model, please consider citing both the Environment Tuning paper and the original Qwen3 technical report.

@article{lu2025dont,
  title={Don't Just Fine-tune the Agent, Tune the Environment},
  author={Lu, Siyuan and Wang, Zechuan and Zhang, Hongxuan and Wu, Qintong and Gan, Leilei and Zhuang, Chenyi and Gu, Jinjie and Lin, Tao},
  journal={arXiv preprint arXiv:2510.10197},
  year={2025},
  url={https://arxiv.org/abs/2510.10197}
}
@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388}
}
Downloads last month
188
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IcyFish/Qwen3-4B-EnvTuning

Finetuned
(1514)
this model

Collection including IcyFish/Qwen3-4B-EnvTuning

Papers for IcyFish/Qwen3-4B-EnvTuning