Instruction Data Selection via Answer Divergence
English | ็ฎไฝไธญๆ
ACL 2026 Main Conference
Bo Li, Mingda Wang, Shikun Zhang, Wei Ye
This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.
๐ Overview
Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.
For each instruction, ADG:
- samples multiple answers with relatively high-temperature decoding,
- maps answers into a representation space,
- computes geometry-aware scores from the sampled answers,
- ranks examples by the combined score,
- performs proportional selection within semantic bins.
This repository provides the practical pipeline for:
- multi-sample answer generation,
- instruction embedding and clustering,
- ADG scoring and subset selection,
- model training,
- benchmark evaluation,
- optional task-type analysis.
โ๏ธ Installation
We recommend Python 3.10 or above.
Example:
git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt
Depending on your environment, you may also need to install GPU-specific packages separately.
๐งพ Data Format
ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"input": "",
"output": "Transformers are neural networks based on self-attention..."
}
Notes:
idshould uniquely identify each example.instructionis required.inputis optional and can be empty or omitted.outputis the reference response in the original instruction dataset.- Other instruction datasets can be used as long as they are converted into this format.
After answer generation, the intermediate JSONL file contains records like:
{
"id": 0,
"instruction": "Write a short explanation of transformers.",
"output": "Transformers are neural networks based on self-attention...",
"generated_answers": [
"...",
"...",
"...",
"...",
"..."
]
}
๐ Quick Start
Step 1. Prepare the instruction pool
Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.
Step 2. Generate multiple answers per instruction
Before running, update the following variables in generation/generation.py:
MODEL_NAMEOUTPUT_DIROUTPUT_FILE
Then run:
cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py --input_file /path/to/your/instruction_data.json --batch_size 32
Step 3. Build instruction embeddings and clustering results
Before running, update the following variables in generation/embedding/embed.py:
MODEL_NAMEINPUT_JSONLEMBEDDINGS_PATHCLUSTERS_PATHK_CLUSTERS
Then run:
torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py
Step 4. Run ADG scoring and selection
Choose the scoring script that matches your backbone.
For LLaMA, configure these variables in ADG/ADG_llama.py:
model_nameINPUT_JSONLOUTPUT_DIREMBEDDINGS_PATHCLUSTERS_PATHK_CLUSTERSFINAL_SELECT_COUNT
Then run:
python ADG/ADG_llama.py
For Qwen, configure these variables in ADG/ADG_qwen.py:
model_nameINPUT_JSONLOUTPUT_DIREMBEDDINGS_PATHCLUSTERS_PATHCHECKPOINT_DIRFINAL_SELECT_COUNT
Then run:
python ADG/ADG_qwen.py
The selector saves:
top.jsonmiddle.jsonbottom.json
under the configured OUTPUT_DIR.
Step 5. Train the backbone model
Use the selected subset, typically top.json, for instruction tuning.
For LLaMA:
cd train
bash train_llama.sh
For Qwen:
cd train
bash train_qwen.sh
Before running, update paths such as:
--model_name_or_path--data_path--output_dir
Step 6. Evaluate the trained checkpoint
This repository uses lm-evaluation-harness for benchmark evaluation.
Install it first if needed:
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .
Then configure MODEL_PATH and output paths in eval/eval.sh, and run:
cd eval
bash eval.sh
The evaluation script currently includes:
- BBH
- GSM8K
- MMLU
- TruthfulQA
- MBPP
- HumanEval
๐ Citation
@article{li2026instruction,
title={Instruction Data Selection via Answer Divergence},
author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
journal={arXiv preprint arXiv:2604.10448},
year={2026}
}
- Downloads last month
- 58
Model tree for WisdomShell/ADG-CoT-LLaMa3-8B
Base model
meta-llama/Meta-Llama-3-8B