Instruction Data Selection via Answer Divergence

English | 简体中文



ACL 2026 Main Conference

Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.

🌟 Overview

Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

For each instruction, ADG:

samples multiple answers with relatively high-temperature decoding,
maps answers into a representation space,
computes geometry-aware scores from the sampled answers,
ranks examples by the combined score,
performs proportional selection within semantic bins.

This repository provides the practical pipeline for:

multi-sample answer generation,
instruction embedding and clustering,
ADG scoring and subset selection,
model training,
benchmark evaluation,
optional task-type analysis.

⚙️ Installation

We recommend Python 3.10 or above.

Example:

git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt

Depending on your environment, you may also need to install GPU-specific packages separately.

🧾 Data Format

ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "input": "",
  "output": "Transformers are neural networks based on self-attention..."
}

Notes:

id should uniquely identify each example.
instruction is required.
input is optional and can be empty or omitted.
output is the reference response in the original instruction dataset.
Other instruction datasets can be used as long as they are converted into this format.

After answer generation, the intermediate JSONL file contains records like:

{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "output": "Transformers are neural networks based on self-attention...",
  "generated_answers": [
    "...",
    "...",
    "...",
    "...",
    "..."
  ]
}

🚀 Quick Start

Step 1. Prepare the instruction pool

Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

Step 2. Generate multiple answers per instruction

Before running, update the following variables in generation/generation.py:

MODEL_NAME
OUTPUT_DIR
OUTPUT_FILE

Then run:

cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py   --input_file /path/to/your/instruction_data.json   --batch_size 32

Step 3. Build instruction embeddings and clustering results

Before running, update the following variables in generation/embedding/embed.py:

MODEL_NAME
INPUT_JSONL
EMBEDDINGS_PATH
CLUSTERS_PATH
K_CLUSTERS

Then run:

torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py

Step 4. Run ADG scoring and selection

Choose the scoring script that matches your backbone.

For LLaMA, configure these variables in ADG/ADG_llama.py:

model_name
INPUT_JSONL
OUTPUT_DIR
EMBEDDINGS_PATH
CLUSTERS_PATH
K_CLUSTERS
FINAL_SELECT_COUNT

Then run:

python ADG/ADG_llama.py

For Qwen, configure these variables in ADG/ADG_qwen.py:

model_name
INPUT_JSONL
OUTPUT_DIR
EMBEDDINGS_PATH
CLUSTERS_PATH
CHECKPOINT_DIR
FINAL_SELECT_COUNT

Then run:

python ADG/ADG_qwen.py

The selector saves:

top.json
middle.json
bottom.json

under the configured OUTPUT_DIR.

Step 5. Train the backbone model

Use the selected subset, typically top.json, for instruction tuning.

For LLaMA:

cd train
bash train_llama.sh

For Qwen:

cd train
bash train_qwen.sh

Before running, update paths such as:

--model_name_or_path
--data_path
--output_dir

Step 6. Evaluate the trained checkpoint

This repository uses lm-evaluation-harness for benchmark evaluation.

Install it first if needed:

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

Then configure MODEL_PATH and output paths in eval/eval.sh, and run:

cd eval
bash eval.sh

The evaluation script currently includes:

BBH
GSM8K
MMLU
TruthfulQA
MBPP
HumanEval

📖 Citation

@article{li2026instruction,
  title={Instruction Data Selection via Answer Divergence},
  author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
  journal={arXiv preprint arXiv:2604.10448},
  year={2026}
}

Downloads last month: 58

Safetensors

Model size

1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WisdomShell/ADG-CoT-LLaMa3-8B

Base model

meta-llama/Meta-Llama-3-8B

Finetuned

(548)

this model

Collection including WisdomShell/ADG-CoT-LLaMa3-8B

ADG

Collection

[ACL'26 Main Conference] Instruction Data Selection via Answer Divergence • 6 items • Updated 1 day ago • 3

Paper for WisdomShell/ADG-CoT-LLaMa3-8B

Instruction Data Selection via Answer Divergence

Paper • 2604.10448 • Published 3 days ago

Instruction Data Selection via Answer Divergence

English | 简体中文 ACL 2026 Main Conference Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

🌟 Overview

⚙️ Installation

🧾 Data Format

🚀 Quick Start

Step 1. Prepare the instruction pool

Step 2. Generate multiple answers per instruction

Step 3. Build instruction embeddings and clustering results

Step 4. Run ADG scoring and selection

Step 5. Train the backbone model

Step 6. Evaluate the trained checkpoint

📖 Citation

Model tree for WisdomShell/ADG-CoT-LLaMa3-8B

Collection including WisdomShell/ADG-CoT-LLaMa3-8B

Paper for WisdomShell/ADG-CoT-LLaMa3-8B

English | 简体中文

ACL 2026 Main Conference

Bo Li, Mingda Wang, Shikun Zhang, Wei Ye