Instruction Data Selection via Answer Divergence

English | ็ฎ€ไฝ“ไธญๆ–‡

Task

ACL 2026 Main Conference

Bo Li, Mingda Wang, Shikun Zhang, Wei Ye

This repository releases the core pipeline of Answer Divergence-Guided Selection (ADG) for instruction data selection. ADG scores each instruction by the geometric structure of multiple sampled answers, rather than relying on a single reference response. In the paper, ADG consistently improves instruction tuning under a fixed 10K budget across two backbones, three public instruction pools, and six benchmarks spanning reasoning, knowledge, and coding. The method combines dispersion magnitude and shape anisotropy, then performs bin-wise selection for semantic coverage.


๐ŸŒŸ Overview

Instruction tuning quality depends heavily on which examples are selected under a fixed data budget. ADG addresses this by examining how a base model responds to the same instruction under stochastic decoding.

For each instruction, ADG:

  1. samples multiple answers with relatively high-temperature decoding,
  2. maps answers into a representation space,
  3. computes geometry-aware scores from the sampled answers,
  4. ranks examples by the combined score,
  5. performs proportional selection within semantic bins.

This repository provides the practical pipeline for:

  • multi-sample answer generation,
  • instruction embedding and clustering,
  • ADG scoring and subset selection,
  • model training,
  • benchmark evaluation,
  • optional task-type analysis.

โš™๏ธ Installation

We recommend Python 3.10 or above.

Example:

git clone https://github.com/WisdomShell/ADG.git
conda create -n adg python=3.12.9
conda activate adg
pip install -r requirements.txt

Depending on your environment, you may also need to install GPU-specific packages separately.


๐Ÿงพ Data Format

ADG expects instruction datasets in JSON or JSONL format. Each example should follow the schema below:

{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "input": "",
  "output": "Transformers are neural networks based on self-attention..."
}

Notes:

  • id should uniquely identify each example.
  • instruction is required.
  • input is optional and can be empty or omitted.
  • output is the reference response in the original instruction dataset.
  • Other instruction datasets can be used as long as they are converted into this format.

After answer generation, the intermediate JSONL file contains records like:

{
  "id": 0,
  "instruction": "Write a short explanation of transformers.",
  "output": "Transformers are neural networks based on self-attention...",
  "generated_answers": [
    "...",
    "...",
    "...",
    "...",
    "..."
  ]
}

๐Ÿš€ Quick Start

Step 1. Prepare the instruction pool

Download and preprocess your instruction dataset, such as Alpaca-GPT4, WizardLM, or CoT, into the required format.

Step 2. Generate multiple answers per instruction

Before running, update the following variables in generation/generation.py:

  • MODEL_NAME
  • OUTPUT_DIR
  • OUTPUT_FILE

Then run:

cd generation
torchrun --nproc_per_node=4 --master_port=29500 generation.py   --input_file /path/to/your/instruction_data.json   --batch_size 32

Step 3. Build instruction embeddings and clustering results

Before running, update the following variables in generation/embedding/embed.py:

  • MODEL_NAME
  • INPUT_JSONL
  • EMBEDDINGS_PATH
  • CLUSTERS_PATH
  • K_CLUSTERS

Then run:

torchrun --nproc_per_node=4 --master_port=29501 generation/embedding/embed.py

Step 4. Run ADG scoring and selection

Choose the scoring script that matches your backbone.

For LLaMA, configure these variables in ADG/ADG_llama.py:

  • model_name
  • INPUT_JSONL
  • OUTPUT_DIR
  • EMBEDDINGS_PATH
  • CLUSTERS_PATH
  • K_CLUSTERS
  • FINAL_SELECT_COUNT

Then run:

python ADG/ADG_llama.py

For Qwen, configure these variables in ADG/ADG_qwen.py:

  • model_name
  • INPUT_JSONL
  • OUTPUT_DIR
  • EMBEDDINGS_PATH
  • CLUSTERS_PATH
  • CHECKPOINT_DIR
  • FINAL_SELECT_COUNT

Then run:

python ADG/ADG_qwen.py

The selector saves:

  • top.json
  • middle.json
  • bottom.json

under the configured OUTPUT_DIR.

Step 5. Train the backbone model

Use the selected subset, typically top.json, for instruction tuning.

For LLaMA:

cd train
bash train_llama.sh

For Qwen:

cd train
bash train_qwen.sh

Before running, update paths such as:

  • --model_name_or_path
  • --data_path
  • --output_dir

Step 6. Evaluate the trained checkpoint

This repository uses lm-evaluation-harness for benchmark evaluation.

Install it first if needed:

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

Then configure MODEL_PATH and output paths in eval/eval.sh, and run:

cd eval
bash eval.sh

The evaluation script currently includes:

  • BBH
  • GSM8K
  • MMLU
  • TruthfulQA
  • MBPP
  • HumanEval

๐Ÿ“– Citation

@article{li2026instruction,
  title={Instruction Data Selection via Answer Divergence},
  author={Li, Bo and Wang, Mingda and Zhang, Shikun and Ye, Wei},
  journal={arXiv preprint arXiv:2604.10448},
  year={2026}
}

Downloads last month
58
Safetensors
Model size
1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for WisdomShell/ADG-CoT-LLaMa3-8B

Finetuned
(548)
this model

Collection including WisdomShell/ADG-CoT-LLaMa3-8B

Paper for WisdomShell/ADG-CoT-LLaMa3-8B