Continual Internalization
Collection
12 items • Updated
This repository contains a LoRA adapter trained with the DiSC pipeline for continual internalization experiments in Knowledge Horizon.
The adapter was trained on a 200-token-budget version of the easy QA training set and evaluated on easy train/test and hard questions.
adapter_model.safetensors (LoRA weights)adapter_config.jsontokenizer.json, tokenizer_config.json, chat_template.jinja)checkpoint-400checkpoint-493 (end of epoch)Qwen/Qwen3-30B-A3B-Instruct-25076537205 (kh_disc_30b_a2)2026-04-04 (America/New_York)ailab1 node, 2x NVIDIA H20000:25:52checkpoints/disc_lora__qwen3-30b-a3b__knowledge-horizon__6537205The stage-3 training input came from the following chain:
data/easy_qa_200tok_train.jsonl (1098 rows)python prepare_training_data.py --input data/easy_qa_200tok_train.jsonl --output data/training_data.parquet10981098197985 (stage1_splits.parquet)985 (stage2_scored.parquet)00Hard QA files are used for evaluation, not for training.
Executed with:
python disc_stage1_prepare.py \
--input data/training_data.parquet \
--output runs/disc_qwen3_30b_ailab2_6537205/stage1_splits.parquet \
--k_splits 5 \
--min_sentences 3 \
--dedupe_by_article_text
Important defaults:
42k-1 random interior split points + final sentence endpointExecuted with:
python disc_stage2_score.py \
--model models/Qwen3-30B-A3B-Instruct-2507 \
--input runs/disc_qwen3_30b_ailab2_6537205/stage1_splits.parquet \
--output runs/disc_qwen3_30b_ailab2_6537205/stage2_scored.parquet \
--tp 2 \
--max_model_len 4096 \
--max_num_batched_tokens 2048 \
--max_num_seqs 2 \
--gpu_memory_utilization 0.88 \
--disable_custom_all_reduce true \
--enforce_eager true \
--top_k 128 \
--max_suffix_tokens 256 \
--batch_size 1
Executed with:
torchrun \
--nproc-per-node 2 \
--master_port <job_specific_port> \
disc_stage3_train.py \
--model_name models/Qwen3-30B-A3B-Instruct-2507 \
--train_file runs/disc_qwen3_30b_ailab2_6537205/stage2_scored.parquet \
--output_dir checkpoints/disc_lora__qwen3-30b-a3b__knowledge-horizon__6537205 \
--fsdp_config configs/fsdp_config_qwen3_moe.json \
--lora_r 16 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--lora_target_modules all-linear \
--learning_rate 1.5e-5 \
--weight_decay 0.01 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.0 \
--lr_scheduler_type linear \
--precision bf16 \
--temperature 2.0 \
--save_steps 200 \
--report_to none \
--resume_from_checkpoint latest
FSDP config (configs/fsdp_config_qwen3_moe.json):
{
"transformer_layer_cls_to_wrap": "Qwen3MoeDecoderLayer",
"use_orig_params": true,
"sync_module_states": true,
"activation_checkpointing": false,
"limit_all_gathers": true
}
98513,369,344 / 30,545,491,968 (0.0438%)train_runtime: 816.4strain_steps: 493train_steps_per_second: 0.604train_loss: 0.5464epoch: 1.0Evaluation used:
models/Qwen3-30B-A3B-Instruct-2507109810982480evaluate.py)
| Split | N | No-training baseline | DiSC adapter |
|---|---|---|---|
| Easy Train | 1098 | S 217 (19.8%), IDK 724 (65.9%), O 157 (14.3%) |
S 267 (24.3%), IDK 616 (56.1%), O 215 (19.6%) |
| Easy Test | 1098 | S 215 (19.6%), IDK 758 (69.0%), O 125 (11.4%) |
S 277 (25.2%), IDK 648 (59.0%), O 173 (15.8%) |
| Hard (v2 aggregate) | 248 | S 2 (0.8%), IDK 245 (98.8%), O 1 (0.4%) |
S 7 (2.8%), IDK 236 (95.2%), O 5 (2.0%) |
S = strong match, IDK = explicit "I don't know", O = other.
prepare_training_data.pydisc_stage1_prepare.pydisc_stage2_score.pydisc_stage3_train.pyconfigs/fsdp_config_qwen3_moe.jsonslurm/train_disc_lora_qwen3_30b_ailab_2gpu.shgit rev-parse HEAD = c19506c82f4aed88daba20fabe21d8f0f75b25d6Observed environment in this workspace:
3.10torch==2.9.0+cu128transformers==5.5.0.dev0peft==0.17.1datasets==4.3.0vllm==0.12.0from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen3-30B-A3B-Instruct-2507"
adapter_id = "<this-repo>"
tokenizer = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
base_id,
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()
Use vLLM with LoRA enabled and this adapter as the LoRA path.
evaluate.py; treat scores as directional.If you use this adapter, please cite:
@article{padmanabhan2026updating,
title={Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities},
year={2026},
eprint={2602.16093},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{shenfeld2026self,
title={Self-Distillation Enables Continual Learning},
year={2026},
eprint={2601.19897},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507