Okay, so. Apparently training for the full set of domains targeting the style-fitted SmolLM models for a single adapter is still helping the lambada perplexity.

Some diversity loss with this one, though it's not catastrophic. Still might be an interesting difference with the merge approach (I'm not really approaching this right to study properly)

Tasks	Version	Filter	Metric		Value		Stderr
arc_easy	1	none	acc	↑	0.7883	±	0.0084
		none	acc_norm	↑	0.7601	±	0.0088
lambada_openai	1	none	acc	↑	0.7044	±	0.0064
		none	perplexity	↓	3.7642	±	0.0852
openbookqa	1	none	acc	↑	0.3160	±	0.0208
		none	acc_norm	↑	0.4060	±	0.0220
piqa	1	none	acc	↑	0.7807	±	0.0097
		none	acc_norm	↑	0.7791	±	0.0097

Prefix Entropy (lower = more confident predictions)

Domain	Qwen3-4B-Base	Karcher	Karcher+Adapter
ao3_english	3.309	3.238	2.988
github_python	1.514	1.456	1.407
wikipedia_english	1.974	1.892	1.807
bbc_news	2.252	2.186	—
arxiv_cs	2.455	2.346	—

Generation Diversity (higher = more diverse)

Domain	Metric	Qwen3-4B-Base	Karcher	Karcher+Adapter
ao3_english	Distinct-1	0.547	0.612	0.575
	Distinct-2	0.947	0.963	0.940
	Pairwise div	0.905	0.900	0.892
github_python	Distinct-1	0.556	0.595	0.596
	Distinct-2	0.839	0.895	0.889
	Pairwise div	0.930	0.933	0.940
wikipedia_english	Distinct-1	0.567	0.585	0.540
	Distinct-2	0.913	0.929	0.914
	Pairwise div	0.904	0.906	0.898

Task	Metric	Qwen3-4B-Base	GRPO-Merge	Δ Base	GRPO-Wave	Δ Base	Δ Merge	Style-Karcher	Δ Base	Δ Wave	Full-Adapter	Δ Base	Δ Karcher
arc_easy	acc	0.7891	0.7870	-0.27%	0.7912	+0.27%	+0.53%	0.7883	-0.10%	-0.37%	0.7883	-0.10%	±0.00%
arc_easy	acc_norm	0.7609	0.7605	-0.05%	0.7643	+0.45%	+0.50%	0.7576	-0.43%	-1.04%	0.7601	-0.11%	+0.33%
lambada_openai	acc	0.6912	0.6984	+1.04%	0.7006	+1.36%	+0.31%	0.7087	+2.53%	+1.16%	0.7044	+1.91%	-0.61%
lambada_openai	perplexity↓	4.2433	4.0490	-4.58%	3.9616	-6.64%	-2.16%	3.8343	-9.63%	-3.21%	3.7642	-11.29%	-1.83%
openbookqa	acc	0.3160	0.3180	+0.63%	0.3180	+0.63%	±0.00%	0.3160	±0.00%	-0.63%	0.3160	±0.00%	±0.00%
openbookqa	acc_norm	0.4100	0.4120	+0.49%	0.4100	±0.00%	-0.49%	0.4080	-0.49%	-0.49%	0.4060	-0.98%	-0.49%
piqa	acc	0.7797	0.7807	+0.13%	0.7813	+0.21%	+0.08%	0.7786	-0.14%	-0.35%	0.7807	+0.13%	+0.27%
piqa	acc_norm	0.7807	0.7807	±0.00%	0.7813	+0.08%	+0.08%	0.7807	±0.00%	-0.08%	0.7791	-0.20%	-0.20%

Diversity Metrics (temperature=1.0, 8 completions per prompt)

Domain	Metric	Base	Karcher	Δ Base	Full-Adapter	Δ Base	Δ Karcher
ao3_english	Prefix entropy	3.309	3.238	-2.1%	2.988	-9.7%	-7.7%
ao3_english	Distinct-1	0.618	0.683	+10.5%	0.575	-7.0%	-15.8%
ao3_english	Distinct-2	0.962	0.984	+2.3%	0.940	-2.3%	-4.5%
ao3_english	Pairwise div	0.919	0.932	+1.4%	0.892	-2.9%	-4.3%
github_python	Prefix entropy	1.514	1.456	-3.8%	1.407	-7.1%	-3.4%
github_python	Distinct-1	0.610	0.624	+2.3%	0.596	-2.3%	-4.5%
github_python	Distinct-2	0.890	0.876	-1.6%	0.889	-0.1%	+1.5%
github_python	Pairwise div	0.933	0.933	±0.0%	0.940	+0.8%	+0.8%
wikipedia_english	Prefix entropy	1.974	1.892	-4.2%	1.807	-8.5%	-4.5%
wikipedia_english	Distinct-1	0.599	0.559	-6.7%	0.540	-9.8%	-3.4%
wikipedia_english	Distinct-2	0.932	0.898	-3.6%	0.914	-1.9%	+1.8%
wikipedia_english	Pairwise div	0.907	0.900	-0.8%	0.898	-1.0%	-0.2%
bbc_news	Prefix entropy	2.252	2.186	-2.9%	—	—	—
bbc_news	Distinct-1	0.557	0.577	+3.6%	—	—	—
bbc_news	Distinct-2	0.949	0.951	+0.3%	—	—	—
bbc_news	Pairwise div	0.901	0.908	+0.8%	—	—	—
arxiv_cs	Prefix entropy	2.455	2.346	-4.4%	—	—	—
arxiv_cs	Distinct-1	0.555	0.567	+2.3%	—	—	—
arxiv_cs	Distinct-2	0.905	0.906	+0.2%	—	—	—
arxiv_cs	Pairwise div	0.895	0.901	+0.7%	—	—	—

Training procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Framework versions

PEFT 0.18.0
TRL: 0.24.0
Transformers: 4.57.3
Pytorch: 2.9.1
Datasets: 4.3.0
Tokenizers: 0.22.1

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 1

Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Full-Adapter

Base model

Lambent/Qwen3-4B-Base-Continued-GRPO-Wave

Finetuned

Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Karcher

Adapter

(1)

this model

Paper for Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Full-Adapter

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145