Okay, so. Apparently training for the full set of domains targeting the style-fitted SmolLM models for a single adapter is still helping the lambada perplexity.

Some diversity loss with this one, though it's not catastrophic. Still might be an interesting difference with the merge approach (I'm not really approaching this right to study properly)

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc ↑ 0.7883 Β± 0.0084
none 0 acc_norm ↑ 0.7601 Β± 0.0088
lambada_openai 1 none 0 acc ↑ 0.7044 Β± 0.0064
none 0 perplexity ↓ 3.7642 Β± 0.0852
openbookqa 1 none 0 acc ↑ 0.3160 Β± 0.0208
none 0 acc_norm ↑ 0.4060 Β± 0.0220
piqa 1 none 0 acc ↑ 0.7807 Β± 0.0097
none 0 acc_norm ↑ 0.7791 Β± 0.0097

Prefix Entropy (lower = more confident predictions)

Domain Qwen3-4B-Base Karcher Karcher+Adapter
ao3_english 3.309 3.238 2.988
github_python 1.514 1.456 1.407
wikipedia_english 1.974 1.892 1.807
bbc_news 2.252 2.186 β€”
arxiv_cs 2.455 2.346 β€”

Generation Diversity (higher = more diverse)

Domain Metric Qwen3-4B-Base Karcher Karcher+Adapter
ao3_english Distinct-1 0.547 0.612 0.575
Distinct-2 0.947 0.963 0.940
Pairwise div 0.905 0.900 0.892
github_python Distinct-1 0.556 0.595 0.596
Distinct-2 0.839 0.895 0.889
Pairwise div 0.930 0.933 0.940
wikipedia_english Distinct-1 0.567 0.585 0.540
Distinct-2 0.913 0.929 0.914
Pairwise div 0.904 0.906 0.898
Task Metric Qwen3-4B-Base GRPO-Merge Ξ” Base GRPO-Wave Ξ” Base Ξ” Merge Style-Karcher Ξ” Base Ξ” Wave Full-Adapter Ξ” Base Ξ” Karcher
arc_easy acc 0.7891 0.7870 -0.27% 0.7912 +0.27% +0.53% 0.7883 -0.10% -0.37% 0.7883 -0.10% Β±0.00%
arc_easy acc_norm 0.7609 0.7605 -0.05% 0.7643 +0.45% +0.50% 0.7576 -0.43% -1.04% 0.7601 -0.11% +0.33%
lambada_openai acc 0.6912 0.6984 +1.04% 0.7006 +1.36% +0.31% 0.7087 +2.53% +1.16% 0.7044 +1.91% -0.61%
lambada_openai perplexity↓ 4.2433 4.0490 -4.58% 3.9616 -6.64% -2.16% 3.8343 -9.63% -3.21% 3.7642 -11.29% -1.83%
openbookqa acc 0.3160 0.3180 +0.63% 0.3180 +0.63% Β±0.00% 0.3160 Β±0.00% -0.63% 0.3160 Β±0.00% Β±0.00%
openbookqa acc_norm 0.4100 0.4120 +0.49% 0.4100 Β±0.00% -0.49% 0.4080 -0.49% -0.49% 0.4060 -0.98% -0.49%
piqa acc 0.7797 0.7807 +0.13% 0.7813 +0.21% +0.08% 0.7786 -0.14% -0.35% 0.7807 +0.13% +0.27%
piqa acc_norm 0.7807 0.7807 Β±0.00% 0.7813 +0.08% +0.08% 0.7807 Β±0.00% -0.08% 0.7791 -0.20% -0.20%

Diversity Metrics (temperature=1.0, 8 completions per prompt)

Domain Metric Base Karcher Ξ” Base Full-Adapter Ξ” Base Ξ” Karcher
ao3_english Prefix entropy 3.309 3.238 -2.1% 2.988 -9.7% -7.7%
ao3_english Distinct-1 0.618 0.683 +10.5% 0.575 -7.0% -15.8%
ao3_english Distinct-2 0.962 0.984 +2.3% 0.940 -2.3% -4.5%
ao3_english Pairwise div 0.919 0.932 +1.4% 0.892 -2.9% -4.3%
github_python Prefix entropy 1.514 1.456 -3.8% 1.407 -7.1% -3.4%
github_python Distinct-1 0.610 0.624 +2.3% 0.596 -2.3% -4.5%
github_python Distinct-2 0.890 0.876 -1.6% 0.889 -0.1% +1.5%
github_python Pairwise div 0.933 0.933 Β±0.0% 0.940 +0.8% +0.8%
wikipedia_english Prefix entropy 1.974 1.892 -4.2% 1.807 -8.5% -4.5%
wikipedia_english Distinct-1 0.599 0.559 -6.7% 0.540 -9.8% -3.4%
wikipedia_english Distinct-2 0.932 0.898 -3.6% 0.914 -1.9% +1.8%
wikipedia_english Pairwise div 0.907 0.900 -0.8% 0.898 -1.0% -0.2%
bbc_news Prefix entropy 2.252 2.186 -2.9% β€” β€” β€”
bbc_news Distinct-1 0.557 0.577 +3.6% β€” β€” β€”
bbc_news Distinct-2 0.949 0.951 +0.3% β€” β€” β€”
bbc_news Pairwise div 0.901 0.908 +0.8% β€” β€” β€”
arxiv_cs Prefix entropy 2.455 2.346 -4.4% β€” β€” β€”
arxiv_cs Distinct-1 0.555 0.567 +2.3% β€” β€” β€”
arxiv_cs Distinct-2 0.905 0.906 +0.2% β€” β€” β€”
arxiv_cs Pairwise div 0.895 0.901 +0.7% β€” β€” β€”

Training procedure

Visualize in Weights & Biases

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Framework versions

  • PEFT 0.18.0
  • TRL: 0.24.0
  • Transformers: 4.57.3
  • Pytorch: 2.9.1
  • Datasets: 4.3.0
  • Tokenizers: 0.22.1

Citations

Cite GRPO as:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Full-Adapter

Paper for Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Full-Adapter