Okay, so. Apparently training for the full set of domains targeting the style-fitted SmolLM models for a single adapter is still helping the lambada perplexity.
Some diversity loss with this one, though it's not catastrophic. Still might be an interesting difference with the merge approach (I'm not really approaching this right to study properly)
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| arc_easy | 1 | none | 0 | acc | β | 0.7883 | Β± | 0.0084 |
| none | 0 | acc_norm | β | 0.7601 | Β± | 0.0088 | ||
| lambada_openai | 1 | none | 0 | acc | β | 0.7044 | Β± | 0.0064 |
| none | 0 | perplexity | β | 3.7642 | Β± | 0.0852 | ||
| openbookqa | 1 | none | 0 | acc | β | 0.3160 | Β± | 0.0208 |
| none | 0 | acc_norm | β | 0.4060 | Β± | 0.0220 | ||
| piqa | 1 | none | 0 | acc | β | 0.7807 | Β± | 0.0097 |
| none | 0 | acc_norm | β | 0.7791 | Β± | 0.0097 |
Prefix Entropy (lower = more confident predictions)
| Domain | Qwen3-4B-Base | Karcher | Karcher+Adapter | ||
|---|---|---|---|---|---|
| ao3_english | 3.309 | 3.238 | 2.988 | ||
| github_python | 1.514 | 1.456 | 1.407 | ||
| wikipedia_english | 1.974 | 1.892 | 1.807 | ||
| bbc_news | 2.252 | 2.186 | β | ||
| arxiv_cs | 2.455 | 2.346 | β |
Generation Diversity (higher = more diverse)
| Domain | Metric | Qwen3-4B-Base | Karcher | Karcher+Adapter |
|---|---|---|---|---|
| ao3_english | Distinct-1 | 0.547 | 0.612 | 0.575 |
| Distinct-2 | 0.947 | 0.963 | 0.940 | |
| Pairwise div | 0.905 | 0.900 | 0.892 | |
| github_python | Distinct-1 | 0.556 | 0.595 | 0.596 |
| Distinct-2 | 0.839 | 0.895 | 0.889 | |
| Pairwise div | 0.930 | 0.933 | 0.940 | |
| wikipedia_english | Distinct-1 | 0.567 | 0.585 | 0.540 |
| Distinct-2 | 0.913 | 0.929 | 0.914 | |
| Pairwise div | 0.904 | 0.906 | 0.898 |
| Task | Metric | Qwen3-4B-Base | GRPO-Merge | Ξ Base | GRPO-Wave | Ξ Base | Ξ Merge | Style-Karcher | Ξ Base | Ξ Wave | Full-Adapter | Ξ Base | Ξ Karcher |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| arc_easy | acc | 0.7891 | 0.7870 | -0.27% | 0.7912 | +0.27% | +0.53% | 0.7883 | -0.10% | -0.37% | 0.7883 | -0.10% | Β±0.00% |
| arc_easy | acc_norm | 0.7609 | 0.7605 | -0.05% | 0.7643 | +0.45% | +0.50% | 0.7576 | -0.43% | -1.04% | 0.7601 | -0.11% | +0.33% |
| lambada_openai | acc | 0.6912 | 0.6984 | +1.04% | 0.7006 | +1.36% | +0.31% | 0.7087 | +2.53% | +1.16% | 0.7044 | +1.91% | -0.61% |
| lambada_openai | perplexityβ | 4.2433 | 4.0490 | -4.58% | 3.9616 | -6.64% | -2.16% | 3.8343 | -9.63% | -3.21% | 3.7642 | -11.29% | -1.83% |
| openbookqa | acc | 0.3160 | 0.3180 | +0.63% | 0.3180 | +0.63% | Β±0.00% | 0.3160 | Β±0.00% | -0.63% | 0.3160 | Β±0.00% | Β±0.00% |
| openbookqa | acc_norm | 0.4100 | 0.4120 | +0.49% | 0.4100 | Β±0.00% | -0.49% | 0.4080 | -0.49% | -0.49% | 0.4060 | -0.98% | -0.49% |
| piqa | acc | 0.7797 | 0.7807 | +0.13% | 0.7813 | +0.21% | +0.08% | 0.7786 | -0.14% | -0.35% | 0.7807 | +0.13% | +0.27% |
| piqa | acc_norm | 0.7807 | 0.7807 | Β±0.00% | 0.7813 | +0.08% | +0.08% | 0.7807 | Β±0.00% | -0.08% | 0.7791 | -0.20% | -0.20% |
Diversity Metrics (temperature=1.0, 8 completions per prompt)
| Domain | Metric | Base | Karcher | Ξ Base | Full-Adapter | Ξ Base | Ξ Karcher |
|---|---|---|---|---|---|---|---|
| ao3_english | Prefix entropy | 3.309 | 3.238 | -2.1% | 2.988 | -9.7% | -7.7% |
| ao3_english | Distinct-1 | 0.618 | 0.683 | +10.5% | 0.575 | -7.0% | -15.8% |
| ao3_english | Distinct-2 | 0.962 | 0.984 | +2.3% | 0.940 | -2.3% | -4.5% |
| ao3_english | Pairwise div | 0.919 | 0.932 | +1.4% | 0.892 | -2.9% | -4.3% |
| github_python | Prefix entropy | 1.514 | 1.456 | -3.8% | 1.407 | -7.1% | -3.4% |
| github_python | Distinct-1 | 0.610 | 0.624 | +2.3% | 0.596 | -2.3% | -4.5% |
| github_python | Distinct-2 | 0.890 | 0.876 | -1.6% | 0.889 | -0.1% | +1.5% |
| github_python | Pairwise div | 0.933 | 0.933 | Β±0.0% | 0.940 | +0.8% | +0.8% |
| wikipedia_english | Prefix entropy | 1.974 | 1.892 | -4.2% | 1.807 | -8.5% | -4.5% |
| wikipedia_english | Distinct-1 | 0.599 | 0.559 | -6.7% | 0.540 | -9.8% | -3.4% |
| wikipedia_english | Distinct-2 | 0.932 | 0.898 | -3.6% | 0.914 | -1.9% | +1.8% |
| wikipedia_english | Pairwise div | 0.907 | 0.900 | -0.8% | 0.898 | -1.0% | -0.2% |
| bbc_news | Prefix entropy | 2.252 | 2.186 | -2.9% | β | β | β |
| bbc_news | Distinct-1 | 0.557 | 0.577 | +3.6% | β | β | β |
| bbc_news | Distinct-2 | 0.949 | 0.951 | +0.3% | β | β | β |
| bbc_news | Pairwise div | 0.901 | 0.908 | +0.8% | β | β | β |
| arxiv_cs | Prefix entropy | 2.455 | 2.346 | -4.4% | β | β | β |
| arxiv_cs | Distinct-1 | 0.555 | 0.567 | +2.3% | β | β | β |
| arxiv_cs | Distinct-2 | 0.905 | 0.906 | +0.2% | β | β | β |
| arxiv_cs | Pairwise div | 0.895 | 0.901 | +0.7% | β | β | β |
Training procedure
This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
Framework versions
- PEFT 0.18.0
- TRL: 0.24.0
- Transformers: 4.57.3
- Pytorch: 2.9.1
- Datasets: 4.3.0
- Tokenizers: 0.22.1
Citations
Cite GRPO as:
@article{shao2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 1
Model tree for Lambent/Qwen3-4B-Base-Continued-GRPO-Style-Full-Adapter
Base model
Lambent/Qwen3-4B-Base-Continued-GRPO-Wave