LLucass commited on
Commit
439dfa7
·
verified ·
1 Parent(s): 453a69b

Model save

Browse files
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
3
+ library_name: transformers
4
+ model_name: GRPO
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - grpo
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for GRPO
13
+
14
+ This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="LLucass/GRPO", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/lavatorywang-nus/openr1/runs/h7vq56gq)
31
+
32
+
33
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.18.0
38
+ - Transformers: 4.50.0
39
+ - Pytorch: 2.5.1
40
+ - Datasets: 3.6.0
41
+ - Tokenizers: 0.21.1
42
+
43
+ ## Citations
44
+
45
+ Cite GRPO as:
46
+
47
+ ```bibtex
48
+ @article{zhihong2024deepseekmath,
49
+ title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
50
+ author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
51
+ year = 2024,
52
+ eprint = {arXiv:2402.03300},
53
+ }
54
+
55
+ ```
56
+
57
+ Cite TRL as:
58
+
59
+ ```bibtex
60
+ @misc{vonwerra2022trl,
61
+ title = {{TRL: Transformer Reinforcement Learning}},
62
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
63
+ year = 2020,
64
+ journal = {GitHub repository},
65
+ publisher = {GitHub},
66
+ howpublished = {\url{https://github.com/huggingface/trl}}
67
+ }
68
+ ```
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00011944215420650531,
4
+ "train_runtime": 12092.6435,
5
+ "train_samples": 7000,
6
+ "train_samples_per_second": 1.191,
7
+ "train_steps_per_second": 0.012
8
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151646,
4
+ "do_sample": true,
5
+ "eos_token_id": 151643,
6
+ "temperature": 0.6,
7
+ "top_p": 0.95,
8
+ "transformers_version": "4.50.0"
9
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "total_flos": 0.0,
3
+ "train_loss": 0.00011944215420650531,
4
+ "train_runtime": 12092.6435,
5
+ "train_samples": 7000,
6
+ "train_samples_per_second": 1.191,
7
+ "train_steps_per_second": 0.012
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,2143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 0.17142857142857143,
6
+ "eval_steps": 500,
7
+ "global_step": 150,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "completion_length": 2693.6875610351562,
14
+ "entropy": 0.3662109375,
15
+ "epoch": 0.001142857142857143,
16
+ "grad_norm": 0.12395373731851578,
17
+ "kl": 0.0,
18
+ "learning_rate": 6.666666666666667e-08,
19
+ "loss": 0.0,
20
+ "reward": 0.7708333535119891,
21
+ "reward_std": 0.4629540964961052,
22
+ "rewards/accuracy_reward": 0.25000001303851604,
23
+ "rewards/format_reward": 0.5208333386108279,
24
+ "step": 1
25
+ },
26
+ {
27
+ "completion_length": 3127.3958435058594,
28
+ "entropy": 0.353515625,
29
+ "epoch": 0.002285714285714286,
30
+ "grad_norm": 0.14846429228782654,
31
+ "kl": 0.0,
32
+ "learning_rate": 1.3333333333333334e-07,
33
+ "loss": 0.0,
34
+ "reward": 0.6458333637565374,
35
+ "reward_std": 0.4249730706214905,
36
+ "rewards/accuracy_reward": 0.2812500102445483,
37
+ "rewards/format_reward": 0.3645833386108279,
38
+ "step": 2
39
+ },
40
+ {
41
+ "completion_length": 3685.041748046875,
42
+ "entropy": 0.4443359375,
43
+ "epoch": 0.0034285714285714284,
44
+ "grad_norm": 0.10399040579795837,
45
+ "kl": 4.1425228118896484e-05,
46
+ "learning_rate": 2e-07,
47
+ "loss": 0.0,
48
+ "reward": 0.23958333674818277,
49
+ "reward_std": 0.3668827787041664,
50
+ "rewards/accuracy_reward": 0.0729166679084301,
51
+ "rewards/format_reward": 0.16666667256504297,
52
+ "step": 3
53
+ },
54
+ {
55
+ "completion_length": 2380.291778564453,
56
+ "entropy": 0.40478515625,
57
+ "epoch": 0.004571428571428572,
58
+ "grad_norm": 0.16352659463882446,
59
+ "kl": 3.409385681152344e-05,
60
+ "learning_rate": 2.6666666666666667e-07,
61
+ "loss": 0.0,
62
+ "reward": 0.8229166865348816,
63
+ "reward_std": 0.507609948515892,
64
+ "rewards/accuracy_reward": 0.19791667722165585,
65
+ "rewards/format_reward": 0.6250000223517418,
66
+ "step": 4
67
+ },
68
+ {
69
+ "completion_length": 3441.2188720703125,
70
+ "entropy": 0.45458984375,
71
+ "epoch": 0.005714285714285714,
72
+ "grad_norm": 0.15812984108924866,
73
+ "kl": 4.1961669921875e-05,
74
+ "learning_rate": 3.333333333333333e-07,
75
+ "loss": 0.0,
76
+ "reward": 0.42708334885537624,
77
+ "reward_std": 0.5058739930391312,
78
+ "rewards/accuracy_reward": 0.07291666697710752,
79
+ "rewards/format_reward": 0.35416667722165585,
80
+ "step": 5
81
+ },
82
+ {
83
+ "completion_length": 3382.3438110351562,
84
+ "entropy": 0.45166015625,
85
+ "epoch": 0.006857142857142857,
86
+ "grad_norm": 0.15454305708408356,
87
+ "kl": 4.26173210144043e-05,
88
+ "learning_rate": 4e-07,
89
+ "loss": 0.0,
90
+ "reward": 0.40625000558793545,
91
+ "reward_std": 0.5202516540884972,
92
+ "rewards/accuracy_reward": 0.0833333358168602,
93
+ "rewards/format_reward": 0.3229166744276881,
94
+ "step": 6
95
+ },
96
+ {
97
+ "completion_length": 3277.291748046875,
98
+ "entropy": 0.39404296875,
99
+ "epoch": 0.008,
100
+ "grad_norm": 0.13690507411956787,
101
+ "kl": 2.562999725341797e-05,
102
+ "learning_rate": 4.6666666666666666e-07,
103
+ "loss": 0.0,
104
+ "reward": 0.8854166865348816,
105
+ "reward_std": 0.6845719665288925,
106
+ "rewards/accuracy_reward": 0.2708333432674408,
107
+ "rewards/format_reward": 0.6145833432674408,
108
+ "step": 7
109
+ },
110
+ {
111
+ "completion_length": 2841.916748046875,
112
+ "entropy": 0.36083984375,
113
+ "epoch": 0.009142857142857144,
114
+ "grad_norm": 0.1767321527004242,
115
+ "kl": 2.4050474166870117e-05,
116
+ "learning_rate": 5.333333333333333e-07,
117
+ "loss": 0.0,
118
+ "reward": 0.8854166967794299,
119
+ "reward_std": 0.3672378845512867,
120
+ "rewards/accuracy_reward": 0.3958333535119891,
121
+ "rewards/format_reward": 0.4895833460614085,
122
+ "step": 8
123
+ },
124
+ {
125
+ "completion_length": 3480.6563110351562,
126
+ "entropy": 0.4384765625,
127
+ "epoch": 0.010285714285714285,
128
+ "grad_norm": 0.15406936407089233,
129
+ "kl": 3.796815872192383e-05,
130
+ "learning_rate": 6e-07,
131
+ "loss": 0.0,
132
+ "reward": 0.5520833432674408,
133
+ "reward_std": 0.6496799141168594,
134
+ "rewards/accuracy_reward": 0.17708333488553762,
135
+ "rewards/format_reward": 0.3750000074505806,
136
+ "step": 9
137
+ },
138
+ {
139
+ "completion_length": 2963.572967529297,
140
+ "entropy": 0.3544921875,
141
+ "epoch": 0.011428571428571429,
142
+ "grad_norm": 0.15688633918762207,
143
+ "kl": 2.5287270545959473e-05,
144
+ "learning_rate": 6.666666666666666e-07,
145
+ "loss": 0.0,
146
+ "reward": 0.5937500223517418,
147
+ "reward_std": 0.5099271312355995,
148
+ "rewards/accuracy_reward": 0.17708333861082792,
149
+ "rewards/format_reward": 0.4166666753590107,
150
+ "step": 10
151
+ },
152
+ {
153
+ "completion_length": 3573.7500610351562,
154
+ "entropy": 0.37890625,
155
+ "epoch": 0.012571428571428572,
156
+ "grad_norm": 0.12983083724975586,
157
+ "kl": 2.5391578674316406e-05,
158
+ "learning_rate": 7.333333333333332e-07,
159
+ "loss": 0.0,
160
+ "reward": 0.3125000111758709,
161
+ "reward_std": 0.5802810192108154,
162
+ "rewards/accuracy_reward": 0.1041666679084301,
163
+ "rewards/format_reward": 0.20833334140479565,
164
+ "step": 11
165
+ },
166
+ {
167
+ "completion_length": 2520.8958740234375,
168
+ "entropy": 0.39111328125,
169
+ "epoch": 0.013714285714285714,
170
+ "grad_norm": 0.20449091494083405,
171
+ "kl": 3.743171691894531e-05,
172
+ "learning_rate": 8e-07,
173
+ "loss": 0.0,
174
+ "reward": 0.8020833656191826,
175
+ "reward_std": 0.4411254972219467,
176
+ "rewards/accuracy_reward": 0.14583333395421505,
177
+ "rewards/format_reward": 0.6562500223517418,
178
+ "step": 12
179
+ },
180
+ {
181
+ "completion_length": 3038.041748046875,
182
+ "entropy": 0.3828125,
183
+ "epoch": 0.014857142857142857,
184
+ "grad_norm": 0.14574433863162994,
185
+ "kl": 2.5153160095214844e-05,
186
+ "learning_rate": 8.666666666666667e-07,
187
+ "loss": 0.0,
188
+ "reward": 0.6875000298023224,
189
+ "reward_std": 0.3254704251885414,
190
+ "rewards/accuracy_reward": 0.22916666697710752,
191
+ "rewards/format_reward": 0.4583333432674408,
192
+ "step": 13
193
+ },
194
+ {
195
+ "completion_length": 3116.3125610351562,
196
+ "entropy": 0.37109375,
197
+ "epoch": 0.016,
198
+ "grad_norm": 0.202586367726326,
199
+ "kl": 1.9026920199394226e-05,
200
+ "learning_rate": 9.333333333333333e-07,
201
+ "loss": 0.0,
202
+ "reward": 0.5833333507180214,
203
+ "reward_std": 0.4630111753940582,
204
+ "rewards/accuracy_reward": 0.21875001024454832,
205
+ "rewards/format_reward": 0.3645833395421505,
206
+ "step": 14
207
+ },
208
+ {
209
+ "completion_length": 2924.0521240234375,
210
+ "entropy": 0.36328125,
211
+ "epoch": 0.017142857142857144,
212
+ "grad_norm": 0.09130721539258957,
213
+ "kl": 1.4469027519226074e-05,
214
+ "learning_rate": 1e-06,
215
+ "loss": 0.0,
216
+ "reward": 0.604166679084301,
217
+ "reward_std": 0.22134994342923164,
218
+ "rewards/accuracy_reward": 0.1979166716337204,
219
+ "rewards/format_reward": 0.4062500074505806,
220
+ "step": 15
221
+ },
222
+ {
223
+ "completion_length": 3887.5521850585938,
224
+ "entropy": 0.4755859375,
225
+ "epoch": 0.018285714285714287,
226
+ "grad_norm": 0.11806491017341614,
227
+ "kl": 2.9832124710083008e-05,
228
+ "learning_rate": 9.998781585307575e-07,
229
+ "loss": 0.0,
230
+ "reward": 0.11458333674818277,
231
+ "reward_std": 0.26997610181570053,
232
+ "rewards/accuracy_reward": 0.041666666977107525,
233
+ "rewards/format_reward": 0.07291666977107525,
234
+ "step": 16
235
+ },
236
+ {
237
+ "completion_length": 2579.625030517578,
238
+ "entropy": 0.44091796875,
239
+ "epoch": 0.019428571428571427,
240
+ "grad_norm": 0.2032601535320282,
241
+ "kl": 3.93986701965332e-05,
242
+ "learning_rate": 9.99512700102336e-07,
243
+ "loss": 0.0,
244
+ "reward": 0.7083333507180214,
245
+ "reward_std": 0.39187028259038925,
246
+ "rewards/accuracy_reward": 0.19791667442768812,
247
+ "rewards/format_reward": 0.5104166753590107,
248
+ "step": 17
249
+ },
250
+ {
251
+ "completion_length": 3089.104248046875,
252
+ "entropy": 0.3671875,
253
+ "epoch": 0.02057142857142857,
254
+ "grad_norm": 0.11376938223838806,
255
+ "kl": 1.2531876564025879e-05,
256
+ "learning_rate": 9.989038226169207e-07,
257
+ "loss": 0.0,
258
+ "reward": 0.5625000251457095,
259
+ "reward_std": 0.35285963863134384,
260
+ "rewards/accuracy_reward": 0.1666666679084301,
261
+ "rewards/format_reward": 0.3958333386108279,
262
+ "step": 18
263
+ },
264
+ {
265
+ "completion_length": 3130.760498046875,
266
+ "entropy": 0.39111328125,
267
+ "epoch": 0.021714285714285714,
268
+ "grad_norm": 0.09636794775724411,
269
+ "kl": 2.8267502784729004e-05,
270
+ "learning_rate": 9.98051855792412e-07,
271
+ "loss": 0.0,
272
+ "reward": 0.8125000111758709,
273
+ "reward_std": 0.3496965616941452,
274
+ "rewards/accuracy_reward": 0.36458333395421505,
275
+ "rewards/format_reward": 0.44791667722165585,
276
+ "step": 19
277
+ },
278
+ {
279
+ "completion_length": 2585.9896545410156,
280
+ "entropy": 0.329833984375,
281
+ "epoch": 0.022857142857142857,
282
+ "grad_norm": 0.15105831623077393,
283
+ "kl": 6.628036499023438e-05,
284
+ "learning_rate": 9.969572609838744e-07,
285
+ "loss": 0.0,
286
+ "reward": 0.9791666716337204,
287
+ "reward_std": 0.3452813923358917,
288
+ "rewards/accuracy_reward": 0.2812500037252903,
289
+ "rewards/format_reward": 0.6979166716337204,
290
+ "step": 20
291
+ },
292
+ {
293
+ "completion_length": 2804.229248046875,
294
+ "entropy": 0.42578125,
295
+ "epoch": 0.024,
296
+ "grad_norm": 0.2109123021364212,
297
+ "kl": 0.00016552209854125977,
298
+ "learning_rate": 9.956206309337066e-07,
299
+ "loss": 0.0,
300
+ "reward": 0.6145833432674408,
301
+ "reward_std": 0.4177238382399082,
302
+ "rewards/accuracy_reward": 0.1458333395421505,
303
+ "rewards/format_reward": 0.46875002048909664,
304
+ "step": 21
305
+ },
306
+ {
307
+ "completion_length": 1903.1459045410156,
308
+ "entropy": 0.419921875,
309
+ "epoch": 0.025142857142857144,
310
+ "grad_norm": 0.20996998250484467,
311
+ "kl": 0.00026351213455200195,
312
+ "learning_rate": 9.940426894506606e-07,
313
+ "loss": 0.0,
314
+ "reward": 1.1041667014360428,
315
+ "reward_std": 0.4033822976052761,
316
+ "rewards/accuracy_reward": 0.29166667722165585,
317
+ "rewards/format_reward": 0.8125000149011612,
318
+ "step": 22
319
+ },
320
+ {
321
+ "completion_length": 2714.0729370117188,
322
+ "entropy": 0.36865234375,
323
+ "epoch": 0.026285714285714287,
324
+ "grad_norm": 0.16544093191623688,
325
+ "kl": 0.00011658668518066406,
326
+ "learning_rate": 9.922242910178859e-07,
327
+ "loss": 0.0,
328
+ "reward": 0.6770833507180214,
329
+ "reward_std": 0.6271640285849571,
330
+ "rewards/accuracy_reward": 0.1770833432674408,
331
+ "rewards/format_reward": 0.5000000223517418,
332
+ "step": 23
333
+ },
334
+ {
335
+ "completion_length": 2834.2396850585938,
336
+ "entropy": 0.373046875,
337
+ "epoch": 0.027428571428571427,
338
+ "grad_norm": 0.10939397662878036,
339
+ "kl": 0.0001084059476852417,
340
+ "learning_rate": 9.901664203302124e-07,
341
+ "loss": 0.0,
342
+ "reward": 0.7916666865348816,
343
+ "reward_std": 0.5711240321397781,
344
+ "rewards/accuracy_reward": 0.2187500074505806,
345
+ "rewards/format_reward": 0.572916679084301,
346
+ "step": 24
347
+ },
348
+ {
349
+ "completion_length": 2877.1354370117188,
350
+ "entropy": 0.4296875,
351
+ "epoch": 0.02857142857142857,
352
+ "grad_norm": 0.10193013399839401,
353
+ "kl": 0.00018364191055297852,
354
+ "learning_rate": 9.878701917609207e-07,
355
+ "loss": 0.0,
356
+ "reward": 0.677083358168602,
357
+ "reward_std": 0.2898401468992233,
358
+ "rewards/accuracy_reward": 0.2395833432674408,
359
+ "rewards/format_reward": 0.4375,
360
+ "step": 25
361
+ },
362
+ {
363
+ "completion_length": 3221.2396850585938,
364
+ "entropy": 0.4248046875,
365
+ "epoch": 0.029714285714285714,
366
+ "grad_norm": 0.07458896934986115,
367
+ "kl": 3.0487775802612305e-05,
368
+ "learning_rate": 9.853368487582886e-07,
369
+ "loss": 0.0,
370
+ "reward": 0.6562500149011612,
371
+ "reward_std": 0.25371449440717697,
372
+ "rewards/accuracy_reward": 0.19791666977107525,
373
+ "rewards/format_reward": 0.4583333358168602,
374
+ "step": 26
375
+ },
376
+ {
377
+ "completion_length": 3297.3959350585938,
378
+ "entropy": 0.45703125,
379
+ "epoch": 0.030857142857142857,
380
+ "grad_norm": 0.0925775095820427,
381
+ "kl": 0.00012201815843582153,
382
+ "learning_rate": 9.825677631722435e-07,
383
+ "loss": 0.0,
384
+ "reward": 0.541666679084301,
385
+ "reward_std": 0.4426998719573021,
386
+ "rewards/accuracy_reward": 0.15625000279396772,
387
+ "rewards/format_reward": 0.385416679084301,
388
+ "step": 27
389
+ },
390
+ {
391
+ "completion_length": 2984.2188110351562,
392
+ "entropy": 0.3994140625,
393
+ "epoch": 0.032,
394
+ "grad_norm": 0.12723609805107117,
395
+ "kl": 0.00015980005264282227,
396
+ "learning_rate": 9.795644345114794e-07,
397
+ "loss": 0.0,
398
+ "reward": 0.8437500447034836,
399
+ "reward_std": 0.48607436567544937,
400
+ "rewards/accuracy_reward": 0.3333333460614085,
401
+ "rewards/format_reward": 0.5104166865348816,
402
+ "step": 28
403
+ },
404
+ {
405
+ "completion_length": 3707.2501220703125,
406
+ "entropy": 0.43408203125,
407
+ "epoch": 0.03314285714285714,
408
+ "grad_norm": 0.20130394399166107,
409
+ "kl": 0.0003858804702758789,
410
+ "learning_rate": 9.76328489131448e-07,
411
+ "loss": 0.0,
412
+ "reward": 0.2500000027939677,
413
+ "reward_std": 0.37919554859399796,
414
+ "rewards/accuracy_reward": 0.06250000186264515,
415
+ "rewards/format_reward": 0.18750001024454832,
416
+ "step": 29
417
+ },
418
+ {
419
+ "completion_length": 3099.9063110351562,
420
+ "entropy": 0.384765625,
421
+ "epoch": 0.03428571428571429,
422
+ "grad_norm": 0.13564985990524292,
423
+ "kl": 0.0005750656127929688,
424
+ "learning_rate": 9.728616793536587e-07,
425
+ "loss": 0.0,
426
+ "reward": 0.8854166828095913,
427
+ "reward_std": 0.5532158613204956,
428
+ "rewards/accuracy_reward": 0.3125000009313226,
429
+ "rewards/format_reward": 0.572916679084301,
430
+ "step": 30
431
+ },
432
+ {
433
+ "completion_length": 3310.8125610351562,
434
+ "entropy": 0.40087890625,
435
+ "epoch": 0.03542857142857143,
436
+ "grad_norm": 0.14703762531280518,
437
+ "kl": 0.0007152557373046875,
438
+ "learning_rate": 9.69165882516764e-07,
439
+ "loss": 0.0,
440
+ "reward": 0.4583333469927311,
441
+ "reward_std": 0.4937985762953758,
442
+ "rewards/accuracy_reward": 0.16666667070239782,
443
+ "rewards/format_reward": 0.2916666716337204,
444
+ "step": 31
445
+ },
446
+ {
447
+ "completion_length": 3543.666748046875,
448
+ "entropy": 0.4521484375,
449
+ "epoch": 0.036571428571428574,
450
+ "grad_norm": 0.11301636695861816,
451
+ "kl": 0.00032842159271240234,
452
+ "learning_rate": 9.65243099959949e-07,
453
+ "loss": 0.0,
454
+ "reward": 0.6250000223517418,
455
+ "reward_std": 0.46596524864435196,
456
+ "rewards/accuracy_reward": 0.28125000558793545,
457
+ "rewards/format_reward": 0.34375,
458
+ "step": 32
459
+ },
460
+ {
461
+ "completion_length": 3395.947998046875,
462
+ "entropy": 0.384765625,
463
+ "epoch": 0.037714285714285714,
464
+ "grad_norm": 0.12107253074645996,
465
+ "kl": 0.00042450428009033203,
466
+ "learning_rate": 9.610954559391704e-07,
467
+ "loss": 0.0,
468
+ "reward": 0.604166679084301,
469
+ "reward_std": 0.5497709587216377,
470
+ "rewards/accuracy_reward": 0.18750000093132257,
471
+ "rewards/format_reward": 0.41666667722165585,
472
+ "step": 33
473
+ },
474
+ {
475
+ "completion_length": 2621.218780517578,
476
+ "entropy": 0.45263671875,
477
+ "epoch": 0.038857142857142854,
478
+ "grad_norm": 0.15317150950431824,
479
+ "kl": 0.0013637542724609375,
480
+ "learning_rate": 9.567251964768342e-07,
481
+ "loss": 0.0001,
482
+ "reward": 0.8541666865348816,
483
+ "reward_std": 0.4670567326247692,
484
+ "rewards/accuracy_reward": 0.31250001303851604,
485
+ "rewards/format_reward": 0.5416666828095913,
486
+ "step": 34
487
+ },
488
+ {
489
+ "completion_length": 3166.3958740234375,
490
+ "entropy": 0.43115234375,
491
+ "epoch": 0.04,
492
+ "grad_norm": 0.1469903290271759,
493
+ "kl": 0.0011509060859680176,
494
+ "learning_rate": 9.521346881455354e-07,
495
+ "loss": 0.0,
496
+ "reward": 0.6458333656191826,
497
+ "reward_std": 0.6130613833665848,
498
+ "rewards/accuracy_reward": 0.23958333767950535,
499
+ "rewards/format_reward": 0.4062500149011612,
500
+ "step": 35
501
+ },
502
+ {
503
+ "completion_length": 3509.697998046875,
504
+ "entropy": 0.513671875,
505
+ "epoch": 0.04114285714285714,
506
+ "grad_norm": 0.11033376306295395,
507
+ "kl": 0.0011191368103027344,
508
+ "learning_rate": 9.473264167865171e-07,
509
+ "loss": 0.0,
510
+ "reward": 0.23958333395421505,
511
+ "reward_std": 0.24118434637784958,
512
+ "rewards/accuracy_reward": 0.031250000931322575,
513
+ "rewards/format_reward": 0.20833334140479565,
514
+ "step": 36
515
+ },
516
+ {
517
+ "completion_length": 3363.8333740234375,
518
+ "entropy": 0.42138671875,
519
+ "epoch": 0.04228571428571429,
520
+ "grad_norm": 0.11778294295072556,
521
+ "kl": 0.0008115768432617188,
522
+ "learning_rate": 9.42302986163543e-07,
523
+ "loss": 0.0,
524
+ "reward": 0.2812500149011612,
525
+ "reward_std": 0.13804075866937637,
526
+ "rewards/accuracy_reward": 0.031250000931322575,
527
+ "rewards/format_reward": 0.25,
528
+ "step": 37
529
+ },
530
+ {
531
+ "completion_length": 3610.8438110351562,
532
+ "entropy": 0.44677734375,
533
+ "epoch": 0.04342857142857143,
534
+ "grad_norm": 0.061884235590696335,
535
+ "kl": 0.0005736351013183594,
536
+ "learning_rate": 9.370671165529144e-07,
537
+ "loss": 0.0,
538
+ "reward": 0.21875000558793545,
539
+ "reward_std": 0.17128896713256836,
540
+ "rewards/accuracy_reward": 0.10416666697710752,
541
+ "rewards/format_reward": 0.11458333861082792,
542
+ "step": 38
543
+ },
544
+ {
545
+ "completion_length": 2926.4063110351562,
546
+ "entropy": 0.36669921875,
547
+ "epoch": 0.044571428571428574,
548
+ "grad_norm": 0.1068028062582016,
549
+ "kl": 0.0011527538299560547,
550
+ "learning_rate": 9.316216432703916e-07,
551
+ "loss": 0.0,
552
+ "reward": 0.7708333656191826,
553
+ "reward_std": 0.1930682435631752,
554
+ "rewards/accuracy_reward": 0.25,
555
+ "rewards/format_reward": 0.5208333507180214,
556
+ "step": 39
557
+ },
558
+ {
559
+ "completion_length": 2785.1146545410156,
560
+ "entropy": 0.388671875,
561
+ "epoch": 0.045714285714285714,
562
+ "grad_norm": 0.17572174966335297,
563
+ "kl": 0.0032024383544921875,
564
+ "learning_rate": 9.259695151358214e-07,
565
+ "loss": 0.0001,
566
+ "reward": 0.7291666902601719,
567
+ "reward_std": 0.3721684589982033,
568
+ "rewards/accuracy_reward": 0.1979166716337204,
569
+ "rewards/format_reward": 0.5312500186264515,
570
+ "step": 40
571
+ },
572
+ {
573
+ "completion_length": 3123.947998046875,
574
+ "entropy": 0.35791015625,
575
+ "epoch": 0.046857142857142854,
576
+ "grad_norm": 0.144905224442482,
577
+ "kl": 0.0007574558258056641,
578
+ "learning_rate": 9.20113792876298e-07,
579
+ "loss": 0.0,
580
+ "reward": 0.5416666865348816,
581
+ "reward_std": 0.4893290549516678,
582
+ "rewards/accuracy_reward": 0.12500000651925802,
583
+ "rewards/format_reward": 0.416666679084301,
584
+ "step": 41
585
+ },
586
+ {
587
+ "completion_length": 3056.197998046875,
588
+ "entropy": 0.48193359375,
589
+ "epoch": 0.048,
590
+ "grad_norm": 0.07001210004091263,
591
+ "kl": 0.000598907470703125,
592
+ "learning_rate": 9.140576474687263e-07,
593
+ "loss": 0.0,
594
+ "reward": 0.30208333395421505,
595
+ "reward_std": 0.15690935403108597,
596
+ "rewards/accuracy_reward": 0.02083333395421505,
597
+ "rewards/format_reward": 0.2812500009313226,
598
+ "step": 42
599
+ },
600
+ {
601
+ "completion_length": 3097.479248046875,
602
+ "entropy": 0.4033203125,
603
+ "epoch": 0.04914285714285714,
604
+ "grad_norm": 0.09348543733358383,
605
+ "kl": 0.001148223876953125,
606
+ "learning_rate": 9.078043584226815e-07,
607
+ "loss": 0.0,
608
+ "reward": 0.4895833358168602,
609
+ "reward_std": 0.3020758181810379,
610
+ "rewards/accuracy_reward": 0.1666666679084301,
611
+ "rewards/format_reward": 0.32291666977107525,
612
+ "step": 43
613
+ },
614
+ {
615
+ "completion_length": 2797.7084197998047,
616
+ "entropy": 0.39013671875,
617
+ "epoch": 0.05028571428571429,
618
+ "grad_norm": 0.15556485950946808,
619
+ "kl": 0.0014767646789550781,
620
+ "learning_rate": 9.013573120044966e-07,
621
+ "loss": 0.0001,
622
+ "reward": 0.8020833386108279,
623
+ "reward_std": 0.3607782945036888,
624
+ "rewards/accuracy_reward": 0.2708333395421505,
625
+ "rewards/format_reward": 0.5312500102445483,
626
+ "step": 44
627
+ },
628
+ {
629
+ "completion_length": 3618.791748046875,
630
+ "entropy": 0.423828125,
631
+ "epoch": 0.05142857142857143,
632
+ "grad_norm": 0.09923144429922104,
633
+ "kl": 0.0026645660400390625,
634
+ "learning_rate": 8.9471999940354e-07,
635
+ "loss": 0.0001,
636
+ "reward": 0.5833333348855376,
637
+ "reward_std": 0.4189528524875641,
638
+ "rewards/accuracy_reward": 0.2604166716337204,
639
+ "rewards/format_reward": 0.3229166781529784,
640
+ "step": 45
641
+ },
642
+ {
643
+ "completion_length": 3482.479248046875,
644
+ "entropy": 0.50634765625,
645
+ "epoch": 0.052571428571428575,
646
+ "grad_norm": 0.12269324064254761,
647
+ "kl": 0.0013875961303710938,
648
+ "learning_rate": 8.878960148416747e-07,
649
+ "loss": 0.0001,
650
+ "reward": 0.22916667722165585,
651
+ "reward_std": 0.26679350435733795,
652
+ "rewards/accuracy_reward": 0.0416666679084301,
653
+ "rewards/format_reward": 0.18750000186264515,
654
+ "step": 46
655
+ },
656
+ {
657
+ "completion_length": 2958.291748046875,
658
+ "entropy": 0.390625,
659
+ "epoch": 0.053714285714285714,
660
+ "grad_norm": 0.16963143646717072,
661
+ "kl": 0.0011917352676391602,
662
+ "learning_rate": 8.808890536269229e-07,
663
+ "loss": 0.0,
664
+ "reward": 0.8854166967794299,
665
+ "reward_std": 0.5451135858893394,
666
+ "rewards/accuracy_reward": 0.3437500149011612,
667
+ "rewards/format_reward": 0.5416666669771075,
668
+ "step": 47
669
+ },
670
+ {
671
+ "completion_length": 2956.416717529297,
672
+ "entropy": 0.396484375,
673
+ "epoch": 0.054857142857142854,
674
+ "grad_norm": 0.14105089008808136,
675
+ "kl": 0.0033426284790039062,
676
+ "learning_rate": 8.737029101523929e-07,
677
+ "loss": 0.0001,
678
+ "reward": 0.7395833507180214,
679
+ "reward_std": 0.5457281768321991,
680
+ "rewards/accuracy_reward": 0.29166666977107525,
681
+ "rewards/format_reward": 0.4479166679084301,
682
+ "step": 48
683
+ },
684
+ {
685
+ "completion_length": 2448.1146850585938,
686
+ "entropy": 0.36865234375,
687
+ "epoch": 0.056,
688
+ "grad_norm": 0.15970121324062347,
689
+ "kl": 0.006764888763427734,
690
+ "learning_rate": 8.663414758415478e-07,
691
+ "loss": 0.0003,
692
+ "reward": 0.895833395421505,
693
+ "reward_std": 0.464010052382946,
694
+ "rewards/accuracy_reward": 0.25000000558793545,
695
+ "rewards/format_reward": 0.6458333507180214,
696
+ "step": 49
697
+ },
698
+ {
699
+ "completion_length": 3050.1041870117188,
700
+ "entropy": 0.34521484375,
701
+ "epoch": 0.05714285714285714,
702
+ "grad_norm": 0.12386268377304077,
703
+ "kl": 0.0011911392211914062,
704
+ "learning_rate": 8.588087370409302e-07,
705
+ "loss": 0.0,
706
+ "reward": 0.6562500325962901,
707
+ "reward_std": 0.4570996016263962,
708
+ "rewards/accuracy_reward": 0.2916666818782687,
709
+ "rewards/format_reward": 0.3645833497866988,
710
+ "step": 50
711
+ },
712
+ {
713
+ "completion_length": 2495.2708740234375,
714
+ "entropy": 0.44873046875,
715
+ "epoch": 0.05828571428571429,
716
+ "grad_norm": 0.12384030222892761,
717
+ "kl": 0.005596160888671875,
718
+ "learning_rate": 8.511087728614862e-07,
719
+ "loss": 0.0002,
720
+ "reward": 0.6875000149011612,
721
+ "reward_std": 0.3033446706831455,
722
+ "rewards/accuracy_reward": 0.1458333395421505,
723
+ "rewards/format_reward": 0.5416666716337204,
724
+ "step": 51
725
+ },
726
+ {
727
+ "completion_length": 3027.406280517578,
728
+ "entropy": 0.384765625,
729
+ "epoch": 0.05942857142857143,
730
+ "grad_norm": 0.0901699811220169,
731
+ "kl": 0.0021982192993164062,
732
+ "learning_rate": 8.432457529696548e-07,
733
+ "loss": 0.0001,
734
+ "reward": 0.8750000298023224,
735
+ "reward_std": 0.5351639539003372,
736
+ "rewards/accuracy_reward": 0.3958333507180214,
737
+ "rewards/format_reward": 0.4791666865348816,
738
+ "step": 52
739
+ },
740
+ {
741
+ "completion_length": 2952.8646850585938,
742
+ "entropy": 0.41845703125,
743
+ "epoch": 0.060571428571428575,
744
+ "grad_norm": 0.09236861765384674,
745
+ "kl": 0.0012669563293457031,
746
+ "learning_rate": 8.352239353294194e-07,
747
+ "loss": 0.0001,
748
+ "reward": 0.8541666865348816,
749
+ "reward_std": 0.5214647725224495,
750
+ "rewards/accuracy_reward": 0.260416679084301,
751
+ "rewards/format_reward": 0.5937500074505806,
752
+ "step": 53
753
+ },
754
+ {
755
+ "completion_length": 2996.1250610351562,
756
+ "entropy": 0.3837890625,
757
+ "epoch": 0.061714285714285715,
758
+ "grad_norm": 0.15135987102985382,
759
+ "kl": 0.0015659332275390625,
760
+ "learning_rate": 8.270476638965461e-07,
761
+ "loss": 0.0001,
762
+ "reward": 0.9479166939854622,
763
+ "reward_std": 0.7639089524745941,
764
+ "rewards/accuracy_reward": 0.4062500111758709,
765
+ "rewards/format_reward": 0.5416666828095913,
766
+ "step": 54
767
+ },
768
+ {
769
+ "completion_length": 3076.2500610351562,
770
+ "entropy": 0.4130859375,
771
+ "epoch": 0.06285714285714286,
772
+ "grad_norm": 0.12680813670158386,
773
+ "kl": 0.0023276805877685547,
774
+ "learning_rate": 8.187213662662538e-07,
775
+ "loss": 0.0001,
776
+ "reward": 0.6979166865348816,
777
+ "reward_std": 0.5675121322274208,
778
+ "rewards/accuracy_reward": 0.23958333861082792,
779
+ "rewards/format_reward": 0.458333358168602,
780
+ "step": 55
781
+ },
782
+ {
783
+ "completion_length": 3058.104248046875,
784
+ "entropy": 0.4072265625,
785
+ "epoch": 0.064,
786
+ "grad_norm": 0.10628776252269745,
787
+ "kl": 0.0009794235229492188,
788
+ "learning_rate": 8.102495512755938e-07,
789
+ "loss": 0.0,
790
+ "reward": 0.6562500298023224,
791
+ "reward_std": 0.3362164571881294,
792
+ "rewards/accuracy_reward": 0.19791666697710752,
793
+ "rewards/format_reward": 0.4583333469927311,
794
+ "step": 56
795
+ },
796
+ {
797
+ "completion_length": 3532.2813110351562,
798
+ "entropy": 0.3369140625,
799
+ "epoch": 0.06514285714285714,
800
+ "grad_norm": 0.09391733258962631,
801
+ "kl": 0.0005993843078613281,
802
+ "learning_rate": 8.01636806561836e-07,
803
+ "loss": 0.0,
804
+ "reward": 0.3854166865348816,
805
+ "reward_std": 0.3325711265206337,
806
+ "rewards/accuracy_reward": 0.08333333395421505,
807
+ "rewards/format_reward": 0.3020833432674408,
808
+ "step": 57
809
+ },
810
+ {
811
+ "completion_length": 2239.1145935058594,
812
+ "entropy": 0.322998046875,
813
+ "epoch": 0.06628571428571428,
814
+ "grad_norm": 0.11698172241449356,
815
+ "kl": 0.0037631988525390625,
816
+ "learning_rate": 7.928877960781808e-07,
817
+ "loss": 0.0002,
818
+ "reward": 1.0000000223517418,
819
+ "reward_std": 0.39449498802423477,
820
+ "rewards/accuracy_reward": 0.2604166669771075,
821
+ "rewards/format_reward": 0.7395833358168602,
822
+ "step": 58
823
+ },
824
+ {
825
+ "completion_length": 3092.3438110351562,
826
+ "entropy": 0.3662109375,
827
+ "epoch": 0.06742857142857143,
828
+ "grad_norm": 0.10882271081209183,
829
+ "kl": 0.001026153564453125,
830
+ "learning_rate": 7.840072575681468e-07,
831
+ "loss": 0.0,
832
+ "reward": 0.5625000298023224,
833
+ "reward_std": 0.39667778089642525,
834
+ "rewards/accuracy_reward": 0.19791667442768812,
835
+ "rewards/format_reward": 0.36458333395421505,
836
+ "step": 59
837
+ },
838
+ {
839
+ "completion_length": 3120.5209350585938,
840
+ "entropy": 0.38037109375,
841
+ "epoch": 0.06857142857142857,
842
+ "grad_norm": 0.1319083720445633,
843
+ "kl": 0.0019273757934570312,
844
+ "learning_rate": 7.75e-07,
845
+ "loss": 0.0001,
846
+ "reward": 0.583333358168602,
847
+ "reward_std": 0.49578939378261566,
848
+ "rewards/accuracy_reward": 0.13541666977107525,
849
+ "rewards/format_reward": 0.4479166902601719,
850
+ "step": 60
851
+ },
852
+ {
853
+ "completion_length": 2971.9166870117188,
854
+ "entropy": 0.36669921875,
855
+ "epoch": 0.06971428571428571,
856
+ "grad_norm": 0.17734739184379578,
857
+ "kl": 0.0010924339294433594,
858
+ "learning_rate": 7.658709009626109e-07,
859
+ "loss": 0.0,
860
+ "reward": 0.8020833730697632,
861
+ "reward_std": 0.5149242952466011,
862
+ "rewards/accuracy_reward": 0.2083333358168602,
863
+ "rewards/format_reward": 0.5937500149011612,
864
+ "step": 61
865
+ },
866
+ {
867
+ "completion_length": 2529.7188720703125,
868
+ "entropy": 0.329833984375,
869
+ "epoch": 0.07085714285714285,
870
+ "grad_norm": 0.26185768842697144,
871
+ "kl": 0.016622543334960938,
872
+ "learning_rate": 7.566249040241553e-07,
873
+ "loss": 0.0007,
874
+ "reward": 0.9270833656191826,
875
+ "reward_std": 0.4443442225456238,
876
+ "rewards/accuracy_reward": 0.27083334885537624,
877
+ "rewards/format_reward": 0.6562500149011612,
878
+ "step": 62
879
+ },
880
+ {
881
+ "completion_length": 2196.1250610351562,
882
+ "entropy": 0.36279296875,
883
+ "epoch": 0.072,
884
+ "grad_norm": 0.1202726885676384,
885
+ "kl": 0.0028791427612304688,
886
+ "learning_rate": 7.472670160550848e-07,
887
+ "loss": 0.0001,
888
+ "reward": 1.1562500596046448,
889
+ "reward_std": 0.43789636343717575,
890
+ "rewards/accuracy_reward": 0.385416679084301,
891
+ "rewards/format_reward": 0.7708333432674408,
892
+ "step": 63
893
+ },
894
+ {
895
+ "completion_length": 3074.041717529297,
896
+ "entropy": 0.42041015625,
897
+ "epoch": 0.07314285714285715,
898
+ "grad_norm": 0.10591074079275131,
899
+ "kl": 0.0019989013671875,
900
+ "learning_rate": 7.37802304516818e-07,
901
+ "loss": 0.0001,
902
+ "reward": 0.6250000149011612,
903
+ "reward_std": 0.4529266282916069,
904
+ "rewards/accuracy_reward": 0.18750000651925802,
905
+ "rewards/format_reward": 0.4375000074505806,
906
+ "step": 64
907
+ },
908
+ {
909
+ "completion_length": 2871.416748046875,
910
+ "entropy": 0.365478515625,
911
+ "epoch": 0.07428571428571429,
912
+ "grad_norm": 0.16383042931556702,
913
+ "kl": 0.001850128173828125,
914
+ "learning_rate": 7.282358947176205e-07,
915
+ "loss": 0.0001,
916
+ "reward": 0.6666666865348816,
917
+ "reward_std": 0.38071464747190475,
918
+ "rewards/accuracy_reward": 0.19791667722165585,
919
+ "rewards/format_reward": 0.4687500149011612,
920
+ "step": 65
921
+ },
922
+ {
923
+ "completion_length": 2014.6354370117188,
924
+ "entropy": 0.33642578125,
925
+ "epoch": 0.07542857142857143,
926
+ "grad_norm": 0.26954010128974915,
927
+ "kl": 0.008558273315429688,
928
+ "learning_rate": 7.185729670371604e-07,
929
+ "loss": 0.0003,
930
+ "reward": 0.9375000447034836,
931
+ "reward_std": 0.35192636400461197,
932
+ "rewards/accuracy_reward": 0.34375000838190317,
933
+ "rewards/format_reward": 0.59375,
934
+ "step": 66
935
+ },
936
+ {
937
+ "completion_length": 3587.479248046875,
938
+ "entropy": 0.36572265625,
939
+ "epoch": 0.07657142857142857,
940
+ "grad_norm": 0.08370436728000641,
941
+ "kl": 0.001728057861328125,
942
+ "learning_rate": 7.08818754121241e-07,
943
+ "loss": 0.0001,
944
+ "reward": 0.22916667442768812,
945
+ "reward_std": 0.2259194478392601,
946
+ "rewards/accuracy_reward": 0.010416666977107525,
947
+ "rewards/format_reward": 0.21875000279396772,
948
+ "step": 67
949
+ },
950
+ {
951
+ "completion_length": 2018.3750457763672,
952
+ "entropy": 0.36474609375,
953
+ "epoch": 0.07771428571428571,
954
+ "grad_norm": 0.18221524357795715,
955
+ "kl": 0.0046844482421875,
956
+ "learning_rate": 6.989785380482312e-07,
957
+ "loss": 0.0002,
958
+ "reward": 0.916666716337204,
959
+ "reward_std": 0.35510556399822235,
960
+ "rewards/accuracy_reward": 0.2604166716337204,
961
+ "rewards/format_reward": 0.6562500149011612,
962
+ "step": 68
963
+ },
964
+ {
965
+ "completion_length": 2231.1979370117188,
966
+ "entropy": 0.41796875,
967
+ "epoch": 0.07885714285714286,
968
+ "grad_norm": 0.19485469162464142,
969
+ "kl": 0.0045928955078125,
970
+ "learning_rate": 6.890576474687263e-07,
971
+ "loss": 0.0002,
972
+ "reward": 0.6666666939854622,
973
+ "reward_std": 0.3663952201604843,
974
+ "rewards/accuracy_reward": 0.06250000186264515,
975
+ "rewards/format_reward": 0.6041666865348816,
976
+ "step": 69
977
+ },
978
+ {
979
+ "completion_length": 3180.8854370117188,
980
+ "entropy": 0.38232421875,
981
+ "epoch": 0.08,
982
+ "grad_norm": 0.13257598876953125,
983
+ "kl": 0.00274658203125,
984
+ "learning_rate": 6.790614547199906e-07,
985
+ "loss": 0.0001,
986
+ "reward": 0.45833334885537624,
987
+ "reward_std": 0.4246904104948044,
988
+ "rewards/accuracy_reward": 0.07291666977107525,
989
+ "rewards/format_reward": 0.38541666977107525,
990
+ "step": 70
991
+ },
992
+ {
993
+ "completion_length": 2643.1771240234375,
994
+ "entropy": 0.43798828125,
995
+ "epoch": 0.08114285714285714,
996
+ "grad_norm": 0.13770414888858795,
997
+ "kl": 0.0029726028442382812,
998
+ "learning_rate": 6.68995372916741e-07,
999
+ "loss": 0.0001,
1000
+ "reward": 0.7500000149011612,
1001
+ "reward_std": 0.30269280821084976,
1002
+ "rewards/accuracy_reward": 0.2500000074505806,
1003
+ "rewards/format_reward": 0.5000000149011612,
1004
+ "step": 71
1005
+ },
1006
+ {
1007
+ "completion_length": 2940.6146545410156,
1008
+ "entropy": 0.4892578125,
1009
+ "epoch": 0.08228571428571428,
1010
+ "grad_norm": 0.20104120671749115,
1011
+ "kl": 0.0030574798583984375,
1012
+ "learning_rate": 6.588648530198504e-07,
1013
+ "loss": 0.0001,
1014
+ "reward": 0.5000000149011612,
1015
+ "reward_std": 0.47806398570537567,
1016
+ "rewards/accuracy_reward": 0.06250000186264515,
1017
+ "rewards/format_reward": 0.4375000111758709,
1018
+ "step": 72
1019
+ },
1020
+ {
1021
+ "completion_length": 3779.4583740234375,
1022
+ "entropy": 0.513671875,
1023
+ "epoch": 0.08342857142857144,
1024
+ "grad_norm": 0.09181614220142365,
1025
+ "kl": 0.0015926361083984375,
1026
+ "learning_rate": 6.486753808845564e-07,
1027
+ "loss": 0.0001,
1028
+ "reward": 0.3229166744276881,
1029
+ "reward_std": 0.4466712549328804,
1030
+ "rewards/accuracy_reward": 0.125,
1031
+ "rewards/format_reward": 0.19791667070239782,
1032
+ "step": 73
1033
+ },
1034
+ {
1035
+ "completion_length": 3236.4688720703125,
1036
+ "entropy": 0.4130859375,
1037
+ "epoch": 0.08457142857142858,
1038
+ "grad_norm": 0.1495177149772644,
1039
+ "kl": 0.0028803348541259766,
1040
+ "learning_rate": 6.384324742897735e-07,
1041
+ "loss": 0.0001,
1042
+ "reward": 0.645833358168602,
1043
+ "reward_std": 0.4744175747036934,
1044
+ "rewards/accuracy_reward": 0.26041666977107525,
1045
+ "rewards/format_reward": 0.385416679084301,
1046
+ "step": 74
1047
+ },
1048
+ {
1049
+ "completion_length": 3159.635498046875,
1050
+ "entropy": 0.404296875,
1051
+ "epoch": 0.08571428571428572,
1052
+ "grad_norm": 0.11836569011211395,
1053
+ "kl": 0.0026121139526367188,
1054
+ "learning_rate": 6.281416799501187e-07,
1055
+ "loss": 0.0001,
1056
+ "reward": 0.697916679084301,
1057
+ "reward_std": 0.46560388058423996,
1058
+ "rewards/accuracy_reward": 0.22916666697710752,
1059
+ "rewards/format_reward": 0.4687500111758709,
1060
+ "step": 75
1061
+ },
1062
+ {
1063
+ "completion_length": 2450.041748046875,
1064
+ "entropy": 0.412353515625,
1065
+ "epoch": 0.08685714285714285,
1066
+ "grad_norm": 0.1628679782152176,
1067
+ "kl": 0.0018434524536132812,
1068
+ "learning_rate": 6.178085705122674e-07,
1069
+ "loss": 0.0001,
1070
+ "reward": 0.6875,
1071
+ "reward_std": 0.31900282949209213,
1072
+ "rewards/accuracy_reward": 0.08333333395421505,
1073
+ "rewards/format_reward": 0.6041666716337204,
1074
+ "step": 76
1075
+ },
1076
+ {
1077
+ "completion_length": 3208.3751220703125,
1078
+ "entropy": 0.453125,
1079
+ "epoch": 0.088,
1080
+ "grad_norm": 0.1415146142244339,
1081
+ "kl": 0.002140045166015625,
1082
+ "learning_rate": 6.074387415372676e-07,
1083
+ "loss": 0.0001,
1084
+ "reward": 0.5104166818782687,
1085
+ "reward_std": 0.44949568808078766,
1086
+ "rewards/accuracy_reward": 0.09375000558793545,
1087
+ "rewards/format_reward": 0.4166666744276881,
1088
+ "step": 77
1089
+ },
1090
+ {
1091
+ "completion_length": 2896.6771240234375,
1092
+ "entropy": 0.3759765625,
1093
+ "epoch": 0.08914285714285715,
1094
+ "grad_norm": 0.13142429292201996,
1095
+ "kl": 0.0013580322265625,
1096
+ "learning_rate": 5.97037808470444e-07,
1097
+ "loss": 0.0001,
1098
+ "reward": 0.7500000298023224,
1099
+ "reward_std": 0.5779594928026199,
1100
+ "rewards/accuracy_reward": 0.250000006519258,
1101
+ "rewards/format_reward": 0.5000000223517418,
1102
+ "step": 78
1103
+ },
1104
+ {
1105
+ "completion_length": 2312.0938110351562,
1106
+ "entropy": 0.3544921875,
1107
+ "epoch": 0.09028571428571429,
1108
+ "grad_norm": 0.1474093347787857,
1109
+ "kl": 0.0019969940185546875,
1110
+ "learning_rate": 5.866114036005362e-07,
1111
+ "loss": 0.0001,
1112
+ "reward": 0.8125000149011612,
1113
+ "reward_std": 0.36936958134174347,
1114
+ "rewards/accuracy_reward": 0.20833333674818277,
1115
+ "rewards/format_reward": 0.604166679084301,
1116
+ "step": 79
1117
+ },
1118
+ {
1119
+ "completion_length": 3418.0833740234375,
1120
+ "entropy": 0.49755859375,
1121
+ "epoch": 0.09142857142857143,
1122
+ "grad_norm": 0.14068344235420227,
1123
+ "kl": 0.002742767333984375,
1124
+ "learning_rate": 5.761651730097142e-07,
1125
+ "loss": 0.0001,
1126
+ "reward": 0.5208333544433117,
1127
+ "reward_std": 0.4246201291680336,
1128
+ "rewards/accuracy_reward": 0.16666666697710752,
1129
+ "rewards/format_reward": 0.3541666716337204,
1130
+ "step": 80
1131
+ },
1132
+ {
1133
+ "completion_length": 2992.0729370117188,
1134
+ "entropy": 0.56787109375,
1135
+ "epoch": 0.09257142857142857,
1136
+ "grad_norm": 0.1596606820821762,
1137
+ "kl": 0.005756378173828125,
1138
+ "learning_rate": 5.657047735161255e-07,
1139
+ "loss": 0.0002,
1140
+ "reward": 0.4895833432674408,
1141
+ "reward_std": 0.3185732662677765,
1142
+ "rewards/accuracy_reward": 0.11458333395421505,
1143
+ "rewards/format_reward": 0.3750000074505806,
1144
+ "step": 81
1145
+ },
1146
+ {
1147
+ "completion_length": 2483.9584350585938,
1148
+ "entropy": 0.39306640625,
1149
+ "epoch": 0.09371428571428571,
1150
+ "grad_norm": 0.13652034103870392,
1151
+ "kl": 0.002727508544921875,
1152
+ "learning_rate": 5.552358696106288e-07,
1153
+ "loss": 0.0001,
1154
+ "reward": 0.8020833432674408,
1155
+ "reward_std": 0.24248424544930458,
1156
+ "rewards/accuracy_reward": 0.3020833395421505,
1157
+ "rewards/format_reward": 0.5000000074505806,
1158
+ "step": 82
1159
+ },
1160
+ {
1161
+ "completion_length": 2964.8646850585938,
1162
+ "entropy": 0.48486328125,
1163
+ "epoch": 0.09485714285714286,
1164
+ "grad_norm": 0.10738710314035416,
1165
+ "kl": 0.00264739990234375,
1166
+ "learning_rate": 5.447641303893714e-07,
1167
+ "loss": 0.0001,
1168
+ "reward": 0.5312500074505806,
1169
+ "reward_std": 0.3985592797398567,
1170
+ "rewards/accuracy_reward": 0.1770833395421505,
1171
+ "rewards/format_reward": 0.3541666716337204,
1172
+ "step": 83
1173
+ },
1174
+ {
1175
+ "completion_length": 3069.7396240234375,
1176
+ "entropy": 0.45263671875,
1177
+ "epoch": 0.096,
1178
+ "grad_norm": 0.14286305010318756,
1179
+ "kl": 0.0017604827880859375,
1180
+ "learning_rate": 5.342952264838747e-07,
1181
+ "loss": 0.0001,
1182
+ "reward": 0.739583358168602,
1183
+ "reward_std": 0.44136959314346313,
1184
+ "rewards/accuracy_reward": 0.25,
1185
+ "rewards/format_reward": 0.4895833432674408,
1186
+ "step": 84
1187
+ },
1188
+ {
1189
+ "completion_length": 2664.2188110351562,
1190
+ "entropy": 0.324951171875,
1191
+ "epoch": 0.09714285714285714,
1192
+ "grad_norm": 0.13478516042232513,
1193
+ "kl": 0.002002716064453125,
1194
+ "learning_rate": 5.238348269902859e-07,
1195
+ "loss": 0.0001,
1196
+ "reward": 0.7604166716337204,
1197
+ "reward_std": 0.5178688690066338,
1198
+ "rewards/accuracy_reward": 0.15625000186264515,
1199
+ "rewards/format_reward": 0.6041666679084301,
1200
+ "step": 85
1201
+ },
1202
+ {
1203
+ "completion_length": 2774.291748046875,
1204
+ "entropy": 0.465576171875,
1205
+ "epoch": 0.09828571428571428,
1206
+ "grad_norm": 0.16704270243644714,
1207
+ "kl": 0.00365447998046875,
1208
+ "learning_rate": 5.133885963994639e-07,
1209
+ "loss": 0.0001,
1210
+ "reward": 0.6250000102445483,
1211
+ "reward_std": 0.2606133744120598,
1212
+ "rewards/accuracy_reward": 0.16666666977107525,
1213
+ "rewards/format_reward": 0.4583333386108279,
1214
+ "step": 86
1215
+ },
1216
+ {
1217
+ "completion_length": 2400.302215576172,
1218
+ "entropy": 0.4453125,
1219
+ "epoch": 0.09942857142857142,
1220
+ "grad_norm": 0.23015545308589935,
1221
+ "kl": 0.0040435791015625,
1222
+ "learning_rate": 5.02962191529556e-07,
1223
+ "loss": 0.0002,
1224
+ "reward": 0.8125000447034836,
1225
+ "reward_std": 0.5173326060175896,
1226
+ "rewards/accuracy_reward": 0.18750001024454832,
1227
+ "rewards/format_reward": 0.625,
1228
+ "step": 87
1229
+ },
1230
+ {
1231
+ "completion_length": 2469.7396697998047,
1232
+ "entropy": 0.41943359375,
1233
+ "epoch": 0.10057142857142858,
1234
+ "grad_norm": 0.16470564901828766,
1235
+ "kl": 0.0041961669921875,
1236
+ "learning_rate": 4.925612584627324e-07,
1237
+ "loss": 0.0002,
1238
+ "reward": 1.0208333730697632,
1239
+ "reward_std": 0.693773627281189,
1240
+ "rewards/accuracy_reward": 0.3750000186264515,
1241
+ "rewards/format_reward": 0.6458333507180214,
1242
+ "step": 88
1243
+ },
1244
+ {
1245
+ "completion_length": 2834.635498046875,
1246
+ "entropy": 0.37939453125,
1247
+ "epoch": 0.10171428571428572,
1248
+ "grad_norm": 0.1899234652519226,
1249
+ "kl": 0.003200531005859375,
1250
+ "learning_rate": 4.821914294877326e-07,
1251
+ "loss": 0.0001,
1252
+ "reward": 0.6562500149011612,
1253
+ "reward_std": 0.5321320816874504,
1254
+ "rewards/accuracy_reward": 0.17708333395421505,
1255
+ "rewards/format_reward": 0.479166679084301,
1256
+ "step": 89
1257
+ },
1258
+ {
1259
+ "completion_length": 2226.3959045410156,
1260
+ "entropy": 0.59619140625,
1261
+ "epoch": 0.10285714285714286,
1262
+ "grad_norm": 0.15134188532829285,
1263
+ "kl": 0.0078887939453125,
1264
+ "learning_rate": 4.7185832004988133e-07,
1265
+ "loss": 0.0003,
1266
+ "reward": 0.6458333563059568,
1267
+ "reward_std": 0.21650634706020355,
1268
+ "rewards/accuracy_reward": 0.031250000931322575,
1269
+ "rewards/format_reward": 0.6145833460614085,
1270
+ "step": 90
1271
+ },
1272
+ {
1273
+ "completion_length": 2510.8334045410156,
1274
+ "entropy": 0.423828125,
1275
+ "epoch": 0.104,
1276
+ "grad_norm": 0.15344281494617462,
1277
+ "kl": 0.0041046142578125,
1278
+ "learning_rate": 4.6156752571022637e-07,
1279
+ "loss": 0.0002,
1280
+ "reward": 0.8958333637565374,
1281
+ "reward_std": 0.40820014476776123,
1282
+ "rewards/accuracy_reward": 0.2604166669771075,
1283
+ "rewards/format_reward": 0.6354166818782687,
1284
+ "step": 91
1285
+ },
1286
+ {
1287
+ "completion_length": 2551.062530517578,
1288
+ "entropy": 0.390869140625,
1289
+ "epoch": 0.10514285714285715,
1290
+ "grad_norm": 0.12282641232013702,
1291
+ "kl": 0.00506591796875,
1292
+ "learning_rate": 4.513246191154434e-07,
1293
+ "loss": 0.0002,
1294
+ "reward": 0.6770833432674408,
1295
+ "reward_std": 0.3805246874690056,
1296
+ "rewards/accuracy_reward": 0.09375000093132257,
1297
+ "rewards/format_reward": 0.5833333432674408,
1298
+ "step": 92
1299
+ },
1300
+ {
1301
+ "completion_length": 3784.7188110351562,
1302
+ "entropy": 0.638671875,
1303
+ "epoch": 0.10628571428571429,
1304
+ "grad_norm": 0.2033790946006775,
1305
+ "kl": 0.0058460235595703125,
1306
+ "learning_rate": 4.4113514698014953e-07,
1307
+ "loss": 0.0002,
1308
+ "reward": 0.0729166679084301,
1309
+ "reward_std": 0.18205293267965317,
1310
+ "rewards/accuracy_reward": 0.0,
1311
+ "rewards/format_reward": 0.0729166679084301,
1312
+ "step": 93
1313
+ },
1314
+ {
1315
+ "completion_length": 2915.9375,
1316
+ "entropy": 0.5439453125,
1317
+ "epoch": 0.10742857142857143,
1318
+ "grad_norm": 0.19546350836753845,
1319
+ "kl": 0.004344940185546875,
1320
+ "learning_rate": 4.3100462708325914e-07,
1321
+ "loss": 0.0002,
1322
+ "reward": 0.5833333395421505,
1323
+ "reward_std": 0.427902989089489,
1324
+ "rewards/accuracy_reward": 0.18750000279396772,
1325
+ "rewards/format_reward": 0.3958333358168602,
1326
+ "step": 94
1327
+ },
1328
+ {
1329
+ "completion_length": 3644.4688110351562,
1330
+ "entropy": 0.47607421875,
1331
+ "epoch": 0.10857142857142857,
1332
+ "grad_norm": 0.09118141978979111,
1333
+ "kl": 0.0022993087768554688,
1334
+ "learning_rate": 4.209385452800095e-07,
1335
+ "loss": 0.0001,
1336
+ "reward": 0.3958333358168602,
1337
+ "reward_std": 0.4476298391819,
1338
+ "rewards/accuracy_reward": 0.1041666679084301,
1339
+ "rewards/format_reward": 0.2916666679084301,
1340
+ "step": 95
1341
+ },
1342
+ {
1343
+ "completion_length": 2458.0833740234375,
1344
+ "entropy": 0.37353515625,
1345
+ "epoch": 0.10971428571428571,
1346
+ "grad_norm": 0.16451668739318848,
1347
+ "kl": 0.0038547515869140625,
1348
+ "learning_rate": 4.1094235253127374e-07,
1349
+ "loss": 0.0002,
1350
+ "reward": 0.8750000447034836,
1351
+ "reward_std": 0.4621882885694504,
1352
+ "rewards/accuracy_reward": 0.2708333348855376,
1353
+ "rewards/format_reward": 0.6041666865348816,
1354
+ "step": 96
1355
+ },
1356
+ {
1357
+ "completion_length": 2523.0833740234375,
1358
+ "entropy": 0.421875,
1359
+ "epoch": 0.11085714285714286,
1360
+ "grad_norm": 0.1885625571012497,
1361
+ "kl": 0.003082275390625,
1362
+ "learning_rate": 4.0102146195176887e-07,
1363
+ "loss": 0.0001,
1364
+ "reward": 0.927083358168602,
1365
+ "reward_std": 0.5721743106842041,
1366
+ "rewards/accuracy_reward": 0.2708333348855376,
1367
+ "rewards/format_reward": 0.6562500223517418,
1368
+ "step": 97
1369
+ },
1370
+ {
1371
+ "completion_length": 2336.5625915527344,
1372
+ "entropy": 0.384033203125,
1373
+ "epoch": 0.112,
1374
+ "grad_norm": 0.15700186789035797,
1375
+ "kl": 0.0025310516357421875,
1376
+ "learning_rate": 3.911812458787591e-07,
1377
+ "loss": 0.0001,
1378
+ "reward": 0.7916666865348816,
1379
+ "reward_std": 0.2110944464802742,
1380
+ "rewards/accuracy_reward": 0.13541666697710752,
1381
+ "rewards/format_reward": 0.6562500223517418,
1382
+ "step": 98
1383
+ },
1384
+ {
1385
+ "completion_length": 2524.8958740234375,
1386
+ "entropy": 0.379150390625,
1387
+ "epoch": 0.11314285714285714,
1388
+ "grad_norm": 0.1743498593568802,
1389
+ "kl": 0.003711700439453125,
1390
+ "learning_rate": 3.8142703296283953e-07,
1391
+ "loss": 0.0001,
1392
+ "reward": 0.8125000204890966,
1393
+ "reward_std": 0.4803639128804207,
1394
+ "rewards/accuracy_reward": 0.2708333432674408,
1395
+ "rewards/format_reward": 0.5416666697710752,
1396
+ "step": 99
1397
+ },
1398
+ {
1399
+ "completion_length": 2306.0521545410156,
1400
+ "entropy": 0.35009765625,
1401
+ "epoch": 0.11428571428571428,
1402
+ "grad_norm": 0.12883873283863068,
1403
+ "kl": 0.003726959228515625,
1404
+ "learning_rate": 3.7176410528237945e-07,
1405
+ "loss": 0.0001,
1406
+ "reward": 1.0416666865348816,
1407
+ "reward_std": 0.44550345838069916,
1408
+ "rewards/accuracy_reward": 0.3437500074505806,
1409
+ "rewards/format_reward": 0.6979166716337204,
1410
+ "step": 100
1411
+ },
1412
+ {
1413
+ "completion_length": 2115.5313110351562,
1414
+ "entropy": 0.437255859375,
1415
+ "epoch": 0.11542857142857142,
1416
+ "grad_norm": 0.1613183170557022,
1417
+ "kl": 0.00330352783203125,
1418
+ "learning_rate": 3.62197695483182e-07,
1419
+ "loss": 0.0001,
1420
+ "reward": 0.833333358168602,
1421
+ "reward_std": 0.2652370296418667,
1422
+ "rewards/accuracy_reward": 0.15625000558793545,
1423
+ "rewards/format_reward": 0.6770833358168602,
1424
+ "step": 101
1425
+ },
1426
+ {
1427
+ "completion_length": 1823.4167175292969,
1428
+ "entropy": 0.369384765625,
1429
+ "epoch": 0.11657142857142858,
1430
+ "grad_norm": 0.11039572954177856,
1431
+ "kl": 0.0045318603515625,
1432
+ "learning_rate": 3.5273298394491515e-07,
1433
+ "loss": 0.0002,
1434
+ "reward": 0.9375000298023224,
1435
+ "reward_std": 0.2463684342801571,
1436
+ "rewards/accuracy_reward": 0.12500000279396772,
1437
+ "rewards/format_reward": 0.8125000298023224,
1438
+ "step": 102
1439
+ },
1440
+ {
1441
+ "completion_length": 2301.593780517578,
1442
+ "entropy": 0.376953125,
1443
+ "epoch": 0.11771428571428572,
1444
+ "grad_norm": 0.27249184250831604,
1445
+ "kl": 0.00434112548828125,
1446
+ "learning_rate": 3.433750959758446e-07,
1447
+ "loss": 0.0002,
1448
+ "reward": 0.895833358168602,
1449
+ "reward_std": 0.5925451144576073,
1450
+ "rewards/accuracy_reward": 0.19791667815297842,
1451
+ "rewards/format_reward": 0.6979166865348816,
1452
+ "step": 103
1453
+ },
1454
+ {
1455
+ "completion_length": 2650.4063720703125,
1456
+ "entropy": 0.45458984375,
1457
+ "epoch": 0.11885714285714286,
1458
+ "grad_norm": 0.14005857706069946,
1459
+ "kl": 0.00519561767578125,
1460
+ "learning_rate": 3.3412909903738936e-07,
1461
+ "loss": 0.0002,
1462
+ "reward": 0.572916679084301,
1463
+ "reward_std": 0.42935075983405113,
1464
+ "rewards/accuracy_reward": 0.09375000186264515,
1465
+ "rewards/format_reward": 0.479166679084301,
1466
+ "step": 104
1467
+ },
1468
+ {
1469
+ "completion_length": 2248.6666870117188,
1470
+ "entropy": 0.35205078125,
1471
+ "epoch": 0.12,
1472
+ "grad_norm": 0.18584585189819336,
1473
+ "kl": 0.0032196044921875,
1474
+ "learning_rate": 3.250000000000001e-07,
1475
+ "loss": 0.0001,
1476
+ "reward": 0.885416716337204,
1477
+ "reward_std": 0.5438356846570969,
1478
+ "rewards/accuracy_reward": 0.25000000838190317,
1479
+ "rewards/format_reward": 0.6354166865348816,
1480
+ "step": 105
1481
+ },
1482
+ {
1483
+ "completion_length": 2213.1250915527344,
1484
+ "entropy": 0.30859375,
1485
+ "epoch": 0.12114285714285715,
1486
+ "grad_norm": 0.13285432755947113,
1487
+ "kl": 0.003154754638671875,
1488
+ "learning_rate": 3.159927424318531e-07,
1489
+ "loss": 0.0001,
1490
+ "reward": 1.0625000204890966,
1491
+ "reward_std": 0.4201487675309181,
1492
+ "rewards/accuracy_reward": 0.40625,
1493
+ "rewards/format_reward": 0.6562500055879354,
1494
+ "step": 106
1495
+ },
1496
+ {
1497
+ "completion_length": 2495.4166870117188,
1498
+ "entropy": 0.54296875,
1499
+ "epoch": 0.12228571428571429,
1500
+ "grad_norm": 0.2279452681541443,
1501
+ "kl": 0.005664825439453125,
1502
+ "learning_rate": 3.0711220392181934e-07,
1503
+ "loss": 0.0002,
1504
+ "reward": 0.8437500298023224,
1505
+ "reward_std": 0.39580530673265457,
1506
+ "rewards/accuracy_reward": 0.19791666697710752,
1507
+ "rewards/format_reward": 0.645833358168602,
1508
+ "step": 107
1509
+ },
1510
+ {
1511
+ "completion_length": 2470.9896545410156,
1512
+ "entropy": 0.41259765625,
1513
+ "epoch": 0.12342857142857143,
1514
+ "grad_norm": 0.1852155178785324,
1515
+ "kl": 0.005123138427734375,
1516
+ "learning_rate": 2.9836319343816397e-07,
1517
+ "loss": 0.0002,
1518
+ "reward": 0.7604167014360428,
1519
+ "reward_std": 0.44793232530355453,
1520
+ "rewards/accuracy_reward": 0.1770833358168602,
1521
+ "rewards/format_reward": 0.583333358168602,
1522
+ "step": 108
1523
+ },
1524
+ {
1525
+ "completion_length": 2746.9166870117188,
1526
+ "entropy": 0.41796875,
1527
+ "epoch": 0.12457142857142857,
1528
+ "grad_norm": 0.15881556272506714,
1529
+ "kl": 0.00360107421875,
1530
+ "learning_rate": 2.897504487244061e-07,
1531
+ "loss": 0.0001,
1532
+ "reward": 0.6458333656191826,
1533
+ "reward_std": 0.3334706202149391,
1534
+ "rewards/accuracy_reward": 0.15625,
1535
+ "rewards/format_reward": 0.4895833507180214,
1536
+ "step": 109
1537
+ },
1538
+ {
1539
+ "completion_length": 2482.5000610351562,
1540
+ "entropy": 0.400390625,
1541
+ "epoch": 0.12571428571428572,
1542
+ "grad_norm": 0.1606108397245407,
1543
+ "kl": 0.0030384063720703125,
1544
+ "learning_rate": 2.812786337337463e-07,
1545
+ "loss": 0.0001,
1546
+ "reward": 0.8750000298023224,
1547
+ "reward_std": 0.560508705675602,
1548
+ "rewards/accuracy_reward": 0.22916667349636555,
1549
+ "rewards/format_reward": 0.645833358168602,
1550
+ "step": 110
1551
+ },
1552
+ {
1553
+ "completion_length": 2674.791748046875,
1554
+ "entropy": 0.486572265625,
1555
+ "epoch": 0.12685714285714286,
1556
+ "grad_norm": 0.14915555715560913,
1557
+ "kl": 0.00421905517578125,
1558
+ "learning_rate": 2.729523361034538e-07,
1559
+ "loss": 0.0002,
1560
+ "reward": 0.614583358168602,
1561
+ "reward_std": 0.38519187271595,
1562
+ "rewards/accuracy_reward": 0.12500000186264515,
1563
+ "rewards/format_reward": 0.4895833507180214,
1564
+ "step": 111
1565
+ },
1566
+ {
1567
+ "completion_length": 2968.5834350585938,
1568
+ "entropy": 0.470703125,
1569
+ "epoch": 0.128,
1570
+ "grad_norm": 0.2177121490240097,
1571
+ "kl": 0.0030193328857421875,
1572
+ "learning_rate": 2.6477606467058035e-07,
1573
+ "loss": 0.0001,
1574
+ "reward": 0.8229166865348816,
1575
+ "reward_std": 0.5380749329924583,
1576
+ "rewards/accuracy_reward": 0.2708333386108279,
1577
+ "rewards/format_reward": 0.5520833507180214,
1578
+ "step": 112
1579
+ },
1580
+ {
1581
+ "completion_length": 1850.3125457763672,
1582
+ "entropy": 0.37451171875,
1583
+ "epoch": 0.12914285714285714,
1584
+ "grad_norm": 0.18240460753440857,
1585
+ "kl": 0.004482269287109375,
1586
+ "learning_rate": 2.567542470303452e-07,
1587
+ "loss": 0.0002,
1588
+ "reward": 0.9375000149011612,
1589
+ "reward_std": 0.39232632517814636,
1590
+ "rewards/accuracy_reward": 0.1979166716337204,
1591
+ "rewards/format_reward": 0.7395833432674408,
1592
+ "step": 113
1593
+ },
1594
+ {
1595
+ "completion_length": 1991.9479675292969,
1596
+ "entropy": 0.34228515625,
1597
+ "epoch": 0.13028571428571428,
1598
+ "grad_norm": 0.12856441736221313,
1599
+ "kl": 0.00384521484375,
1600
+ "learning_rate": 2.488912271385139e-07,
1601
+ "loss": 0.0002,
1602
+ "reward": 0.9687500298023224,
1603
+ "reward_std": 0.41063307225704193,
1604
+ "rewards/accuracy_reward": 0.16666666977107525,
1605
+ "rewards/format_reward": 0.802083358168602,
1606
+ "step": 114
1607
+ },
1608
+ {
1609
+ "completion_length": 2642.2813110351562,
1610
+ "entropy": 0.470703125,
1611
+ "epoch": 0.13142857142857142,
1612
+ "grad_norm": 0.1557956039905548,
1613
+ "kl": 0.005847930908203125,
1614
+ "learning_rate": 2.411912629590699e-07,
1615
+ "loss": 0.0002,
1616
+ "reward": 0.7500000149011612,
1617
+ "reward_std": 0.3538191542029381,
1618
+ "rewards/accuracy_reward": 0.2500000074505806,
1619
+ "rewards/format_reward": 0.5000000149011612,
1620
+ "step": 115
1621
+ },
1622
+ {
1623
+ "completion_length": 3452.9271240234375,
1624
+ "entropy": 0.5498046875,
1625
+ "epoch": 0.13257142857142856,
1626
+ "grad_norm": 0.12163397669792175,
1627
+ "kl": 0.00417327880859375,
1628
+ "learning_rate": 2.336585241584522e-07,
1629
+ "loss": 0.0002,
1630
+ "reward": 0.3125000009313226,
1631
+ "reward_std": 0.3494237996637821,
1632
+ "rewards/accuracy_reward": 0.09375000279396772,
1633
+ "rewards/format_reward": 0.21875000558793545,
1634
+ "step": 116
1635
+ },
1636
+ {
1637
+ "completion_length": 2859.000030517578,
1638
+ "entropy": 0.5322265625,
1639
+ "epoch": 0.1337142857142857,
1640
+ "grad_norm": 0.23236258327960968,
1641
+ "kl": 0.0061187744140625,
1642
+ "learning_rate": 2.2629708984760706e-07,
1643
+ "loss": 0.0002,
1644
+ "reward": 0.48958336375653744,
1645
+ "reward_std": 0.34661681205034256,
1646
+ "rewards/accuracy_reward": 0.052083334885537624,
1647
+ "rewards/format_reward": 0.4375000027939677,
1648
+ "step": 117
1649
+ },
1650
+ {
1651
+ "completion_length": 2824.354248046875,
1652
+ "entropy": 0.3935546875,
1653
+ "epoch": 0.13485714285714287,
1654
+ "grad_norm": 0.11868440359830856,
1655
+ "kl": 0.002780914306640625,
1656
+ "learning_rate": 2.1911094637307714e-07,
1657
+ "loss": 0.0001,
1658
+ "reward": 1.0416666865348816,
1659
+ "reward_std": 0.6117755249142647,
1660
+ "rewards/accuracy_reward": 0.4270833358168602,
1661
+ "rewards/format_reward": 0.6145833432674408,
1662
+ "step": 118
1663
+ },
1664
+ {
1665
+ "completion_length": 2026.5937805175781,
1666
+ "entropy": 0.446533203125,
1667
+ "epoch": 0.136,
1668
+ "grad_norm": 0.1918383240699768,
1669
+ "kl": 0.00518035888671875,
1670
+ "learning_rate": 2.1210398515832536e-07,
1671
+ "loss": 0.0002,
1672
+ "reward": 0.989583358168602,
1673
+ "reward_std": 0.3108450919389725,
1674
+ "rewards/accuracy_reward": 0.2604166716337204,
1675
+ "rewards/format_reward": 0.7291666865348816,
1676
+ "step": 119
1677
+ },
1678
+ {
1679
+ "completion_length": 1974.4270935058594,
1680
+ "entropy": 0.43359375,
1681
+ "epoch": 0.13714285714285715,
1682
+ "grad_norm": 0.19848716259002686,
1683
+ "kl": 0.006320953369140625,
1684
+ "learning_rate": 2.0528000059645995e-07,
1685
+ "loss": 0.0003,
1686
+ "reward": 0.864583358168602,
1687
+ "reward_std": 0.4301687255501747,
1688
+ "rewards/accuracy_reward": 0.16666666697710752,
1689
+ "rewards/format_reward": 0.6979166865348816,
1690
+ "step": 120
1691
+ },
1692
+ {
1693
+ "completion_length": 1066.2396240234375,
1694
+ "entropy": 0.30810546875,
1695
+ "epoch": 0.1382857142857143,
1696
+ "grad_norm": 0.15486152470111847,
1697
+ "kl": 0.00482940673828125,
1698
+ "learning_rate": 1.986426879955034e-07,
1699
+ "loss": 0.0002,
1700
+ "reward": 1.2500000298023224,
1701
+ "reward_std": 0.25760992616415024,
1702
+ "rewards/accuracy_reward": 0.3020833386108279,
1703
+ "rewards/format_reward": 0.9479166865348816,
1704
+ "step": 121
1705
+ },
1706
+ {
1707
+ "completion_length": 2534.8958740234375,
1708
+ "entropy": 0.4521484375,
1709
+ "epoch": 0.13942857142857143,
1710
+ "grad_norm": 0.12797077000141144,
1711
+ "kl": 0.00380706787109375,
1712
+ "learning_rate": 1.9219564157731844e-07,
1713
+ "loss": 0.0002,
1714
+ "reward": 0.8229167014360428,
1715
+ "reward_std": 0.3391122668981552,
1716
+ "rewards/accuracy_reward": 0.20833333861082792,
1717
+ "rewards/format_reward": 0.6145833432674408,
1718
+ "step": 122
1719
+ },
1720
+ {
1721
+ "completion_length": 2572.791748046875,
1722
+ "entropy": 0.451416015625,
1723
+ "epoch": 0.14057142857142857,
1724
+ "grad_norm": 0.1227995902299881,
1725
+ "kl": 0.0032405853271484375,
1726
+ "learning_rate": 1.8594235253127372e-07,
1727
+ "loss": 0.0001,
1728
+ "reward": 0.7083333544433117,
1729
+ "reward_std": 0.3739900141954422,
1730
+ "rewards/accuracy_reward": 0.1354166679084301,
1731
+ "rewards/format_reward": 0.5729166828095913,
1732
+ "step": 123
1733
+ },
1734
+ {
1735
+ "completion_length": 2198.6563110351562,
1736
+ "entropy": 0.33984375,
1737
+ "epoch": 0.1417142857142857,
1738
+ "grad_norm": 0.2465568333864212,
1739
+ "kl": 0.015537261962890625,
1740
+ "learning_rate": 1.7988620712370195e-07,
1741
+ "loss": 0.0006,
1742
+ "reward": 0.9687500298023224,
1743
+ "reward_std": 0.5658619552850723,
1744
+ "rewards/accuracy_reward": 0.25,
1745
+ "rewards/format_reward": 0.7187500298023224,
1746
+ "step": 124
1747
+ },
1748
+ {
1749
+ "completion_length": 2865.250030517578,
1750
+ "entropy": 0.431640625,
1751
+ "epoch": 0.14285714285714285,
1752
+ "grad_norm": 0.10473211109638214,
1753
+ "kl": 0.00345611572265625,
1754
+ "learning_rate": 1.7403048486417868e-07,
1755
+ "loss": 0.0001,
1756
+ "reward": 0.6979166697710752,
1757
+ "reward_std": 0.3267679661512375,
1758
+ "rewards/accuracy_reward": 0.30208333395421505,
1759
+ "rewards/format_reward": 0.3958333386108279,
1760
+ "step": 125
1761
+ },
1762
+ {
1763
+ "completion_length": 2886.2188720703125,
1764
+ "entropy": 0.4560546875,
1765
+ "epoch": 0.144,
1766
+ "grad_norm": 0.09086798876523972,
1767
+ "kl": 0.0029506683349609375,
1768
+ "learning_rate": 1.6837835672960831e-07,
1769
+ "loss": 0.0001,
1770
+ "reward": 0.7395833358168602,
1771
+ "reward_std": 0.3677559196949005,
1772
+ "rewards/accuracy_reward": 0.2083333432674408,
1773
+ "rewards/format_reward": 0.5312500149011612,
1774
+ "step": 126
1775
+ },
1776
+ {
1777
+ "completion_length": 2771.3438415527344,
1778
+ "entropy": 0.41796875,
1779
+ "epoch": 0.14514285714285713,
1780
+ "grad_norm": 0.15809084475040436,
1781
+ "kl": 0.004436492919921875,
1782
+ "learning_rate": 1.6293288344708566e-07,
1783
+ "loss": 0.0002,
1784
+ "reward": 0.635416679084301,
1785
+ "reward_std": 0.49130160734057426,
1786
+ "rewards/accuracy_reward": 0.0937500037252903,
1787
+ "rewards/format_reward": 0.541666679084301,
1788
+ "step": 127
1789
+ },
1790
+ {
1791
+ "completion_length": 2852.291748046875,
1792
+ "entropy": 0.5322265625,
1793
+ "epoch": 0.1462857142857143,
1794
+ "grad_norm": 0.17982318997383118,
1795
+ "kl": 0.004756927490234375,
1796
+ "learning_rate": 1.5769701383645698e-07,
1797
+ "loss": 0.0002,
1798
+ "reward": 0.9166666967794299,
1799
+ "reward_std": 0.5036755502223969,
1800
+ "rewards/accuracy_reward": 0.3645833358168602,
1801
+ "rewards/format_reward": 0.5520833535119891,
1802
+ "step": 128
1803
+ },
1804
+ {
1805
+ "completion_length": 3367.9583740234375,
1806
+ "entropy": 0.49658203125,
1807
+ "epoch": 0.14742857142857144,
1808
+ "grad_norm": 0.1593668907880783,
1809
+ "kl": 0.00467681884765625,
1810
+ "learning_rate": 1.5267358321348285e-07,
1811
+ "loss": 0.0002,
1812
+ "reward": 0.4583333507180214,
1813
+ "reward_std": 0.4737073630094528,
1814
+ "rewards/accuracy_reward": 0.1666666753590107,
1815
+ "rewards/format_reward": 0.2916666679084301,
1816
+ "step": 129
1817
+ },
1818
+ {
1819
+ "completion_length": 2817.7709350585938,
1820
+ "entropy": 0.5,
1821
+ "epoch": 0.14857142857142858,
1822
+ "grad_norm": 0.16384749114513397,
1823
+ "kl": 0.00354766845703125,
1824
+ "learning_rate": 1.4786531185446452e-07,
1825
+ "loss": 0.0001,
1826
+ "reward": 0.479166679084301,
1827
+ "reward_std": 0.40873220562934875,
1828
+ "rewards/accuracy_reward": 0.06250000186264515,
1829
+ "rewards/format_reward": 0.4166666716337204,
1830
+ "step": 130
1831
+ },
1832
+ {
1833
+ "completion_length": 2720.8021850585938,
1834
+ "entropy": 0.49658203125,
1835
+ "epoch": 0.14971428571428572,
1836
+ "grad_norm": 0.24187295138835907,
1837
+ "kl": 0.004924774169921875,
1838
+ "learning_rate": 1.432748035231658e-07,
1839
+ "loss": 0.0002,
1840
+ "reward": 0.9062500298023224,
1841
+ "reward_std": 0.504379153251648,
1842
+ "rewards/accuracy_reward": 0.375,
1843
+ "rewards/format_reward": 0.5312500223517418,
1844
+ "step": 131
1845
+ },
1846
+ {
1847
+ "completion_length": 2590.0521545410156,
1848
+ "entropy": 0.4248046875,
1849
+ "epoch": 0.15085714285714286,
1850
+ "grad_norm": 0.13521018624305725,
1851
+ "kl": 0.003265380859375,
1852
+ "learning_rate": 1.3890454406082956e-07,
1853
+ "loss": 0.0001,
1854
+ "reward": 0.833333358168602,
1855
+ "reward_std": 0.5318443104624748,
1856
+ "rewards/accuracy_reward": 0.2916666744276881,
1857
+ "rewards/format_reward": 0.541666679084301,
1858
+ "step": 132
1859
+ },
1860
+ {
1861
+ "completion_length": 3037.8854370117188,
1862
+ "entropy": 0.49072265625,
1863
+ "epoch": 0.152,
1864
+ "grad_norm": 0.17249611020088196,
1865
+ "kl": 0.004932403564453125,
1866
+ "learning_rate": 1.3475690004005097e-07,
1867
+ "loss": 0.0002,
1868
+ "reward": 0.5104166865348816,
1869
+ "reward_std": 0.2723224312067032,
1870
+ "rewards/accuracy_reward": 0.1041666716337204,
1871
+ "rewards/format_reward": 0.4062500149011612,
1872
+ "step": 133
1873
+ },
1874
+ {
1875
+ "completion_length": 2485.322998046875,
1876
+ "entropy": 0.5224609375,
1877
+ "epoch": 0.15314285714285714,
1878
+ "grad_norm": 0.16694706678390503,
1879
+ "kl": 0.00672149658203125,
1880
+ "learning_rate": 1.308341174832359e-07,
1881
+ "loss": 0.0003,
1882
+ "reward": 0.8750000149011612,
1883
+ "reward_std": 0.49518734961748123,
1884
+ "rewards/accuracy_reward": 0.2500000102445483,
1885
+ "rewards/format_reward": 0.6250000149011612,
1886
+ "step": 134
1887
+ },
1888
+ {
1889
+ "completion_length": 1746.3958740234375,
1890
+ "entropy": 0.373046875,
1891
+ "epoch": 0.15428571428571428,
1892
+ "grad_norm": 0.1858878880739212,
1893
+ "kl": 0.007049560546875,
1894
+ "learning_rate": 1.2713832064634125e-07,
1895
+ "loss": 0.0003,
1896
+ "reward": 1.1145833432674408,
1897
+ "reward_std": 0.3352552205324173,
1898
+ "rewards/accuracy_reward": 0.43750000558793545,
1899
+ "rewards/format_reward": 0.6770833432674408,
1900
+ "step": 135
1901
+ },
1902
+ {
1903
+ "completion_length": 2145.4584045410156,
1904
+ "entropy": 0.35986328125,
1905
+ "epoch": 0.15542857142857142,
1906
+ "grad_norm": 0.196150541305542,
1907
+ "kl": 0.00507354736328125,
1908
+ "learning_rate": 1.2367151086855187e-07,
1909
+ "loss": 0.0002,
1910
+ "reward": 0.9895833358168602,
1911
+ "reward_std": 0.6333772391080856,
1912
+ "rewards/accuracy_reward": 0.3125000074505806,
1913
+ "rewards/format_reward": 0.6770833507180214,
1914
+ "step": 136
1915
+ },
1916
+ {
1917
+ "completion_length": 2813.510467529297,
1918
+ "entropy": 0.387939453125,
1919
+ "epoch": 0.15657142857142858,
1920
+ "grad_norm": 0.13014180958271027,
1921
+ "kl": 0.00402069091796875,
1922
+ "learning_rate": 1.2043556548852063e-07,
1923
+ "loss": 0.0002,
1924
+ "reward": 0.677083358168602,
1925
+ "reward_std": 0.5024930611252785,
1926
+ "rewards/accuracy_reward": 0.1458333395421505,
1927
+ "rewards/format_reward": 0.5312500260770321,
1928
+ "step": 137
1929
+ },
1930
+ {
1931
+ "completion_length": 2061.0000610351562,
1932
+ "entropy": 0.35107421875,
1933
+ "epoch": 0.15771428571428572,
1934
+ "grad_norm": 0.10559725016355515,
1935
+ "kl": 0.0037689208984375,
1936
+ "learning_rate": 1.1743223682775649e-07,
1937
+ "loss": 0.0002,
1938
+ "reward": 0.9270833730697632,
1939
+ "reward_std": 0.30403000861406326,
1940
+ "rewards/accuracy_reward": 0.18750000558793545,
1941
+ "rewards/format_reward": 0.739583358168602,
1942
+ "step": 138
1943
+ },
1944
+ {
1945
+ "completion_length": 3106.291748046875,
1946
+ "entropy": 0.55615234375,
1947
+ "epoch": 0.15885714285714286,
1948
+ "grad_norm": 0.15882417559623718,
1949
+ "kl": 0.00519561767578125,
1950
+ "learning_rate": 1.1466315124171128e-07,
1951
+ "loss": 0.0002,
1952
+ "reward": 0.708333358168602,
1953
+ "reward_std": 0.5456142984330654,
1954
+ "rewards/accuracy_reward": 0.1666666753590107,
1955
+ "rewards/format_reward": 0.541666679084301,
1956
+ "step": 139
1957
+ },
1958
+ {
1959
+ "completion_length": 2395.885498046875,
1960
+ "entropy": 0.4853515625,
1961
+ "epoch": 0.16,
1962
+ "grad_norm": 0.2862064242362976,
1963
+ "kl": 0.006809234619140625,
1964
+ "learning_rate": 1.1212980823907929e-07,
1965
+ "loss": 0.0003,
1966
+ "reward": 0.7500000298023224,
1967
+ "reward_std": 0.38956041634082794,
1968
+ "rewards/accuracy_reward": 0.1666666679084301,
1969
+ "rewards/format_reward": 0.5833333507180214,
1970
+ "step": 140
1971
+ },
1972
+ {
1973
+ "completion_length": 1969.2292175292969,
1974
+ "entropy": 0.33935546875,
1975
+ "epoch": 0.16114285714285714,
1976
+ "grad_norm": 0.17285719513893127,
1977
+ "kl": 0.0047760009765625,
1978
+ "learning_rate": 1.0983357966978745e-07,
1979
+ "loss": 0.0002,
1980
+ "reward": 0.9895833730697632,
1981
+ "reward_std": 0.520443569868803,
1982
+ "rewards/accuracy_reward": 0.2187500074505806,
1983
+ "rewards/format_reward": 0.770833358168602,
1984
+ "step": 141
1985
+ },
1986
+ {
1987
+ "completion_length": 2584.9688110351562,
1988
+ "entropy": 0.447998046875,
1989
+ "epoch": 0.16228571428571428,
1990
+ "grad_norm": 0.13187262415885925,
1991
+ "kl": 0.0047740936279296875,
1992
+ "learning_rate": 1.0777570898211405e-07,
1993
+ "loss": 0.0002,
1994
+ "reward": 0.9166667014360428,
1995
+ "reward_std": 0.4435262605547905,
1996
+ "rewards/accuracy_reward": 0.2187500111758709,
1997
+ "rewards/format_reward": 0.6979166865348816,
1998
+ "step": 142
1999
+ },
2000
+ {
2001
+ "completion_length": 2300.7500610351562,
2002
+ "entropy": 0.4326171875,
2003
+ "epoch": 0.16342857142857142,
2004
+ "grad_norm": 0.25766721367836,
2005
+ "kl": 0.00977325439453125,
2006
+ "learning_rate": 1.0595731054933934e-07,
2007
+ "loss": 0.0004,
2008
+ "reward": 0.6875000074505806,
2009
+ "reward_std": 0.3443669453263283,
2010
+ "rewards/accuracy_reward": 0.0833333358168602,
2011
+ "rewards/format_reward": 0.6041666716337204,
2012
+ "step": 143
2013
+ },
2014
+ {
2015
+ "completion_length": 2849.4688110351562,
2016
+ "entropy": 0.45556640625,
2017
+ "epoch": 0.16457142857142856,
2018
+ "grad_norm": 0.14796483516693115,
2019
+ "kl": 0.00493621826171875,
2020
+ "learning_rate": 1.0437936906629334e-07,
2021
+ "loss": 0.0002,
2022
+ "reward": 0.677083358168602,
2023
+ "reward_std": 0.4717573896050453,
2024
+ "rewards/accuracy_reward": 0.2187500037252903,
2025
+ "rewards/format_reward": 0.4583333432674408,
2026
+ "step": 144
2027
+ },
2028
+ {
2029
+ "completion_length": 1885.4479522705078,
2030
+ "entropy": 0.353515625,
2031
+ "epoch": 0.1657142857142857,
2032
+ "grad_norm": 0.1617114096879959,
2033
+ "kl": 0.005008697509765625,
2034
+ "learning_rate": 1.0304273901612565e-07,
2035
+ "loss": 0.0002,
2036
+ "reward": 1.0208333730697632,
2037
+ "reward_std": 0.3080247640609741,
2038
+ "rewards/accuracy_reward": 0.3020833460614085,
2039
+ "rewards/format_reward": 0.7187500149011612,
2040
+ "step": 145
2041
+ },
2042
+ {
2043
+ "completion_length": 1947.0312805175781,
2044
+ "entropy": 0.376953125,
2045
+ "epoch": 0.16685714285714287,
2046
+ "grad_norm": 0.11680302768945694,
2047
+ "kl": 0.0033721923828125,
2048
+ "learning_rate": 1.0194814420758804e-07,
2049
+ "loss": 0.0001,
2050
+ "reward": 0.8750000149011612,
2051
+ "reward_std": 0.22604453563690186,
2052
+ "rewards/accuracy_reward": 0.07291666977107525,
2053
+ "rewards/format_reward": 0.8020833432674408,
2054
+ "step": 146
2055
+ },
2056
+ {
2057
+ "completion_length": 2215.7084045410156,
2058
+ "entropy": 0.392578125,
2059
+ "epoch": 0.168,
2060
+ "grad_norm": 0.21416479349136353,
2061
+ "kl": 0.00583648681640625,
2062
+ "learning_rate": 1.0109617738307911e-07,
2063
+ "loss": 0.0002,
2064
+ "reward": 0.8437500149011612,
2065
+ "reward_std": 0.5214347615838051,
2066
+ "rewards/accuracy_reward": 0.1770833395421505,
2067
+ "rewards/format_reward": 0.6666666865348816,
2068
+ "step": 147
2069
+ },
2070
+ {
2071
+ "completion_length": 1631.229248046875,
2072
+ "entropy": 0.302734375,
2073
+ "epoch": 0.16914285714285715,
2074
+ "grad_norm": 0.17106008529663086,
2075
+ "kl": 0.00476837158203125,
2076
+ "learning_rate": 1.0048729989766394e-07,
2077
+ "loss": 0.0002,
2078
+ "reward": 0.9895833730697632,
2079
+ "reward_std": 0.29220427572727203,
2080
+ "rewards/accuracy_reward": 0.13541666977107525,
2081
+ "rewards/format_reward": 0.8541666865348816,
2082
+ "step": 148
2083
+ },
2084
+ {
2085
+ "completion_length": 2407.8229370117188,
2086
+ "entropy": 0.3408203125,
2087
+ "epoch": 0.1702857142857143,
2088
+ "grad_norm": 0.15983973443508148,
2089
+ "kl": 0.008087158203125,
2090
+ "learning_rate": 1.0012184146924223e-07,
2091
+ "loss": 0.0003,
2092
+ "reward": 0.9270833432674408,
2093
+ "reward_std": 0.48707588016986847,
2094
+ "rewards/accuracy_reward": 0.2500000074505806,
2095
+ "rewards/format_reward": 0.6770833432674408,
2096
+ "step": 149
2097
+ },
2098
+ {
2099
+ "completion_length": 2178.729217529297,
2100
+ "entropy": 0.36962890625,
2101
+ "epoch": 0.17142857142857143,
2102
+ "grad_norm": 0.19081099331378937,
2103
+ "kl": 0.004444122314453125,
2104
+ "learning_rate": 1e-07,
2105
+ "loss": 0.0002,
2106
+ "reward": 1.0312500298023224,
2107
+ "reward_std": 0.5531396120786667,
2108
+ "rewards/accuracy_reward": 0.291666679084301,
2109
+ "rewards/format_reward": 0.7395833432674408,
2110
+ "step": 150
2111
+ },
2112
+ {
2113
+ "epoch": 0.17142857142857143,
2114
+ "step": 150,
2115
+ "total_flos": 0.0,
2116
+ "train_loss": 0.00011944215420650531,
2117
+ "train_runtime": 12092.6435,
2118
+ "train_samples_per_second": 1.191,
2119
+ "train_steps_per_second": 0.012
2120
+ }
2121
+ ],
2122
+ "logging_steps": 1,
2123
+ "max_steps": 150,
2124
+ "num_input_tokens_seen": 0,
2125
+ "num_train_epochs": 1,
2126
+ "save_steps": 50,
2127
+ "stateful_callbacks": {
2128
+ "TrainerControl": {
2129
+ "args": {
2130
+ "should_epoch_stop": false,
2131
+ "should_evaluate": false,
2132
+ "should_log": false,
2133
+ "should_save": true,
2134
+ "should_training_stop": true
2135
+ },
2136
+ "attributes": {}
2137
+ }
2138
+ },
2139
+ "total_flos": 0.0,
2140
+ "train_batch_size": 6,
2141
+ "trial_name": null,
2142
+ "trial_params": null
2143
+ }