Siddeshwar1625 commited on
Commit
6cb9f23
·
verified ·
1 Parent(s): f413b50

Delete self_play_hf_a10g_train

Browse files
Files changed (30) hide show
  1. self_play_hf_a10g_train/round_001/generator_train/README.md +0 -67
  2. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/chat_template.jinja +0 -54
  3. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/config.json +0 -57
  4. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/generation_config.json +0 -13
  5. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/model.safetensors +0 -3
  6. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/optimizer.pt +0 -3
  7. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/rng_state.pth +0 -3
  8. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/scheduler.pt +0 -3
  9. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/tokenizer.json +0 -3
  10. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/tokenizer_config.json +0 -32
  11. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/trainer_state.json +0 -764
  12. self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/training_args.bin +0 -3
  13. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/chat_template.jinja +0 -54
  14. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/config.json +0 -57
  15. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/generation_config.json +0 -13
  16. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/model.safetensors +0 -3
  17. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/optimizer.pt +0 -3
  18. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/rng_state.pth +0 -3
  19. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/scheduler.pt +0 -3
  20. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/tokenizer.json +0 -3
  21. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/tokenizer_config.json +0 -32
  22. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/trainer_state.json +0 -953
  23. self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/training_args.bin +0 -3
  24. self_play_hf_a10g_train/round_001/generator_train/final_model/chat_template.jinja +0 -54
  25. self_play_hf_a10g_train/round_001/generator_train/final_model/config.json +0 -57
  26. self_play_hf_a10g_train/round_001/generator_train/final_model/generation_config.json +0 -13
  27. self_play_hf_a10g_train/round_001/generator_train/final_model/model.safetensors +0 -3
  28. self_play_hf_a10g_train/round_001/generator_train/final_model/tokenizer.json +0 -3
  29. self_play_hf_a10g_train/round_001/generator_train/final_model/tokenizer_config.json +0 -32
  30. self_play_hf_a10g_train/round_001/generator_train/final_model/training_args.bin +0 -3
self_play_hf_a10g_train/round_001/generator_train/README.md DELETED
@@ -1,67 +0,0 @@
1
- ---
2
- base_model: Qwen/Qwen2.5-0.5B-Instruct
3
- library_name: transformers
4
- model_name: generator_train
5
- tags:
6
- - generated_from_trainer
7
- - grpo
8
- - trl
9
- licence: license
10
- ---
11
-
12
- # Model Card for generator_train
13
-
14
- This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct).
15
- It has been trained using [TRL](https://github.com/huggingface/trl).
16
-
17
- ## Quick start
18
-
19
- ```python
20
- from transformers import pipeline
21
-
22
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
- generator = pipeline("text-generation", model="None", device="cuda")
24
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
- print(output["generated_text"])
26
- ```
27
-
28
- ## Training procedure
29
-
30
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/siddeshwar2004-international-institute-of-information-te/osint-self-play-train/runs/d6lveb1e)
31
-
32
-
33
-
34
- This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
35
-
36
- ### Framework versions
37
-
38
- - TRL: 1.2.0
39
- - Transformers: 5.6.2
40
- - Pytorch: 2.11.0
41
- - Datasets: 4.8.4
42
- - Tokenizers: 0.22.2
43
-
44
- ## Citations
45
-
46
- Cite GRPO as:
47
-
48
- ```bibtex
49
- @article{shao2024deepseekmath,
50
- title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
51
- author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
52
- year = 2024,
53
- eprint = {arXiv:2402.03300},
54
- }
55
- ```
56
-
57
- Cite TRL as:
58
-
59
- ```bibtex
60
- @software{vonwerra2020trl,
61
- title = {{TRL: Transformers Reinforcement Learning}},
62
- author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
63
- license = {Apache-2.0},
64
- url = {https://github.com/huggingface/trl},
65
- year = {2020}
66
- }
67
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/chat_template.jinja DELETED
@@ -1,54 +0,0 @@
1
- {%- if tools %}
2
- {{- '<|im_start|>system\n' }}
3
- {%- if messages[0]['role'] == 'system' %}
4
- {{- messages[0]['content'] }}
5
- {%- else %}
6
- {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
- {%- endif %}
8
- {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
- {%- for tool in tools %}
10
- {{- "\n" }}
11
- {{- tool | tojson }}
12
- {%- endfor %}
13
- {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
- {%- else %}
15
- {%- if messages[0]['role'] == 'system' %}
16
- {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
- {%- else %}
18
- {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
- {%- endif %}
20
- {%- endif %}
21
- {%- for message in messages %}
22
- {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
- {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
- {%- elif message.role == "assistant" %}
25
- {{- '<|im_start|>' + message.role }}
26
- {%- if message.content %}
27
- {{- '\n' + message.content }}
28
- {%- endif %}
29
- {%- for tool_call in message.tool_calls %}
30
- {%- if tool_call.function is defined %}
31
- {%- set tool_call = tool_call.function %}
32
- {%- endif %}
33
- {{- '\n<tool_call>\n{"name": "' }}
34
- {{- tool_call.name }}
35
- {{- '", "arguments": ' }}
36
- {{- tool_call.arguments | tojson }}
37
- {{- '}\n</tool_call>' }}
38
- {%- endfor %}
39
- {{- '<|im_end|>\n' }}
40
- {%- elif message.role == "tool" %}
41
- {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
- {{- '<|im_start|>user' }}
43
- {%- endif %}
44
- {{- '\n<tool_response>\n' }}
45
- {{- message.content }}
46
- {{- '\n</tool_response>' }}
47
- {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
- {{- '<|im_end|>\n' }}
49
- {%- endif %}
50
- {%- endif %}
51
- {%- endfor %}
52
- {%- if add_generation_prompt %}
53
- {{- '<|im_start|>assistant\n' }}
54
- {%- endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/config.json DELETED
@@ -1,57 +0,0 @@
1
- {
2
- "architectures": [
3
- "Qwen2ForCausalLM"
4
- ],
5
- "attention_dropout": 0.0,
6
- "bos_token_id": null,
7
- "dtype": "float32",
8
- "eos_token_id": 151645,
9
- "hidden_act": "silu",
10
- "hidden_size": 896,
11
- "initializer_range": 0.02,
12
- "intermediate_size": 4864,
13
- "layer_types": [
14
- "full_attention",
15
- "full_attention",
16
- "full_attention",
17
- "full_attention",
18
- "full_attention",
19
- "full_attention",
20
- "full_attention",
21
- "full_attention",
22
- "full_attention",
23
- "full_attention",
24
- "full_attention",
25
- "full_attention",
26
- "full_attention",
27
- "full_attention",
28
- "full_attention",
29
- "full_attention",
30
- "full_attention",
31
- "full_attention",
32
- "full_attention",
33
- "full_attention",
34
- "full_attention",
35
- "full_attention",
36
- "full_attention",
37
- "full_attention"
38
- ],
39
- "max_position_embeddings": 32768,
40
- "max_window_layers": 21,
41
- "model_type": "qwen2",
42
- "num_attention_heads": 14,
43
- "num_hidden_layers": 24,
44
- "num_key_value_heads": 2,
45
- "pad_token_id": 151643,
46
- "rms_norm_eps": 1e-06,
47
- "rope_parameters": {
48
- "rope_theta": 1000000.0,
49
- "rope_type": "default"
50
- },
51
- "sliding_window": null,
52
- "tie_word_embeddings": true,
53
- "transformers_version": "5.6.2",
54
- "use_cache": false,
55
- "use_sliding_window": false,
56
- "vocab_size": 151936
57
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/generation_config.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "do_sample": true,
3
- "eos_token_id": [
4
- 151645,
5
- 151643
6
- ],
7
- "pad_token_id": 151643,
8
- "repetition_penalty": 1.1,
9
- "temperature": 0.7,
10
- "top_k": 20,
11
- "top_p": 0.8,
12
- "transformers_version": "5.6.2"
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0827989c90bcaad483dee84b62c1ba69fdf377e659087667ae2a28e1992a2fc6
3
- size 1976163472
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/optimizer.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ceec2f7113291001d1654c27484e896c530e9cfd6710e3ffcfdcc4fb0eeee677
3
- size 3952509771
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/rng_state.pth DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b5725c96886e63d92a4804b9e1b509a94ed72b0ac9da7f6955ce8a319b258d37
3
- size 14645
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/scheduler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:44b7842b78ac3e944fa674f4961bef93e595fe53f0c4495643408e144ea2a074
3
- size 1465
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/tokenizer.json DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
- size 11421892
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/tokenizer_config.json DELETED
@@ -1,32 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "backend": "tokenizers",
4
- "bos_token": null,
5
- "clean_up_tokenization_spaces": false,
6
- "eos_token": "<|im_end|>",
7
- "errors": "replace",
8
- "extra_special_tokens": [
9
- "<|im_start|>",
10
- "<|im_end|>",
11
- "<|object_ref_start|>",
12
- "<|object_ref_end|>",
13
- "<|box_start|>",
14
- "<|box_end|>",
15
- "<|quad_start|>",
16
- "<|quad_end|>",
17
- "<|vision_start|>",
18
- "<|vision_end|>",
19
- "<|vision_pad|>",
20
- "<|image_pad|>",
21
- "<|video_pad|>"
22
- ],
23
- "is_local": false,
24
- "local_files_only": false,
25
- "model_max_length": 131072,
26
- "pad_token": "<|endoftext|>",
27
- "padding_side": "left",
28
- "split_special_tokens": false,
29
- "tokenizer_class": "Qwen2Tokenizer",
30
- "truncation_side": "left",
31
- "unk_token": null
32
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/trainer_state.json DELETED
@@ -1,764 +0,0 @@
1
- {
2
- "best_global_step": null,
3
- "best_metric": null,
4
- "best_model_checkpoint": null,
5
- "epoch": 1.6666666666666665,
6
- "eval_steps": 500,
7
- "global_step": 40,
8
- "is_hyper_param_search": false,
9
- "is_local_process_zero": true,
10
- "is_world_process_zero": true,
11
- "log_history": [
12
- {
13
- "clip_ratio/high_max": 0.001953125,
14
- "clip_ratio/high_mean": 0.001953125,
15
- "clip_ratio/low_mean": 0.0,
16
- "clip_ratio/low_min": 0.0,
17
- "clip_ratio/region_mean": 0.001953125,
18
- "completions/clipped_ratio": 1.0,
19
- "completions/max_length": 384.0,
20
- "completions/max_terminated_length": 0.0,
21
- "completions/mean_length": 384.0,
22
- "completions/mean_terminated_length": 0.0,
23
- "completions/min_length": 384.0,
24
- "completions/min_terminated_length": 0.0,
25
- "entropy": 1.9362258911132812,
26
- "epoch": 0.041666666666666664,
27
- "frac_reward_zero_std": 0.0,
28
- "grad_norm": 3.3426833152770996,
29
- "kl": 0.0005498419050127268,
30
- "learning_rate": 5e-06,
31
- "loss": 0.13995857536792755,
32
- "num_tokens": 25244.0,
33
- "reward": -0.4352343678474426,
34
- "reward_std": 0.306624174118042,
35
- "rewards/GeneratorRewardFunction/mean": -0.4352343678474426,
36
- "rewards/GeneratorRewardFunction/std": 0.306624174118042,
37
- "step": 1,
38
- "step_time": 12.578062469000088
39
- },
40
- {
41
- "clip_ratio/high_max": 0.00390625,
42
- "clip_ratio/high_mean": 0.00390625,
43
- "clip_ratio/low_mean": 0.0026041667442768812,
44
- "clip_ratio/low_min": 0.0026041667442768812,
45
- "clip_ratio/region_mean": 0.0065104165114462376,
46
- "entropy": 1.2686206102371216,
47
- "epoch": 0.08333333333333333,
48
- "grad_norm": 2.8547239303588867,
49
- "kl": 0.001546451705507934,
50
- "learning_rate": 4.9000000000000005e-06,
51
- "loss": -0.06681232899427414,
52
- "step": 2,
53
- "step_time": 0.22036709600001814
54
- },
55
- {
56
- "clip_ratio/high_max": 0.012369791977107525,
57
- "clip_ratio/high_mean": 0.012369791977107525,
58
- "clip_ratio/low_mean": 0.015625,
59
- "clip_ratio/low_min": 0.015625,
60
- "clip_ratio/region_mean": 0.02799479104578495,
61
- "entropy": 1.8668650388717651,
62
- "epoch": 0.125,
63
- "grad_norm": 2.4686105251312256,
64
- "kl": 0.005345983896404505,
65
- "learning_rate": 4.800000000000001e-06,
66
- "loss": 0.010777520947158337,
67
- "step": 3,
68
- "step_time": 0.2199646679999887
69
- },
70
- {
71
- "clip_ratio/high_max": 0.02083333395421505,
72
- "clip_ratio/high_mean": 0.02083333395421505,
73
- "clip_ratio/low_mean": 0.010416666977107525,
74
- "clip_ratio/low_min": 0.010416666977107525,
75
- "clip_ratio/region_mean": 0.03125,
76
- "entropy": 1.1842881441116333,
77
- "epoch": 0.16666666666666666,
78
- "grad_norm": 1.569398045539856,
79
- "kl": 0.0072342646308243275,
80
- "learning_rate": 4.7e-06,
81
- "loss": -0.08198019117116928,
82
- "step": 4,
83
- "step_time": 0.2201611520000597
84
- },
85
- {
86
- "clip_ratio/high_max": 0.001953125,
87
- "clip_ratio/high_mean": 0.001953125,
88
- "clip_ratio/low_mean": 0.0006510416860692203,
89
- "clip_ratio/low_min": 0.0006510416860692203,
90
- "clip_ratio/region_mean": 0.0026041667442768812,
91
- "completions/clipped_ratio": 1.0,
92
- "completions/max_length": 384.0,
93
- "completions/max_terminated_length": 0.0,
94
- "completions/mean_length": 384.0,
95
- "completions/mean_terminated_length": 0.0,
96
- "completions/min_length": 384.0,
97
- "completions/min_terminated_length": 0.0,
98
- "entropy": 1.3128995895385742,
99
- "epoch": 0.20833333333333334,
100
- "frac_reward_zero_std": 0.0,
101
- "grad_norm": 2.202421188354492,
102
- "kl": 0.0129135362803936,
103
- "learning_rate": 4.600000000000001e-06,
104
- "loss": -0.0841849148273468,
105
- "num_tokens": 50440.0,
106
- "reward": -0.3341406285762787,
107
- "reward_std": 0.3155691623687744,
108
- "rewards/GeneratorRewardFunction/mean": -0.3341406285762787,
109
- "rewards/GeneratorRewardFunction/std": 0.3155691623687744,
110
- "step": 5,
111
- "step_time": 12.076087126999937
112
- },
113
- {
114
- "clip_ratio/high_max": 0.0026041667442768812,
115
- "clip_ratio/high_mean": 0.0026041667442768812,
116
- "clip_ratio/low_mean": 0.0052083334885537624,
117
- "clip_ratio/low_min": 0.0052083334885537624,
118
- "clip_ratio/region_mean": 0.0078125,
119
- "entropy": 1.3001914024353027,
120
- "epoch": 0.25,
121
- "grad_norm": 2.854139804840088,
122
- "kl": 0.01436698716133833,
123
- "learning_rate": 4.5e-06,
124
- "loss": 0.02869725041091442,
125
- "step": 6,
126
- "step_time": 0.22715311399997518
127
- },
128
- {
129
- "clip_ratio/high_max": 0.0071614584885537624,
130
- "clip_ratio/high_mean": 0.0071614584885537624,
131
- "clip_ratio/low_mean": 0.0071614584885537624,
132
- "clip_ratio/low_min": 0.0071614584885537624,
133
- "clip_ratio/region_mean": 0.014322916977107525,
134
- "entropy": 1.0331100225448608,
135
- "epoch": 0.2916666666666667,
136
- "grad_norm": 1.9297211170196533,
137
- "kl": 0.01791433058679104,
138
- "learning_rate": 4.4e-06,
139
- "loss": -0.028683962300419807,
140
- "step": 7,
141
- "step_time": 0.22586069900000894
142
- },
143
- {
144
- "clip_ratio/high_max": 0.029296875,
145
- "clip_ratio/high_mean": 0.029296875,
146
- "clip_ratio/low_mean": 0.01171875,
147
- "clip_ratio/low_min": 0.01171875,
148
- "clip_ratio/region_mean": 0.041015625,
149
- "entropy": 1.1462408304214478,
150
- "epoch": 0.3333333333333333,
151
- "grad_norm": 2.57124924659729,
152
- "kl": 0.0388585664331913,
153
- "learning_rate": 4.3e-06,
154
- "loss": 0.08592668920755386,
155
- "step": 8,
156
- "step_time": 0.22552726200001416
157
- },
158
- {
159
- "clip_ratio/high_max": 0.0,
160
- "clip_ratio/high_mean": 0.0,
161
- "clip_ratio/low_mean": 0.0,
162
- "clip_ratio/low_min": 0.0,
163
- "clip_ratio/region_mean": 0.0,
164
- "completions/clipped_ratio": 1.0,
165
- "completions/max_length": 384.0,
166
- "completions/max_terminated_length": 0.0,
167
- "completions/mean_length": 384.0,
168
- "completions/mean_terminated_length": 0.0,
169
- "completions/min_length": 384.0,
170
- "completions/min_terminated_length": 0.0,
171
- "entropy": 0.999983549118042,
172
- "epoch": 0.375,
173
- "frac_reward_zero_std": 0.25,
174
- "grad_norm": 2.0052192211151123,
175
- "kl": 0.030047910287976265,
176
- "learning_rate": 4.2000000000000004e-06,
177
- "loss": -0.05663062259554863,
178
- "num_tokens": 75884.0,
179
- "reward": -0.3902343511581421,
180
- "reward_std": 0.31722894310951233,
181
- "rewards/GeneratorRewardFunction/mean": -0.3902343511581421,
182
- "rewards/GeneratorRewardFunction/std": 0.3172289729118347,
183
- "step": 9,
184
- "step_time": 12.047747722000054
185
- },
186
- {
187
- "clip_ratio/high_max": 0.005859375,
188
- "clip_ratio/high_mean": 0.005859375,
189
- "clip_ratio/low_mean": 0.0006510416860692203,
190
- "clip_ratio/low_min": 0.0006510416860692203,
191
- "clip_ratio/region_mean": 0.0065104165114462376,
192
- "entropy": 1.6177984476089478,
193
- "epoch": 0.4166666666666667,
194
- "grad_norm": 2.137237071990967,
195
- "kl": 0.04101690649986267,
196
- "learning_rate": 4.1e-06,
197
- "loss": -0.02161034755408764,
198
- "step": 10,
199
- "step_time": 0.2252395229999138
200
- },
201
- {
202
- "clip_ratio/high_max": 0.01692708395421505,
203
- "clip_ratio/high_mean": 0.01692708395421505,
204
- "clip_ratio/low_mean": 0.0065104165114462376,
205
- "clip_ratio/low_min": 0.0065104165114462376,
206
- "clip_ratio/region_mean": 0.0234375,
207
- "entropy": 1.038699746131897,
208
- "epoch": 0.4583333333333333,
209
- "grad_norm": 2.672621965408325,
210
- "kl": 0.031740155071020126,
211
- "learning_rate": 4.000000000000001e-06,
212
- "loss": 0.056199509650468826,
213
- "step": 11,
214
- "step_time": 0.22556489999999485
215
- },
216
- {
217
- "clip_ratio/high_max": 0.01822916604578495,
218
- "clip_ratio/high_mean": 0.01822916604578495,
219
- "clip_ratio/low_mean": 0.010416666977107525,
220
- "clip_ratio/low_min": 0.010416666977107525,
221
- "clip_ratio/region_mean": 0.02864583395421505,
222
- "entropy": 1.296442985534668,
223
- "epoch": 0.5,
224
- "grad_norm": 1.4488099813461304,
225
- "kl": 0.04755128547549248,
226
- "learning_rate": 3.900000000000001e-06,
227
- "loss": 0.02385079860687256,
228
- "step": 12,
229
- "step_time": 0.22498397900005784
230
- },
231
- {
232
- "clip_ratio/high_max": 0.0013020833721384406,
233
- "clip_ratio/high_mean": 0.0013020833721384406,
234
- "clip_ratio/low_mean": 0.0,
235
- "clip_ratio/low_min": 0.0,
236
- "clip_ratio/region_mean": 0.0013020833721384406,
237
- "completions/clipped_ratio": 1.0,
238
- "completions/max_length": 384.0,
239
- "completions/max_terminated_length": 0.0,
240
- "completions/mean_length": 384.0,
241
- "completions/mean_terminated_length": 0.0,
242
- "completions/min_length": 384.0,
243
- "completions/min_terminated_length": 0.0,
244
- "entropy": 1.3575109243392944,
245
- "epoch": 0.5416666666666666,
246
- "frac_reward_zero_std": 0.0,
247
- "grad_norm": 2.2914443016052246,
248
- "kl": 0.06257984787225723,
249
- "learning_rate": 3.8000000000000005e-06,
250
- "loss": -0.10538653284311295,
251
- "num_tokens": 101112.0,
252
- "reward": -0.22843749821186066,
253
- "reward_std": 0.294514924287796,
254
- "rewards/GeneratorRewardFunction/mean": -0.22843749821186066,
255
- "rewards/GeneratorRewardFunction/std": 0.2945149540901184,
256
- "step": 13,
257
- "step_time": 12.01501083200003
258
- },
259
- {
260
- "clip_ratio/high_max": 0.001953125,
261
- "clip_ratio/high_mean": 0.001953125,
262
- "clip_ratio/low_mean": 0.0,
263
- "clip_ratio/low_min": 0.0,
264
- "clip_ratio/region_mean": 0.001953125,
265
- "entropy": 1.2918612957000732,
266
- "epoch": 0.5833333333333334,
267
- "grad_norm": 3.0368542671203613,
268
- "kl": 0.04979195073246956,
269
- "learning_rate": 3.7e-06,
270
- "loss": -0.003113487036898732,
271
- "step": 14,
272
- "step_time": 0.21806825399994523
273
- },
274
- {
275
- "clip_ratio/high_max": 0.0026041667442768812,
276
- "clip_ratio/high_mean": 0.0026041667442768812,
277
- "clip_ratio/low_mean": 0.005859375,
278
- "clip_ratio/low_min": 0.005859375,
279
- "clip_ratio/region_mean": 0.008463541977107525,
280
- "entropy": 1.1081053018569946,
281
- "epoch": 0.625,
282
- "grad_norm": 3.5923683643341064,
283
- "kl": 0.06817911565303802,
284
- "learning_rate": 3.6000000000000003e-06,
285
- "loss": 0.15118412673473358,
286
- "step": 15,
287
- "step_time": 0.217887520999966
288
- },
289
- {
290
- "clip_ratio/high_max": 0.02018229104578495,
291
- "clip_ratio/high_mean": 0.02018229104578495,
292
- "clip_ratio/low_mean": 0.0026041667442768812,
293
- "clip_ratio/low_min": 0.0026041667442768812,
294
- "clip_ratio/region_mean": 0.02278645895421505,
295
- "entropy": 1.0803831815719604,
296
- "epoch": 0.6666666666666666,
297
- "grad_norm": 1.789110541343689,
298
- "kl": 0.056480005383491516,
299
- "learning_rate": 3.5e-06,
300
- "loss": -0.03890883922576904,
301
- "step": 16,
302
- "step_time": 0.21781940799996846
303
- },
304
- {
305
- "clip_ratio/high_max": 0.0,
306
- "clip_ratio/high_mean": 0.0,
307
- "clip_ratio/low_mean": 0.0,
308
- "clip_ratio/low_min": 0.0,
309
- "clip_ratio/region_mean": 0.0,
310
- "completions/clipped_ratio": 1.0,
311
- "completions/max_length": 384.0,
312
- "completions/max_terminated_length": 0.0,
313
- "completions/mean_length": 384.0,
314
- "completions/mean_terminated_length": 0.0,
315
- "completions/min_length": 384.0,
316
- "completions/min_terminated_length": 0.0,
317
- "entropy": 0.8709045052528381,
318
- "epoch": 0.7083333333333334,
319
- "frac_reward_zero_std": 0.0,
320
- "grad_norm": 1.329393982887268,
321
- "kl": 0.06073950603604317,
322
- "learning_rate": 3.4000000000000005e-06,
323
- "loss": -0.11920400708913803,
324
- "num_tokens": 126436.0,
325
- "reward": -0.2240625023841858,
326
- "reward_std": 0.2881968021392822,
327
- "rewards/GeneratorRewardFunction/mean": -0.2240625023841858,
328
- "rewards/GeneratorRewardFunction/std": 0.2881968021392822,
329
- "step": 17,
330
- "step_time": 12.08798373600007
331
- },
332
- {
333
- "clip_ratio/high_max": 0.0026041667442768812,
334
- "clip_ratio/high_mean": 0.0026041667442768812,
335
- "clip_ratio/low_mean": 0.0006510416860692203,
336
- "clip_ratio/low_min": 0.0006510416860692203,
337
- "clip_ratio/region_mean": 0.0032552082557231188,
338
- "entropy": 1.083386778831482,
339
- "epoch": 0.75,
340
- "grad_norm": 1.343295931816101,
341
- "kl": 0.0919194221496582,
342
- "learning_rate": 3.3000000000000006e-06,
343
- "loss": -0.007308408617973328,
344
- "step": 18,
345
- "step_time": 0.2253850289998809
346
- },
347
- {
348
- "clip_ratio/high_max": 0.005859375,
349
- "clip_ratio/high_mean": 0.005859375,
350
- "clip_ratio/low_mean": 0.0026041667442768812,
351
- "clip_ratio/low_min": 0.0026041667442768812,
352
- "clip_ratio/region_mean": 0.008463541977107525,
353
- "entropy": 1.406662940979004,
354
- "epoch": 0.7916666666666666,
355
- "grad_norm": 3.3420534133911133,
356
- "kl": 0.06450249999761581,
357
- "learning_rate": 3.2000000000000003e-06,
358
- "loss": 0.12472107261419296,
359
- "step": 19,
360
- "step_time": 0.22467486499999723
361
- },
362
- {
363
- "clip_ratio/high_max": 0.011067708022892475,
364
- "clip_ratio/high_mean": 0.011067708022892475,
365
- "clip_ratio/low_mean": 0.0032552082557231188,
366
- "clip_ratio/low_min": 0.0032552082557231188,
367
- "clip_ratio/region_mean": 0.014322916977107525,
368
- "entropy": 1.6491953134536743,
369
- "epoch": 0.8333333333333334,
370
- "grad_norm": 3.3672103881835938,
371
- "kl": 0.0773777961730957,
372
- "learning_rate": 3.1000000000000004e-06,
373
- "loss": 0.0035695277620106936,
374
- "step": 20,
375
- "step_time": 0.22389788999998927
376
- },
377
- {
378
- "clip_ratio/high_max": 0.0006510416860692203,
379
- "clip_ratio/high_mean": 0.0006510416860692203,
380
- "clip_ratio/low_mean": 0.0,
381
- "clip_ratio/low_min": 0.0,
382
- "clip_ratio/region_mean": 0.0006510416860692203,
383
- "completions/clipped_ratio": 1.0,
384
- "completions/max_length": 384.0,
385
- "completions/max_terminated_length": 0.0,
386
- "completions/mean_length": 384.0,
387
- "completions/mean_terminated_length": 0.0,
388
- "completions/min_length": 384.0,
389
- "completions/min_terminated_length": 0.0,
390
- "entropy": 1.4178005456924438,
391
- "epoch": 0.875,
392
- "frac_reward_zero_std": 0.0,
393
- "grad_norm": 2.6333799362182617,
394
- "kl": 0.07392226904630661,
395
- "learning_rate": 3e-06,
396
- "loss": -0.09101495891809464,
397
- "num_tokens": 151596.0,
398
- "reward": -0.15257811546325684,
399
- "reward_std": 0.24854345619678497,
400
- "rewards/GeneratorRewardFunction/mean": -0.15257811546325684,
401
- "rewards/GeneratorRewardFunction/std": 0.24854345619678497,
402
- "step": 21,
403
- "step_time": 12.020297105999816
404
- },
405
- {
406
- "clip_ratio/high_max": 0.0006510416860692203,
407
- "clip_ratio/high_mean": 0.0006510416860692203,
408
- "clip_ratio/low_mean": 0.0013020833721384406,
409
- "clip_ratio/low_min": 0.0013020833721384406,
410
- "clip_ratio/region_mean": 0.001953125,
411
- "entropy": 1.2036248445510864,
412
- "epoch": 0.9166666666666666,
413
- "grad_norm": 2.1499149799346924,
414
- "kl": 0.0772874653339386,
415
- "learning_rate": 2.9e-06,
416
- "loss": 0.08120749890804291,
417
- "step": 22,
418
- "step_time": 0.21995178900010615
419
- },
420
- {
421
- "clip_ratio/high_max": 0.0032552082557231188,
422
- "clip_ratio/high_mean": 0.0032552082557231188,
423
- "clip_ratio/low_mean": 0.0032552082557231188,
424
- "clip_ratio/low_min": 0.0032552082557231188,
425
- "clip_ratio/region_mean": 0.0065104165114462376,
426
- "entropy": 1.1966055631637573,
427
- "epoch": 0.9583333333333334,
428
- "grad_norm": 2.0064616203308105,
429
- "kl": 0.07331382483243942,
430
- "learning_rate": 2.8000000000000003e-06,
431
- "loss": 0.03140506148338318,
432
- "step": 23,
433
- "step_time": 0.21996421700009705
434
- },
435
- {
436
- "clip_ratio/high_max": 0.011067708022892475,
437
- "clip_ratio/high_mean": 0.011067708022892475,
438
- "clip_ratio/low_mean": 0.0006510416860692203,
439
- "clip_ratio/low_min": 0.0006510416860692203,
440
- "clip_ratio/region_mean": 0.01171875,
441
- "entropy": 0.9102082252502441,
442
- "epoch": 1.0,
443
- "grad_norm": 1.7175334692001343,
444
- "kl": 0.14611481130123138,
445
- "learning_rate": 2.7000000000000004e-06,
446
- "loss": -0.021010393276810646,
447
- "step": 24,
448
- "step_time": 0.21931950600014716
449
- },
450
- {
451
- "clip_ratio/high_max": 0.0,
452
- "clip_ratio/high_mean": 0.0,
453
- "clip_ratio/low_mean": 0.0,
454
- "clip_ratio/low_min": 0.0,
455
- "clip_ratio/region_mean": 0.0,
456
- "completions/clipped_ratio": 1.0,
457
- "completions/max_length": 384.0,
458
- "completions/max_terminated_length": 0.0,
459
- "completions/mean_length": 384.0,
460
- "completions/mean_terminated_length": 0.0,
461
- "completions/min_length": 384.0,
462
- "completions/min_terminated_length": 0.0,
463
- "entropy": 1.9153881072998047,
464
- "epoch": 1.0416666666666667,
465
- "frac_reward_zero_std": 0.0,
466
- "grad_norm": 2.3460445404052734,
467
- "kl": 0.09710023552179337,
468
- "learning_rate": 2.6e-06,
469
- "loss": 0.015220091678202152,
470
- "num_tokens": 177212.0,
471
- "reward": -0.15656250715255737,
472
- "reward_std": 0.22349287569522858,
473
- "rewards/GeneratorRewardFunction/mean": -0.15656250715255737,
474
- "rewards/GeneratorRewardFunction/std": 0.22349286079406738,
475
- "step": 25,
476
- "step_time": 12.153388549000056
477
- },
478
- {
479
- "clip_ratio/high_max": 0.00390625,
480
- "clip_ratio/high_mean": 0.00390625,
481
- "clip_ratio/low_mean": 0.0013020833721384406,
482
- "clip_ratio/low_min": 0.0013020833721384406,
483
- "clip_ratio/region_mean": 0.0052083334885537624,
484
- "entropy": 1.365325927734375,
485
- "epoch": 1.0833333333333333,
486
- "grad_norm": 1.8710312843322754,
487
- "kl": 0.0985046848654747,
488
- "learning_rate": 2.5e-06,
489
- "loss": -0.02838735282421112,
490
- "step": 26,
491
- "step_time": 0.22659933299996737
492
- },
493
- {
494
- "clip_ratio/high_max": 0.008463541977107525,
495
- "clip_ratio/high_mean": 0.008463541977107525,
496
- "clip_ratio/low_mean": 0.0013020833721384406,
497
- "clip_ratio/low_min": 0.0013020833721384406,
498
- "clip_ratio/region_mean": 0.009765625,
499
- "entropy": 1.2517439126968384,
500
- "epoch": 1.125,
501
- "grad_norm": 2.821958303451538,
502
- "kl": 0.09274079650640488,
503
- "learning_rate": 2.4000000000000003e-06,
504
- "loss": -0.007298170123249292,
505
- "step": 27,
506
- "step_time": 0.22647249999999985
507
- },
508
- {
509
- "clip_ratio/high_max": 0.0052083334885537624,
510
- "clip_ratio/high_mean": 0.0052083334885537624,
511
- "clip_ratio/low_mean": 0.009114583022892475,
512
- "clip_ratio/low_min": 0.009114583022892475,
513
- "clip_ratio/region_mean": 0.014322916977107525,
514
- "entropy": 2.0579044818878174,
515
- "epoch": 1.1666666666666667,
516
- "grad_norm": 3.259742259979248,
517
- "kl": 0.10746321082115173,
518
- "learning_rate": 2.3000000000000004e-06,
519
- "loss": 0.021702758967876434,
520
- "step": 28,
521
- "step_time": 0.22640677999993386
522
- },
523
- {
524
- "clip_ratio/high_max": 0.0,
525
- "clip_ratio/high_mean": 0.0,
526
- "clip_ratio/low_mean": 0.0,
527
- "clip_ratio/low_min": 0.0,
528
- "clip_ratio/region_mean": 0.0,
529
- "completions/clipped_ratio": 1.0,
530
- "completions/max_length": 384.0,
531
- "completions/max_terminated_length": 0.0,
532
- "completions/mean_length": 384.0,
533
- "completions/mean_terminated_length": 0.0,
534
- "completions/min_length": 384.0,
535
- "completions/min_terminated_length": 0.0,
536
- "entropy": 1.3725861310958862,
537
- "epoch": 1.2083333333333333,
538
- "frac_reward_zero_std": 0.0,
539
- "grad_norm": 1.8806989192962646,
540
- "kl": 0.11961983889341354,
541
- "learning_rate": 2.2e-06,
542
- "loss": 0.07187109440565109,
543
- "num_tokens": 202200.0,
544
- "reward": -0.0561029389500618,
545
- "reward_std": 0.314301997423172,
546
- "rewards/GeneratorRewardFunction/mean": -0.0561029389500618,
547
- "rewards/GeneratorRewardFunction/std": 0.314301997423172,
548
- "step": 29,
549
- "step_time": 13.662849896999887
550
- },
551
- {
552
- "clip_ratio/high_max": 0.001953125,
553
- "clip_ratio/high_mean": 0.001953125,
554
- "clip_ratio/low_mean": 0.0013020833721384406,
555
- "clip_ratio/low_min": 0.0013020833721384406,
556
- "clip_ratio/region_mean": 0.0032552082557231188,
557
- "entropy": 1.2213298082351685,
558
- "epoch": 1.25,
559
- "grad_norm": 2.1918396949768066,
560
- "kl": 0.12398240715265274,
561
- "learning_rate": 2.1000000000000002e-06,
562
- "loss": -0.052896980196237564,
563
- "step": 30,
564
- "step_time": 0.2210835590001352
565
- },
566
- {
567
- "clip_ratio/high_max": 0.001953125,
568
- "clip_ratio/high_mean": 0.001953125,
569
- "clip_ratio/low_mean": 0.001953125,
570
- "clip_ratio/low_min": 0.001953125,
571
- "clip_ratio/region_mean": 0.00390625,
572
- "entropy": 1.2683231830596924,
573
- "epoch": 1.2916666666666667,
574
- "grad_norm": 2.524726390838623,
575
- "kl": 0.14297537505626678,
576
- "learning_rate": 2.0000000000000003e-06,
577
- "loss": 0.12745414674282074,
578
- "step": 31,
579
- "step_time": 0.22096665699996265
580
- },
581
- {
582
- "clip_ratio/high_max": 0.0032552082557231188,
583
- "clip_ratio/high_mean": 0.0032552082557231188,
584
- "clip_ratio/low_mean": 0.0006510416860692203,
585
- "clip_ratio/low_min": 0.0006510416860692203,
586
- "clip_ratio/region_mean": 0.00390625,
587
- "entropy": 1.0583091974258423,
588
- "epoch": 1.3333333333333333,
589
- "grad_norm": 2.408073902130127,
590
- "kl": 0.0881701335310936,
591
- "learning_rate": 1.9000000000000002e-06,
592
- "loss": -0.14430458843708038,
593
- "step": 32,
594
- "step_time": 0.21999255600007928
595
- },
596
- {
597
- "clip_ratio/high_max": 0.0,
598
- "clip_ratio/high_mean": 0.0,
599
- "clip_ratio/low_mean": 0.0,
600
- "clip_ratio/low_min": 0.0,
601
- "clip_ratio/region_mean": 0.0,
602
- "completions/clipped_ratio": 1.0,
603
- "completions/max_length": 384.0,
604
- "completions/max_terminated_length": 0.0,
605
- "completions/mean_length": 384.0,
606
- "completions/mean_terminated_length": 0.0,
607
- "completions/min_length": 384.0,
608
- "completions/min_terminated_length": 0.0,
609
- "entropy": 1.3751106262207031,
610
- "epoch": 1.375,
611
- "frac_reward_zero_std": 0.0,
612
- "grad_norm": 2.4002864360809326,
613
- "kl": 0.0900505781173706,
614
- "learning_rate": 1.8000000000000001e-06,
615
- "loss": 0.06270528584718704,
616
- "num_tokens": 227556.0,
617
- "reward": -0.11414062231779099,
618
- "reward_std": 0.21683935821056366,
619
- "rewards/GeneratorRewardFunction/mean": -0.11414062231779099,
620
- "rewards/GeneratorRewardFunction/std": 0.21683938801288605,
621
- "step": 33,
622
- "step_time": 12.070902493999938
623
- },
624
- {
625
- "clip_ratio/high_max": 0.0045572915114462376,
626
- "clip_ratio/high_mean": 0.0045572915114462376,
627
- "clip_ratio/low_mean": 0.0,
628
- "clip_ratio/low_min": 0.0,
629
- "clip_ratio/region_mean": 0.0045572915114462376,
630
- "entropy": 1.2606719732284546,
631
- "epoch": 1.4166666666666667,
632
- "grad_norm": 1.671729326248169,
633
- "kl": 0.1210540160536766,
634
- "learning_rate": 1.7000000000000002e-06,
635
- "loss": -0.04401962831616402,
636
- "step": 34,
637
- "step_time": 0.2276347459999215
638
- },
639
- {
640
- "clip_ratio/high_max": 0.0013020833721384406,
641
- "clip_ratio/high_mean": 0.0013020833721384406,
642
- "clip_ratio/low_mean": 0.001953125,
643
- "clip_ratio/low_min": 0.001953125,
644
- "clip_ratio/region_mean": 0.0032552082557231188,
645
- "entropy": 1.2780500650405884,
646
- "epoch": 1.4583333333333333,
647
- "grad_norm": 2.278010845184326,
648
- "kl": 0.11484409123659134,
649
- "learning_rate": 1.6000000000000001e-06,
650
- "loss": -0.08475238084793091,
651
- "step": 35,
652
- "step_time": 0.22882699100000536
653
- },
654
- {
655
- "clip_ratio/high_max": 0.0052083334885537624,
656
- "clip_ratio/high_mean": 0.0052083334885537624,
657
- "clip_ratio/low_mean": 0.0032552082557231188,
658
- "clip_ratio/low_min": 0.0032552082557231188,
659
- "clip_ratio/region_mean": 0.008463541977107525,
660
- "entropy": 1.0553101301193237,
661
- "epoch": 1.5,
662
- "grad_norm": 1.582037091255188,
663
- "kl": 0.12029703706502914,
664
- "learning_rate": 1.5e-06,
665
- "loss": 0.06627888232469559,
666
- "step": 36,
667
- "step_time": 0.22751581399984389
668
- },
669
- {
670
- "clip_ratio/high_max": 0.0,
671
- "clip_ratio/high_mean": 0.0,
672
- "clip_ratio/low_mean": 0.0,
673
- "clip_ratio/low_min": 0.0,
674
- "clip_ratio/region_mean": 0.0,
675
- "completions/clipped_ratio": 1.0,
676
- "completions/max_length": 384.0,
677
- "completions/max_terminated_length": 0.0,
678
- "completions/mean_length": 384.0,
679
- "completions/mean_terminated_length": 0.0,
680
- "completions/min_length": 384.0,
681
- "completions/min_terminated_length": 0.0,
682
- "entropy": 1.0647958517074585,
683
- "epoch": 1.5416666666666665,
684
- "frac_reward_zero_std": 0.0,
685
- "grad_norm": 2.1763558387756348,
686
- "kl": 0.08708903193473816,
687
- "learning_rate": 1.4000000000000001e-06,
688
- "loss": -0.00017260713502764702,
689
- "num_tokens": 252640.0,
690
- "reward": -0.10210937261581421,
691
- "reward_std": 0.19573244452476501,
692
- "rewards/GeneratorRewardFunction/mean": -0.10210937261581421,
693
- "rewards/GeneratorRewardFunction/std": 0.1957324594259262,
694
- "step": 37,
695
- "step_time": 12.015305628000078
696
- },
697
- {
698
- "clip_ratio/high_max": 0.0026041667442768812,
699
- "clip_ratio/high_mean": 0.0026041667442768812,
700
- "clip_ratio/low_mean": 0.0,
701
- "clip_ratio/low_min": 0.0,
702
- "clip_ratio/region_mean": 0.0026041667442768812,
703
- "entropy": 1.0041130781173706,
704
- "epoch": 1.5833333333333335,
705
- "grad_norm": 1.6093602180480957,
706
- "kl": 0.11537543684244156,
707
- "learning_rate": 1.3e-06,
708
- "loss": -0.12453166395425797,
709
- "step": 38,
710
- "step_time": 0.22048816200003785
711
- },
712
- {
713
- "clip_ratio/high_max": 0.0045572915114462376,
714
- "clip_ratio/high_mean": 0.0045572915114462376,
715
- "clip_ratio/low_mean": 0.0013020833721384406,
716
- "clip_ratio/low_min": 0.0013020833721384406,
717
- "clip_ratio/region_mean": 0.005859375,
718
- "entropy": 1.500306487083435,
719
- "epoch": 1.625,
720
- "grad_norm": 3.409069299697876,
721
- "kl": 0.10904627293348312,
722
- "learning_rate": 1.2000000000000002e-06,
723
- "loss": 0.12661518156528473,
724
- "step": 39,
725
- "step_time": 0.22087437000004684
726
- },
727
- {
728
- "clip_ratio/high_max": 0.0078125,
729
- "clip_ratio/high_mean": 0.0078125,
730
- "clip_ratio/low_mean": 0.0013020833721384406,
731
- "clip_ratio/low_min": 0.0013020833721384406,
732
- "clip_ratio/region_mean": 0.009114583022892475,
733
- "entropy": 1.0560635328292847,
734
- "epoch": 1.6666666666666665,
735
- "grad_norm": 2.0718417167663574,
736
- "kl": 0.11926760524511337,
737
- "learning_rate": 1.1e-06,
738
- "loss": -0.0004449083062354475,
739
- "step": 40,
740
- "step_time": 0.2202887500000088
741
- }
742
- ],
743
- "logging_steps": 1,
744
- "max_steps": 50,
745
- "num_input_tokens_seen": 252640,
746
- "num_train_epochs": 3,
747
- "save_steps": 10,
748
- "stateful_callbacks": {
749
- "TrainerControl": {
750
- "args": {
751
- "should_epoch_stop": false,
752
- "should_evaluate": false,
753
- "should_log": false,
754
- "should_save": true,
755
- "should_training_stop": false
756
- },
757
- "attributes": {}
758
- }
759
- },
760
- "total_flos": 0.0,
761
- "train_batch_size": 4,
762
- "trial_name": null,
763
- "trial_params": null
764
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-40/training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:31ec66b64f432daf7616434296713e432d134face96e308f2ebc175e2e26f025
3
- size 7249
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/chat_template.jinja DELETED
@@ -1,54 +0,0 @@
1
- {%- if tools %}
2
- {{- '<|im_start|>system\n' }}
3
- {%- if messages[0]['role'] == 'system' %}
4
- {{- messages[0]['content'] }}
5
- {%- else %}
6
- {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
- {%- endif %}
8
- {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
- {%- for tool in tools %}
10
- {{- "\n" }}
11
- {{- tool | tojson }}
12
- {%- endfor %}
13
- {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
- {%- else %}
15
- {%- if messages[0]['role'] == 'system' %}
16
- {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
- {%- else %}
18
- {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
- {%- endif %}
20
- {%- endif %}
21
- {%- for message in messages %}
22
- {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
- {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
- {%- elif message.role == "assistant" %}
25
- {{- '<|im_start|>' + message.role }}
26
- {%- if message.content %}
27
- {{- '\n' + message.content }}
28
- {%- endif %}
29
- {%- for tool_call in message.tool_calls %}
30
- {%- if tool_call.function is defined %}
31
- {%- set tool_call = tool_call.function %}
32
- {%- endif %}
33
- {{- '\n<tool_call>\n{"name": "' }}
34
- {{- tool_call.name }}
35
- {{- '", "arguments": ' }}
36
- {{- tool_call.arguments | tojson }}
37
- {{- '}\n</tool_call>' }}
38
- {%- endfor %}
39
- {{- '<|im_end|>\n' }}
40
- {%- elif message.role == "tool" %}
41
- {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
- {{- '<|im_start|>user' }}
43
- {%- endif %}
44
- {{- '\n<tool_response>\n' }}
45
- {{- message.content }}
46
- {{- '\n</tool_response>' }}
47
- {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
- {{- '<|im_end|>\n' }}
49
- {%- endif %}
50
- {%- endif %}
51
- {%- endfor %}
52
- {%- if add_generation_prompt %}
53
- {{- '<|im_start|>assistant\n' }}
54
- {%- endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/config.json DELETED
@@ -1,57 +0,0 @@
1
- {
2
- "architectures": [
3
- "Qwen2ForCausalLM"
4
- ],
5
- "attention_dropout": 0.0,
6
- "bos_token_id": null,
7
- "dtype": "float32",
8
- "eos_token_id": 151645,
9
- "hidden_act": "silu",
10
- "hidden_size": 896,
11
- "initializer_range": 0.02,
12
- "intermediate_size": 4864,
13
- "layer_types": [
14
- "full_attention",
15
- "full_attention",
16
- "full_attention",
17
- "full_attention",
18
- "full_attention",
19
- "full_attention",
20
- "full_attention",
21
- "full_attention",
22
- "full_attention",
23
- "full_attention",
24
- "full_attention",
25
- "full_attention",
26
- "full_attention",
27
- "full_attention",
28
- "full_attention",
29
- "full_attention",
30
- "full_attention",
31
- "full_attention",
32
- "full_attention",
33
- "full_attention",
34
- "full_attention",
35
- "full_attention",
36
- "full_attention",
37
- "full_attention"
38
- ],
39
- "max_position_embeddings": 32768,
40
- "max_window_layers": 21,
41
- "model_type": "qwen2",
42
- "num_attention_heads": 14,
43
- "num_hidden_layers": 24,
44
- "num_key_value_heads": 2,
45
- "pad_token_id": 151643,
46
- "rms_norm_eps": 1e-06,
47
- "rope_parameters": {
48
- "rope_theta": 1000000.0,
49
- "rope_type": "default"
50
- },
51
- "sliding_window": null,
52
- "tie_word_embeddings": true,
53
- "transformers_version": "5.6.2",
54
- "use_cache": false,
55
- "use_sliding_window": false,
56
- "vocab_size": 151936
57
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/generation_config.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "do_sample": true,
3
- "eos_token_id": [
4
- 151645,
5
- 151643
6
- ],
7
- "pad_token_id": 151643,
8
- "repetition_penalty": 1.1,
9
- "temperature": 0.7,
10
- "top_k": 20,
11
- "top_p": 0.8,
12
- "transformers_version": "5.6.2"
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c6fa4eed67a84ce4076ba3848a078496971cd34ba048c794e52cc3b4aab54a27
3
- size 1976163472
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/optimizer.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b145cc09cf03081708247bd99e0dd46e23f798d922e5e7df9e75880345e1d969
3
- size 3952509771
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/rng_state.pth DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:7876773bbbd765b5b540a43519e4809b559e65cdfa3f4e9508024dc7702f2f6e
3
- size 14645
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/scheduler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3cf220ca534359fdb729d8e74ed3b0c609c54e6d591e1d2478f5521fc51fba05
3
- size 1465
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/tokenizer.json DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
- size 11421892
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/tokenizer_config.json DELETED
@@ -1,32 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "backend": "tokenizers",
4
- "bos_token": null,
5
- "clean_up_tokenization_spaces": false,
6
- "eos_token": "<|im_end|>",
7
- "errors": "replace",
8
- "extra_special_tokens": [
9
- "<|im_start|>",
10
- "<|im_end|>",
11
- "<|object_ref_start|>",
12
- "<|object_ref_end|>",
13
- "<|box_start|>",
14
- "<|box_end|>",
15
- "<|quad_start|>",
16
- "<|quad_end|>",
17
- "<|vision_start|>",
18
- "<|vision_end|>",
19
- "<|vision_pad|>",
20
- "<|image_pad|>",
21
- "<|video_pad|>"
22
- ],
23
- "is_local": false,
24
- "local_files_only": false,
25
- "model_max_length": 131072,
26
- "pad_token": "<|endoftext|>",
27
- "padding_side": "left",
28
- "split_special_tokens": false,
29
- "tokenizer_class": "Qwen2Tokenizer",
30
- "truncation_side": "left",
31
- "unk_token": null
32
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/trainer_state.json DELETED
@@ -1,953 +0,0 @@
1
- {
2
- "best_global_step": null,
3
- "best_metric": null,
4
- "best_model_checkpoint": null,
5
- "epoch": 2.0833333333333335,
6
- "eval_steps": 500,
7
- "global_step": 50,
8
- "is_hyper_param_search": false,
9
- "is_local_process_zero": true,
10
- "is_world_process_zero": true,
11
- "log_history": [
12
- {
13
- "clip_ratio/high_max": 0.001953125,
14
- "clip_ratio/high_mean": 0.001953125,
15
- "clip_ratio/low_mean": 0.0,
16
- "clip_ratio/low_min": 0.0,
17
- "clip_ratio/region_mean": 0.001953125,
18
- "completions/clipped_ratio": 1.0,
19
- "completions/max_length": 384.0,
20
- "completions/max_terminated_length": 0.0,
21
- "completions/mean_length": 384.0,
22
- "completions/mean_terminated_length": 0.0,
23
- "completions/min_length": 384.0,
24
- "completions/min_terminated_length": 0.0,
25
- "entropy": 1.9362258911132812,
26
- "epoch": 0.041666666666666664,
27
- "frac_reward_zero_std": 0.0,
28
- "grad_norm": 3.3426833152770996,
29
- "kl": 0.0005498419050127268,
30
- "learning_rate": 5e-06,
31
- "loss": 0.13995857536792755,
32
- "num_tokens": 25244.0,
33
- "reward": -0.4352343678474426,
34
- "reward_std": 0.306624174118042,
35
- "rewards/GeneratorRewardFunction/mean": -0.4352343678474426,
36
- "rewards/GeneratorRewardFunction/std": 0.306624174118042,
37
- "step": 1,
38
- "step_time": 12.578062469000088
39
- },
40
- {
41
- "clip_ratio/high_max": 0.00390625,
42
- "clip_ratio/high_mean": 0.00390625,
43
- "clip_ratio/low_mean": 0.0026041667442768812,
44
- "clip_ratio/low_min": 0.0026041667442768812,
45
- "clip_ratio/region_mean": 0.0065104165114462376,
46
- "entropy": 1.2686206102371216,
47
- "epoch": 0.08333333333333333,
48
- "grad_norm": 2.8547239303588867,
49
- "kl": 0.001546451705507934,
50
- "learning_rate": 4.9000000000000005e-06,
51
- "loss": -0.06681232899427414,
52
- "step": 2,
53
- "step_time": 0.22036709600001814
54
- },
55
- {
56
- "clip_ratio/high_max": 0.012369791977107525,
57
- "clip_ratio/high_mean": 0.012369791977107525,
58
- "clip_ratio/low_mean": 0.015625,
59
- "clip_ratio/low_min": 0.015625,
60
- "clip_ratio/region_mean": 0.02799479104578495,
61
- "entropy": 1.8668650388717651,
62
- "epoch": 0.125,
63
- "grad_norm": 2.4686105251312256,
64
- "kl": 0.005345983896404505,
65
- "learning_rate": 4.800000000000001e-06,
66
- "loss": 0.010777520947158337,
67
- "step": 3,
68
- "step_time": 0.2199646679999887
69
- },
70
- {
71
- "clip_ratio/high_max": 0.02083333395421505,
72
- "clip_ratio/high_mean": 0.02083333395421505,
73
- "clip_ratio/low_mean": 0.010416666977107525,
74
- "clip_ratio/low_min": 0.010416666977107525,
75
- "clip_ratio/region_mean": 0.03125,
76
- "entropy": 1.1842881441116333,
77
- "epoch": 0.16666666666666666,
78
- "grad_norm": 1.569398045539856,
79
- "kl": 0.0072342646308243275,
80
- "learning_rate": 4.7e-06,
81
- "loss": -0.08198019117116928,
82
- "step": 4,
83
- "step_time": 0.2201611520000597
84
- },
85
- {
86
- "clip_ratio/high_max": 0.001953125,
87
- "clip_ratio/high_mean": 0.001953125,
88
- "clip_ratio/low_mean": 0.0006510416860692203,
89
- "clip_ratio/low_min": 0.0006510416860692203,
90
- "clip_ratio/region_mean": 0.0026041667442768812,
91
- "completions/clipped_ratio": 1.0,
92
- "completions/max_length": 384.0,
93
- "completions/max_terminated_length": 0.0,
94
- "completions/mean_length": 384.0,
95
- "completions/mean_terminated_length": 0.0,
96
- "completions/min_length": 384.0,
97
- "completions/min_terminated_length": 0.0,
98
- "entropy": 1.3128995895385742,
99
- "epoch": 0.20833333333333334,
100
- "frac_reward_zero_std": 0.0,
101
- "grad_norm": 2.202421188354492,
102
- "kl": 0.0129135362803936,
103
- "learning_rate": 4.600000000000001e-06,
104
- "loss": -0.0841849148273468,
105
- "num_tokens": 50440.0,
106
- "reward": -0.3341406285762787,
107
- "reward_std": 0.3155691623687744,
108
- "rewards/GeneratorRewardFunction/mean": -0.3341406285762787,
109
- "rewards/GeneratorRewardFunction/std": 0.3155691623687744,
110
- "step": 5,
111
- "step_time": 12.076087126999937
112
- },
113
- {
114
- "clip_ratio/high_max": 0.0026041667442768812,
115
- "clip_ratio/high_mean": 0.0026041667442768812,
116
- "clip_ratio/low_mean": 0.0052083334885537624,
117
- "clip_ratio/low_min": 0.0052083334885537624,
118
- "clip_ratio/region_mean": 0.0078125,
119
- "entropy": 1.3001914024353027,
120
- "epoch": 0.25,
121
- "grad_norm": 2.854139804840088,
122
- "kl": 0.01436698716133833,
123
- "learning_rate": 4.5e-06,
124
- "loss": 0.02869725041091442,
125
- "step": 6,
126
- "step_time": 0.22715311399997518
127
- },
128
- {
129
- "clip_ratio/high_max": 0.0071614584885537624,
130
- "clip_ratio/high_mean": 0.0071614584885537624,
131
- "clip_ratio/low_mean": 0.0071614584885537624,
132
- "clip_ratio/low_min": 0.0071614584885537624,
133
- "clip_ratio/region_mean": 0.014322916977107525,
134
- "entropy": 1.0331100225448608,
135
- "epoch": 0.2916666666666667,
136
- "grad_norm": 1.9297211170196533,
137
- "kl": 0.01791433058679104,
138
- "learning_rate": 4.4e-06,
139
- "loss": -0.028683962300419807,
140
- "step": 7,
141
- "step_time": 0.22586069900000894
142
- },
143
- {
144
- "clip_ratio/high_max": 0.029296875,
145
- "clip_ratio/high_mean": 0.029296875,
146
- "clip_ratio/low_mean": 0.01171875,
147
- "clip_ratio/low_min": 0.01171875,
148
- "clip_ratio/region_mean": 0.041015625,
149
- "entropy": 1.1462408304214478,
150
- "epoch": 0.3333333333333333,
151
- "grad_norm": 2.57124924659729,
152
- "kl": 0.0388585664331913,
153
- "learning_rate": 4.3e-06,
154
- "loss": 0.08592668920755386,
155
- "step": 8,
156
- "step_time": 0.22552726200001416
157
- },
158
- {
159
- "clip_ratio/high_max": 0.0,
160
- "clip_ratio/high_mean": 0.0,
161
- "clip_ratio/low_mean": 0.0,
162
- "clip_ratio/low_min": 0.0,
163
- "clip_ratio/region_mean": 0.0,
164
- "completions/clipped_ratio": 1.0,
165
- "completions/max_length": 384.0,
166
- "completions/max_terminated_length": 0.0,
167
- "completions/mean_length": 384.0,
168
- "completions/mean_terminated_length": 0.0,
169
- "completions/min_length": 384.0,
170
- "completions/min_terminated_length": 0.0,
171
- "entropy": 0.999983549118042,
172
- "epoch": 0.375,
173
- "frac_reward_zero_std": 0.25,
174
- "grad_norm": 2.0052192211151123,
175
- "kl": 0.030047910287976265,
176
- "learning_rate": 4.2000000000000004e-06,
177
- "loss": -0.05663062259554863,
178
- "num_tokens": 75884.0,
179
- "reward": -0.3902343511581421,
180
- "reward_std": 0.31722894310951233,
181
- "rewards/GeneratorRewardFunction/mean": -0.3902343511581421,
182
- "rewards/GeneratorRewardFunction/std": 0.3172289729118347,
183
- "step": 9,
184
- "step_time": 12.047747722000054
185
- },
186
- {
187
- "clip_ratio/high_max": 0.005859375,
188
- "clip_ratio/high_mean": 0.005859375,
189
- "clip_ratio/low_mean": 0.0006510416860692203,
190
- "clip_ratio/low_min": 0.0006510416860692203,
191
- "clip_ratio/region_mean": 0.0065104165114462376,
192
- "entropy": 1.6177984476089478,
193
- "epoch": 0.4166666666666667,
194
- "grad_norm": 2.137237071990967,
195
- "kl": 0.04101690649986267,
196
- "learning_rate": 4.1e-06,
197
- "loss": -0.02161034755408764,
198
- "step": 10,
199
- "step_time": 0.2252395229999138
200
- },
201
- {
202
- "clip_ratio/high_max": 0.01692708395421505,
203
- "clip_ratio/high_mean": 0.01692708395421505,
204
- "clip_ratio/low_mean": 0.0065104165114462376,
205
- "clip_ratio/low_min": 0.0065104165114462376,
206
- "clip_ratio/region_mean": 0.0234375,
207
- "entropy": 1.038699746131897,
208
- "epoch": 0.4583333333333333,
209
- "grad_norm": 2.672621965408325,
210
- "kl": 0.031740155071020126,
211
- "learning_rate": 4.000000000000001e-06,
212
- "loss": 0.056199509650468826,
213
- "step": 11,
214
- "step_time": 0.22556489999999485
215
- },
216
- {
217
- "clip_ratio/high_max": 0.01822916604578495,
218
- "clip_ratio/high_mean": 0.01822916604578495,
219
- "clip_ratio/low_mean": 0.010416666977107525,
220
- "clip_ratio/low_min": 0.010416666977107525,
221
- "clip_ratio/region_mean": 0.02864583395421505,
222
- "entropy": 1.296442985534668,
223
- "epoch": 0.5,
224
- "grad_norm": 1.4488099813461304,
225
- "kl": 0.04755128547549248,
226
- "learning_rate": 3.900000000000001e-06,
227
- "loss": 0.02385079860687256,
228
- "step": 12,
229
- "step_time": 0.22498397900005784
230
- },
231
- {
232
- "clip_ratio/high_max": 0.0013020833721384406,
233
- "clip_ratio/high_mean": 0.0013020833721384406,
234
- "clip_ratio/low_mean": 0.0,
235
- "clip_ratio/low_min": 0.0,
236
- "clip_ratio/region_mean": 0.0013020833721384406,
237
- "completions/clipped_ratio": 1.0,
238
- "completions/max_length": 384.0,
239
- "completions/max_terminated_length": 0.0,
240
- "completions/mean_length": 384.0,
241
- "completions/mean_terminated_length": 0.0,
242
- "completions/min_length": 384.0,
243
- "completions/min_terminated_length": 0.0,
244
- "entropy": 1.3575109243392944,
245
- "epoch": 0.5416666666666666,
246
- "frac_reward_zero_std": 0.0,
247
- "grad_norm": 2.2914443016052246,
248
- "kl": 0.06257984787225723,
249
- "learning_rate": 3.8000000000000005e-06,
250
- "loss": -0.10538653284311295,
251
- "num_tokens": 101112.0,
252
- "reward": -0.22843749821186066,
253
- "reward_std": 0.294514924287796,
254
- "rewards/GeneratorRewardFunction/mean": -0.22843749821186066,
255
- "rewards/GeneratorRewardFunction/std": 0.2945149540901184,
256
- "step": 13,
257
- "step_time": 12.01501083200003
258
- },
259
- {
260
- "clip_ratio/high_max": 0.001953125,
261
- "clip_ratio/high_mean": 0.001953125,
262
- "clip_ratio/low_mean": 0.0,
263
- "clip_ratio/low_min": 0.0,
264
- "clip_ratio/region_mean": 0.001953125,
265
- "entropy": 1.2918612957000732,
266
- "epoch": 0.5833333333333334,
267
- "grad_norm": 3.0368542671203613,
268
- "kl": 0.04979195073246956,
269
- "learning_rate": 3.7e-06,
270
- "loss": -0.003113487036898732,
271
- "step": 14,
272
- "step_time": 0.21806825399994523
273
- },
274
- {
275
- "clip_ratio/high_max": 0.0026041667442768812,
276
- "clip_ratio/high_mean": 0.0026041667442768812,
277
- "clip_ratio/low_mean": 0.005859375,
278
- "clip_ratio/low_min": 0.005859375,
279
- "clip_ratio/region_mean": 0.008463541977107525,
280
- "entropy": 1.1081053018569946,
281
- "epoch": 0.625,
282
- "grad_norm": 3.5923683643341064,
283
- "kl": 0.06817911565303802,
284
- "learning_rate": 3.6000000000000003e-06,
285
- "loss": 0.15118412673473358,
286
- "step": 15,
287
- "step_time": 0.217887520999966
288
- },
289
- {
290
- "clip_ratio/high_max": 0.02018229104578495,
291
- "clip_ratio/high_mean": 0.02018229104578495,
292
- "clip_ratio/low_mean": 0.0026041667442768812,
293
- "clip_ratio/low_min": 0.0026041667442768812,
294
- "clip_ratio/region_mean": 0.02278645895421505,
295
- "entropy": 1.0803831815719604,
296
- "epoch": 0.6666666666666666,
297
- "grad_norm": 1.789110541343689,
298
- "kl": 0.056480005383491516,
299
- "learning_rate": 3.5e-06,
300
- "loss": -0.03890883922576904,
301
- "step": 16,
302
- "step_time": 0.21781940799996846
303
- },
304
- {
305
- "clip_ratio/high_max": 0.0,
306
- "clip_ratio/high_mean": 0.0,
307
- "clip_ratio/low_mean": 0.0,
308
- "clip_ratio/low_min": 0.0,
309
- "clip_ratio/region_mean": 0.0,
310
- "completions/clipped_ratio": 1.0,
311
- "completions/max_length": 384.0,
312
- "completions/max_terminated_length": 0.0,
313
- "completions/mean_length": 384.0,
314
- "completions/mean_terminated_length": 0.0,
315
- "completions/min_length": 384.0,
316
- "completions/min_terminated_length": 0.0,
317
- "entropy": 0.8709045052528381,
318
- "epoch": 0.7083333333333334,
319
- "frac_reward_zero_std": 0.0,
320
- "grad_norm": 1.329393982887268,
321
- "kl": 0.06073950603604317,
322
- "learning_rate": 3.4000000000000005e-06,
323
- "loss": -0.11920400708913803,
324
- "num_tokens": 126436.0,
325
- "reward": -0.2240625023841858,
326
- "reward_std": 0.2881968021392822,
327
- "rewards/GeneratorRewardFunction/mean": -0.2240625023841858,
328
- "rewards/GeneratorRewardFunction/std": 0.2881968021392822,
329
- "step": 17,
330
- "step_time": 12.08798373600007
331
- },
332
- {
333
- "clip_ratio/high_max": 0.0026041667442768812,
334
- "clip_ratio/high_mean": 0.0026041667442768812,
335
- "clip_ratio/low_mean": 0.0006510416860692203,
336
- "clip_ratio/low_min": 0.0006510416860692203,
337
- "clip_ratio/region_mean": 0.0032552082557231188,
338
- "entropy": 1.083386778831482,
339
- "epoch": 0.75,
340
- "grad_norm": 1.343295931816101,
341
- "kl": 0.0919194221496582,
342
- "learning_rate": 3.3000000000000006e-06,
343
- "loss": -0.007308408617973328,
344
- "step": 18,
345
- "step_time": 0.2253850289998809
346
- },
347
- {
348
- "clip_ratio/high_max": 0.005859375,
349
- "clip_ratio/high_mean": 0.005859375,
350
- "clip_ratio/low_mean": 0.0026041667442768812,
351
- "clip_ratio/low_min": 0.0026041667442768812,
352
- "clip_ratio/region_mean": 0.008463541977107525,
353
- "entropy": 1.406662940979004,
354
- "epoch": 0.7916666666666666,
355
- "grad_norm": 3.3420534133911133,
356
- "kl": 0.06450249999761581,
357
- "learning_rate": 3.2000000000000003e-06,
358
- "loss": 0.12472107261419296,
359
- "step": 19,
360
- "step_time": 0.22467486499999723
361
- },
362
- {
363
- "clip_ratio/high_max": 0.011067708022892475,
364
- "clip_ratio/high_mean": 0.011067708022892475,
365
- "clip_ratio/low_mean": 0.0032552082557231188,
366
- "clip_ratio/low_min": 0.0032552082557231188,
367
- "clip_ratio/region_mean": 0.014322916977107525,
368
- "entropy": 1.6491953134536743,
369
- "epoch": 0.8333333333333334,
370
- "grad_norm": 3.3672103881835938,
371
- "kl": 0.0773777961730957,
372
- "learning_rate": 3.1000000000000004e-06,
373
- "loss": 0.0035695277620106936,
374
- "step": 20,
375
- "step_time": 0.22389788999998927
376
- },
377
- {
378
- "clip_ratio/high_max": 0.0006510416860692203,
379
- "clip_ratio/high_mean": 0.0006510416860692203,
380
- "clip_ratio/low_mean": 0.0,
381
- "clip_ratio/low_min": 0.0,
382
- "clip_ratio/region_mean": 0.0006510416860692203,
383
- "completions/clipped_ratio": 1.0,
384
- "completions/max_length": 384.0,
385
- "completions/max_terminated_length": 0.0,
386
- "completions/mean_length": 384.0,
387
- "completions/mean_terminated_length": 0.0,
388
- "completions/min_length": 384.0,
389
- "completions/min_terminated_length": 0.0,
390
- "entropy": 1.4178005456924438,
391
- "epoch": 0.875,
392
- "frac_reward_zero_std": 0.0,
393
- "grad_norm": 2.6333799362182617,
394
- "kl": 0.07392226904630661,
395
- "learning_rate": 3e-06,
396
- "loss": -0.09101495891809464,
397
- "num_tokens": 151596.0,
398
- "reward": -0.15257811546325684,
399
- "reward_std": 0.24854345619678497,
400
- "rewards/GeneratorRewardFunction/mean": -0.15257811546325684,
401
- "rewards/GeneratorRewardFunction/std": 0.24854345619678497,
402
- "step": 21,
403
- "step_time": 12.020297105999816
404
- },
405
- {
406
- "clip_ratio/high_max": 0.0006510416860692203,
407
- "clip_ratio/high_mean": 0.0006510416860692203,
408
- "clip_ratio/low_mean": 0.0013020833721384406,
409
- "clip_ratio/low_min": 0.0013020833721384406,
410
- "clip_ratio/region_mean": 0.001953125,
411
- "entropy": 1.2036248445510864,
412
- "epoch": 0.9166666666666666,
413
- "grad_norm": 2.1499149799346924,
414
- "kl": 0.0772874653339386,
415
- "learning_rate": 2.9e-06,
416
- "loss": 0.08120749890804291,
417
- "step": 22,
418
- "step_time": 0.21995178900010615
419
- },
420
- {
421
- "clip_ratio/high_max": 0.0032552082557231188,
422
- "clip_ratio/high_mean": 0.0032552082557231188,
423
- "clip_ratio/low_mean": 0.0032552082557231188,
424
- "clip_ratio/low_min": 0.0032552082557231188,
425
- "clip_ratio/region_mean": 0.0065104165114462376,
426
- "entropy": 1.1966055631637573,
427
- "epoch": 0.9583333333333334,
428
- "grad_norm": 2.0064616203308105,
429
- "kl": 0.07331382483243942,
430
- "learning_rate": 2.8000000000000003e-06,
431
- "loss": 0.03140506148338318,
432
- "step": 23,
433
- "step_time": 0.21996421700009705
434
- },
435
- {
436
- "clip_ratio/high_max": 0.011067708022892475,
437
- "clip_ratio/high_mean": 0.011067708022892475,
438
- "clip_ratio/low_mean": 0.0006510416860692203,
439
- "clip_ratio/low_min": 0.0006510416860692203,
440
- "clip_ratio/region_mean": 0.01171875,
441
- "entropy": 0.9102082252502441,
442
- "epoch": 1.0,
443
- "grad_norm": 1.7175334692001343,
444
- "kl": 0.14611481130123138,
445
- "learning_rate": 2.7000000000000004e-06,
446
- "loss": -0.021010393276810646,
447
- "step": 24,
448
- "step_time": 0.21931950600014716
449
- },
450
- {
451
- "clip_ratio/high_max": 0.0,
452
- "clip_ratio/high_mean": 0.0,
453
- "clip_ratio/low_mean": 0.0,
454
- "clip_ratio/low_min": 0.0,
455
- "clip_ratio/region_mean": 0.0,
456
- "completions/clipped_ratio": 1.0,
457
- "completions/max_length": 384.0,
458
- "completions/max_terminated_length": 0.0,
459
- "completions/mean_length": 384.0,
460
- "completions/mean_terminated_length": 0.0,
461
- "completions/min_length": 384.0,
462
- "completions/min_terminated_length": 0.0,
463
- "entropy": 1.9153881072998047,
464
- "epoch": 1.0416666666666667,
465
- "frac_reward_zero_std": 0.0,
466
- "grad_norm": 2.3460445404052734,
467
- "kl": 0.09710023552179337,
468
- "learning_rate": 2.6e-06,
469
- "loss": 0.015220091678202152,
470
- "num_tokens": 177212.0,
471
- "reward": -0.15656250715255737,
472
- "reward_std": 0.22349287569522858,
473
- "rewards/GeneratorRewardFunction/mean": -0.15656250715255737,
474
- "rewards/GeneratorRewardFunction/std": 0.22349286079406738,
475
- "step": 25,
476
- "step_time": 12.153388549000056
477
- },
478
- {
479
- "clip_ratio/high_max": 0.00390625,
480
- "clip_ratio/high_mean": 0.00390625,
481
- "clip_ratio/low_mean": 0.0013020833721384406,
482
- "clip_ratio/low_min": 0.0013020833721384406,
483
- "clip_ratio/region_mean": 0.0052083334885537624,
484
- "entropy": 1.365325927734375,
485
- "epoch": 1.0833333333333333,
486
- "grad_norm": 1.8710312843322754,
487
- "kl": 0.0985046848654747,
488
- "learning_rate": 2.5e-06,
489
- "loss": -0.02838735282421112,
490
- "step": 26,
491
- "step_time": 0.22659933299996737
492
- },
493
- {
494
- "clip_ratio/high_max": 0.008463541977107525,
495
- "clip_ratio/high_mean": 0.008463541977107525,
496
- "clip_ratio/low_mean": 0.0013020833721384406,
497
- "clip_ratio/low_min": 0.0013020833721384406,
498
- "clip_ratio/region_mean": 0.009765625,
499
- "entropy": 1.2517439126968384,
500
- "epoch": 1.125,
501
- "grad_norm": 2.821958303451538,
502
- "kl": 0.09274079650640488,
503
- "learning_rate": 2.4000000000000003e-06,
504
- "loss": -0.007298170123249292,
505
- "step": 27,
506
- "step_time": 0.22647249999999985
507
- },
508
- {
509
- "clip_ratio/high_max": 0.0052083334885537624,
510
- "clip_ratio/high_mean": 0.0052083334885537624,
511
- "clip_ratio/low_mean": 0.009114583022892475,
512
- "clip_ratio/low_min": 0.009114583022892475,
513
- "clip_ratio/region_mean": 0.014322916977107525,
514
- "entropy": 2.0579044818878174,
515
- "epoch": 1.1666666666666667,
516
- "grad_norm": 3.259742259979248,
517
- "kl": 0.10746321082115173,
518
- "learning_rate": 2.3000000000000004e-06,
519
- "loss": 0.021702758967876434,
520
- "step": 28,
521
- "step_time": 0.22640677999993386
522
- },
523
- {
524
- "clip_ratio/high_max": 0.0,
525
- "clip_ratio/high_mean": 0.0,
526
- "clip_ratio/low_mean": 0.0,
527
- "clip_ratio/low_min": 0.0,
528
- "clip_ratio/region_mean": 0.0,
529
- "completions/clipped_ratio": 1.0,
530
- "completions/max_length": 384.0,
531
- "completions/max_terminated_length": 0.0,
532
- "completions/mean_length": 384.0,
533
- "completions/mean_terminated_length": 0.0,
534
- "completions/min_length": 384.0,
535
- "completions/min_terminated_length": 0.0,
536
- "entropy": 1.3725861310958862,
537
- "epoch": 1.2083333333333333,
538
- "frac_reward_zero_std": 0.0,
539
- "grad_norm": 1.8806989192962646,
540
- "kl": 0.11961983889341354,
541
- "learning_rate": 2.2e-06,
542
- "loss": 0.07187109440565109,
543
- "num_tokens": 202200.0,
544
- "reward": -0.0561029389500618,
545
- "reward_std": 0.314301997423172,
546
- "rewards/GeneratorRewardFunction/mean": -0.0561029389500618,
547
- "rewards/GeneratorRewardFunction/std": 0.314301997423172,
548
- "step": 29,
549
- "step_time": 13.662849896999887
550
- },
551
- {
552
- "clip_ratio/high_max": 0.001953125,
553
- "clip_ratio/high_mean": 0.001953125,
554
- "clip_ratio/low_mean": 0.0013020833721384406,
555
- "clip_ratio/low_min": 0.0013020833721384406,
556
- "clip_ratio/region_mean": 0.0032552082557231188,
557
- "entropy": 1.2213298082351685,
558
- "epoch": 1.25,
559
- "grad_norm": 2.1918396949768066,
560
- "kl": 0.12398240715265274,
561
- "learning_rate": 2.1000000000000002e-06,
562
- "loss": -0.052896980196237564,
563
- "step": 30,
564
- "step_time": 0.2210835590001352
565
- },
566
- {
567
- "clip_ratio/high_max": 0.001953125,
568
- "clip_ratio/high_mean": 0.001953125,
569
- "clip_ratio/low_mean": 0.001953125,
570
- "clip_ratio/low_min": 0.001953125,
571
- "clip_ratio/region_mean": 0.00390625,
572
- "entropy": 1.2683231830596924,
573
- "epoch": 1.2916666666666667,
574
- "grad_norm": 2.524726390838623,
575
- "kl": 0.14297537505626678,
576
- "learning_rate": 2.0000000000000003e-06,
577
- "loss": 0.12745414674282074,
578
- "step": 31,
579
- "step_time": 0.22096665699996265
580
- },
581
- {
582
- "clip_ratio/high_max": 0.0032552082557231188,
583
- "clip_ratio/high_mean": 0.0032552082557231188,
584
- "clip_ratio/low_mean": 0.0006510416860692203,
585
- "clip_ratio/low_min": 0.0006510416860692203,
586
- "clip_ratio/region_mean": 0.00390625,
587
- "entropy": 1.0583091974258423,
588
- "epoch": 1.3333333333333333,
589
- "grad_norm": 2.408073902130127,
590
- "kl": 0.0881701335310936,
591
- "learning_rate": 1.9000000000000002e-06,
592
- "loss": -0.14430458843708038,
593
- "step": 32,
594
- "step_time": 0.21999255600007928
595
- },
596
- {
597
- "clip_ratio/high_max": 0.0,
598
- "clip_ratio/high_mean": 0.0,
599
- "clip_ratio/low_mean": 0.0,
600
- "clip_ratio/low_min": 0.0,
601
- "clip_ratio/region_mean": 0.0,
602
- "completions/clipped_ratio": 1.0,
603
- "completions/max_length": 384.0,
604
- "completions/max_terminated_length": 0.0,
605
- "completions/mean_length": 384.0,
606
- "completions/mean_terminated_length": 0.0,
607
- "completions/min_length": 384.0,
608
- "completions/min_terminated_length": 0.0,
609
- "entropy": 1.3751106262207031,
610
- "epoch": 1.375,
611
- "frac_reward_zero_std": 0.0,
612
- "grad_norm": 2.4002864360809326,
613
- "kl": 0.0900505781173706,
614
- "learning_rate": 1.8000000000000001e-06,
615
- "loss": 0.06270528584718704,
616
- "num_tokens": 227556.0,
617
- "reward": -0.11414062231779099,
618
- "reward_std": 0.21683935821056366,
619
- "rewards/GeneratorRewardFunction/mean": -0.11414062231779099,
620
- "rewards/GeneratorRewardFunction/std": 0.21683938801288605,
621
- "step": 33,
622
- "step_time": 12.070902493999938
623
- },
624
- {
625
- "clip_ratio/high_max": 0.0045572915114462376,
626
- "clip_ratio/high_mean": 0.0045572915114462376,
627
- "clip_ratio/low_mean": 0.0,
628
- "clip_ratio/low_min": 0.0,
629
- "clip_ratio/region_mean": 0.0045572915114462376,
630
- "entropy": 1.2606719732284546,
631
- "epoch": 1.4166666666666667,
632
- "grad_norm": 1.671729326248169,
633
- "kl": 0.1210540160536766,
634
- "learning_rate": 1.7000000000000002e-06,
635
- "loss": -0.04401962831616402,
636
- "step": 34,
637
- "step_time": 0.2276347459999215
638
- },
639
- {
640
- "clip_ratio/high_max": 0.0013020833721384406,
641
- "clip_ratio/high_mean": 0.0013020833721384406,
642
- "clip_ratio/low_mean": 0.001953125,
643
- "clip_ratio/low_min": 0.001953125,
644
- "clip_ratio/region_mean": 0.0032552082557231188,
645
- "entropy": 1.2780500650405884,
646
- "epoch": 1.4583333333333333,
647
- "grad_norm": 2.278010845184326,
648
- "kl": 0.11484409123659134,
649
- "learning_rate": 1.6000000000000001e-06,
650
- "loss": -0.08475238084793091,
651
- "step": 35,
652
- "step_time": 0.22882699100000536
653
- },
654
- {
655
- "clip_ratio/high_max": 0.0052083334885537624,
656
- "clip_ratio/high_mean": 0.0052083334885537624,
657
- "clip_ratio/low_mean": 0.0032552082557231188,
658
- "clip_ratio/low_min": 0.0032552082557231188,
659
- "clip_ratio/region_mean": 0.008463541977107525,
660
- "entropy": 1.0553101301193237,
661
- "epoch": 1.5,
662
- "grad_norm": 1.582037091255188,
663
- "kl": 0.12029703706502914,
664
- "learning_rate": 1.5e-06,
665
- "loss": 0.06627888232469559,
666
- "step": 36,
667
- "step_time": 0.22751581399984389
668
- },
669
- {
670
- "clip_ratio/high_max": 0.0,
671
- "clip_ratio/high_mean": 0.0,
672
- "clip_ratio/low_mean": 0.0,
673
- "clip_ratio/low_min": 0.0,
674
- "clip_ratio/region_mean": 0.0,
675
- "completions/clipped_ratio": 1.0,
676
- "completions/max_length": 384.0,
677
- "completions/max_terminated_length": 0.0,
678
- "completions/mean_length": 384.0,
679
- "completions/mean_terminated_length": 0.0,
680
- "completions/min_length": 384.0,
681
- "completions/min_terminated_length": 0.0,
682
- "entropy": 1.0647958517074585,
683
- "epoch": 1.5416666666666665,
684
- "frac_reward_zero_std": 0.0,
685
- "grad_norm": 2.1763558387756348,
686
- "kl": 0.08708903193473816,
687
- "learning_rate": 1.4000000000000001e-06,
688
- "loss": -0.00017260713502764702,
689
- "num_tokens": 252640.0,
690
- "reward": -0.10210937261581421,
691
- "reward_std": 0.19573244452476501,
692
- "rewards/GeneratorRewardFunction/mean": -0.10210937261581421,
693
- "rewards/GeneratorRewardFunction/std": 0.1957324594259262,
694
- "step": 37,
695
- "step_time": 12.015305628000078
696
- },
697
- {
698
- "clip_ratio/high_max": 0.0026041667442768812,
699
- "clip_ratio/high_mean": 0.0026041667442768812,
700
- "clip_ratio/low_mean": 0.0,
701
- "clip_ratio/low_min": 0.0,
702
- "clip_ratio/region_mean": 0.0026041667442768812,
703
- "entropy": 1.0041130781173706,
704
- "epoch": 1.5833333333333335,
705
- "grad_norm": 1.6093602180480957,
706
- "kl": 0.11537543684244156,
707
- "learning_rate": 1.3e-06,
708
- "loss": -0.12453166395425797,
709
- "step": 38,
710
- "step_time": 0.22048816200003785
711
- },
712
- {
713
- "clip_ratio/high_max": 0.0045572915114462376,
714
- "clip_ratio/high_mean": 0.0045572915114462376,
715
- "clip_ratio/low_mean": 0.0013020833721384406,
716
- "clip_ratio/low_min": 0.0013020833721384406,
717
- "clip_ratio/region_mean": 0.005859375,
718
- "entropy": 1.500306487083435,
719
- "epoch": 1.625,
720
- "grad_norm": 3.409069299697876,
721
- "kl": 0.10904627293348312,
722
- "learning_rate": 1.2000000000000002e-06,
723
- "loss": 0.12661518156528473,
724
- "step": 39,
725
- "step_time": 0.22087437000004684
726
- },
727
- {
728
- "clip_ratio/high_max": 0.0078125,
729
- "clip_ratio/high_mean": 0.0078125,
730
- "clip_ratio/low_mean": 0.0013020833721384406,
731
- "clip_ratio/low_min": 0.0013020833721384406,
732
- "clip_ratio/region_mean": 0.009114583022892475,
733
- "entropy": 1.0560635328292847,
734
- "epoch": 1.6666666666666665,
735
- "grad_norm": 2.0718417167663574,
736
- "kl": 0.11926760524511337,
737
- "learning_rate": 1.1e-06,
738
- "loss": -0.0004449083062354475,
739
- "step": 40,
740
- "step_time": 0.2202887500000088
741
- },
742
- {
743
- "clip_ratio/high_max": 0.0,
744
- "clip_ratio/high_mean": 0.0,
745
- "clip_ratio/low_mean": 0.0013020833721384406,
746
- "clip_ratio/low_min": 0.0013020833721384406,
747
- "clip_ratio/region_mean": 0.0013020833721384406,
748
- "completions/clipped_ratio": 1.0,
749
- "completions/max_length": 384.0,
750
- "completions/max_terminated_length": 0.0,
751
- "completions/mean_length": 384.0,
752
- "completions/mean_terminated_length": 0.0,
753
- "completions/min_length": 384.0,
754
- "completions/min_terminated_length": 0.0,
755
- "entropy": 1.0184931755065918,
756
- "epoch": 1.7083333333333335,
757
- "frac_reward_zero_std": 0.0,
758
- "grad_norm": 1.9755194187164307,
759
- "kl": 0.1180298700928688,
760
- "learning_rate": 1.0000000000000002e-06,
761
- "loss": 0.03202051296830177,
762
- "num_tokens": 277896.0,
763
- "reward": -0.06937499344348907,
764
- "reward_std": 0.1560969203710556,
765
- "rewards/GeneratorRewardFunction/mean": -0.06937499344348907,
766
- "rewards/GeneratorRewardFunction/std": 0.1560969203710556,
767
- "step": 41,
768
- "step_time": 12.091393732999904
769
- },
770
- {
771
- "clip_ratio/high_max": 0.0032552082557231188,
772
- "clip_ratio/high_mean": 0.0032552082557231188,
773
- "clip_ratio/low_mean": 0.0006510416860692203,
774
- "clip_ratio/low_min": 0.0006510416860692203,
775
- "clip_ratio/region_mean": 0.00390625,
776
- "entropy": 0.8101570010185242,
777
- "epoch": 1.75,
778
- "grad_norm": 2.101008653640747,
779
- "kl": 0.13180766999721527,
780
- "learning_rate": 9.000000000000001e-07,
781
- "loss": -0.03199642524123192,
782
- "step": 42,
783
- "step_time": 0.22810021400005098
784
- },
785
- {
786
- "clip_ratio/high_max": 0.001953125,
787
- "clip_ratio/high_mean": 0.001953125,
788
- "clip_ratio/low_mean": 0.0006510416860692203,
789
- "clip_ratio/low_min": 0.0006510416860692203,
790
- "clip_ratio/region_mean": 0.0026041667442768812,
791
- "entropy": 0.9268913269042969,
792
- "epoch": 1.7916666666666665,
793
- "grad_norm": 2.1574151515960693,
794
- "kl": 0.11732880026102066,
795
- "learning_rate": 8.000000000000001e-07,
796
- "loss": 0.0002514577645342797,
797
- "step": 43,
798
- "step_time": 0.22811048399989886
799
- },
800
- {
801
- "clip_ratio/high_max": 0.00390625,
802
- "clip_ratio/high_mean": 0.00390625,
803
- "clip_ratio/low_mean": 0.0,
804
- "clip_ratio/low_min": 0.0,
805
- "clip_ratio/region_mean": 0.00390625,
806
- "entropy": 1.145074486732483,
807
- "epoch": 1.8333333333333335,
808
- "grad_norm": 2.5536458492279053,
809
- "kl": 0.12928128242492676,
810
- "learning_rate": 7.000000000000001e-07,
811
- "loss": 0.0016053098952397704,
812
- "step": 44,
813
- "step_time": 0.22788419799985604
814
- },
815
- {
816
- "clip_ratio/high_max": 0.0006510416860692203,
817
- "clip_ratio/high_mean": 0.0006510416860692203,
818
- "clip_ratio/low_mean": 0.0,
819
- "clip_ratio/low_min": 0.0,
820
- "clip_ratio/region_mean": 0.0006510416860692203,
821
- "completions/clipped_ratio": 1.0,
822
- "completions/max_length": 384.0,
823
- "completions/max_terminated_length": 0.0,
824
- "completions/mean_length": 384.0,
825
- "completions/mean_terminated_length": 0.0,
826
- "completions/min_length": 384.0,
827
- "completions/min_terminated_length": 0.0,
828
- "entropy": 1.1803818941116333,
829
- "epoch": 1.875,
830
- "frac_reward_zero_std": 0.0,
831
- "grad_norm": 3.2263009548187256,
832
- "kl": 0.13209813833236694,
833
- "learning_rate": 6.000000000000001e-07,
834
- "loss": 0.1281612068414688,
835
- "num_tokens": 303192.0,
836
- "reward": -0.11374999582767487,
837
- "reward_std": 0.18029142916202545,
838
- "rewards/GeneratorRewardFunction/mean": -0.11374999582767487,
839
- "rewards/GeneratorRewardFunction/std": 0.18029142916202545,
840
- "step": 45,
841
- "step_time": 12.014625670999976
842
- },
843
- {
844
- "clip_ratio/high_max": 0.0026041667442768812,
845
- "clip_ratio/high_mean": 0.0026041667442768812,
846
- "clip_ratio/low_mean": 0.0,
847
- "clip_ratio/low_min": 0.0,
848
- "clip_ratio/region_mean": 0.0026041667442768812,
849
- "entropy": 1.6430233716964722,
850
- "epoch": 1.9166666666666665,
851
- "grad_norm": 2.463127851486206,
852
- "kl": 0.11944004148244858,
853
- "learning_rate": 5.000000000000001e-07,
854
- "loss": -0.01078779250383377,
855
- "step": 46,
856
- "step_time": 0.22117237800011935
857
- },
858
- {
859
- "clip_ratio/high_max": 0.0013020833721384406,
860
- "clip_ratio/high_mean": 0.0013020833721384406,
861
- "clip_ratio/low_mean": 0.0,
862
- "clip_ratio/low_min": 0.0,
863
- "clip_ratio/region_mean": 0.0013020833721384406,
864
- "entropy": 1.1240859031677246,
865
- "epoch": 1.9583333333333335,
866
- "grad_norm": 2.1054372787475586,
867
- "kl": 0.13911886513233185,
868
- "learning_rate": 4.0000000000000003e-07,
869
- "loss": 0.001417159684933722,
870
- "step": 47,
871
- "step_time": 0.2201927370001613
872
- },
873
- {
874
- "clip_ratio/high_max": 0.0032552082557231188,
875
- "clip_ratio/high_mean": 0.0032552082557231188,
876
- "clip_ratio/low_mean": 0.0,
877
- "clip_ratio/low_min": 0.0,
878
- "clip_ratio/region_mean": 0.0032552082557231188,
879
- "entropy": 1.3605166673660278,
880
- "epoch": 2.0,
881
- "grad_norm": 1.7440528869628906,
882
- "kl": 0.14588220417499542,
883
- "learning_rate": 3.0000000000000004e-07,
884
- "loss": -0.11717051267623901,
885
- "step": 48,
886
- "step_time": 0.21969574700005978
887
- },
888
- {
889
- "clip_ratio/high_max": 0.0,
890
- "clip_ratio/high_mean": 0.0,
891
- "clip_ratio/low_mean": 0.0,
892
- "clip_ratio/low_min": 0.0,
893
- "clip_ratio/region_mean": 0.0,
894
- "completions/clipped_ratio": 1.0,
895
- "completions/max_length": 384.0,
896
- "completions/max_terminated_length": 0.0,
897
- "completions/mean_length": 384.0,
898
- "completions/mean_terminated_length": 0.0,
899
- "completions/min_length": 384.0,
900
- "completions/min_terminated_length": 0.0,
901
- "entropy": 0.9781540036201477,
902
- "epoch": 2.0416666666666665,
903
- "frac_reward_zero_std": 0.0,
904
- "grad_norm": 2.4057631492614746,
905
- "kl": 0.14009785652160645,
906
- "learning_rate": 2.0000000000000002e-07,
907
- "loss": 0.06281977146863937,
908
- "num_tokens": 328804.0,
909
- "reward": -0.07187499850988388,
910
- "reward_std": 0.11617336422204971,
911
- "rewards/GeneratorRewardFunction/mean": -0.07187499850988388,
912
- "rewards/GeneratorRewardFunction/std": 0.11617336422204971,
913
- "step": 49,
914
- "step_time": 12.04901073699989
915
- },
916
- {
917
- "clip_ratio/high_max": 0.0013020833721384406,
918
- "clip_ratio/high_mean": 0.0013020833721384406,
919
- "clip_ratio/low_mean": 0.0,
920
- "clip_ratio/low_min": 0.0,
921
- "clip_ratio/region_mean": 0.0013020833721384406,
922
- "entropy": 1.6572185754776,
923
- "epoch": 2.0833333333333335,
924
- "grad_norm": 2.6693296432495117,
925
- "kl": 0.13599954545497894,
926
- "learning_rate": 1.0000000000000001e-07,
927
- "loss": -0.16521048545837402,
928
- "step": 50,
929
- "step_time": 0.23019724899995708
930
- }
931
- ],
932
- "logging_steps": 1,
933
- "max_steps": 50,
934
- "num_input_tokens_seen": 328804,
935
- "num_train_epochs": 3,
936
- "save_steps": 10,
937
- "stateful_callbacks": {
938
- "TrainerControl": {
939
- "args": {
940
- "should_epoch_stop": false,
941
- "should_evaluate": false,
942
- "should_log": false,
943
- "should_save": true,
944
- "should_training_stop": true
945
- },
946
- "attributes": {}
947
- }
948
- },
949
- "total_flos": 0.0,
950
- "train_batch_size": 4,
951
- "trial_name": null,
952
- "trial_params": null
953
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/checkpoint-50/training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:31ec66b64f432daf7616434296713e432d134face96e308f2ebc175e2e26f025
3
- size 7249
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/chat_template.jinja DELETED
@@ -1,54 +0,0 @@
1
- {%- if tools %}
2
- {{- '<|im_start|>system\n' }}
3
- {%- if messages[0]['role'] == 'system' %}
4
- {{- messages[0]['content'] }}
5
- {%- else %}
6
- {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
7
- {%- endif %}
8
- {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
- {%- for tool in tools %}
10
- {{- "\n" }}
11
- {{- tool | tojson }}
12
- {%- endfor %}
13
- {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
- {%- else %}
15
- {%- if messages[0]['role'] == 'system' %}
16
- {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
- {%- else %}
18
- {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
19
- {%- endif %}
20
- {%- endif %}
21
- {%- for message in messages %}
22
- {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
- {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
- {%- elif message.role == "assistant" %}
25
- {{- '<|im_start|>' + message.role }}
26
- {%- if message.content %}
27
- {{- '\n' + message.content }}
28
- {%- endif %}
29
- {%- for tool_call in message.tool_calls %}
30
- {%- if tool_call.function is defined %}
31
- {%- set tool_call = tool_call.function %}
32
- {%- endif %}
33
- {{- '\n<tool_call>\n{"name": "' }}
34
- {{- tool_call.name }}
35
- {{- '", "arguments": ' }}
36
- {{- tool_call.arguments | tojson }}
37
- {{- '}\n</tool_call>' }}
38
- {%- endfor %}
39
- {{- '<|im_end|>\n' }}
40
- {%- elif message.role == "tool" %}
41
- {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
- {{- '<|im_start|>user' }}
43
- {%- endif %}
44
- {{- '\n<tool_response>\n' }}
45
- {{- message.content }}
46
- {{- '\n</tool_response>' }}
47
- {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
- {{- '<|im_end|>\n' }}
49
- {%- endif %}
50
- {%- endif %}
51
- {%- endfor %}
52
- {%- if add_generation_prompt %}
53
- {{- '<|im_start|>assistant\n' }}
54
- {%- endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/config.json DELETED
@@ -1,57 +0,0 @@
1
- {
2
- "architectures": [
3
- "Qwen2ForCausalLM"
4
- ],
5
- "attention_dropout": 0.0,
6
- "bos_token_id": null,
7
- "dtype": "float32",
8
- "eos_token_id": 151645,
9
- "hidden_act": "silu",
10
- "hidden_size": 896,
11
- "initializer_range": 0.02,
12
- "intermediate_size": 4864,
13
- "layer_types": [
14
- "full_attention",
15
- "full_attention",
16
- "full_attention",
17
- "full_attention",
18
- "full_attention",
19
- "full_attention",
20
- "full_attention",
21
- "full_attention",
22
- "full_attention",
23
- "full_attention",
24
- "full_attention",
25
- "full_attention",
26
- "full_attention",
27
- "full_attention",
28
- "full_attention",
29
- "full_attention",
30
- "full_attention",
31
- "full_attention",
32
- "full_attention",
33
- "full_attention",
34
- "full_attention",
35
- "full_attention",
36
- "full_attention",
37
- "full_attention"
38
- ],
39
- "max_position_embeddings": 32768,
40
- "max_window_layers": 21,
41
- "model_type": "qwen2",
42
- "num_attention_heads": 14,
43
- "num_hidden_layers": 24,
44
- "num_key_value_heads": 2,
45
- "pad_token_id": 151643,
46
- "rms_norm_eps": 1e-06,
47
- "rope_parameters": {
48
- "rope_theta": 1000000.0,
49
- "rope_type": "default"
50
- },
51
- "sliding_window": null,
52
- "tie_word_embeddings": true,
53
- "transformers_version": "5.6.2",
54
- "use_cache": false,
55
- "use_sliding_window": false,
56
- "vocab_size": 151936
57
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/generation_config.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "do_sample": true,
3
- "eos_token_id": [
4
- 151645,
5
- 151643
6
- ],
7
- "pad_token_id": 151643,
8
- "repetition_penalty": 1.1,
9
- "temperature": 0.7,
10
- "top_k": 20,
11
- "top_p": 0.8,
12
- "transformers_version": "5.6.2"
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c6fa4eed67a84ce4076ba3848a078496971cd34ba048c794e52cc3b4aab54a27
3
- size 1976163472
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/tokenizer.json DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
3
- size 11421892
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/tokenizer_config.json DELETED
@@ -1,32 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "backend": "tokenizers",
4
- "bos_token": null,
5
- "clean_up_tokenization_spaces": false,
6
- "eos_token": "<|im_end|>",
7
- "errors": "replace",
8
- "extra_special_tokens": [
9
- "<|im_start|>",
10
- "<|im_end|>",
11
- "<|object_ref_start|>",
12
- "<|object_ref_end|>",
13
- "<|box_start|>",
14
- "<|box_end|>",
15
- "<|quad_start|>",
16
- "<|quad_end|>",
17
- "<|vision_start|>",
18
- "<|vision_end|>",
19
- "<|vision_pad|>",
20
- "<|image_pad|>",
21
- "<|video_pad|>"
22
- ],
23
- "is_local": false,
24
- "local_files_only": false,
25
- "model_max_length": 131072,
26
- "pad_token": "<|endoftext|>",
27
- "padding_side": "left",
28
- "split_special_tokens": false,
29
- "tokenizer_class": "Qwen2Tokenizer",
30
- "truncation_side": "left",
31
- "unk_token": null
32
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
self_play_hf_a10g_train/round_001/generator_train/final_model/training_args.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:31ec66b64f432daf7616434296713e432d134face96e308f2ebc175e2e26f025
3
- size 7249