File size: 11,816 Bytes
77ff990
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
# KoHRM-Text Methodology and Architecture Notes

์ž‘์„ฑ์ผ: 2026-05-24

์ด ๋ฌธ์„œ๋Š” `KoHRM-Text-1.4B`๊ฐ€ HRM-Text ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ์–ด๋–ค ์ ์—์„œ ๊ฐ™๊ณ , ์–ด๋–ค ์ ์—์„œ ์šด์˜์ƒ ๋‹ค๋ฅธ์ง€ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธ์„œ:

- HRM-Text paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text
- KoHRM-Text code: https://github.com/LLM-OS-Models/KoHRM-text

## ๊ฒฐ๋ก 

์šฐ๋ฆฌ์˜ ํ˜„์žฌ ํ•™์Šต์€ ๋ฐฉ๋ฒ•๋ก  ๊ด€์ ์—์„œ๋Š” HRM-Text ๋…ผ๋ฌธ์‹ single-stage instruction pretraining์— ๋งž์ถฐ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ์‹คํ–‰ ์šด์˜ ๊ด€์ ์—์„œ๋Š” ๋…ผ๋ฌธ๊ณผ ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋‹จ์ผ ์—ฐ์† run์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , ์ค‘๊ฐ„ checkpointing/crash recovery๋ฅผ ์“ฐ์ง€ ์•Š์•˜๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„์™€ HF ์—…๋กœ๋“œ, OOM ํšŒํ”ผ, ์ฒดํฌํฌ์ธํŠธ ๋ณด์กด ๋•Œ๋ฌธ์— stage-0, stage0b, stage-1์ฒ˜๋Ÿผ ๋‚˜๋ˆ„์–ด ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ฐจ์ด๋Š” ๋‹ค์Œ์ž…๋‹ˆ๋‹ค.

| ํ•ญ๋ชฉ | HRM-Text ๋…ผ๋ฌธ | KoHRM-Text ํ˜„์žฌ ๋ฐฉ์‹ |
|---|---|---|
| ํ•™์Šต ๋ชฉ์  | instruction-response task-completion objective | ๋™์ผ |
| loss | response-only NLL | ๋™์ผ |
| attention | PrefixLM, instruction bidirectional + response causal | ๋™์ผ ์ฝ”๋“œ ๊ฒฝ๋กœ ์‚ฌ์šฉ |
| raw LM pretraining ํ›„ SFT | ํ•˜์ง€ ์•Š์Œ | ํ•˜์ง€ ์•Š์Œ |
| SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ | instruction pretraining์— ํฌํ•จ | ํฌํ•จ |
| ์‹คํ–‰ ํ˜•ํƒœ | ๋‹จ์ผ ์—ฐ์† run | staged resume run |
| checkpoint | ๋…ผ๋ฌธ์€ ์ค‘๊ฐ„ checkpointing ์—†์Œ | ์šด์˜์ƒ 5,000 step๋งˆ๋‹ค ์ €์žฅ |
| tokenizer | 65,536 BPE | 131,072 Korean/terminal BPE |
| hardware | 16 x H100 | 8 x H200 |

๋”ฐ๋ผ์„œ โ€œ๋…ผ๋ฌธ์ฒ˜๋Ÿผ single-stage ์ง€์‹œ๋ฌธ ์‚ฌ์ „ํ•™์Šต์ธ๊ฐ€?โ€์— ๋Œ€ํ•œ ๋‹ต์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.

> ํ•™์Šต objective์™€ ๋ฐ์ดํ„ฐ ํฌ๋งท์€ single-stage instruction pretraining์ž…๋‹ˆ๋‹ค.  
> ๊ทธ๋Ÿฌ๋‚˜ ์‹คํ–‰์€ ํ•œ ํ”„๋กœ์„ธ์Šค์˜ ๋‹จ์ผ ์—ฐ์† run์ด ์•„๋‹ˆ๋ผ, ๊ฐ™์€ objective๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ checkpoint resume์œผ๋กœ ์ด์–ด๊ฐ€๋Š” staged pretraining์ž…๋‹ˆ๋‹ค.

## ๋…ผ๋ฌธ ๋ฐฉ๋ฒ•๋ก  ์š”์•ฝ

HRM-Text ๋…ผ๋ฌธ์€ ๊ธฐ์กด LLM์˜ ๋Œ€๊ทœ๋ชจ raw-text causal LM ์‚ฌ์ „ํ•™์Šต ํ›„ mid-training/SFT๋กœ ๊ฐ€๋Š” ๋‹ค๋‹จ๊ณ„ recipe๋ฅผ ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ๋ณด๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ instruction-response pair๋งŒ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ํ•ต์‹ฌ:

1. raw text ์ „์ฒด ํ† ํฐ์„ ์˜ˆ์ธกํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
2. instruction tokens์—๋Š” loss๋ฅผ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
3. response tokens์—๋งŒ negative log-likelihood loss๋ฅผ ์ค๋‹ˆ๋‹ค.
4. instruction segment๋Š” PrefixLM mask๋กœ bidirectional attention์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.
5. response segment๋Š” autoregressive causal attention์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
6. ๋ฐ์ดํ„ฐ๋Š” direct, cot, synth, noisy ๊ฐ™์€ condition tag๋ฅผ instruction ์•ž์— ๋ถ™์—ฌ ์‘๋‹ต ์Šคํƒ€์ผ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.
7. `<think>...</think>` ๊ฐ™์€ explicit long-CoT trace๋Š” ์ œ๊ฑฐํ•ด ๋‚ด๋ถ€ recurrent computation์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ๋งก๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์€ 1B HRM-Text๋ฅผ scratch๋กœ ํ•™์Šตํ–ˆ๊ณ , ์•ฝ 40B unique tokens ๋ฐ 16 x H100์—์„œ ์•ฝ 46์‹œ๊ฐ„์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ ํ‰๊ฐ€์—๋Š” EMA checkpoint๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

## ํ˜„์žฌ KoHRM-Text ํ•™์Šต ๋ฐฉ์‹

ํ˜„์žฌ `KoHRM-Text-1.4B`๋„ raw causal LM์ด ์•„๋‹ˆ๋ผ HRM-Text V1Dataset ํฌ๋งท์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ์ƒ˜ํ”Œ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.

```text
instruction span -> response span
```

ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

```text
<|im_start|> condition_tokens instruction <|im_end|> response <|box_end|>
```

condition token์€ ๋‹ค์Œ mapping์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

| condition | token |
|---|---|
| direct | `<|object_ref_start|>` |
| cot | `<|object_ref_end|>` |
| noisy | `<|quad_start|>` |
| synth | `<|quad_end|>` |

`dataset_new.py` ๊ธฐ์ค€์œผ๋กœ instruction span์€ `inputs`์—๋Š” ๋“ค์–ด๊ฐ€์ง€๋งŒ `target_only=True`์ผ ๋•Œ label์€ `IGNORE_LABEL_ID`๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. response span๋งŒ ์‹ค์ œ cross entropy loss๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.

์ฆ‰ ํ˜„์žฌ ํ•™์Šต์€ โ€œ๋ฌธ์„œ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋‹ค ๋งžํžˆ๋Š” raw LMโ€์ด ์•„๋‹ˆ๋ผ โ€œ์ฃผ์–ด์ง„ ์ง€์‹œ/๋ฌธ๋งฅ์„ ๋ณด๊ณ  ์‘๋‹ต์„ ์™„์„ฑํ•˜๋Š” task-completion pretrainingโ€์ž…๋‹ˆ๋‹ค.

## PrefixLM ๊ตฌํ˜„ ํ™•์ธ

ํ˜„์žฌ ์ฝ”๋“œ์—์„œ PrefixLM ๊ฒฝ๋กœ๋Š” ๋‹ค์Œ ํŒŒ์ผ๋“ค๋กœ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.

| ํŒŒ์ผ | ์—ญํ•  |
|---|---|
| `dataset_new.py` | instruction/response span ๋ถ„๋ฆฌ, response-only label ๊ตฌ์„ฑ |
| `models/flash_attention_prefixlm_v2.py` | prefix bidirectional attention + response causal attention ๊ตฌํ˜„ |
| `models/layers.py` | attention type `prefixlm` ์‚ฌ์šฉ |
| `models/lm_head.py` | `IGNORE_LABEL_ID`๋ฅผ ์ œ์™ธํ•˜๊ณ  response label์—๋งŒ CE loss ๊ณ„์‚ฐ |

`dataset_new.py`์—์„œ๋Š” ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด instruction ๊ธธ์ด๋ฅผ `prefix_lens`, response ๊ธธ์ด๋ฅผ `causal_lens`๋กœ ๋„˜๊น๋‹ˆ๋‹ค.

`flash_attention_prefixlm_v2.py`๋Š” attention์„ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

1. prefix ๊ตฌ๊ฐ„: instruction tokens๋ผ๋ฆฌ bidirectional attention
2. causal ๊ตฌ๊ฐ„: response tokens๊ฐ€ prefix ์ „์ฒด์™€ ์ด์ „ response tokens๋ฅผ ๋ณด๋Š” causal attention

์ด ๊ตฌ์กฐ๊ฐ€ ๋…ผ๋ฌธ์˜ PrefixLM๊ณผ ๋งž๋Š” ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

## ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

ํ˜„์žฌ ํ‘œ์ค€ ๋ชจ๋ธ๋ช…์€ `KoHRM-Text-1.4B`์ด๊ณ , `arch/size@arch=XL`์— ํ•ด๋‹นํ•˜๋Š” ์„ค์ •์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| params | 1,384,120,320 |
| hidden size | 1,536 |
| total configured layers | 32 |
| half layers | true |
| H module layers | 16 |
| L module layers | 16 |
| heads | 12 |
| head dim | 128 |
| expansion | 4 |
| intermediate size | 4,096 |
| context | 4,096 |
| RoPE theta | 10,000 |
| norm | RMSNorm-style parameterless norm |
| init | LeCun normal |
| dtype | bfloat16 |
| tokenizer vocab | 131,072 |

HRM recurrent schedule:

| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| H cycles | 2 |
| L cycles per H cycle | 3 |
| effective H/L recurrence | H2L3 |
| bp min steps | 2 |
| bp max steps | 5 |
| bp warmup ratio | 0.2 |

์ฝ”๋“œ์ƒ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1. token embedding์—์„œ ์‹œ์ž‘ํ•œ hidden state๋ฅผ `z_H`๋กœ ๋‘ก๋‹ˆ๋‹ค.
2. learned/fixed low-level initial state `z_L_init`์—์„œ `z_L`์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
3. ๊ฐ H cycle๋งˆ๋‹ค L module์„ 3๋ฒˆ ๋ฐ˜๋ณต ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
4. ๊ทธ ๋’ค H module์„ 1๋ฒˆ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
5. ์ด 2๋ฒˆ์˜ H cycle์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
6. ์ตœ์ข… `z_H`์— LM head๋ฅผ ๋ถ™์—ฌ vocabulary logits๋ฅผ ๋ƒ…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ํ‘œํ˜„์œผ๋กœ๋Š” slow-evolving strategic layer์ธ H module๊ณผ fast-evolving execution layer์ธ L module์˜ dual-timescale recurrent design์ž…๋‹ˆ๋‹ค.

## MagicNorm / ์•ˆ์ •ํ™”

๋…ผ๋ฌธ์€ recurrent depth๊ฐ€ ๊นŠ์–ด์ง€๋ฉด activation variance์™€ gradient instability๊ฐ€ ์ปค์ง€๋ฏ€๋กœ MagicNorm๊ณผ warmup deep credit assignment๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ์ฝ”๋“œ์—์„œ๋Š” `norm_type: pre`๋ฅผ ์“ฐ๋˜, Transformer module ๋์— final RMSNorm์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋‚ด๋ถ€ block์€ PreNorm ์Šคํƒ€์ผ์ด๊ณ  module output์€ norm์œผ๋กœ capped๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” MagicNorm ๊ณ„์—ด ์•ˆ์ •ํ™”์™€ ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค.

Backward๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋“  recurrent step์„ ๋‹ค ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ , `bp_steps`๋ฅผ warmupํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ์„ค์ •:

```yaml
bp_warmup_ratio: 0.2
bp_min_steps: 2
bp_max_steps: 5
```

์ดˆ๊ธฐ์—๋Š” ๋งˆ์ง€๋ง‰ 2 recurrent steps ์œ„์ฃผ๋กœ gradient๋ฅผ ํ˜๋ฆฌ๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ ์ตœ๋Œ€ 5 steps๊นŒ์ง€ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค. ์ด ์ ๋„ ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ๋งž์Šต๋‹ˆ๋‹ค.

## Optimizer / ์Šค์ผ€์ค„

ํ˜„์žฌ pretraining config:

| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| optimizer | Adam-atan2 |
| beta1 | 0.9 |
| beta2 | 0.95 |
| weight decay | 0.1 |
| lr | 2.2e-4 |
| lr warmup | 2,000 steps |
| lr min ratio | 1.0 |
| EMA | 0.9999 |
| gradient clipping | ์—†์Œ |

๋…ผ๋ฌธ๋„ Adam-atan2, warmup 2,000 steps, weight decay 0.1, EMA 0.9999, bf16์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ/ํ‰๊ฐ€๋Š” EMA checkpoint๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

## ํ˜„์žฌ staged run์ด ๋…ผ๋ฌธ๊ณผ ๋‹ค๋ฅธ ์ด์œ 

๋…ผ๋ฌธ์€ โ€œsingle continuous runโ€์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ ํ˜„์‹ค์  ์ด์œ  ๋•Œ๋ฌธ์— staged run์œผ๋กœ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค.

1. ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/tokenizer ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ๋๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
2. GPU๋ฅผ ๋†€๋ฆฌ์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
3. H200 8์žฅ ํ™˜๊ฒฝ์—์„œ 131K vocab ๋•Œ๋ฌธ์— batch OOM ์•ˆ์ • ๋งˆ์ง„์„ ์‹ค์ธกํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.
4. HF ์—…๋กœ๋“œ์™€ raw checkpoint ๋ณด์กด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
5. full HRM 328G no-cap retokenization์ด ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ฐ stage๊ฐ€ ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.

| Stage | Objective | ์„ฑ๊ฒฉ |
|---|---|---|
| stage-0 | PrefixLM response-only | ์ค€๋น„ ์™„๋ฃŒ๋œ 711.3M mix |
| stage0b | PrefixLM response-only | ๊ฐ™์€ mix ์ถ”๊ฐ€ pass |
| stage-1 | PrefixLM response-only | HRM fast-cap 14.55B |
| later stage | PrefixLM response-only | full HRM 328G retokenized + ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ mix |
| final SFT | PrefixLM response-only ๋˜๋Š” SFT์šฉ response-only | ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ํ›„์ฒ˜๋ฆฌ |

์ค‘์š”ํ•œ ์ ์€ stage-0์—์„œ stage-1๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋กœ global step/LR schedule์„ ์ด์–ด๊ฐ€๋„๋ก ์ˆ˜์ •ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฆ‰ โ€œํ•™์Šต ๋ฐฉ๋ฒ•๋ก โ€์€ single-stage instruction pretraining์ด๊ณ , โ€œ์šด์˜ ๋ฐฉ์‹โ€์€ staged continuation์ž…๋‹ˆ๋‹ค.

## ํ˜„์žฌ ๋ฐ์ดํ„ฐ๊ฐ€ single-stage ์›์น™์— ๋งž๋Š”์ง€

ํ˜„์žฌ prepared dataset๋“ค์€ ์ „๋ถ€ ๊ฐ€๋Šฅํ•œ ํ•œ instruction-response ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

| ๋ฐ์ดํ„ฐ | single-stage ์›์น™ ์ ํ•ฉ์„ฑ |
|---|---|
| HRM cleaned data | ์›๋ž˜ HRM instruction/response/condition ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| ToolBench | tool instruction -> tool-call/answer response ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| SWE-ZERO/local terminal | terminal context -> next action/answer ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| GLM/Claude reasoning | final answer ์ค‘์‹ฌ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ์ ํ•ฉ |
| ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์œ„ํ‚ค ์›๋ฌธ | chunked instruction/response task๋กœ ๋ฐ”๊ฟ” ํˆฌ์ž…ํ•˜๋ฏ€๋กœ ์ ํ•ฉ |

์ฃผ์˜ํ•  ์ :

- ํ•œ๊ตญ์–ด ์œ„ํ‚ค/๋ฒ•๋ฅ  raw chunk๋Š” โ€œ๊ทธ๋ƒฅ ๋‹ค์Œ ํ…์ŠคํŠธ ์˜ˆ์ธกโ€์ฒ˜๋Ÿผ ๋„ฃ์œผ๋ฉด ๋…ผ๋ฌธ์‹ task-completion์—์„œ ๋ฉ€์–ด์ง‘๋‹ˆ๋‹ค.
- ๋”ฐ๋ผ์„œ title/context๋ฅผ instruction์œผ๋กœ ๋‘๊ณ  chunk/summary/extraction์„ response๋กœ ๋‘๋Š” ์‹์ด ๋” ๋งž์Šต๋‹ˆ๋‹ค.
- local terminal dataset์€ objective์— ์ž˜ ๋งž์ง€๋งŒ ์ „์ฒด ๋น„์ค‘์ด ๋„ˆ๋ฌด ์ปค์ง€๋ฉด ์ผ๋ฐ˜ ์ง€์‹/ํ•œ๊ตญ์–ด ๊ท ํ˜•์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

## ํ˜„์žฌ ๋ฐฉ์‹ ํ‰๊ฐ€

ํ˜„์žฌ ๋ฐฉ์‹์€ ๋…ผ๋ฌธ ํ•ต์‹ฌ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.

๋งž๋Š” ๋ถ€๋ถ„:

- scratch training
- HRM H2L3 recurrent architecture
- PrefixLM attention
- response-only loss
- condition token ์‚ฌ์šฉ
- Adam-atan2
- bf16
- EMA 0.9999
- 4,096 context
- LeCun normal init

๋‹ค๋ฅธ ๋ถ€๋ถ„:

- vocab 65,536์ด ์•„๋‹ˆ๋ผ 131,072์ž…๋‹ˆ๋‹ค.
- 16 x H100์ด ์•„๋‹ˆ๋ผ 8 x H200์ž…๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ์€ ๋‹จ์ผ ์—ฐ์† run, ์šฐ๋ฆฌ๋Š” staged resume run์ž…๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋ณด๊ณ ํ–ˆ๊ณ , ํ˜„์žฌ public checkpoint๋Š” stage-1 fast-cap ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค.
- ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋น„์ค‘์ด ๋…ผ๋ฌธ๋ณด๋‹ค ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค.

์œ„ ์ฐจ์ด๋Š” ์˜๋„๋œ ๋ณ€๊ฒฝ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฏ€๋กœ tokenizer์™€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋ฐ”๊พผ ๊ฒƒ์€ ๋งž์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋…ผ๋ฌธ์˜ ํšจ์œจ์„ฑ์„ ์žฌํ˜„ํ•˜๋ ค๋ฉด ์ตœ์ข…์ ์œผ๋กœ full HRM cleaned data์™€ balanced Korean/terminal/tool mix๋ฅผ ํ•ฉ์ณ 40B+ token ์ˆ˜์ค€์œผ๋กœ ์ด์–ด ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

## ์šด์˜ ๊ฒฐ๋ก 

ํ˜„์žฌ๋Š” ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ๊ณ„์† ๊ฐ€๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.

1. ํ˜„์žฌ stage-1์€ ๊ณ„์† ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
2. ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ๋ฐ์ดํ„ฐ๋Š” HF dataset repo์— ์˜ฌ๋ ค ๋‹ค๋ฅธ ๋จธ์‹ ์—์„œ๋„ ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋‘ก๋‹ˆ๋‹ค.
3. full HRM 328G no-cap retokenization์ด ๋๋‚˜๋ฉด next stage๋กœ ์ด์–ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
4. SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ๋„ pretraining์— ๋จผ์ € ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
5. ๋ณ„๋„ final SFT๋Š” ๋งˆ์ง€๋ง‰์— ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
6. model repo๋Š” ์ตœ์‹  safetensors ์ค‘์‹ฌ, raw checkpoint repo๋Š” resume์šฉ์œผ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.