gyung commited on
Commit
77ff990
ยท
verified ยท
1 Parent(s): 0756b71

Add methodology and VRAM notes

Browse files
METHODOLOGY_ARCHITECTURE_NOTES_2026-05-24.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KoHRM-Text Methodology and Architecture Notes
2
+
3
+ ์ž‘์„ฑ์ผ: 2026-05-24
4
+
5
+ ์ด ๋ฌธ์„œ๋Š” `KoHRM-Text-1.4B`๊ฐ€ HRM-Text ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ์–ด๋–ค ์ ์—์„œ ๊ฐ™๊ณ , ์–ด๋–ค ์ ์—์„œ ์šด์˜์ƒ ๋‹ค๋ฅธ์ง€ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
6
+
7
+ ์ฐธ๊ณ  ๋ฌธ์„œ:
8
+
9
+ - HRM-Text paper: https://arxiv.org/html/2605.20613
10
+ - Upstream code: https://github.com/sapientinc/HRM-Text
11
+ - KoHRM-Text code: https://github.com/LLM-OS-Models/KoHRM-text
12
+
13
+ ## ๊ฒฐ๋ก 
14
+
15
+ ์šฐ๋ฆฌ์˜ ํ˜„์žฌ ํ•™์Šต์€ ๋ฐฉ๋ฒ•๋ก  ๊ด€์ ์—์„œ๋Š” HRM-Text ๋…ผ๋ฌธ์‹ single-stage instruction pretraining์— ๋งž์ถฐ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
16
+
17
+ ๋‹ค๋งŒ ์‹คํ–‰ ์šด์˜ ๊ด€์ ์—์„œ๋Š” ๋…ผ๋ฌธ๊ณผ ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋‹จ์ผ ์—ฐ์† run์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , ์ค‘๊ฐ„ checkpointing/crash recovery๋ฅผ ์“ฐ์ง€ ์•Š์•˜๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„์™€ HF ์—…๋กœ๋“œ, OOM ํšŒํ”ผ, ์ฒดํฌํฌ์ธํŠธ ๋ณด์กด ๋•Œ๋ฌธ์— stage-0, stage0b, stage-1์ฒ˜๋Ÿผ ๋‚˜๋ˆ„์–ด ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
18
+
19
+ ํ•ต์‹ฌ ์ฐจ์ด๋Š” ๋‹ค์Œ์ž…๋‹ˆ๋‹ค.
20
+
21
+ | ํ•ญ๋ชฉ | HRM-Text ๋…ผ๋ฌธ | KoHRM-Text ํ˜„์žฌ ๋ฐฉ์‹ |
22
+ |---|---|---|
23
+ | ํ•™์Šต ๋ชฉ์  | instruction-response task-completion objective | ๋™์ผ |
24
+ | loss | response-only NLL | ๋™์ผ |
25
+ | attention | PrefixLM, instruction bidirectional + response causal | ๋™์ผ ์ฝ”๋“œ ๊ฒฝ๋กœ ์‚ฌ์šฉ |
26
+ | raw LM pretraining ํ›„ SFT | ํ•˜์ง€ ์•Š์Œ | ํ•˜์ง€ ์•Š์Œ |
27
+ | SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ | instruction pretraining์— ํฌํ•จ | ํฌํ•จ |
28
+ | ์‹คํ–‰ ํ˜•ํƒœ | ๋‹จ์ผ ์—ฐ์† run | staged resume run |
29
+ | checkpoint | ๋…ผ๋ฌธ์€ ์ค‘๊ฐ„ checkpointing ์—†์Œ | ์šด์˜์ƒ 5,000 step๋งˆ๋‹ค ์ €์žฅ |
30
+ | tokenizer | 65,536 BPE | 131,072 Korean/terminal BPE |
31
+ | hardware | 16 x H100 | 8 x H200 |
32
+
33
+ ๋”ฐ๋ผ์„œ โ€œ๋…ผ๋ฌธ์ฒ˜๋Ÿผ single-stage ์ง€์‹œ๋ฌธ ์‚ฌ์ „ํ•™์Šต์ธ๊ฐ€?โ€์— ๋Œ€ํ•œ ๋‹ต์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.
34
+
35
+ > ํ•™์Šต objective์™€ ๋ฐ์ดํ„ฐ ํฌ๋งท์€ single-stage instruction pretraining์ž…๋‹ˆ๋‹ค.
36
+ > ๊ทธ๋Ÿฌ๋‚˜ ์‹คํ–‰์€ ํ•œ ํ”„๋กœ์„ธ์Šค์˜ ๋‹จ์ผ ์—ฐ์† run์ด ์•„๋‹ˆ๋ผ, ๊ฐ™์€ objective๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ checkpoint resume์œผ๋กœ ์ด์–ด๊ฐ€๋Š” staged pretraining์ž…๋‹ˆ๋‹ค.
37
+
38
+ ## ๋…ผ๋ฌธ ๋ฐฉ๋ฒ•๋ก  ์š”์•ฝ
39
+
40
+ HRM-Text ๋…ผ๋ฌธ์€ ๊ธฐ์กด LLM์˜ ๋Œ€๊ทœ๋ชจ raw-text causal LM ์‚ฌ์ „ํ•™์Šต ํ›„ mid-training/SFT๋กœ ๊ฐ€๋Š” ๋‹ค๋‹จ๊ณ„ recipe๋ฅผ ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ๋ณด๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ instruction-response pair๋งŒ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
41
+
42
+ ๋…ผ๋ฌธ ํ•ต์‹ฌ:
43
+
44
+ 1. raw text ์ „์ฒด ํ† ํฐ์„ ์˜ˆ์ธกํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
45
+ 2. instruction tokens์—๋Š” loss๋ฅผ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
46
+ 3. response tokens์—๋งŒ negative log-likelihood loss๋ฅผ ์ค๋‹ˆ๋‹ค.
47
+ 4. instruction segment๋Š” PrefixLM mask๋กœ bidirectional attention์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.
48
+ 5. response segment๋Š” autoregressive causal attention์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
49
+ 6. ๋ฐ์ดํ„ฐ๋Š” direct, cot, synth, noisy ๊ฐ™์€ condition tag๋ฅผ instruction ์•ž์— ๋ถ™์—ฌ ์‘๋‹ต ์Šคํƒ€์ผ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.
50
+ 7. `<think>...</think>` ๊ฐ™์€ explicit long-CoT trace๋Š” ์ œ๊ฑฐํ•ด ๋‚ด๋ถ€ recurrent computation์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ๋งก๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
51
+
52
+ ๋…ผ๋ฌธ์€ 1B HRM-Text๋ฅผ scratch๋กœ ํ•™์Šตํ–ˆ๊ณ , ์•ฝ 40B unique tokens ๋ฐ 16 x H100์—์„œ ์•ฝ 46์‹œ๊ฐ„์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ ํ‰๊ฐ€์—๋Š” EMA checkpoint๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
53
+
54
+ ## ํ˜„์žฌ KoHRM-Text ํ•™์Šต ๋ฐฉ์‹
55
+
56
+ ํ˜„์žฌ `KoHRM-Text-1.4B`๋„ raw causal LM์ด ์•„๋‹ˆ๋ผ HRM-Text V1Dataset ํฌ๋งท์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
57
+
58
+ ๊ฐ ์ƒ˜ํ”Œ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.
59
+
60
+ ```text
61
+ instruction span -> response span
62
+ ```
63
+
64
+ ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
65
+
66
+ ```text
67
+ <|im_start|> condition_tokens instruction <|im_end|> response <|box_end|>
68
+ ```
69
+
70
+ condition token์€ ๋‹ค์Œ mapping์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
71
+
72
+ | condition | token |
73
+ |---|---|
74
+ | direct | `<|object_ref_start|>` |
75
+ | cot | `<|object_ref_end|>` |
76
+ | noisy | `<|quad_start|>` |
77
+ | synth | `<|quad_end|>` |
78
+
79
+ `dataset_new.py` ๊ธฐ์ค€์œผ๋กœ instruction span์€ `inputs`์—๋Š” ๋“ค์–ด๊ฐ€์ง€๋งŒ `target_only=True`์ผ ๋•Œ label์€ `IGNORE_LABEL_ID`๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. response span๋งŒ ์‹ค์ œ cross entropy loss๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.
80
+
81
+ ์ฆ‰ ํ˜„์žฌ ํ•™์Šต์€ โ€œ๋ฌธ์„œ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋‹ค ๋งžํžˆ๋Š” raw LMโ€์ด ์•„๋‹ˆ๋ผ โ€œ์ฃผ์–ด์ง„ ์ง€์‹œ/๋ฌธ๋งฅ์„ ๋ณด๊ณ  ์‘๋‹ต์„ ์™„์„ฑํ•˜๋Š” task-completion pretrainingโ€์ž…๋‹ˆ๋‹ค.
82
+
83
+ ## PrefixLM ๊ตฌํ˜„ ํ™•์ธ
84
+
85
+ ํ˜„์žฌ ์ฝ”๋“œ์—์„œ PrefixLM ๊ฒฝ๋กœ๋Š” ๋‹ค์Œ ํŒŒ์ผ๋“ค๋กœ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.
86
+
87
+ | ํŒŒ์ผ | ์—ญํ•  |
88
+ |---|---|
89
+ | `dataset_new.py` | instruction/response span ๋ถ„๋ฆฌ, response-only label ๊ตฌ์„ฑ |
90
+ | `models/flash_attention_prefixlm_v2.py` | prefix bidirectional attention + response causal attention ๊ตฌํ˜„ |
91
+ | `models/layers.py` | attention type `prefixlm` ์‚ฌ์šฉ |
92
+ | `models/lm_head.py` | `IGNORE_LABEL_ID`๋ฅผ ์ œ์™ธํ•˜๊ณ  response label์—๋งŒ CE loss ๊ณ„์‚ฐ |
93
+
94
+ `dataset_new.py`์—์„œ๋Š” ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด instruction ๊ธธ์ด๋ฅผ `prefix_lens`, response ๊ธธ์ด๋ฅผ `causal_lens`๋กœ ๋„˜๊น๋‹ˆ๋‹ค.
95
+
96
+ `flash_attention_prefixlm_v2.py`๋Š” attention์„ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
97
+
98
+ 1. prefix ๊ตฌ๊ฐ„: instruction tokens๋ผ๋ฆฌ bidirectional attention
99
+ 2. causal ๊ตฌ๊ฐ„: response tokens๊ฐ€ prefix ์ „์ฒด์™€ ์ด์ „ response tokens๋ฅผ ๋ณด๋Š” causal attention
100
+
101
+ ์ด ๊ตฌ์กฐ๊ฐ€ ๋…ผ๋ฌธ์˜ PrefixLM๊ณผ ๋งž๋Š” ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.
102
+
103
+ ## ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
104
+
105
+ ํ˜„์žฌ ํ‘œ์ค€ ๋ชจ๋ธ๋ช…์€ `KoHRM-Text-1.4B`์ด๊ณ , `arch/size@arch=XL`์— ํ•ด๋‹นํ•˜๋Š” ์„ค์ •์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
106
+
107
+ | ํ•ญ๋ชฉ | ๊ฐ’ |
108
+ |---|---:|
109
+ | params | 1,384,120,320 |
110
+ | hidden size | 1,536 |
111
+ | total configured layers | 32 |
112
+ | half layers | true |
113
+ | H module layers | 16 |
114
+ | L module layers | 16 |
115
+ | heads | 12 |
116
+ | head dim | 128 |
117
+ | expansion | 4 |
118
+ | intermediate size | 4,096 |
119
+ | context | 4,096 |
120
+ | RoPE theta | 10,000 |
121
+ | norm | RMSNorm-style parameterless norm |
122
+ | init | LeCun normal |
123
+ | dtype | bfloat16 |
124
+ | tokenizer vocab | 131,072 |
125
+
126
+ HRM recurrent schedule:
127
+
128
+ | ํ•ญ๋ชฉ | ๊ฐ’ |
129
+ |---|---:|
130
+ | H cycles | 2 |
131
+ | L cycles per H cycle | 3 |
132
+ | effective H/L recurrence | H2L3 |
133
+ | bp min steps | 2 |
134
+ | bp max steps | 5 |
135
+ | bp warmup ratio | 0.2 |
136
+
137
+ ์ฝ”๋“œ์ƒ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
138
+
139
+ 1. token embedding์—์„œ ์‹œ์ž‘ํ•œ hidden state๋ฅผ `z_H`๋กœ ๋‘ก๋‹ˆ๋‹ค.
140
+ 2. learned/fixed low-level initial state `z_L_init`์—์„œ `z_L`์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
141
+ 3. ๊ฐ H cycle๋งˆ๋‹ค L module์„ 3๋ฒˆ ๋ฐ˜๋ณต ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
142
+ 4. ๊ทธ ๋’ค H module์„ 1๋ฒˆ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
143
+ 5. ์ด 2๋ฒˆ์˜ H cycle์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
144
+ 6. ์ตœ์ข… `z_H`์— LM head๋ฅผ ๋ถ™์—ฌ vocabulary logits๋ฅผ ๋ƒ…๋‹ˆ๋‹ค.
145
+
146
+ ๋…ผ๋ฌธ ํ‘œํ˜„์œผ๋กœ๋Š” slow-evolving strategic layer์ธ H module๊ณผ fast-evolving execution layer์ธ L module์˜ dual-timescale recurrent design์ž…๋‹ˆ๋‹ค.
147
+
148
+ ## MagicNorm / ์•ˆ์ •ํ™”
149
+
150
+ ๋…ผ๋ฌธ์€ recurrent depth๊ฐ€ ๊นŠ์–ด์ง€๋ฉด activation variance์™€ gradient instability๊ฐ€ ์ปค์ง€๋ฏ€๋กœ MagicNorm๊ณผ warmup deep credit assignment๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
151
+
152
+ ํ˜„์žฌ ์ฝ”๋“œ์—์„œ๋Š” `norm_type: pre`๋ฅผ ์“ฐ๋˜, Transformer module ๋์— final RMSNorm์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋‚ด๋ถ€ block์€ PreNorm ์Šคํƒ€์ผ์ด๊ณ  module output์€ norm์œผ๋กœ capped๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” MagicNorm ๊ณ„์—ด ์•ˆ์ •ํ™”์™€ ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค.
153
+
154
+ Backward๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋“  recurrent step์„ ๋‹ค ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ , `bp_steps`๋ฅผ warmupํ•ฉ๋‹ˆ๋‹ค.
155
+
156
+ ํ˜„์žฌ ์„ค์ •:
157
+
158
+ ```yaml
159
+ bp_warmup_ratio: 0.2
160
+ bp_min_steps: 2
161
+ bp_max_steps: 5
162
+ ```
163
+
164
+ ์ดˆ๊ธฐ์—๋Š” ๋งˆ์ง€๋ง‰ 2 recurrent steps ์œ„์ฃผ๋กœ gradient๋ฅผ ํ˜๋ฆฌ๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ ์ตœ๋Œ€ 5 steps๊นŒ์ง€ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค. ์ด ์ ๋„ ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ๋งž์Šต๋‹ˆ๋‹ค.
165
+
166
+ ## Optimizer / ์Šค์ผ€์ค„
167
+
168
+ ํ˜„์žฌ pretraining config:
169
+
170
+ | ํ•ญ๋ชฉ | ๊ฐ’ |
171
+ |---|---:|
172
+ | optimizer | Adam-atan2 |
173
+ | beta1 | 0.9 |
174
+ | beta2 | 0.95 |
175
+ | weight decay | 0.1 |
176
+ | lr | 2.2e-4 |
177
+ | lr warmup | 2,000 steps |
178
+ | lr min ratio | 1.0 |
179
+ | EMA | 0.9999 |
180
+ | gradient clipping | ์—†์Œ |
181
+
182
+ ๋…ผ๋ฌธ๋„ Adam-atan2, warmup 2,000 steps, weight decay 0.1, EMA 0.9999, bf16์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ/ํ‰๊ฐ€๋Š” EMA checkpoint๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
183
+
184
+ ## ํ˜„์žฌ staged run์ด ๋…ผ๋ฌธ๊ณผ ๋‹ค๋ฅธ ์ด์œ 
185
+
186
+ ๋…ผ๋ฌธ์€ โ€œsingle continuous runโ€์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ ํ˜„์‹ค์  ์ด์œ  ๋•Œ๋ฌธ์— staged run์œผ๋กœ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค.
187
+
188
+ 1. ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/tokenizer ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ๋๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
189
+ 2. GPU๋ฅผ ๋†€๋ฆฌ์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
190
+ 3. H200 8์žฅ ํ™˜๊ฒฝ์—์„œ 131K vocab ๋•Œ๋ฌธ์— batch OOM ์•ˆ์ • ๋งˆ์ง„์„ ์‹ค์ธกํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.
191
+ 4. HF ์—…๋กœ๋“œ์™€ raw checkpoint ๋ณด์กด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
192
+ 5. full HRM 328G no-cap retokenization์ด ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.
193
+
194
+ ํ•˜์ง€๋งŒ ๊ฐ stage๊ฐ€ ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.
195
+
196
+ | Stage | Objective | ์„ฑ๊ฒฉ |
197
+ |---|---|---|
198
+ | stage-0 | PrefixLM response-only | ์ค€๋น„ ์™„๋ฃŒ๋œ 711.3M mix |
199
+ | stage0b | PrefixLM response-only | ๊ฐ™์€ mix ์ถ”๊ฐ€ pass |
200
+ | stage-1 | PrefixLM response-only | HRM fast-cap 14.55B |
201
+ | later stage | PrefixLM response-only | full HRM 328G retokenized + ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ mix |
202
+ | final SFT | PrefixLM response-only ๋˜๋Š” SFT์šฉ response-only | ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ํ›„์ฒ˜๋ฆฌ |
203
+
204
+ ์ค‘์š”ํ•œ ์ ์€ stage-0์—์„œ stage-1๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋กœ global step/LR schedule์„ ์ด์–ด๊ฐ€๋„๋ก ์ˆ˜์ •ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
205
+
206
+ ์ฆ‰ โ€œํ•™์Šต ๋ฐฉ๋ฒ•๋ก โ€์€ single-stage instruction pretraining์ด๊ณ , โ€œ์šด์˜ ๋ฐฉ์‹โ€์€ staged continuation์ž…๋‹ˆ๋‹ค.
207
+
208
+ ## ํ˜„์žฌ ๋ฐ์ดํ„ฐ๊ฐ€ single-stage ์›์น™์— ๋งž๋Š”์ง€
209
+
210
+ ํ˜„์žฌ prepared dataset๋“ค์€ ์ „๋ถ€ ๊ฐ€๋Šฅํ•œ ํ•œ instruction-response ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
211
+
212
+ | ๋ฐ์ดํ„ฐ | single-stage ์›์น™ ์ ํ•ฉ์„ฑ |
213
+ |---|---|
214
+ | HRM cleaned data | ์›๋ž˜ HRM instruction/response/condition ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
215
+ | ToolBench | tool instruction -> tool-call/answer response ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
216
+ | SWE-ZERO/local terminal | terminal context -> next action/answer ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
217
+ | GLM/Claude reasoning | final answer ์ค‘์‹ฌ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ์ ํ•ฉ |
218
+ | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์œ„ํ‚ค ์›๋ฌธ | chunked instruction/response task๋กœ ๋ฐ”๊ฟ” ํˆฌ์ž…ํ•˜๋ฏ€๋กœ ์ ํ•ฉ |
219
+
220
+ ์ฃผ์˜ํ•  ์ :
221
+
222
+ - ํ•œ๊ตญ์–ด ์œ„ํ‚ค/๋ฒ•๋ฅ  raw chunk๋Š” โ€œ๊ทธ๋ƒฅ ๋‹ค์Œ ํ…์ŠคํŠธ ์˜ˆ์ธกโ€์ฒ˜๋Ÿผ ๋„ฃ์œผ๋ฉด ๋…ผ๋ฌธ์‹ task-completion์—์„œ ๋ฉ€์–ด์ง‘๋‹ˆ๋‹ค.
223
+ - ๋”ฐ๋ผ์„œ title/context๋ฅผ instruction๏ฟฝ๏ฟฝ๋กœ ๋‘๊ณ  chunk/summary/extraction์„ response๋กœ ๋‘๋Š” ์‹์ด ๋” ๋งž์Šต๋‹ˆ๋‹ค.
224
+ - local terminal dataset์€ objective์— ์ž˜ ๋งž์ง€๋งŒ ์ „์ฒด ๋น„์ค‘์ด ๋„ˆ๋ฌด ์ปค์ง€๋ฉด ์ผ๋ฐ˜ ์ง€์‹/ํ•œ๊ตญ์–ด ๊ท ํ˜•์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
225
+
226
+ ## ํ˜„์žฌ ๋ฐฉ์‹ ํ‰๊ฐ€
227
+
228
+ ํ˜„์žฌ ๋ฐฉ์‹์€ ๋…ผ๋ฌธ ํ•ต์‹ฌ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.
229
+
230
+ ๋งž๋Š” ๋ถ€๋ถ„:
231
+
232
+ - scratch training
233
+ - HRM H2L3 recurrent architecture
234
+ - PrefixLM attention
235
+ - response-only loss
236
+ - condition token ์‚ฌ์šฉ
237
+ - Adam-atan2
238
+ - bf16
239
+ - EMA 0.9999
240
+ - 4,096 context
241
+ - LeCun normal init
242
+
243
+ ๋‹ค๋ฅธ ๋ถ€๋ถ„:
244
+
245
+ - vocab 65,536์ด ์•„๋‹ˆ๋ผ 131,072์ž…๋‹ˆ๋‹ค.
246
+ - 16 x H100์ด ์•„๋‹ˆ๋ผ 8 x H200์ž…๋‹ˆ๋‹ค.
247
+ - ๋…ผ๋ฌธ์€ ๋‹จ์ผ ์—ฐ์† run, ์šฐ๋ฆฌ๋Š” staged resume run์ž…๋‹ˆ๋‹ค.
248
+ - ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋ณด๊ณ ํ–ˆ๊ณ , ํ˜„์žฌ public checkpoint๋Š” stage-1 fast-cap ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค.
249
+ - ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋น„์ค‘์ด ๋…ผ๋ฌธ๋ณด๋‹ค ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค.
250
+
251
+ ์œ„ ์ฐจ์ด๋Š” ์˜๋„๋œ ๋ณ€๊ฒฝ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฏ€๋กœ tokenizer์™€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋ฐ”๊พผ ๊ฒƒ์€ ๋งž์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋…ผ๋ฌธ์˜ ํšจ์œจ์„ฑ์„ ์žฌํ˜„ํ•˜๋ ค๋ฉด ์ตœ์ข…์ ์œผ๋กœ full HRM cleaned data์™€ balanced Korean/terminal/tool mix๋ฅผ ํ•ฉ์ณ 40B+ token ์ˆ˜์ค€์œผ๋กœ ์ด์–ด ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
252
+
253
+ ## ์šด์˜ ๊ฒฐ๋ก 
254
+
255
+ ํ˜„์žฌ๋Š” ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ๊ณ„์† ๊ฐ€๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.
256
+
257
+ 1. ํ˜„์žฌ stage-1์€ ๊ณ„์† ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
258
+ 2. ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ๋ฐ์ดํ„ฐ๋Š” HF dataset repo์— ์˜ฌ๋ ค ๋‹ค๋ฅธ ๋จธ์‹ ์—์„œ๋„ ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋‘ก๋‹ˆ๋‹ค.
259
+ 3. full HRM 328G no-cap retokenization์ด ๋๋‚˜๋ฉด next stage๋กœ ์ด์–ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
260
+ 4. SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ๋„ pretraining์— ๋จผ์ € ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
261
+ 5. ๋ณ„๋„ final SFT๋Š” ๋งˆ์ง€๋ง‰์— ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
262
+ 6. model repo๋Š” ์ตœ์‹  safetensors ์ค‘์‹ฌ, raw checkpoint repo๋Š” resume์šฉ์œผ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
263
+