Add methodology and VRAM notes
Browse files
METHODOLOGY_ARCHITECTURE_NOTES_2026-05-24.md
ADDED
|
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# KoHRM-Text Methodology and Architecture Notes
|
| 2 |
+
|
| 3 |
+
์์ฑ์ผ: 2026-05-24
|
| 4 |
+
|
| 5 |
+
์ด ๋ฌธ์๋ `KoHRM-Text-1.4B`๊ฐ HRM-Text ๋
ผ๋ฌธ ๋ฐฉ์๊ณผ ์ด๋ค ์ ์์ ๊ฐ๊ณ , ์ด๋ค ์ ์์ ์ด์์ ๋ค๋ฅธ์ง ์ ๋ฆฌํฉ๋๋ค.
|
| 6 |
+
|
| 7 |
+
์ฐธ๊ณ ๋ฌธ์:
|
| 8 |
+
|
| 9 |
+
- HRM-Text paper: https://arxiv.org/html/2605.20613
|
| 10 |
+
- Upstream code: https://github.com/sapientinc/HRM-Text
|
| 11 |
+
- KoHRM-Text code: https://github.com/LLM-OS-Models/KoHRM-text
|
| 12 |
+
|
| 13 |
+
## ๊ฒฐ๋ก
|
| 14 |
+
|
| 15 |
+
์ฐ๋ฆฌ์ ํ์ฌ ํ์ต์ ๋ฐฉ๋ฒ๋ก ๊ด์ ์์๋ HRM-Text ๋
ผ๋ฌธ์ single-stage instruction pretraining์ ๋ง์ถฐ์ ธ ์์ต๋๋ค.
|
| 16 |
+
|
| 17 |
+
๋ค๋ง ์คํ ์ด์ ๊ด์ ์์๋ ๋
ผ๋ฌธ๊ณผ ์์ ํ ๊ฐ์ง ์์ต๋๋ค. ๋
ผ๋ฌธ์ 40B unique tokens๋ฅผ ๋จ์ผ ์ฐ์ run์ผ๋ก ํ์ตํ๊ณ , ์ค๊ฐ checkpointing/crash recovery๋ฅผ ์ฐ์ง ์์๋ค๊ณ ์ค๋ช
ํฉ๋๋ค. ์ฐ๋ฆฌ๋ ๋ฐ์ดํฐ ์ค๋น์ HF ์
๋ก๋, OOM ํํผ, ์ฒดํฌํฌ์ธํธ ๋ณด์กด ๋๋ฌธ์ stage-0, stage0b, stage-1์ฒ๋ผ ๋๋์ด ์คํํ๊ณ ์์ต๋๋ค.
|
| 18 |
+
|
| 19 |
+
ํต์ฌ ์ฐจ์ด๋ ๋ค์์
๋๋ค.
|
| 20 |
+
|
| 21 |
+
| ํญ๋ชฉ | HRM-Text ๋
ผ๋ฌธ | KoHRM-Text ํ์ฌ ๋ฐฉ์ |
|
| 22 |
+
|---|---|---|
|
| 23 |
+
| ํ์ต ๋ชฉ์ | instruction-response task-completion objective | ๋์ผ |
|
| 24 |
+
| loss | response-only NLL | ๋์ผ |
|
| 25 |
+
| attention | PrefixLM, instruction bidirectional + response causal | ๋์ผ ์ฝ๋ ๊ฒฝ๋ก ์ฌ์ฉ |
|
| 26 |
+
| raw LM pretraining ํ SFT | ํ์ง ์์ | ํ์ง ์์ |
|
| 27 |
+
| SFT ํ๋ณด ๋ฐ์ดํฐ | instruction pretraining์ ํฌํจ | ํฌํจ |
|
| 28 |
+
| ์คํ ํํ | ๋จ์ผ ์ฐ์ run | staged resume run |
|
| 29 |
+
| checkpoint | ๋
ผ๋ฌธ์ ์ค๊ฐ checkpointing ์์ | ์ด์์ 5,000 step๋ง๋ค ์ ์ฅ |
|
| 30 |
+
| tokenizer | 65,536 BPE | 131,072 Korean/terminal BPE |
|
| 31 |
+
| hardware | 16 x H100 | 8 x H200 |
|
| 32 |
+
|
| 33 |
+
๋ฐ๋ผ์ โ๋
ผ๋ฌธ์ฒ๋ผ single-stage ์ง์๋ฌธ ์ฌ์ ํ์ต์ธ๊ฐ?โ์ ๋ํ ๋ต์ ๋ค์์ฒ๋ผ ์ ๋ฆฌํ๋ ๊ฒ์ด ์ ํํฉ๋๋ค.
|
| 34 |
+
|
| 35 |
+
> ํ์ต objective์ ๋ฐ์ดํฐ ํฌ๋งท์ single-stage instruction pretraining์
๋๋ค.
|
| 36 |
+
> ๊ทธ๋ฌ๋ ์คํ์ ํ ํ๋ก์ธ์ค์ ๋จ์ผ ์ฐ์ run์ด ์๋๋ผ, ๊ฐ์ objective๋ฅผ ์ ์งํ๋ฉด์ checkpoint resume์ผ๋ก ์ด์ด๊ฐ๋ staged pretraining์
๋๋ค.
|
| 37 |
+
|
| 38 |
+
## ๋
ผ๋ฌธ ๋ฐฉ๋ฒ๋ก ์์ฝ
|
| 39 |
+
|
| 40 |
+
HRM-Text ๋
ผ๋ฌธ์ ๊ธฐ์กด LLM์ ๋๊ท๋ชจ raw-text causal LM ์ฌ์ ํ์ต ํ mid-training/SFT๋ก ๊ฐ๋ ๋ค๋จ๊ณ recipe๋ฅผ ๋นํจ์จ์ ์ด๋ผ๊ณ ๋ณด๊ณ , ์ฒ์๋ถํฐ instruction-response pair๋ง์ผ๋ก ํ์ตํฉ๋๋ค.
|
| 41 |
+
|
| 42 |
+
๋
ผ๋ฌธ ํต์ฌ:
|
| 43 |
+
|
| 44 |
+
1. raw text ์ ์ฒด ํ ํฐ์ ์์ธกํ์ง ์์ต๋๋ค.
|
| 45 |
+
2. instruction tokens์๋ loss๋ฅผ ์ฃผ์ง ์์ต๋๋ค.
|
| 46 |
+
3. response tokens์๋ง negative log-likelihood loss๋ฅผ ์ค๋๋ค.
|
| 47 |
+
4. instruction segment๋ PrefixLM mask๋ก bidirectional attention์ ํ์ฉํฉ๋๋ค.
|
| 48 |
+
5. response segment๋ autoregressive causal attention์ ์ ์งํฉ๋๋ค.
|
| 49 |
+
6. ๋ฐ์ดํฐ๋ direct, cot, synth, noisy ๊ฐ์ condition tag๋ฅผ instruction ์์ ๋ถ์ฌ ์๋ต ์คํ์ผ์ ์ ์ดํฉ๋๋ค.
|
| 50 |
+
7. `<think>...</think>` ๊ฐ์ explicit long-CoT trace๋ ์ ๊ฑฐํด ๋ด๋ถ recurrent computation์ด ๋ฌธ์ ํด๊ฒฐ์ ๋งก๊ฒ ํฉ๋๋ค.
|
| 51 |
+
|
| 52 |
+
๋
ผ๋ฌธ์ 1B HRM-Text๋ฅผ scratch๋ก ํ์ตํ๊ณ , ์ฝ 40B unique tokens ๋ฐ 16 x H100์์ ์ฝ 46์๊ฐ์ ์ฌ์ฉํ๋ค๊ณ ์ค๋ช
ํฉ๋๋ค. ๊ณต๊ฐ ํ๊ฐ์๋ EMA checkpoint๋ฅผ ์ฌ์ฉํฉ๋๋ค.
|
| 53 |
+
|
| 54 |
+
## ํ์ฌ KoHRM-Text ํ์ต ๋ฐฉ์
|
| 55 |
+
|
| 56 |
+
ํ์ฌ `KoHRM-Text-1.4B`๋ raw causal LM์ด ์๋๋ผ HRM-Text V1Dataset ํฌ๋งท์ผ๋ก ํ์ตํฉ๋๋ค.
|
| 57 |
+
|
| 58 |
+
๊ฐ ์ํ์ ๊ธฐ๋ณธ์ ์ผ๋ก ๋ค์ ๊ตฌ์กฐ๋ฅผ ๊ฐ์ต๋๋ค.
|
| 59 |
+
|
| 60 |
+
```text
|
| 61 |
+
instruction span -> response span
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
ํ ํฐ ๋ ๋ฒจ์์๋ ๋ค์์ฒ๋ผ ๊ตฌ์ฑ๋ฉ๋๋ค.
|
| 65 |
+
|
| 66 |
+
```text
|
| 67 |
+
<|im_start|> condition_tokens instruction <|im_end|> response <|box_end|>
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
condition token์ ๋ค์ mapping์ ์ฌ์ฉํฉ๋๋ค.
|
| 71 |
+
|
| 72 |
+
| condition | token |
|
| 73 |
+
|---|---|
|
| 74 |
+
| direct | `<|object_ref_start|>` |
|
| 75 |
+
| cot | `<|object_ref_end|>` |
|
| 76 |
+
| noisy | `<|quad_start|>` |
|
| 77 |
+
| synth | `<|quad_end|>` |
|
| 78 |
+
|
| 79 |
+
`dataset_new.py` ๊ธฐ์ค์ผ๋ก instruction span์ `inputs`์๋ ๋ค์ด๊ฐ์ง๋ง `target_only=True`์ผ ๋ label์ `IGNORE_LABEL_ID`๊ฐ ๋ฉ๋๋ค. response span๋ง ์ค์ cross entropy loss๋ฅผ ๋ฐ์ต๋๋ค.
|
| 80 |
+
|
| 81 |
+
์ฆ ํ์ฌ ํ์ต์ โ๋ฌธ์๋ฅผ ์ฒ์๋ถํฐ ๋๊น์ง ๋ค ๋งํ๋ raw LMโ์ด ์๋๋ผ โ์ฃผ์ด์ง ์ง์/๋ฌธ๋งฅ์ ๋ณด๊ณ ์๋ต์ ์์ฑํ๋ task-completion pretrainingโ์
๋๋ค.
|
| 82 |
+
|
| 83 |
+
## PrefixLM ๊ตฌํ ํ์ธ
|
| 84 |
+
|
| 85 |
+
ํ์ฌ ์ฝ๋์์ PrefixLM ๊ฒฝ๋ก๋ ๋ค์ ํ์ผ๋ค๋ก ํ์ธ๋ฉ๋๋ค.
|
| 86 |
+
|
| 87 |
+
| ํ์ผ | ์ญํ |
|
| 88 |
+
|---|---|
|
| 89 |
+
| `dataset_new.py` | instruction/response span ๋ถ๋ฆฌ, response-only label ๊ตฌ์ฑ |
|
| 90 |
+
| `models/flash_attention_prefixlm_v2.py` | prefix bidirectional attention + response causal attention ๊ตฌํ |
|
| 91 |
+
| `models/layers.py` | attention type `prefixlm` ์ฌ์ฉ |
|
| 92 |
+
| `models/lm_head.py` | `IGNORE_LABEL_ID`๋ฅผ ์ ์ธํ๊ณ response label์๋ง CE loss ๊ณ์ฐ |
|
| 93 |
+
|
| 94 |
+
`dataset_new.py`์์๋ ๊ฐ ์ํ์ ๋ํด instruction ๊ธธ์ด๋ฅผ `prefix_lens`, response ๊ธธ์ด๋ฅผ `causal_lens`๋ก ๋๊น๋๋ค.
|
| 95 |
+
|
| 96 |
+
`flash_attention_prefixlm_v2.py`๋ attention์ ๋ ๋ถ๋ถ์ผ๋ก ์ฒ๋ฆฌํฉ๋๋ค.
|
| 97 |
+
|
| 98 |
+
1. prefix ๊ตฌ๊ฐ: instruction tokens๋ผ๋ฆฌ bidirectional attention
|
| 99 |
+
2. causal ๊ตฌ๊ฐ: response tokens๊ฐ prefix ์ ์ฒด์ ์ด์ response tokens๋ฅผ ๋ณด๋ causal attention
|
| 100 |
+
|
| 101 |
+
์ด ๊ตฌ์กฐ๊ฐ ๋
ผ๋ฌธ์ PrefixLM๊ณผ ๋ง๋ ํต์ฌ์
๋๋ค.
|
| 102 |
+
|
| 103 |
+
## ๋ชจ๋ธ ์ํคํ
์ฒ
|
| 104 |
+
|
| 105 |
+
ํ์ฌ ํ์ค ๋ชจ๋ธ๋ช
์ `KoHRM-Text-1.4B`์ด๊ณ , `arch/size@arch=XL`์ ํด๋นํ๋ ์ค์ ์ ์ฌ์ฉํฉ๋๋ค.
|
| 106 |
+
|
| 107 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 108 |
+
|---|---:|
|
| 109 |
+
| params | 1,384,120,320 |
|
| 110 |
+
| hidden size | 1,536 |
|
| 111 |
+
| total configured layers | 32 |
|
| 112 |
+
| half layers | true |
|
| 113 |
+
| H module layers | 16 |
|
| 114 |
+
| L module layers | 16 |
|
| 115 |
+
| heads | 12 |
|
| 116 |
+
| head dim | 128 |
|
| 117 |
+
| expansion | 4 |
|
| 118 |
+
| intermediate size | 4,096 |
|
| 119 |
+
| context | 4,096 |
|
| 120 |
+
| RoPE theta | 10,000 |
|
| 121 |
+
| norm | RMSNorm-style parameterless norm |
|
| 122 |
+
| init | LeCun normal |
|
| 123 |
+
| dtype | bfloat16 |
|
| 124 |
+
| tokenizer vocab | 131,072 |
|
| 125 |
+
|
| 126 |
+
HRM recurrent schedule:
|
| 127 |
+
|
| 128 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 129 |
+
|---|---:|
|
| 130 |
+
| H cycles | 2 |
|
| 131 |
+
| L cycles per H cycle | 3 |
|
| 132 |
+
| effective H/L recurrence | H2L3 |
|
| 133 |
+
| bp min steps | 2 |
|
| 134 |
+
| bp max steps | 5 |
|
| 135 |
+
| bp warmup ratio | 0.2 |
|
| 136 |
+
|
| 137 |
+
์ฝ๋์ ํ๋ฆ์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
|
| 138 |
+
|
| 139 |
+
1. token embedding์์ ์์ํ hidden state๋ฅผ `z_H`๋ก ๋ก๋๋ค.
|
| 140 |
+
2. learned/fixed low-level initial state `z_L_init`์์ `z_L`์ ์์ํฉ๋๋ค.
|
| 141 |
+
3. ๊ฐ H cycle๋ง๋ค L module์ 3๋ฒ ๋ฐ๋ณต ์
๋ฐ์ดํธํฉ๋๋ค.
|
| 142 |
+
4. ๊ทธ ๋ค H module์ 1๋ฒ ์
๋ฐ์ดํธํฉ๋๋ค.
|
| 143 |
+
5. ์ด 2๋ฒ์ H cycle์ ์ํํฉ๋๋ค.
|
| 144 |
+
6. ์ต์ข
`z_H`์ LM head๋ฅผ ๋ถ์ฌ vocabulary logits๋ฅผ ๋
๋๋ค.
|
| 145 |
+
|
| 146 |
+
๋
ผ๋ฌธ ํํ์ผ๋ก๋ slow-evolving strategic layer์ธ H module๊ณผ fast-evolving execution layer์ธ L module์ dual-timescale recurrent design์
๋๋ค.
|
| 147 |
+
|
| 148 |
+
## MagicNorm / ์์ ํ
|
| 149 |
+
|
| 150 |
+
๋
ผ๋ฌธ์ recurrent depth๊ฐ ๊น์ด์ง๋ฉด activation variance์ gradient instability๊ฐ ์ปค์ง๋ฏ๋ก MagicNorm๊ณผ warmup deep credit assignment๋ฅผ ์ฌ์ฉํฉ๋๋ค.
|
| 151 |
+
|
| 152 |
+
ํ์ฌ ์ฝ๋์์๋ `norm_type: pre`๋ฅผ ์ฐ๋, Transformer module ๋์ final RMSNorm์ ์ ์ฉํฉ๋๋ค. ์ฆ ๋ด๋ถ block์ PreNorm ์คํ์ผ์ด๊ณ module output์ norm์ผ๋ก capped๋ฉ๋๋ค. ์ด๊ฒ์ด ๋
ผ๋ฌธ์์ ๋งํ๋ MagicNorm ๊ณ์ด ์์ ํ์ ๋์๋ฉ๋๋ค.
|
| 153 |
+
|
| 154 |
+
Backward๋ ์ฒ์๋ถํฐ ๋ชจ๋ recurrent step์ ๋ค ํต๊ณผํ์ง ์๊ณ , `bp_steps`๋ฅผ warmupํฉ๋๋ค.
|
| 155 |
+
|
| 156 |
+
ํ์ฌ ์ค์ :
|
| 157 |
+
|
| 158 |
+
```yaml
|
| 159 |
+
bp_warmup_ratio: 0.2
|
| 160 |
+
bp_min_steps: 2
|
| 161 |
+
bp_max_steps: 5
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
์ด๊ธฐ์๋ ๋ง์ง๋ง 2 recurrent steps ์์ฃผ๋ก gradient๋ฅผ ํ๋ฆฌ๊ณ , ํ์ต์ด ์งํ๋๋ฉฐ ์ต๋ 5 steps๊น์ง ๋๋ฆฝ๋๋ค. ์ด ์ ๋ ๋
ผ๋ฌธ ๋ฐฉ์๊ณผ ๋ง์ต๋๋ค.
|
| 165 |
+
|
| 166 |
+
## Optimizer / ์ค์ผ์ค
|
| 167 |
+
|
| 168 |
+
ํ์ฌ pretraining config:
|
| 169 |
+
|
| 170 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 171 |
+
|---|---:|
|
| 172 |
+
| optimizer | Adam-atan2 |
|
| 173 |
+
| beta1 | 0.9 |
|
| 174 |
+
| beta2 | 0.95 |
|
| 175 |
+
| weight decay | 0.1 |
|
| 176 |
+
| lr | 2.2e-4 |
|
| 177 |
+
| lr warmup | 2,000 steps |
|
| 178 |
+
| lr min ratio | 1.0 |
|
| 179 |
+
| EMA | 0.9999 |
|
| 180 |
+
| gradient clipping | ์์ |
|
| 181 |
+
|
| 182 |
+
๋
ผ๋ฌธ๋ Adam-atan2, warmup 2,000 steps, weight decay 0.1, EMA 0.9999, bf16์ ์ฌ์ฉํฉ๋๋ค. ๊ณต๊ฐ/ํ๊ฐ๋ EMA checkpoint๋ฅผ ๊ธฐ์ค์ผ๋ก ํฉ๋๋ค.
|
| 183 |
+
|
| 184 |
+
## ํ์ฌ staged run์ด ๋
ผ๋ฌธ๊ณผ ๋ค๋ฅธ ์ด์
|
| 185 |
+
|
| 186 |
+
๋
ผ๋ฌธ์ โsingle continuous runโ์ด๋ผ๊ณ ์ค๋ช
ํฉ๋๋ค. ์ฐ๋ฆฌ๋ ๋ค์ ํ์ค์ ์ด์ ๋๋ฌธ์ staged run์ผ๋ก ์ด์ํฉ๋๋ค.
|
| 187 |
+
|
| 188 |
+
1. ํ๊ตญ์ด/ํฐ๋ฏธ๋/tokenizer ์ ์ฒ๋ฆฌ๊ฐ ์์ฐจ์ ์ผ๋ก ๋๋๊ณ ์์ต๋๋ค.
|
| 189 |
+
2. GPU๋ฅผ ๋๋ฆฌ์ง ์๊ธฐ ์ํด ์ค๋น๋ ๋ฐ์ดํฐ๋ถํฐ ํ์ต์ ์์ํ์ต๋๋ค.
|
| 190 |
+
3. H200 8์ฅ ํ๊ฒฝ์์ 131K vocab ๋๋ฌธ์ batch OOM ์์ ๋ง์ง์ ์ค์ธกํด์ผ ํ์ต๋๋ค.
|
| 191 |
+
4. HF ์
๋ก๋์ raw checkpoint ๋ณด์กด์ด ํ์ํฉ๋๋ค.
|
| 192 |
+
5. full HRM 328G no-cap retokenization์ด ์์ง ์งํ ์ค์
๋๋ค.
|
| 193 |
+
|
| 194 |
+
ํ์ง๋ง ๊ฐ stage๊ฐ ๋ค๋ฅธ ๋ชฉ์ ํจ์๋ก ๋ฐ๋๋ ๊ฒ์ ์๋๋๋ค.
|
| 195 |
+
|
| 196 |
+
| Stage | Objective | ์ฑ๊ฒฉ |
|
| 197 |
+
|---|---|---|
|
| 198 |
+
| stage-0 | PrefixLM response-only | ์ค๋น ์๋ฃ๋ 711.3M mix |
|
| 199 |
+
| stage0b | PrefixLM response-only | ๊ฐ์ mix ์ถ๊ฐ pass |
|
| 200 |
+
| stage-1 | PrefixLM response-only | HRM fast-cap 14.55B |
|
| 201 |
+
| later stage | PrefixLM response-only | full HRM 328G retokenized + ํ๊ตญ์ด/ํฐ๋ฏธ๋/ํด์ฝ mix |
|
| 202 |
+
| final SFT | PrefixLM response-only ๋๋ SFT์ฉ response-only | ํ์ง ๋์ subset์ผ๋ก ํ์ฒ๋ฆฌ |
|
| 203 |
+
|
| 204 |
+
์ค์ํ ์ ์ stage-0์์ stage-1๋ก ๋์ด๊ฐ ๋ model/optimizer/EMA/carry๋ฅผ ์ด์ด๋ฐ๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋ก global step/LR schedule์ ์ด์ด๊ฐ๋๋ก ์์ ํ๋ค๋ ๊ฒ์
๋๋ค.
|
| 205 |
+
|
| 206 |
+
์ฆ โํ์ต ๋ฐฉ๋ฒ๋ก โ์ single-stage instruction pretraining์ด๊ณ , โ์ด์ ๋ฐฉ์โ์ staged continuation์
๋๋ค.
|
| 207 |
+
|
| 208 |
+
## ํ์ฌ ๋ฐ์ดํฐ๊ฐ single-stage ์์น์ ๋ง๋์ง
|
| 209 |
+
|
| 210 |
+
ํ์ฌ prepared dataset๋ค์ ์ ๋ถ ๊ฐ๋ฅํ ํ instruction-response ํํ๋ก ๋ณํํ์ต๋๋ค.
|
| 211 |
+
|
| 212 |
+
| ๋ฐ์ดํฐ | single-stage ์์น ์ ํฉ์ฑ |
|
| 213 |
+
|---|---|
|
| 214 |
+
| HRM cleaned data | ์๋ HRM instruction/response/condition ๊ตฌ์กฐ๋ผ ์ ํฉ |
|
| 215 |
+
| ToolBench | tool instruction -> tool-call/answer response ๊ตฌ์กฐ๋ผ ์ ํฉ |
|
| 216 |
+
| SWE-ZERO/local terminal | terminal context -> next action/answer ๊ตฌ์กฐ๋ผ ์ ํฉ |
|
| 217 |
+
| GLM/Claude reasoning | final answer ์ค์ฌ์ผ๋ก ์ ๋ฆฌํ๋ฉด ์ ํฉ |
|
| 218 |
+
| ํ๊ตญ์ด ๋ฒ๋ฅ /์ํค ์๋ฌธ | chunked instruction/response task๋ก ๋ฐ๊ฟ ํฌ์
ํ๋ฏ๋ก ์ ํฉ |
|
| 219 |
+
|
| 220 |
+
์ฃผ์ํ ์ :
|
| 221 |
+
|
| 222 |
+
- ํ๊ตญ์ด ์ํค/๋ฒ๋ฅ raw chunk๋ โ๊ทธ๋ฅ ๋ค์ ํ
์คํธ ์์ธกโ์ฒ๋ผ ๋ฃ์ผ๋ฉด ๋
ผ๋ฌธ์ task-completion์์ ๋ฉ์ด์ง๋๋ค.
|
| 223 |
+
- ๋ฐ๋ผ์ title/context๋ฅผ instruction๏ฟฝ๏ฟฝ๋ก ๋๊ณ chunk/summary/extraction์ response๋ก ๋๋ ์์ด ๋ ๋ง์ต๋๋ค.
|
| 224 |
+
- local terminal dataset์ objective์ ์ ๋ง์ง๋ง ์ ์ฒด ๋น์ค์ด ๋๋ฌด ์ปค์ง๋ฉด ์ผ๋ฐ ์ง์/ํ๊ตญ์ด ๊ท ํ์ด ๋ฌด๋์ง ์ ์์ต๋๋ค.
|
| 225 |
+
|
| 226 |
+
## ํ์ฌ ๋ฐฉ์ ํ๊ฐ
|
| 227 |
+
|
| 228 |
+
ํ์ฌ ๋ฐฉ์์ ๋
ผ๋ฌธ ํต์ฌ๊ณผ ์ ๋ง์ต๋๋ค.
|
| 229 |
+
|
| 230 |
+
๋ง๋ ๋ถ๋ถ:
|
| 231 |
+
|
| 232 |
+
- scratch training
|
| 233 |
+
- HRM H2L3 recurrent architecture
|
| 234 |
+
- PrefixLM attention
|
| 235 |
+
- response-only loss
|
| 236 |
+
- condition token ์ฌ์ฉ
|
| 237 |
+
- Adam-atan2
|
| 238 |
+
- bf16
|
| 239 |
+
- EMA 0.9999
|
| 240 |
+
- 4,096 context
|
| 241 |
+
- LeCun normal init
|
| 242 |
+
|
| 243 |
+
๋ค๋ฅธ ๋ถ๋ถ:
|
| 244 |
+
|
| 245 |
+
- vocab 65,536์ด ์๋๋ผ 131,072์
๋๋ค.
|
| 246 |
+
- 16 x H100์ด ์๋๋ผ 8 x H200์
๋๋ค.
|
| 247 |
+
- ๋
ผ๋ฌธ์ ๋จ์ผ ์ฐ์ run, ์ฐ๋ฆฌ๋ staged resume run์
๋๋ค.
|
| 248 |
+
- ๋
ผ๋ฌธ์ 40B unique tokens๋ฅผ ๋ณด๊ณ ํ๊ณ , ํ์ฌ public checkpoint๋ stage-1 fast-cap ์ค๊ฐ ์ฐ์ถ๋ฌผ์
๋๋ค.
|
| 249 |
+
- ํ๊ตญ์ด/ํฐ๋ฏธ๋/ํด์ฝ ๋น์ค์ด ๋
ผ๋ฌธ๋ณด๋ค ํจ์ฌ ํฝ๋๋ค.
|
| 250 |
+
|
| 251 |
+
์ ์ฐจ์ด๋ ์๋๋ ๋ณ๊ฒฝ์
๋๋ค. ํนํ ํ๊ตญ์ด/ํฐ๋ฏธ๋/ํด์ฝ์ ๋ชฉํ๋ก ํ๋ฏ๋ก tokenizer์ ๋ฐ์ดํฐ ๋ถํฌ๋ฅผ ๋ฐ๊พผ ๊ฒ์ ๋ง์ต๋๋ค. ๋ค๋ง ๋
ผ๋ฌธ์ ํจ์จ์ฑ์ ์ฌํํ๋ ค๋ฉด ์ต์ข
์ ์ผ๋ก full HRM cleaned data์ balanced Korean/terminal/tool mix๋ฅผ ํฉ์ณ 40B+ token ์์ค์ผ๋ก ์ด์ด ํ์ตํด์ผ ํฉ๋๋ค.
|
| 252 |
+
|
| 253 |
+
## ์ด์ ๊ฒฐ๋ก
|
| 254 |
+
|
| 255 |
+
ํ์ฌ๋ ๋ค์ ๊ธฐ์ค์ผ๋ก ๊ณ์ ๊ฐ๋ ๊ฒ์ด ๋ง์ต๋๋ค.
|
| 256 |
+
|
| 257 |
+
1. ํ์ฌ stage-1์ ๊ณ์ ์ ์งํฉ๋๋ค.
|
| 258 |
+
2. ์ ์ฒ๋ฆฌ ์๋ฃ ๋ฐ์ดํฐ๋ HF dataset repo์ ์ฌ๋ ค ๋ค๋ฅธ ๋จธ์ ์์๋ ์ฌํ ๊ฐ๋ฅํ๊ฒ ๋ก๋๋ค.
|
| 259 |
+
3. full HRM 328G no-cap retokenization์ด ๋๋๋ฉด next stage๋ก ์ด์ด ํ์ตํฉ๋๋ค.
|
| 260 |
+
4. SFT ํ๋ณด ๋ฐ์ดํฐ๋ pretraining์ ๋จผ์ ํฌํจํฉ๋๋ค.
|
| 261 |
+
5. ๋ณ๋ final SFT๋ ๋ง์ง๋ง์ ํ์ง ๋์ subset์ผ๋ก ๋ค์ ์ํํฉ๋๋ค.
|
| 262 |
+
6. model repo๋ ์ต์ safetensors ์ค์ฌ, raw checkpoint repo๋ resume์ฉ์ผ๋ก ๋ถ๋ฆฌํฉ๋๋ค.
|
| 263 |
+
|