KoHRM-Text-1.4B / METHODOLOGY_ARCHITECTURE_NOTES_2026-05-24.md
gyung's picture
Add methodology and VRAM notes
77ff990 verified
# KoHRM-Text Methodology and Architecture Notes
์ž‘์„ฑ์ผ: 2026-05-24
์ด ๋ฌธ์„œ๋Š” `KoHRM-Text-1.4B`๊ฐ€ HRM-Text ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ์–ด๋–ค ์ ์—์„œ ๊ฐ™๊ณ , ์–ด๋–ค ์ ์—์„œ ์šด์˜์ƒ ๋‹ค๋ฅธ์ง€ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
์ฐธ๊ณ  ๋ฌธ์„œ:
- HRM-Text paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text
- KoHRM-Text code: https://github.com/LLM-OS-Models/KoHRM-text
## ๊ฒฐ๋ก 
์šฐ๋ฆฌ์˜ ํ˜„์žฌ ํ•™์Šต์€ ๋ฐฉ๋ฒ•๋ก  ๊ด€์ ์—์„œ๋Š” HRM-Text ๋…ผ๋ฌธ์‹ single-stage instruction pretraining์— ๋งž์ถฐ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‹ค๋งŒ ์‹คํ–‰ ์šด์˜ ๊ด€์ ์—์„œ๋Š” ๋…ผ๋ฌธ๊ณผ ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋‹จ์ผ ์—ฐ์† run์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , ์ค‘๊ฐ„ checkpointing/crash recovery๋ฅผ ์“ฐ์ง€ ์•Š์•˜๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„์™€ HF ์—…๋กœ๋“œ, OOM ํšŒํ”ผ, ์ฒดํฌํฌ์ธํŠธ ๋ณด์กด ๋•Œ๋ฌธ์— stage-0, stage0b, stage-1์ฒ˜๋Ÿผ ๋‚˜๋ˆ„์–ด ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•ต์‹ฌ ์ฐจ์ด๋Š” ๋‹ค์Œ์ž…๋‹ˆ๋‹ค.
| ํ•ญ๋ชฉ | HRM-Text ๋…ผ๋ฌธ | KoHRM-Text ํ˜„์žฌ ๋ฐฉ์‹ |
|---|---|---|
| ํ•™์Šต ๋ชฉ์  | instruction-response task-completion objective | ๋™์ผ |
| loss | response-only NLL | ๋™์ผ |
| attention | PrefixLM, instruction bidirectional + response causal | ๋™์ผ ์ฝ”๋“œ ๊ฒฝ๋กœ ์‚ฌ์šฉ |
| raw LM pretraining ํ›„ SFT | ํ•˜์ง€ ์•Š์Œ | ํ•˜์ง€ ์•Š์Œ |
| SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ | instruction pretraining์— ํฌํ•จ | ํฌํ•จ |
| ์‹คํ–‰ ํ˜•ํƒœ | ๋‹จ์ผ ์—ฐ์† run | staged resume run |
| checkpoint | ๋…ผ๋ฌธ์€ ์ค‘๊ฐ„ checkpointing ์—†์Œ | ์šด์˜์ƒ 5,000 step๋งˆ๋‹ค ์ €์žฅ |
| tokenizer | 65,536 BPE | 131,072 Korean/terminal BPE |
| hardware | 16 x H100 | 8 x H200 |
๋”ฐ๋ผ์„œ โ€œ๋…ผ๋ฌธ์ฒ˜๋Ÿผ single-stage ์ง€์‹œ๋ฌธ ์‚ฌ์ „ํ•™์Šต์ธ๊ฐ€?โ€์— ๋Œ€ํ•œ ๋‹ต์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.
> ํ•™์Šต objective์™€ ๋ฐ์ดํ„ฐ ํฌ๋งท์€ single-stage instruction pretraining์ž…๋‹ˆ๋‹ค.
> ๊ทธ๋Ÿฌ๋‚˜ ์‹คํ–‰์€ ํ•œ ํ”„๋กœ์„ธ์Šค์˜ ๋‹จ์ผ ์—ฐ์† run์ด ์•„๋‹ˆ๋ผ, ๊ฐ™์€ objective๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ checkpoint resume์œผ๋กœ ์ด์–ด๊ฐ€๋Š” staged pretraining์ž…๋‹ˆ๋‹ค.
## ๋…ผ๋ฌธ ๋ฐฉ๋ฒ•๋ก  ์š”์•ฝ
HRM-Text ๋…ผ๋ฌธ์€ ๊ธฐ์กด LLM์˜ ๋Œ€๊ทœ๋ชจ raw-text causal LM ์‚ฌ์ „ํ•™์Šต ํ›„ mid-training/SFT๋กœ ๊ฐ€๋Š” ๋‹ค๋‹จ๊ณ„ recipe๋ฅผ ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ๋ณด๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ instruction-response pair๋งŒ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
๋…ผ๋ฌธ ํ•ต์‹ฌ:
1. raw text ์ „์ฒด ํ† ํฐ์„ ์˜ˆ์ธกํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
2. instruction tokens์—๋Š” loss๋ฅผ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
3. response tokens์—๋งŒ negative log-likelihood loss๋ฅผ ์ค๋‹ˆ๋‹ค.
4. instruction segment๋Š” PrefixLM mask๋กœ bidirectional attention์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.
5. response segment๋Š” autoregressive causal attention์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
6. ๋ฐ์ดํ„ฐ๋Š” direct, cot, synth, noisy ๊ฐ™์€ condition tag๋ฅผ instruction ์•ž์— ๋ถ™์—ฌ ์‘๋‹ต ์Šคํƒ€์ผ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.
7. `<think>...</think>` ๊ฐ™์€ explicit long-CoT trace๋Š” ์ œ๊ฑฐํ•ด ๋‚ด๋ถ€ recurrent computation์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ๋งก๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
๋…ผ๋ฌธ์€ 1B HRM-Text๋ฅผ scratch๋กœ ํ•™์Šตํ–ˆ๊ณ , ์•ฝ 40B unique tokens ๋ฐ 16 x H100์—์„œ ์•ฝ 46์‹œ๊ฐ„์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ ํ‰๊ฐ€์—๋Š” EMA checkpoint๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
## ํ˜„์žฌ KoHRM-Text ํ•™์Šต ๋ฐฉ์‹
ํ˜„์žฌ `KoHRM-Text-1.4B`๋„ raw causal LM์ด ์•„๋‹ˆ๋ผ HRM-Text V1Dataset ํฌ๋งท์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
๊ฐ ์ƒ˜ํ”Œ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.
```text
instruction span -> response span
```
ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
```text
<|im_start|> condition_tokens instruction <|im_end|> response <|box_end|>
```
condition token์€ ๋‹ค์Œ mapping์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
| condition | token |
|---|---|
| direct | `<|object_ref_start|>` |
| cot | `<|object_ref_end|>` |
| noisy | `<|quad_start|>` |
| synth | `<|quad_end|>` |
`dataset_new.py` ๊ธฐ์ค€์œผ๋กœ instruction span์€ `inputs`์—๋Š” ๋“ค์–ด๊ฐ€์ง€๋งŒ `target_only=True`์ผ ๋•Œ label์€ `IGNORE_LABEL_ID`๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. response span๋งŒ ์‹ค์ œ cross entropy loss๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.
์ฆ‰ ํ˜„์žฌ ํ•™์Šต์€ โ€œ๋ฌธ์„œ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋‹ค ๋งžํžˆ๋Š” raw LMโ€์ด ์•„๋‹ˆ๋ผ โ€œ์ฃผ์–ด์ง„ ์ง€์‹œ/๋ฌธ๋งฅ์„ ๋ณด๊ณ  ์‘๋‹ต์„ ์™„์„ฑํ•˜๋Š” task-completion pretrainingโ€์ž…๋‹ˆ๋‹ค.
## PrefixLM ๊ตฌํ˜„ ํ™•์ธ
ํ˜„์žฌ ์ฝ”๋“œ์—์„œ PrefixLM ๊ฒฝ๋กœ๋Š” ๋‹ค์Œ ํŒŒ์ผ๋“ค๋กœ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.
| ํŒŒ์ผ | ์—ญํ•  |
|---|---|
| `dataset_new.py` | instruction/response span ๋ถ„๋ฆฌ, response-only label ๊ตฌ์„ฑ |
| `models/flash_attention_prefixlm_v2.py` | prefix bidirectional attention + response causal attention ๊ตฌํ˜„ |
| `models/layers.py` | attention type `prefixlm` ์‚ฌ์šฉ |
| `models/lm_head.py` | `IGNORE_LABEL_ID`๋ฅผ ์ œ์™ธํ•˜๊ณ  response label์—๋งŒ CE loss ๊ณ„์‚ฐ |
`dataset_new.py`์—์„œ๋Š” ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด instruction ๊ธธ์ด๋ฅผ `prefix_lens`, response ๊ธธ์ด๋ฅผ `causal_lens`๋กœ ๋„˜๊น๋‹ˆ๋‹ค.
`flash_attention_prefixlm_v2.py`๋Š” attention์„ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
1. prefix ๊ตฌ๊ฐ„: instruction tokens๋ผ๋ฆฌ bidirectional attention
2. causal ๊ตฌ๊ฐ„: response tokens๊ฐ€ prefix ์ „์ฒด์™€ ์ด์ „ response tokens๋ฅผ ๋ณด๋Š” causal attention
์ด ๊ตฌ์กฐ๊ฐ€ ๋…ผ๋ฌธ์˜ PrefixLM๊ณผ ๋งž๋Š” ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.
## ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
ํ˜„์žฌ ํ‘œ์ค€ ๋ชจ๋ธ๋ช…์€ `KoHRM-Text-1.4B`์ด๊ณ , `arch/size@arch=XL`์— ํ•ด๋‹นํ•˜๋Š” ์„ค์ •์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| params | 1,384,120,320 |
| hidden size | 1,536 |
| total configured layers | 32 |
| half layers | true |
| H module layers | 16 |
| L module layers | 16 |
| heads | 12 |
| head dim | 128 |
| expansion | 4 |
| intermediate size | 4,096 |
| context | 4,096 |
| RoPE theta | 10,000 |
| norm | RMSNorm-style parameterless norm |
| init | LeCun normal |
| dtype | bfloat16 |
| tokenizer vocab | 131,072 |
HRM recurrent schedule:
| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| H cycles | 2 |
| L cycles per H cycle | 3 |
| effective H/L recurrence | H2L3 |
| bp min steps | 2 |
| bp max steps | 5 |
| bp warmup ratio | 0.2 |
์ฝ”๋“œ์ƒ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
1. token embedding์—์„œ ์‹œ์ž‘ํ•œ hidden state๋ฅผ `z_H`๋กœ ๋‘ก๋‹ˆ๋‹ค.
2. learned/fixed low-level initial state `z_L_init`์—์„œ `z_L`์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
3. ๊ฐ H cycle๋งˆ๋‹ค L module์„ 3๋ฒˆ ๋ฐ˜๋ณต ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
4. ๊ทธ ๋’ค H module์„ 1๋ฒˆ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
5. ์ด 2๋ฒˆ์˜ H cycle์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
6. ์ตœ์ข… `z_H`์— LM head๋ฅผ ๋ถ™์—ฌ vocabulary logits๋ฅผ ๋ƒ…๋‹ˆ๋‹ค.
๋…ผ๋ฌธ ํ‘œํ˜„์œผ๋กœ๋Š” slow-evolving strategic layer์ธ H module๊ณผ fast-evolving execution layer์ธ L module์˜ dual-timescale recurrent design์ž…๋‹ˆ๋‹ค.
## MagicNorm / ์•ˆ์ •ํ™”
๋…ผ๋ฌธ์€ recurrent depth๊ฐ€ ๊นŠ์–ด์ง€๋ฉด activation variance์™€ gradient instability๊ฐ€ ์ปค์ง€๋ฏ€๋กœ MagicNorm๊ณผ warmup deep credit assignment๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
ํ˜„์žฌ ์ฝ”๋“œ์—์„œ๋Š” `norm_type: pre`๋ฅผ ์“ฐ๋˜, Transformer module ๋์— final RMSNorm์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋‚ด๋ถ€ block์€ PreNorm ์Šคํƒ€์ผ์ด๊ณ  module output์€ norm์œผ๋กœ capped๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” MagicNorm ๊ณ„์—ด ์•ˆ์ •ํ™”์™€ ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค.
Backward๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋“  recurrent step์„ ๋‹ค ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ , `bp_steps`๋ฅผ warmupํ•ฉ๋‹ˆ๋‹ค.
ํ˜„์žฌ ์„ค์ •:
```yaml
bp_warmup_ratio: 0.2
bp_min_steps: 2
bp_max_steps: 5
```
์ดˆ๊ธฐ์—๋Š” ๋งˆ์ง€๋ง‰ 2 recurrent steps ์œ„์ฃผ๋กœ gradient๋ฅผ ํ˜๋ฆฌ๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ ์ตœ๋Œ€ 5 steps๊นŒ์ง€ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค. ์ด ์ ๋„ ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ๋งž์Šต๋‹ˆ๋‹ค.
## Optimizer / ์Šค์ผ€์ค„
ํ˜„์žฌ pretraining config:
| ํ•ญ๋ชฉ | ๊ฐ’ |
|---|---:|
| optimizer | Adam-atan2 |
| beta1 | 0.9 |
| beta2 | 0.95 |
| weight decay | 0.1 |
| lr | 2.2e-4 |
| lr warmup | 2,000 steps |
| lr min ratio | 1.0 |
| EMA | 0.9999 |
| gradient clipping | ์—†์Œ |
๋…ผ๋ฌธ๋„ Adam-atan2, warmup 2,000 steps, weight decay 0.1, EMA 0.9999, bf16์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ/ํ‰๊ฐ€๋Š” EMA checkpoint๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
## ํ˜„์žฌ staged run์ด ๋…ผ๋ฌธ๊ณผ ๋‹ค๋ฅธ ์ด์œ 
๋…ผ๋ฌธ์€ โ€œsingle continuous runโ€์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ ํ˜„์‹ค์  ์ด์œ  ๋•Œ๋ฌธ์— staged run์œผ๋กœ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค.
1. ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/tokenizer ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ๋๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
2. GPU๋ฅผ ๋†€๋ฆฌ์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
3. H200 8์žฅ ํ™˜๊ฒฝ์—์„œ 131K vocab ๋•Œ๋ฌธ์— batch OOM ์•ˆ์ • ๋งˆ์ง„์„ ์‹ค์ธกํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.
4. HF ์—…๋กœ๋“œ์™€ raw checkpoint ๋ณด์กด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
5. full HRM 328G no-cap retokenization์ด ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๊ฐ stage๊ฐ€ ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.
| Stage | Objective | ์„ฑ๊ฒฉ |
|---|---|---|
| stage-0 | PrefixLM response-only | ์ค€๋น„ ์™„๋ฃŒ๋œ 711.3M mix |
| stage0b | PrefixLM response-only | ๊ฐ™์€ mix ์ถ”๊ฐ€ pass |
| stage-1 | PrefixLM response-only | HRM fast-cap 14.55B |
| later stage | PrefixLM response-only | full HRM 328G retokenized + ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ mix |
| final SFT | PrefixLM response-only ๋˜๋Š” SFT์šฉ response-only | ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ํ›„์ฒ˜๋ฆฌ |
์ค‘์š”ํ•œ ์ ์€ stage-0์—์„œ stage-1๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋กœ global step/LR schedule์„ ์ด์–ด๊ฐ€๋„๋ก ์ˆ˜์ •ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ฆ‰ โ€œํ•™์Šต ๋ฐฉ๋ฒ•๋ก โ€์€ single-stage instruction pretraining์ด๊ณ , โ€œ์šด์˜ ๋ฐฉ์‹โ€์€ staged continuation์ž…๋‹ˆ๋‹ค.
## ํ˜„์žฌ ๋ฐ์ดํ„ฐ๊ฐ€ single-stage ์›์น™์— ๋งž๋Š”์ง€
ํ˜„์žฌ prepared dataset๋“ค์€ ์ „๋ถ€ ๊ฐ€๋Šฅํ•œ ํ•œ instruction-response ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
| ๋ฐ์ดํ„ฐ | single-stage ์›์น™ ์ ํ•ฉ์„ฑ |
|---|---|
| HRM cleaned data | ์›๋ž˜ HRM instruction/response/condition ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| ToolBench | tool instruction -> tool-call/answer response ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| SWE-ZERO/local terminal | terminal context -> next action/answer ๊ตฌ์กฐ๋ผ ์ ํ•ฉ |
| GLM/Claude reasoning | final answer ์ค‘์‹ฌ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ์ ํ•ฉ |
| ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์œ„ํ‚ค ์›๋ฌธ | chunked instruction/response task๋กœ ๋ฐ”๊ฟ” ํˆฌ์ž…ํ•˜๋ฏ€๋กœ ์ ํ•ฉ |
์ฃผ์˜ํ•  ์ :
- ํ•œ๊ตญ์–ด ์œ„ํ‚ค/๋ฒ•๋ฅ  raw chunk๋Š” โ€œ๊ทธ๋ƒฅ ๋‹ค์Œ ํ…์ŠคํŠธ ์˜ˆ์ธกโ€์ฒ˜๋Ÿผ ๋„ฃ์œผ๋ฉด ๋…ผ๋ฌธ์‹ task-completion์—์„œ ๋ฉ€์–ด์ง‘๋‹ˆ๋‹ค.
- ๋”ฐ๋ผ์„œ title/context๋ฅผ instruction์œผ๋กœ ๋‘๊ณ  chunk/summary/extraction์„ response๋กœ ๋‘๋Š” ์‹์ด ๋” ๋งž์Šต๋‹ˆ๋‹ค.
- local terminal dataset์€ objective์— ์ž˜ ๋งž์ง€๋งŒ ์ „์ฒด ๋น„์ค‘์ด ๋„ˆ๋ฌด ์ปค์ง€๋ฉด ์ผ๋ฐ˜ ์ง€์‹/ํ•œ๊ตญ์–ด ๊ท ํ˜•์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
## ํ˜„์žฌ ๋ฐฉ์‹ ํ‰๊ฐ€
ํ˜„์žฌ ๋ฐฉ์‹์€ ๋…ผ๋ฌธ ํ•ต์‹ฌ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.
๋งž๋Š” ๋ถ€๋ถ„:
- scratch training
- HRM H2L3 recurrent architecture
- PrefixLM attention
- response-only loss
- condition token ์‚ฌ์šฉ
- Adam-atan2
- bf16
- EMA 0.9999
- 4,096 context
- LeCun normal init
๋‹ค๋ฅธ ๋ถ€๋ถ„:
- vocab 65,536์ด ์•„๋‹ˆ๋ผ 131,072์ž…๋‹ˆ๋‹ค.
- 16 x H100์ด ์•„๋‹ˆ๋ผ 8 x H200์ž…๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ์€ ๋‹จ์ผ ์—ฐ์† run, ์šฐ๋ฆฌ๋Š” staged resume run์ž…๋‹ˆ๋‹ค.
- ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋ณด๊ณ ํ–ˆ๊ณ , ํ˜„์žฌ public checkpoint๋Š” stage-1 fast-cap ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค.
- ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋น„์ค‘์ด ๋…ผ๋ฌธ๋ณด๋‹ค ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค.
์œ„ ์ฐจ์ด๋Š” ์˜๋„๋œ ๋ณ€๊ฒฝ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฏ€๋กœ tokenizer์™€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋ฐ”๊พผ ๊ฒƒ์€ ๋งž์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋…ผ๋ฌธ์˜ ํšจ์œจ์„ฑ์„ ์žฌํ˜„ํ•˜๋ ค๋ฉด ์ตœ์ข…์ ์œผ๋กœ full HRM cleaned data์™€ balanced Korean/terminal/tool mix๋ฅผ ํ•ฉ์ณ 40B+ token ์ˆ˜์ค€์œผ๋กœ ์ด์–ด ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
## ์šด์˜ ๊ฒฐ๋ก 
ํ˜„์žฌ๋Š” ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ๊ณ„์† ๊ฐ€๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.
1. ํ˜„์žฌ stage-1์€ ๊ณ„์† ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
2. ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ๋ฐ์ดํ„ฐ๋Š” HF dataset repo์— ์˜ฌ๋ ค ๋‹ค๋ฅธ ๋จธ์‹ ์—์„œ๋„ ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋‘ก๋‹ˆ๋‹ค.
3. full HRM 328G no-cap retokenization์ด ๋๋‚˜๋ฉด next stage๋กœ ์ด์–ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
4. SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ๋„ pretraining์— ๋จผ์ € ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
5. ๋ณ„๋„ final SFT๋Š” ๋งˆ์ง€๋ง‰์— ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
6. model repo๋Š” ์ตœ์‹  safetensors ์ค‘์‹ฌ, raw checkpoint repo๋Š” resume์šฉ์œผ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.