KoHRM-Text-1.4B / METHODOLOGY_ARCHITECTURE_NOTES_2026-05-24.md
gyung's picture
Add methodology and VRAM notes
77ff990 verified

KoHRM-Text Methodology and Architecture Notes

์ž‘์„ฑ์ผ: 2026-05-24

์ด ๋ฌธ์„œ๋Š” KoHRM-Text-1.4B๊ฐ€ HRM-Text ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ์–ด๋–ค ์ ์—์„œ ๊ฐ™๊ณ , ์–ด๋–ค ์ ์—์„œ ์šด์˜์ƒ ๋‹ค๋ฅธ์ง€ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธ์„œ:

๊ฒฐ๋ก 

์šฐ๋ฆฌ์˜ ํ˜„์žฌ ํ•™์Šต์€ ๋ฐฉ๋ฒ•๋ก  ๊ด€์ ์—์„œ๋Š” HRM-Text ๋…ผ๋ฌธ์‹ single-stage instruction pretraining์— ๋งž์ถฐ์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ ์‹คํ–‰ ์šด์˜ ๊ด€์ ์—์„œ๋Š” ๋…ผ๋ฌธ๊ณผ ์™„์ „ํžˆ ๊ฐ™์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋‹จ์ผ ์—ฐ์† run์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , ์ค‘๊ฐ„ checkpointing/crash recovery๋ฅผ ์“ฐ์ง€ ์•Š์•˜๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฐ์ดํ„ฐ ์ค€๋น„์™€ HF ์—…๋กœ๋“œ, OOM ํšŒํ”ผ, ์ฒดํฌํฌ์ธํŠธ ๋ณด์กด ๋•Œ๋ฌธ์— stage-0, stage0b, stage-1์ฒ˜๋Ÿผ ๋‚˜๋ˆ„์–ด ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ฐจ์ด๋Š” ๋‹ค์Œ์ž…๋‹ˆ๋‹ค.

ํ•ญ๋ชฉ HRM-Text ๋…ผ๋ฌธ KoHRM-Text ํ˜„์žฌ ๋ฐฉ์‹
ํ•™์Šต ๋ชฉ์  instruction-response task-completion objective ๋™์ผ
loss response-only NLL ๋™์ผ
attention PrefixLM, instruction bidirectional + response causal ๋™์ผ ์ฝ”๋“œ ๊ฒฝ๋กœ ์‚ฌ์šฉ
raw LM pretraining ํ›„ SFT ํ•˜์ง€ ์•Š์Œ ํ•˜์ง€ ์•Š์Œ
SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ instruction pretraining์— ํฌํ•จ ํฌํ•จ
์‹คํ–‰ ํ˜•ํƒœ ๋‹จ์ผ ์—ฐ์† run staged resume run
checkpoint ๋…ผ๋ฌธ์€ ์ค‘๊ฐ„ checkpointing ์—†์Œ ์šด์˜์ƒ 5,000 step๋งˆ๋‹ค ์ €์žฅ
tokenizer 65,536 BPE 131,072 Korean/terminal BPE
hardware 16 x H100 8 x H200

๋”ฐ๋ผ์„œ โ€œ๋…ผ๋ฌธ์ฒ˜๋Ÿผ single-stage ์ง€์‹œ๋ฌธ ์‚ฌ์ „ํ•™์Šต์ธ๊ฐ€?โ€์— ๋Œ€ํ•œ ๋‹ต์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต objective์™€ ๋ฐ์ดํ„ฐ ํฌ๋งท์€ single-stage instruction pretraining์ž…๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์‹คํ–‰์€ ํ•œ ํ”„๋กœ์„ธ์Šค์˜ ๋‹จ์ผ ์—ฐ์† run์ด ์•„๋‹ˆ๋ผ, ๊ฐ™์€ objective๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ checkpoint resume์œผ๋กœ ์ด์–ด๊ฐ€๋Š” staged pretraining์ž…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ๋ฐฉ๋ฒ•๋ก  ์š”์•ฝ

HRM-Text ๋…ผ๋ฌธ์€ ๊ธฐ์กด LLM์˜ ๋Œ€๊ทœ๋ชจ raw-text causal LM ์‚ฌ์ „ํ•™์Šต ํ›„ mid-training/SFT๋กœ ๊ฐ€๋Š” ๋‹ค๋‹จ๊ณ„ recipe๋ฅผ ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ๋ณด๊ณ , ์ฒ˜์Œ๋ถ€ํ„ฐ instruction-response pair๋งŒ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ํ•ต์‹ฌ:

  1. raw text ์ „์ฒด ํ† ํฐ์„ ์˜ˆ์ธกํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  2. instruction tokens์—๋Š” loss๋ฅผ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  3. response tokens์—๋งŒ negative log-likelihood loss๋ฅผ ์ค๋‹ˆ๋‹ค.
  4. instruction segment๋Š” PrefixLM mask๋กœ bidirectional attention์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.
  5. response segment๋Š” autoregressive causal attention์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  6. ๋ฐ์ดํ„ฐ๋Š” direct, cot, synth, noisy ๊ฐ™์€ condition tag๋ฅผ instruction ์•ž์— ๋ถ™์—ฌ ์‘๋‹ต ์Šคํƒ€์ผ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.
  7. <think>...</think> ๊ฐ™์€ explicit long-CoT trace๋Š” ์ œ๊ฑฐํ•ด ๋‚ด๋ถ€ recurrent computation์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ๋งก๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์€ 1B HRM-Text๋ฅผ scratch๋กœ ํ•™์Šตํ–ˆ๊ณ , ์•ฝ 40B unique tokens ๋ฐ 16 x H100์—์„œ ์•ฝ 46์‹œ๊ฐ„์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ ํ‰๊ฐ€์—๋Š” EMA checkpoint๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ KoHRM-Text ํ•™์Šต ๋ฐฉ์‹

ํ˜„์žฌ KoHRM-Text-1.4B๋„ raw causal LM์ด ์•„๋‹ˆ๋ผ HRM-Text V1Dataset ํฌ๋งท์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ์ƒ˜ํ”Œ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์Œ ๊ตฌ์กฐ๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค.

instruction span -> response span

ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ๋Š” ๋‹ค์Œ์ฒ˜๋Ÿผ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

<|im_start|> condition_tokens instruction <|im_end|> response <|box_end|>

condition token์€ ๋‹ค์Œ mapping์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

condition token
direct `<
cot `<
noisy `<
synth `<

dataset_new.py ๊ธฐ์ค€์œผ๋กœ instruction span์€ inputs์—๋Š” ๋“ค์–ด๊ฐ€์ง€๋งŒ target_only=True์ผ ๋•Œ label์€ IGNORE_LABEL_ID๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. response span๋งŒ ์‹ค์ œ cross entropy loss๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.

์ฆ‰ ํ˜„์žฌ ํ•™์Šต์€ โ€œ๋ฌธ์„œ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋‹ค ๋งžํžˆ๋Š” raw LMโ€์ด ์•„๋‹ˆ๋ผ โ€œ์ฃผ์–ด์ง„ ์ง€์‹œ/๋ฌธ๋งฅ์„ ๋ณด๊ณ  ์‘๋‹ต์„ ์™„์„ฑํ•˜๋Š” task-completion pretrainingโ€์ž…๋‹ˆ๋‹ค.

PrefixLM ๊ตฌํ˜„ ํ™•์ธ

ํ˜„์žฌ ์ฝ”๋“œ์—์„œ PrefixLM ๊ฒฝ๋กœ๋Š” ๋‹ค์Œ ํŒŒ์ผ๋“ค๋กœ ํ™•์ธ๋ฉ๋‹ˆ๋‹ค.

ํŒŒ์ผ ์—ญํ• 
dataset_new.py instruction/response span ๋ถ„๋ฆฌ, response-only label ๊ตฌ์„ฑ
models/flash_attention_prefixlm_v2.py prefix bidirectional attention + response causal attention ๊ตฌํ˜„
models/layers.py attention type prefixlm ์‚ฌ์šฉ
models/lm_head.py IGNORE_LABEL_ID๋ฅผ ์ œ์™ธํ•˜๊ณ  response label์—๋งŒ CE loss ๊ณ„์‚ฐ

dataset_new.py์—์„œ๋Š” ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด instruction ๊ธธ์ด๋ฅผ prefix_lens, response ๊ธธ์ด๋ฅผ causal_lens๋กœ ๋„˜๊น๋‹ˆ๋‹ค.

flash_attention_prefixlm_v2.py๋Š” attention์„ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  1. prefix ๊ตฌ๊ฐ„: instruction tokens๋ผ๋ฆฌ bidirectional attention
  2. causal ๊ตฌ๊ฐ„: response tokens๊ฐ€ prefix ์ „์ฒด์™€ ์ด์ „ response tokens๋ฅผ ๋ณด๋Š” causal attention

์ด ๊ตฌ์กฐ๊ฐ€ ๋…ผ๋ฌธ์˜ PrefixLM๊ณผ ๋งž๋Š” ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

ํ˜„์žฌ ํ‘œ์ค€ ๋ชจ๋ธ๋ช…์€ KoHRM-Text-1.4B์ด๊ณ , arch/size@arch=XL์— ํ•ด๋‹นํ•˜๋Š” ์„ค์ •์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•ญ๋ชฉ ๊ฐ’
params 1,384,120,320
hidden size 1,536
total configured layers 32
half layers true
H module layers 16
L module layers 16
heads 12
head dim 128
expansion 4
intermediate size 4,096
context 4,096
RoPE theta 10,000
norm RMSNorm-style parameterless norm
init LeCun normal
dtype bfloat16
tokenizer vocab 131,072

HRM recurrent schedule:

ํ•ญ๋ชฉ ๊ฐ’
H cycles 2
L cycles per H cycle 3
effective H/L recurrence H2L3
bp min steps 2
bp max steps 5
bp warmup ratio 0.2

์ฝ”๋“œ์ƒ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. token embedding์—์„œ ์‹œ์ž‘ํ•œ hidden state๋ฅผ z_H๋กœ ๋‘ก๋‹ˆ๋‹ค.
  2. learned/fixed low-level initial state z_L_init์—์„œ z_L์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
  3. ๊ฐ H cycle๋งˆ๋‹ค L module์„ 3๋ฒˆ ๋ฐ˜๋ณต ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
  4. ๊ทธ ๋’ค H module์„ 1๋ฒˆ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
  5. ์ด 2๋ฒˆ์˜ H cycle์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  6. ์ตœ์ข… z_H์— LM head๋ฅผ ๋ถ™์—ฌ vocabulary logits๋ฅผ ๋ƒ…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ํ‘œํ˜„์œผ๋กœ๋Š” slow-evolving strategic layer์ธ H module๊ณผ fast-evolving execution layer์ธ L module์˜ dual-timescale recurrent design์ž…๋‹ˆ๋‹ค.

MagicNorm / ์•ˆ์ •ํ™”

๋…ผ๋ฌธ์€ recurrent depth๊ฐ€ ๊นŠ์–ด์ง€๋ฉด activation variance์™€ gradient instability๊ฐ€ ์ปค์ง€๋ฏ€๋กœ MagicNorm๊ณผ warmup deep credit assignment๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ์ฝ”๋“œ์—์„œ๋Š” norm_type: pre๋ฅผ ์“ฐ๋˜, Transformer module ๋์— final RMSNorm์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ ๋‚ด๋ถ€ block์€ PreNorm ์Šคํƒ€์ผ์ด๊ณ  module output์€ norm์œผ๋กœ capped๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” MagicNorm ๊ณ„์—ด ์•ˆ์ •ํ™”์™€ ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค.

Backward๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋ชจ๋“  recurrent step์„ ๋‹ค ํ†ต๊ณผํ•˜์ง€ ์•Š๊ณ , bp_steps๋ฅผ warmupํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ์„ค์ •:

bp_warmup_ratio: 0.2
bp_min_steps: 2
bp_max_steps: 5

์ดˆ๊ธฐ์—๋Š” ๋งˆ์ง€๋ง‰ 2 recurrent steps ์œ„์ฃผ๋กœ gradient๋ฅผ ํ˜๋ฆฌ๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ ์ตœ๋Œ€ 5 steps๊นŒ์ง€ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค. ์ด ์ ๋„ ๋…ผ๋ฌธ ๋ฐฉ์‹๊ณผ ๋งž์Šต๋‹ˆ๋‹ค.

Optimizer / ์Šค์ผ€์ค„

ํ˜„์žฌ pretraining config:

ํ•ญ๋ชฉ ๊ฐ’
optimizer Adam-atan2
beta1 0.9
beta2 0.95
weight decay 0.1
lr 2.2e-4
lr warmup 2,000 steps
lr min ratio 1.0
EMA 0.9999
gradient clipping ์—†์Œ

๋…ผ๋ฌธ๋„ Adam-atan2, warmup 2,000 steps, weight decay 0.1, EMA 0.9999, bf16์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ฐœ/ํ‰๊ฐ€๋Š” EMA checkpoint๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ staged run์ด ๋…ผ๋ฌธ๊ณผ ๋‹ค๋ฅธ ์ด์œ 

๋…ผ๋ฌธ์€ โ€œsingle continuous runโ€์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ ํ˜„์‹ค์  ์ด์œ  ๋•Œ๋ฌธ์— staged run์œผ๋กœ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค.

  1. ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/tokenizer ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ๋๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  2. GPU๋ฅผ ๋†€๋ฆฌ์ง€ ์•Š๊ธฐ ์œ„ํ•ด ์ค€๋น„๋œ ๋ฐ์ดํ„ฐ๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. H200 8์žฅ ํ™˜๊ฒฝ์—์„œ 131K vocab ๋•Œ๋ฌธ์— batch OOM ์•ˆ์ • ๋งˆ์ง„์„ ์‹ค์ธกํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.
  4. HF ์—…๋กœ๋“œ์™€ raw checkpoint ๋ณด์กด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  5. full HRM 328G no-cap retokenization์ด ์•„์ง ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ฐ stage๊ฐ€ ๋‹ค๋ฅธ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.

Stage Objective ์„ฑ๊ฒฉ
stage-0 PrefixLM response-only ์ค€๋น„ ์™„๋ฃŒ๋œ 711.3M mix
stage0b PrefixLM response-only ๊ฐ™์€ mix ์ถ”๊ฐ€ pass
stage-1 PrefixLM response-only HRM fast-cap 14.55B
later stage PrefixLM response-only full HRM 328G retokenized + ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ mix
final SFT PrefixLM response-only ๋˜๋Š” SFT์šฉ response-only ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ํ›„์ฒ˜๋ฆฌ

์ค‘์š”ํ•œ ์ ์€ stage-0์—์„œ stage-1๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , resume_step_offset๊ณผ total_steps_override๋กœ global step/LR schedule์„ ์ด์–ด๊ฐ€๋„๋ก ์ˆ˜์ •ํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฆ‰ โ€œํ•™์Šต ๋ฐฉ๋ฒ•๋ก โ€์€ single-stage instruction pretraining์ด๊ณ , โ€œ์šด์˜ ๋ฐฉ์‹โ€์€ staged continuation์ž…๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋ฐ์ดํ„ฐ๊ฐ€ single-stage ์›์น™์— ๋งž๋Š”์ง€

ํ˜„์žฌ prepared dataset๋“ค์€ ์ „๋ถ€ ๊ฐ€๋Šฅํ•œ ํ•œ instruction-response ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ single-stage ์›์น™ ์ ํ•ฉ์„ฑ
HRM cleaned data ์›๋ž˜ HRM instruction/response/condition ๊ตฌ์กฐ๋ผ ์ ํ•ฉ
ToolBench tool instruction -> tool-call/answer response ๊ตฌ์กฐ๋ผ ์ ํ•ฉ
SWE-ZERO/local terminal terminal context -> next action/answer ๊ตฌ์กฐ๋ผ ์ ํ•ฉ
GLM/Claude reasoning final answer ์ค‘์‹ฌ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ์ ํ•ฉ
ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์œ„ํ‚ค ์›๋ฌธ chunked instruction/response task๋กœ ๋ฐ”๊ฟ” ํˆฌ์ž…ํ•˜๋ฏ€๋กœ ์ ํ•ฉ

์ฃผ์˜ํ•  ์ :

  • ํ•œ๊ตญ์–ด ์œ„ํ‚ค/๋ฒ•๋ฅ  raw chunk๋Š” โ€œ๊ทธ๋ƒฅ ๋‹ค์Œ ํ…์ŠคํŠธ ์˜ˆ์ธกโ€์ฒ˜๋Ÿผ ๋„ฃ์œผ๋ฉด ๋…ผ๋ฌธ์‹ task-completion์—์„œ ๋ฉ€์–ด์ง‘๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ title/context๋ฅผ instruction์œผ๋กœ ๋‘๊ณ  chunk/summary/extraction์„ response๋กœ ๋‘๋Š” ์‹์ด ๋” ๋งž์Šต๋‹ˆ๋‹ค.
  • local terminal dataset์€ objective์— ์ž˜ ๋งž์ง€๋งŒ ์ „์ฒด ๋น„์ค‘์ด ๋„ˆ๋ฌด ์ปค์ง€๋ฉด ์ผ๋ฐ˜ ์ง€์‹/ํ•œ๊ตญ์–ด ๊ท ํ˜•์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋ฐฉ์‹ ํ‰๊ฐ€

ํ˜„์žฌ ๋ฐฉ์‹์€ ๋…ผ๋ฌธ ํ•ต์‹ฌ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.

๋งž๋Š” ๋ถ€๋ถ„:

  • scratch training
  • HRM H2L3 recurrent architecture
  • PrefixLM attention
  • response-only loss
  • condition token ์‚ฌ์šฉ
  • Adam-atan2
  • bf16
  • EMA 0.9999
  • 4,096 context
  • LeCun normal init

๋‹ค๋ฅธ ๋ถ€๋ถ„:

  • vocab 65,536์ด ์•„๋‹ˆ๋ผ 131,072์ž…๋‹ˆ๋‹ค.
  • 16 x H100์ด ์•„๋‹ˆ๋ผ 8 x H200์ž…๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ์€ ๋‹จ์ผ ์—ฐ์† run, ์šฐ๋ฆฌ๋Š” staged resume run์ž…๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ์€ 40B unique tokens๋ฅผ ๋ณด๊ณ ํ–ˆ๊ณ , ํ˜„์žฌ public checkpoint๋Š” stage-1 fast-cap ์ค‘๊ฐ„ ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋น„์ค‘์ด ๋…ผ๋ฌธ๋ณด๋‹ค ํ›จ์”ฌ ํฝ๋‹ˆ๋‹ค.

์œ„ ์ฐจ์ด๋Š” ์˜๋„๋œ ๋ณ€๊ฒฝ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ํ•œ๊ตญ์–ด/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฏ€๋กœ tokenizer์™€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋ฐ”๊พผ ๊ฒƒ์€ ๋งž์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋…ผ๋ฌธ์˜ ํšจ์œจ์„ฑ์„ ์žฌํ˜„ํ•˜๋ ค๋ฉด ์ตœ์ข…์ ์œผ๋กœ full HRM cleaned data์™€ balanced Korean/terminal/tool mix๋ฅผ ํ•ฉ์ณ 40B+ token ์ˆ˜์ค€์œผ๋กœ ์ด์–ด ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์šด์˜ ๊ฒฐ๋ก 

ํ˜„์žฌ๋Š” ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ๊ณ„์† ๊ฐ€๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.

  1. ํ˜„์žฌ stage-1์€ ๊ณ„์† ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ ๋ฐ์ดํ„ฐ๋Š” HF dataset repo์— ์˜ฌ๋ ค ๋‹ค๋ฅธ ๋จธ์‹ ์—์„œ๋„ ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋‘ก๋‹ˆ๋‹ค.
  3. full HRM 328G no-cap retokenization์ด ๋๋‚˜๋ฉด next stage๋กœ ์ด์–ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  4. SFT ํ›„๋ณด ๋ฐ์ดํ„ฐ๋„ pretraining์— ๋จผ์ € ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  5. ๋ณ„๋„ final SFT๋Š” ๋งˆ์ง€๋ง‰์— ํ’ˆ์งˆ ๋†’์€ subset์œผ๋กœ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  6. model repo๋Š” ์ตœ์‹  safetensors ์ค‘์‹ฌ, raw checkpoint repo๋Š” resume์šฉ์œผ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.