gyung commited on
Commit
0756b71
ยท
verified ยท
1 Parent(s): 9e78b9e

Update model card and VRAM notes

Browse files
Files changed (2) hide show
  1. README.md +190 -50
  2. VRAM_OOM_NOTES_2026-05-24.md +141 -0
README.md CHANGED
@@ -15,76 +15,216 @@ pipeline_tag: text-generation
15
 
16
  # KoHRM-Text-1.4B
17
 
18
- `KoHRM-Text-1.4B`๋Š” `sapientinc/HRM-Text`์˜ PrefixLM ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ์–ด/์˜์–ด/์ฝ”๋”ฉ/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ์‚ฌ์šฉ์„ฑ์„ ๋ชฉํ‘œ๋กœ scratch pretrainingํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
19
 
20
- ์ด ์นด๋“œ๋Š” 2026-05-23 ๊ธฐ์ค€ ์ž‘์—… ์ค‘์ธ ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ ๋ฉ”์ธ artifact๋Š” stage0b checkpoint๋ฅผ ๋ณ€ํ™˜ํ•œ `model.safetensors` ์•ˆ์ „ ํฌ๋งท์ž…๋‹ˆ๋‹ค. raw HRM-Text FSDP2 checkpoint๋Š” optimizer/EMA resume ์šฉ๋„์ด๋ฏ€๋กœ ๋ฉ”์ธ repo์—์„œ ์ œ๊ฑฐํ•˜๊ณ  ๋ณ„๋„ raw checkpoint repo๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
21
 
22
- ## ๋ชจ๋ธ ์ •๋ณด
23
 
24
- | ํ•ญ๋ชฉ | ๊ฐ’ |
25
  |---|---|
26
- | model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
27
- | base code | `sapientinc/HRM-Text` |
28
- | training from | scratch |
29
- | architecture | HRM-Text `XL` |
30
- | params | 1,384,120,320 |
31
- | context | 4096 tokens |
32
- | dtype | bfloat16 |
33
- | tokenizer | byte-level BPE, NFC normalization |
34
- | vocab | 131,072 |
35
 
36
- ## ํ† ํฌ๋‚˜์ด์ €
37
 
38
- ์ƒˆ tokenizer๋Š” ํ•œ๊ตญ์–ด, ์˜์–ด, ์ฝ”๋“œ, shell, terminal instruction, JSON tool-call์„ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
39
 
40
- | ์ƒ˜ํ”Œ | chars/token |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  |---|---:|
42
- | ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ | 2.60 |
43
- | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ  | 2.36 |
44
- | ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„ ์ง€์‹œ | 2.18 |
45
  | shell command | 2.68 |
46
- | tool JSON | 3.32 |
47
  | Python code | 3.37 |
48
- | ์˜์–ด | 4.40 |
49
 
50
- Tokenizer repo: `LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K`
 
 
 
 
 
 
 
 
51
 
52
- ## ํ•™์Šต ๋ฐ์ดํ„ฐ
53
 
54
- stage-0/stage0b ์ž…๋ ฅ์€ ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ๋œ 711.3M token mix์ž…๋‹ˆ๋‹ค.
55
 
56
- | ๋ฐ์ดํ„ฐ | token |
57
- |---|---:|
58
- | HRM cleaned base sample | 250.0M |
59
- | SWE-ZERO + GLM reasoning mix | 251.2M |
60
- | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์กฐ๋ก€/ํ–‰์ •๊ทœ์น™/ํŒ๋ก€ task | 83.1M |
61
- | ToolBench train tool-call task | 127.0M |
62
- | ํ•ฉ๊ณ„ | 711.3M |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- ํ˜„์žฌ stage-1์€ HRM cleaned fast-cap V1Dataset 14.55B tokens๋กœ ํ•™์Šต ์ค‘์ž…๋‹ˆ๋‹ค. ์ดํ›„ stage๋Š” local terminal dataset, ์ถ”๊ฐ€ ํ•œ๊ตญ์–ด/์ฝ”๋”ฉ/ํˆด์ฝœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์„ฑ๊ฒฉ์˜ `tb2_lite`, Terminal Bench 2, ToolBench eval, chi-bench๋Š” train์—์„œ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.
65
 
66
- ## ํ•™์Šต ๋ฐฉ์‹
 
 
 
67
 
68
- - Objective: PrefixLM style response-only loss
69
- - Optimizer: HRM-Text upstream Adam-atan2
70
- - Context: 4096 tokens
71
- - Hardware: 8 x NVIDIA H200
72
- - Current stage-1 global batch: 229,376 tokens
73
- - Checkpoint policy: main repo์—๋Š” `safetensors`, raw FSDP2๋Š” ๋ณ„๋„ raw checkpoint repo
74
 
75
- stage-1์€ ์ฒ˜์Œ `global_batch_size=262144`๋กœ ์‹œ๋„ํ–ˆ์ง€๋งŒ, ํ›„์† compile graph์—์„œ `32768 x 131072` bf16 logits buffer ์ถ”๊ฐ€ ํ• ๋‹น์ด ํ•„์š”ํ•ด OOM์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ๋Š” `global_batch_size=229376`์œผ๋กœ ์žฌ์‹œ์ž‘ํ•ด ์ง„ํ–‰ ์ค‘์ด๋ฉฐ, ๊ด€์ธก VRAM์€ GPU0 ์•ฝ 105GB, ๋‚˜๋จธ์ง€ ์•ฝ 103GB์ž…๋‹ˆ๋‹ค. ์•ˆ์ • ์†๋„๋Š” ์•ฝ `1.02-1.03 step/sec`์ž…๋‹ˆ๋‹ค.
76
 
77
- Staged pretraining์—์„œ๋Š” checkpoint์˜ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋กœ LR schedule์„ ์ „์ฒด pretraining ๊ธฐ์ค€์— ๋งž์ถฅ๋‹ˆ๋‹ค. ์ฆ‰, ์ƒˆ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„๋  ๋•Œ๋งˆ๋‹ค ํ•™์Šต์„ ์žฌ์‹œ์ž‘ํ•˜๋˜ optimizer์™€ schedule์„ ๋Š์ง€ ์•Š๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์šด์šฉํ•ฉ๋‹ˆ๋‹ค.
 
 
 
 
 
78
 
79
- ## ํ˜„์žฌ ์ƒํƒœ
80
 
81
- - stage-0/stage0b training: complete
82
- - stage0b safetensors HF upload: complete
83
- - unsafe raw DCP files removed from main HF repo
84
- - stage-1 HRM fast-cap training: in progress
85
- - final Transformers conversion: not yet produced
86
- - public benchmark score: not yet evaluated for this model
87
 
88
- ## ์ œํ•œ์‚ฌํ•ญ
 
89
 
90
- ํ˜„์žฌ checkpoint artifact๋Š” ์ค‘๊ฐ„ ํ•™์Šต ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค. ์•ˆ์ „์„ฑ ์ •๋ ฌ, ์ตœ์ข… instruction tuning, ์ตœ์ข… benchmark, ๋ฐฐํฌ์šฉ ๋ณ€ํ™˜์ด ๋๋‚œ ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋Šฅ๋ ฅ์€ ๋ชฉํ‘œ ์˜์—ญ์ด์ง€๋งŒ, stage-0๋งŒ์œผ๋กœ๋Š” ์™„์„ฑ๋œ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
 
15
 
16
  # KoHRM-Text-1.4B
17
 
18
+ `KoHRM-Text-1.4B` is a scratch-pretrained Korean/English/code/terminal/tool-use model based on the `sapientinc/HRM-Text` PrefixLM training stack.
19
 
20
+ This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.
21
 
22
+ ## Links
23
 
24
+ | Item | Link |
25
  |---|---|
26
+ | HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
27
+ | Project code | https://github.com/LLM-OS-Models/KoHRM-text |
28
+ | Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
29
+ | HRM-Text paper | https://arxiv.org/html/2605.20613 |
30
+ | Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
31
+ | Raw resume checkpoints | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints |
 
 
 
32
 
33
+ ## Release Policy
34
 
35
+ The main model repository is intended to expose the latest model-only artifact:
36
 
37
+ - `model.safetensors`
38
+ - `config.json`
39
+ - `tokenizer.json`
40
+ - `tokenizer_config.json`
41
+ - `README.md`
42
+
43
+ It is not intended to keep every training checkpoint as visible model files. Intermediate FSDP2 `.distcp` checkpoints are large resume artifacts and are kept separately in `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints` when needed. The main repo may still have normal Hugging Face git history, but the current file tree should be treated as the latest public model export.
44
+
45
+ Current public artifact: `stage1` HRM fast-cap checkpoint at `step_25000`, converted with EMA weights to `safetensors`. Training is still in progress.
46
+
47
+ ## Model Details
48
+
49
+ | Field | Value |
50
+ |---|---|
51
+ | Model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
52
+ | Standard name | `KoHRM-Text-1.4B` |
53
+ | Training origin | scratch |
54
+ | Architecture family | HRM-Text PrefixLM |
55
+ | Architecture size | `XL` |
56
+ | Parameters | 1,384,120,320 |
57
+ | Context length | 4,096 tokens |
58
+ | Training dtype | bfloat16 |
59
+ | Tokenizer | byte-level BPE, NFC normalization |
60
+ | Vocabulary size | 131,072 |
61
+ | Objective | PrefixLM response-only loss |
62
+ | Optimizer | Adam-atan2 from upstream HRM-Text |
63
+ | EMA | 0.9999 |
64
+
65
+ The model config uses `model_type: hrm_text` and `architectures: ["HrmTextForCausalLM"]`. At the time of this checkpoint, `HrmTextForCausalLM` is a project-side custom architecture, not a built-in Transformers architecture.
66
+
67
+ ## Tokenizer
68
+
69
+ The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It intentionally keeps common chat/tool special tokens as stable single tokens where possible.
70
+
71
+ | Sample bucket | chars/token |
72
  |---|---:|
73
+ | Korean general text | 2.60 |
74
+ | Korean legal text | 2.36 |
75
+ | Korean terminal instruction | 2.18 |
76
  | shell command | 2.68 |
77
+ | tool-call JSON | 3.32 |
78
  | Python code | 3.37 |
79
+ | English | 4.40 |
80
 
81
+ Important formatting tokens include:
82
+
83
+ - `<|im_start|>`
84
+ - `<|im_end|>`
85
+ - `<|box_end|>`
86
+ - `<|object_ref_start|>` for direct condition
87
+ - `<|object_ref_end|>` for cot condition
88
+ - `<|quad_start|>` for noisy condition
89
+ - `<|quad_end|>` for synth condition
90
 
91
+ ## Usage
92
 
93
+ ### Tokenizer
94
 
95
+ ```python
96
+ from transformers import AutoTokenizer
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained(
99
+ "LLM-OS-Models/KoHRM-Text-1.4B",
100
+ use_fast=True,
101
+ )
102
+
103
+ prompt = "<|im_start|><|object_ref_start|>ํ•œ๊ตญ์–ด๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํฐ ํŒŒ์ผ์„ ์ฐพ๋Š” ๋ช…๋ น์„ ์•Œ๋ ค์ฃผ์„ธ์š”.<|im_end|>"
104
+ ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
105
+ print(len(ids), ids[:20])
106
+ ```
107
+
108
+ ### Model Weights
109
+
110
+ The repo currently contains a model-only `safetensors` export. Because the architecture is custom (`hrm_text`), direct `AutoModelForCausalLM.from_pretrained(...)` generation requires an HRM-Text-compatible modeling wrapper or remote-code integration. Until that wrapper is added to the model repo, use the project code and raw FSDP2 checkpoint path for internal inference/resume workflows.
111
+
112
+ Raw checkpoint inference pattern:
113
+
114
+ ```python
115
+ from simple_inference_engine import inference_load_checkpoint, inference_generate
116
+
117
+ ckpt = inference_load_checkpoint(
118
+ ckpt_path="/path/to/KoHRM-Text-1.4B-stage1-hrm-fastcap-gbs180",
119
+ ckpt_epoch=25000,
120
+ ckpt_use_ema=True,
121
+ device="cuda",
122
+ )
123
+
124
+ prompts = iter([
125
+ (0, ("direct", "ํ•œ๊ตญ์–ด๋กœ `du`์™€ `df`์˜ ์ฐจ์ด๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”.")),
126
+ ])
127
+
128
+ for _, text in inference_generate(
129
+ ckpt,
130
+ prompts,
131
+ max_tokens=4096,
132
+ max_generation=512,
133
+ batch_size=1,
134
+ temp=0.0,
135
+ ):
136
+ print(text)
137
+ ```
138
+
139
+ For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.
140
+
141
+ ## Training Data
142
+
143
+ All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
144
+
145
+ Completed and prepared datasets:
146
+
147
+ | Dataset | Tokens | Disk | Use |
148
+ |---|---:|---:|---|
149
+ | `koterm_pretrain_mix_v1` | 711.3M | 2.8G | stage-0/stage0b |
150
+ | HRM cleaned base sample | 250.0M | 994M | included in stage-0 mix |
151
+ | SWE-ZERO + GLM pilot mix | 251.2M | 990M | included in stage-0 mix |
152
+ | Korean legal SFT/task data | 83.1M | 336M | included in stage-0 mix |
153
+ | ToolBench train tool-call data | 127.0M | 500M | included in stage-0 mix |
154
+ | HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
155
+ | Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
156
+ | Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
157
+ | Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
158
+ | HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
159
+ | Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
160
+ | SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
161
+ | GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
162
+
163
+ Major source groups:
164
+
165
+ - Upstream HRM-Text cleaned pretraining data from `sapientinc/HRM-Text-data-io-cleaned-20260515`
166
+ - Korean Wikipedia
167
+ - Korean statutes, local ordinances, administrative rules, and precedent corpora
168
+ - ToolBench train trajectories and tool-use instructions
169
+ - Local terminal/code/math conversations
170
+ - SWE-ZERO terminal/code trajectories
171
+ - GLM reasoning samples
172
+ - Small, reviewed subsets of extra reasoning/agent datasets
173
+
174
+ Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
175
+
176
+ ## Training Run
177
+
178
+ The current public checkpoint was produced through staged pretraining:
179
+
180
+ 1. Train `stage-0` on `koterm_pretrain_mix_v1` with 711.3M tokens.
181
+ 2. Continue once more on the same available mix as `stage0b`.
182
+ 3. Continue to `stage-1` on HRM cleaned fast-cap data with 14.55B tokens.
183
+ 4. Convert `stage1 step_25000` EMA weights to `safetensors` and upload to the main model repo.
184
+
185
+ Current long-running stage-1 settings:
186
+
187
+ | Field | Value |
188
+ |---|---|
189
+ | Hardware | 8 x NVIDIA H200 |
190
+ | Data | `koterm_hrm_cleaned_fastcap_stage1_v1` |
191
+ | Tokens in current stage dataset | 14.55B |
192
+ | Global batch | 180,224 tokens |
193
+ | Local token slots/GPU | 22,528 |
194
+ | Context | 4,096 |
195
+ | LR | 2.2e-4 |
196
+ | LR warmup | 2,000 steps |
197
+ | Checkpoint interval | 5,000 steps |
198
+ | Current public export | `step_25000`, EMA, safetensors |
199
+
200
+ The run uses staged continuation. The checkpoint carries model, optimizer, EMA, and recurrent carry state forward. `resume_step_offset` and `total_steps_override` are used so the learning-rate schedule follows the intended longer pretraining run rather than resetting at every data stage.
201
+
202
+ The full HRM 328G cleaned corpus is being retokenized with the new 131K tokenizer. That full no-cap retokenization is intended to support a larger 40B+ token training continuation, instead of stopping at the 14.55B fast-cap stage.
203
+
204
+ ## Intended Use
205
 
206
+ This checkpoint is intended for:
207
 
208
+ - continued pretraining experiments
209
+ - Korean tokenizer and HRM-Text architecture experiments
210
+ - terminal/tool-call/code pretraining research
211
+ - checkpoint conversion and evaluation work
212
 
213
+ It is not yet intended as a finished assistant model.
 
 
 
 
 
214
 
215
+ ## Limitations
216
 
217
+ - This is an intermediate checkpoint, not a final aligned instruct model.
218
+ - It has not completed the full planned 40B+ token continuation.
219
+ - It has not completed final SFT or safety tuning.
220
+ - Public benchmark scores for this new checkpoint are not final.
221
+ - Direct Transformers generation requires adding the custom `hrm_text` modeling wrapper or remote-code files.
222
+ - Tool-call JSON validity and terminal action safety must be evaluated before production use.
223
 
224
+ ## Citation
225
 
226
+ This work builds on the HRM-Text architecture and training stack:
 
 
 
 
 
227
 
228
+ - Paper: https://arxiv.org/html/2605.20613
229
+ - Upstream code: https://github.com/sapientinc/HRM-Text
230
 
 
VRAM_OOM_NOTES_2026-05-24.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KoHRM-Text VRAM / OOM Notes
2
+
3
+ ์ž‘์„ฑ์ผ: 2026-05-24
4
+
5
+ ์ด ๋ฌธ์„œ๋Š” `KoHRM-Text-1.4B` stage-1 ํ•™์Šต ์ค‘ VRAM์ด ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉฐ ์ฆ๊ฐ€ํ•˜๋Š” ์ด์œ , ์ด์ „ OOM ์›์ธ, ํ˜„์žฌ ์šด์˜ ๊ธฐ์ค€์„ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
6
+
7
+ ## ํ˜„์žฌ ๊ด€์ธก ์ƒํƒœ
8
+
9
+ ํ˜„์žฌ stage-1 run์€ ๋‹ค์Œ ์„ค์ •์œผ๋กœ ์ •์ƒ ํ•™์Šต ์ค‘์ž…๋‹ˆ๋‹ค.
10
+
11
+ | ํ•ญ๋ชฉ | ๊ฐ’ |
12
+ |---|---:|
13
+ | GPU | 8 x NVIDIA H200 |
14
+ | GPU utilization | 8์žฅ ๋ชจ๋‘ 99% |
15
+ | global batch | 180,224 tokens |
16
+ | local token slots/GPU | 22,528 |
17
+ | context | 4,096 |
18
+ | VRAM | GPU0 ์•ฝ 129.9GB, ๋‚˜๋จธ์ง€ ์•ฝ 127.6GB / 143.8GB |
19
+ | speed | ์•ฝ 1.02 step/sec |
20
+ | checkpoint interval | 5,000 steps |
21
+
22
+ ํ˜„์žฌ ์„ค์ •์€ ๋น ๋ฅด์ง€๋งŒ ์—ฌ์œ  VRAM์ด ์•„์ฃผ ๋„“์€ ํŽธ์€ ์•„๋‹™๋‹ˆ๋‹ค. H200 ์žฅ๋‹น ์•ฝ 144GB ์ค‘ 127-130GB๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, NCCL/allocator/compiler/cache/checkpoint ์ˆœ๊ฐ„ ํ”ผํฌ๊ฐ€ ๊ฒน์น˜๋ฉด OOM ์œ„ํ—˜์ด ๋‹ค์‹œ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
23
+
24
+ ## ์™œ ํ•™์Šต ์ค‘ VRAM์ด ์ ์  ์˜ฌ๋ผ๊ฐ€๋‚˜
25
+
26
+ VRAM ์ฆ๊ฐ€๊ฐ€ ๊ณง๋ฐ”๋กœ โ€œ๋ฉ”๋ชจ๋ฆฌ ๋ˆ„์ˆ˜โ€๋ผ๋Š” ๋œป์€ ์•„๋‹™๋‹ˆ๋‹ค. ๋Œ€ํ˜• PyTorch/FSDP/compile ํ•™์Šต์—์„œ๋Š” ๋‹ค์Œ ์š”์ธ์ด ๊ฒน์น˜๋ฉด์„œ ์ดˆ๋ฐ˜๋ณด๋‹ค ๋’ค์—์„œ VRAM์ด ๋” ๋†’์•„์ง€๋Š” ํŒจํ„ด์ด ํ”ํ•ฉ๋‹ˆ๋‹ค.
27
+
28
+ ### 1. torch.compile / CUDA graph / kernel cache
29
+
30
+ HRM-Text ์ฝ”๋“œ๋Š” ์—ฌ๋Ÿฌ forward/backward path๋ฅผ compileํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๋ฐ˜ ๋ช‡ step์—์„œ๋Š” ๋ชจ๋“  shape/path๊ฐ€ ์•„์ง compile๋˜์ง€ ์•Š์•˜๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ ์ถ”๊ฐ€ graph, Triton kernel, CUDA kernel cache๊ฐ€ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
31
+
32
+ ํŠนํžˆ HRM ๊ตฌ์กฐ๋Š” H/L recurrent cycle๊ณผ PrefixLM loss๊ฐ€ ์žˆ์–ด ๋‹จ์ˆœ decoder-only Transformer๋ณด๋‹ค compile path๊ฐ€ ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๋ฐ˜ VRAM๋งŒ ๋ณด๊ณ  batch๋ฅผ ํฌ๊ฒŒ ์žก์œผ๋ฉด ํ›„์† graph๊ฐ€ ์ƒ์„ฑ๋  ๋•Œ ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ชป ๋ฐ›์•„ OOM์ด ๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
33
+
34
+ ### 2. final logits buffer ํฌ๊ธฐ
35
+
36
+ ์ด๋ฒˆ ๋ชจ๋ธ์€ vocab์ด 131,072์ž…๋‹ˆ๋‹ค. upstream HRM-Text ๋…ผ๋ฌธ ์„ค์ •์˜ 65,536 vocab๋ณด๋‹ค ๋‘ ๋ฐฐ์ž…๋‹ˆ๋‹ค.
37
+
38
+ batch token slots๊ฐ€ ์ปค์งˆ์ˆ˜๋ก final logits ๋˜๋Š” loss ๊ณ„์‚ฐ ์ชฝ ์ž„์‹œ ๋ฒ„ํผ๊ฐ€ ๋งค์šฐ ์ปค์ง‘๋‹ˆ๋‹ค.
39
+
40
+ ์˜ˆ๋ฅผ ๋“ค์–ด local token slots/GPU๊ฐ€ 32,768์ด๋ฉด `32768 x 131072` bf16 logits ๊ณ„์—ด ๋ฒ„ํผ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ก ์ƒ ๋‹จ์ผ bf16 dense buffer๋งŒ ์žก์•„๋„ ์•ฝ 8GB ์ด์ƒ์ด๊ณ , ์‹ค์ œ backward/temporary/parallel buffer๊นŒ์ง€ ํ•ฉ์น˜๋ฉด ํ›จ์”ฌ ์ปค์ง‘๋‹ˆ๋‹ค.
41
+
42
+ ์ด ๋•Œ๋ฌธ์— ์ฒ˜์Œ์—๋Š” `global_batch_size=262144` ๋˜๋Š” `229376`์ด ์ž ๊น ๋Œ์•„๊ฐ€๋„, ๋’ค์—์„œ compile graph์™€ logits/loss ์ž„์‹œ ๋ฒ„ํผ๊ฐ€ ๊ฒน์น˜๋Š” ์ˆœ๊ฐ„ OOM์ด ๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
43
+
44
+ ### 3. FSDP2 / optimizer / EMA ์ƒํƒœ
45
+
46
+ ํ˜„์žฌ ํ•™์Šต์€ model weights๋งŒ ๋“ค๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค.
47
+
48
+ - model parameters
49
+ - gradients
50
+ - optimizer state
51
+ - Adam-atan2 state
52
+ - EMA state
53
+ - FSDP shard/all-gather/reduce-scatter buffers
54
+ - recurrent carry ๊ด€๋ จ state
55
+
56
+ ์ด ์ƒํƒœ๋“ค์ด step๋งˆ๋‹ค ํ•ญ์ƒ ๊ฐ™์€ ์ˆœ๊ฐ„์— ๊ฐ™์€ ํฌ๊ธฐ๋กœ ๋ณด์ด๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ํŠน์ • backward path, optimizer step, checkpoint save ์‹œ์ ์— ํ”ผํฌ๊ฐ€ ์˜ฌ๋ผ๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
57
+
58
+ ### 4. NCCL communication buffers
59
+
60
+ 8 GPU ๋ถ„์‚ฐ ํ•™์Šต์—์„œ๋Š” NCCL ํ†ต์‹  ๋ฒ„ํผ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. all-gather/reduce-scatter ํƒ€์ด๋ฐ, bucket ํฌ๊ธฐ, compile๋œ ๊ทธ๋ž˜ํ”„ ์‹คํ–‰ ์ˆœ์„œ์— ๋”ฐ๋ผ GPU๋ณ„ ํ”ผํฌ๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๋ณด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
61
+
62
+ GPU0์ด ๋‹ค๋ฅธ GPU๋ณด๋‹ค ๋” ๋†’๊ฒŒ ๋ณด์ด๋Š” ๊ฒƒ๋„ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. rank0๊ฐ€ ๋กœ๊น…, ์ผ๋ถ€ metadata, checkpoint coordination, dataloader/host interaction์„ ๋” ๋งก๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
63
+
64
+ ### 5. CUDA caching allocator
65
+
66
+ `nvidia-smi`์˜ used memory๋Š” โ€œํ˜„์žฌ ํ…์„œ๊ฐ€ ์‹ค์ œ๋กœ ์“ฐ๋Š” ๋ฉ”๋ชจ๋ฆฌโ€๋งŒ ๋œปํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. PyTorch CUDA allocator๊ฐ€ ํ•œ ๋ฒˆ ํ™•๋ณดํ•œ ๋ธ”๋ก์„ ์žฌ์‚ฌ์šฉํ•˜๋ ค๊ณ  ์บ์‹œ์— ์žก๊ณ  ์žˆ์œผ๋ฉด `nvidia-smi`์—๋Š” ๊ณ„์† ์‚ฌ์šฉ ์ค‘์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.
67
+
68
+ ๋”ฐ๋ผ์„œ step์ด ์ง„ํ–‰๋ ์ˆ˜๋ก used memory๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ณ  ์ž˜ ๋‚ด๋ ค๊ฐ€์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ •์ƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ๊ฒƒ์€ reserved๊ฐ€ ๊ณ„์† ๋ฌดํ•œ ์ฆ๊ฐ€ํ•˜๋Š”์ง€, ๋˜๋Š” ํŠน์ • step ์ดํ›„ ์•ˆ์ • plateau๋ฅผ ๋งŒ๋“œ๋Š”์ง€์ž…๋‹ˆ๋‹ค.
69
+
70
+ ### 6. checkpoint ์ €์žฅ ์‹œ ์ˆœ๊ฐ„ ํ”ผํฌ
71
+
72
+ FSDP2 checkpoint ์ €์žฅ ์‹œ `.distcp` shard, metadata, state_dict materialization, host/device transfer๊ฐ€ ๊ฒน์นฉ๋‹ˆ๋‹ค. ์ €์žฅ ์ž์ฒด๋Š” ์ฃผ๋กœ CPU/disk ์ž‘์—…์ด์ง€๋งŒ, ์ €์žฅ ์ง์ „/์งํ›„ ๋ชจ๋ธ state ์ ‘๊ทผ ๋•Œ๋ฌธ์— GPU/CPU ๋ฉ”๋ชจ๋ฆฌ ํ”ผํฌ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
73
+
74
+ ๊ทธ๋ž˜์„œ ๋„ˆ๋ฌด ์žฆ์€ checkpoint ์ €์žฅ์€ ๋‹ค์Œ ๋ฌธ์ œ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
75
+
76
+ - step ์ฒ˜๋ฆฌ ์ง€์—ฐ
77
+ - ๋””์Šคํฌ ์‚ฌ์šฉ๋Ÿ‰ ๊ธ‰์ฆ
78
+ - HF upload ๋ฐ scan ๋น„์šฉ ์ฆ๊ฐ€
79
+ - ์ €์žฅ ์‹œ์  ํ”ผํฌ ๋ฉ”๋ชจ๋ฆฌ ์ฆ๊ฐ€
80
+
81
+ ํ˜„์žฌ 5,000 step๋งˆ๋‹ค ์•ฝ 21GB๊ธ‰ FSDP2 checkpoint๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. 500 step๋งˆ๋‹ค ์ €์žฅํ•˜๋ฉด stage-1 ๊ธฐ์ค€์œผ๋กœ ์ฒดํฌํฌ์ธํŠธ ์ˆ˜์™€ ์ €์žฅ ๋ถ€ํ•˜๊ฐ€ 10๋ฐฐ ๋Š˜์–ด ๊ณผํ•ฉ๋‹ˆ๋‹ค.
82
+
83
+ ## ์ด์ „ OOM ์›์ธ
84
+
85
+ ์ด์ „ OOM์€ batch๋ฅผ ํฌ๊ฒŒ ์žก์•˜์„ ๋•Œ ์ดˆ๋ฐ˜ ๊ด€์ธก VRAM๋งŒ ๋ณด๊ณ  โ€œ๊ดœ์ฐฎ๋‹คโ€๊ณ  ํŒ๋‹จํ•œ ๊ฒƒ์ด ์›์ธ์ž…๋‹ˆ๋‹ค.
86
+
87
+ ํ•ต์‹ฌ์€ ๋‹ค์Œ์ž…๋‹ˆ๋‹ค.
88
+
89
+ 1. vocab 131K๋ผ logits/loss ๊ด€๋ จ ์ž„์‹œ ๋ฒ„ํผ๊ฐ€ ํฝ๋‹ˆ๋‹ค.
90
+ 2. HRM recurrent compile path๊ฐ€ ์ดˆ๋ฐ˜ ๋ช‡ step ๋’ค ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.
91
+ 3. H200 8์žฅ์ด๋ผ compute๋Š” ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, 1.4B + 131K vocab + EMA + optimizer + FSDP2 ์กฐํ•ฉ์—์„œ๋Š” batch๋ฅผ ๋„ˆ๋ฌด ํฌ๊ฒŒ ์žก์œผ๋ฉด ํ›„๋ฐ˜ ํ”ผํฌ๊ฐ€ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
92
+ 4. `global_batch_size=262144`, `229376`์€ ์ดˆ๋ฐ˜์—๋Š” ๊ฐ€๋Šฅํ•ด ๋ณด์˜€์ง€๋งŒ ์•ˆ์ • ๋งˆ์ง„์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค.
93
+
94
+ ํ˜„์žฌ๋Š” `global_batch_size=180224`๋กœ ๋‚ด๋ ค ์•ˆ์ • ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.
95
+
96
+ ## ์šด์˜ ๊ธฐ์ค€
97
+
98
+ ํ˜„์žฌ stage-1์—์„œ๋Š” GPU๋ฅผ ๋†€๋ฆฌ์ง€ ์•Š๋Š” ๊ฒƒ์ด ์šฐ์„ ์ด์ง€๋งŒ, OOM์œผ๋กœ run์ด ์ฃฝ์œผ๋ฉด ์žฌ์‹œ์ž‘/๊ฒ€์ฆ/์ฒดํฌํฌ์ธํŠธ ์ •๋ฆฌ ๋น„์šฉ์ด ๋” ํฝ๋‹ˆ๋‹ค.
99
+
100
+ ๊ถŒ์žฅ ๊ธฐ์ค€:
101
+
102
+ | ํ•ญ๋ชฉ | ๊ธฐ์ค€ |
103
+ |---|---|
104
+ | primary batch | `global_batch_size=180224` |
105
+ | ์ €์žฅ ์ฃผ๊ธฐ | `checkpoint_step_interval=5000` |
106
+ | ๋กœ์ปฌ ๋ณด๊ด€ | ์ตœ์‹  2-3๊ฐœ checkpoint๋งŒ ์œ ์ง€ |
107
+ | HF main repo | ์ตœ์‹  safetensors export ์ค‘์‹ฌ |
108
+ | HF raw repo | resume๊ฐ€ ํ•„์š”ํ•œ FSDP2 checkpoint๋งŒ ๋ณ„๋„ ๋ณด๊ด€ |
109
+ | OOM ์žฌ๋ฐœ ์‹œ | batch๋ฅผ 5-10% ๋‚ฎ์ถ”๊ณ  ๊ฐ™์€ resume checkpoint์—์„œ ์žฌ์‹œ์ž‘ |
110
+
111
+ ## 500 step checkpoint๊ฐ€ ๊ณผํ•œ ์ด์œ 
112
+
113
+ 500 step๋งˆ๋‹ค ์ €์žฅํ•˜๋ฉด ๋‹ค์Œ ๋ฌธ์ œ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค.
114
+
115
+ - ํ˜„์žฌ FSDP2 checkpoint ํ•˜๋‚˜๊ฐ€ ์•ฝ 21GB์ž…๋‹ˆ๋‹ค.
116
+ - 500 step ๊ฐ„๊ฒฉ์ด๋ฉด 10,000 step๋งˆ๋‹ค ์•ฝ 20๊ฐœ, ์ฆ‰ ์•ฝ 420GB๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค.
117
+ - stage-1 ์ „์ฒด 88,522 step ๊ธฐ์ค€์œผ๋กœ๋Š” ๋‹จ์ˆœ ๊ณ„์‚ฐ์ƒ 170๊ฐœ ์ด์ƒ์ด ์ƒ๊ฒจ ์ˆ˜ TB๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
118
+ - ์ €์žฅ ์ž์ฒด๊ฐ€ ํ•™์Šต ๋ฃจํ”„๋ฅผ ๋ฐฉํ•ดํ•˜๊ณ , HF ์—…๋กœ๋“œ/์Šค์บ”๋„ ์ปค์ง‘๋‹ˆ๋‹ค.
119
+
120
+ ๋”ฐ๋ผ์„œ ํ˜„์žฌ์ฒ˜๋Ÿผ 5,000 step ๊ฐ„๊ฒฉ์œผ๋กœ ์ €์žฅํ•˜๊ณ , ๋กœ์ปฌ์€ ์ตœ์‹  2-3๊ฐœ๋งŒ ๋‚จ๊ธฐ๋Š” ํŽธ์ด ๋งž์Šต๋‹ˆ๋‹ค.
121
+
122
+ ## ๋‹ค์Œ batch ์กฐ์ • ํŒ๋‹จ
123
+
124
+ ํ˜„์žฌ VRAM ์‚ฌ์šฉ๋Ÿ‰์€ ๋†’์ง€๋งŒ ํ•™์Šต ์†๋„๋Š” ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค.
125
+
126
+ ๋‹ค์Œ stage์—์„œ batch๋ฅผ ์˜ฌ๋ฆฌ๊ณ  ์‹ถ์œผ๋ฉด ํ•œ ๋ฒˆ์— ํฌ๊ฒŒ ์˜ฌ๋ฆฌ์ง€ ๋ง๊ณ  ๋‹ค์Œ ์ˆœ์„œ๊ฐ€ ๋‚ซ์Šต๋‹ˆ๋‹ค.
127
+
128
+ 1. `global_batch_size=180224`๋กœ ์•ˆ์ • ์™„๋ฃŒ ํ™•์ธ
129
+ 2. ๋‹ค์Œ dataset stage์—์„œ `196608` ํ…Œ์ŠคํŠธ
130
+ 3. 2-3์ฒœ step ์ด์ƒ VRAM plateau ํ™•์ธ
131
+ 4. checkpoint ์ €์žฅ ์‹œ์ ๊นŒ์ง€ ํ†ต๊ณผํ•˜๋ฉด ์œ ์ง€
132
+ 5. OOM ๋˜๋Š” ํ”ผํฌ ๋ถˆ์•ˆ์ • ์‹œ ์ฆ‰์‹œ `180224` ๋˜๋Š” `172032`๋กœ ๋ณต๊ท€
133
+
134
+ ๋…ผ๋ฌธ ์„ค์ •๊ณผ ๋น„๊ตํ•˜๋ฉด H200 8์žฅ์€ ๊ฐ•ํ•˜์ง€๋งŒ, ์ด๋ฒˆ ๋ชจ๋ธ์€ vocab์ด 131K๋ผ upstream๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ โ€œH200์ด๋‹ˆ๊นŒ ๋ฌด์กฐ๊ฑด H100 16์žฅ batch๋ฅผ ๋„˜๊ธด๋‹คโ€๋Š” ์‹์œผ๋กœ ์žก์œผ๋ฉด ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.
135
+
136
+ ## ๊ฒฐ๋ก 
137
+
138
+ ํ˜„์žฌ VRAM ์ƒ์Šน์€ torch compile/cache, 131K vocab logits buffer, FSDP2/optimizer/EMA/NCCL buffer, checkpoint ์ˆœ๊ฐ„ ํ”ผํฌ๊ฐ€ ๊ฒน์นœ ๊ฒฐ๊ณผ๋กœ ๋ณด๋Š” ๊ฒƒ์ด ๋งž์Šต๋‹ˆ๋‹ค.
139
+
140
+ ํ˜„์žฌ `global_batch_size=180224`, 5,000 step checkpoint, ์ตœ์‹  2-3๊ฐœ ๋ณด๊ด€ ์ •์ฑ…์€ ๋น ๋ฅธ ํ•™์Šต๊ณผ OOM ํšŒํ”ผ ์‚ฌ์ด์˜ ํ˜„์‹ค์ ์ธ ๊ท ํ˜•์ž…๋‹ˆ๋‹ค. ํ•™์Šต์ด ์™„์ „ํžˆ ์•ˆ์ • plateau๋ฅผ ๋ณด์ด๋ฉด ๋‹ค์Œ stage์—์„œ๋งŒ ์†Œํญ ์ฆ๋Ÿ‰์„ ๊ฒ€ํ† ํ•ฉ๋‹ˆ๋‹ค.
141
+