LLM-OS-Models
/

KoHRM-Text-1.4B

@@ -15,76 +15,216 @@ pipeline_tag: text-generation
 # KoHRM-Text-1.4B
-`KoHRM-Text-1.4B`는 `sapientinc/HRM-Text`의 PrefixLM 학습 구조를 기반으로, 한국어/영어/코딩/터미널/툴콜 사용성을 목표로 scratch pretraining하는 모델입니다.
-이 카드는 2026-05-23 기준 작업 중인 모델 카드 초안입니다. 현재 메인 artifact는 stage0b checkpoint를 변환한 `model.safetensors` 안전 포맷입니다. raw HRM-Text FSDP2 checkpoint는 optimizer/EMA resume 용도이므로 메인 repo에서 제거하고 별도 raw checkpoint repo로 분리합니다.
-## 모델 정보
-| 항목 | 값 |
 |---|---|
-| model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
-| base code | `sapientinc/HRM-Text` |
-| training from | scratch |
-| architecture | HRM-Text `XL` |
-| params | 1,384,120,320 |
-| context | 4096 tokens |
-| dtype | bfloat16 |
-| tokenizer | byte-level BPE, NFC normalization |
-| vocab | 131,072 |
-## 토크나이저
-새 tokenizer는 한국어, 영어, 코드, shell, terminal instruction, JSON tool-call을 함께 고려해 학습했습니다.
-| 샘플 | chars/token |
 |---|---:|
-| 한국어 일반 | 2.60 |
-| 한국어 법률 | 2.36 |
-| 한국어 터미널 지시 | 2.18 |
 | shell command | 2.68 |
-| tool JSON | 3.32 |
 | Python code | 3.37 |
-| 영어 | 4.40 |
-Tokenizer repo: `LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K`
-## 학습 데이터
-stage-0/stage0b 입력은 전처리 완료된 711.3M token mix입니다.
-| 데이터 | token |
-|---|---:|
-| HRM cleaned base sample | 250.0M |
-| SWE-ZERO + GLM reasoning mix | 251.2M |
-| 한국어 법률/조례/행정규칙/판례 task | 83.1M |
-| ToolBench train tool-call task | 127.0M |
-| 합계 | 711.3M |
-현재 stage-1은 HRM cleaned fast-cap V1Dataset 14.55B tokens로 학습 중입니다. 이후 stage는 local terminal dataset, 추가 한국어/코딩/툴콜 데이터를 순차적으로 포함합니다. 평가 데이터 성격의 `tb2_lite`, Terminal Bench 2, ToolBench eval, chi-bench는 train에서 제외합니다.
-## 학습 방식
-- Objective: PrefixLM style response-only loss
-- Optimizer: HRM-Text upstream Adam-atan2
-- Context: 4096 tokens
-- Hardware: 8 x NVIDIA H200
-- Current stage-1 global batch: 229,376 tokens
-- Checkpoint policy: main repo에는 `safetensors`, raw FSDP2는 별도 raw checkpoint repo
-stage-1은 처음 `global_batch_size=262144`로 시도했지만, 후속 compile graph에서 `32768 x 131072` bf16 logits buffer 추가 할당이 필요해 OOM이 발생했습니다. 현재는 `global_batch_size=229376`으로 재시작해 진행 중이며, 관측 VRAM은 GPU0 약 105GB, 나머지 약 103GB입니다. 안정 속도는 약 `1.02-1.03 step/sec`입니다.
-Staged pretraining에서는 checkpoint의 model/optimizer/EMA/carry를 이어받고, `resume_step_offset`과 `total_steps_override`로 LR schedule을 전체 pretraining 기준에 맞춥니다. 즉, 새 데이터가 준비될 때마다 학습을 재시작하되 optimizer와 schedule을 끊지 않는 방향으로 운용합니다.
-## 현재 상태
-- stage-0/stage0b training: complete
-- stage0b safetensors HF upload: complete
-- unsafe raw DCP files removed from main HF repo
-- stage-1 HRM fast-cap training: in progress
-- final Transformers conversion: not yet produced
-- public benchmark score: not yet evaluated for this model
-## 제한사항
-현재 checkpoint artifact는 중간 학습 산출물입니다. 안전성 정렬, 최종 instruction tuning, 최종 benchmark, 배포용 변환이 끝난 모델이 아닙니다. 한국어 터미널/툴콜 능력은 목표 영역이지만, stage-0만으로는 완성된 성능을 보장하지 않습니다.

 # KoHRM-Text-1.4B
+`KoHRM-Text-1.4B` is a scratch-pretrained Korean/English/code/terminal/tool-use model based on the `sapientinc/HRM-Text` PrefixLM training stack.
+This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.
+## Links
+| Item | Link |
 |---|---|
+| HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
+| Project code | https://github.com/LLM-OS-Models/KoHRM-text |
+| Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
+| HRM-Text paper | https://arxiv.org/html/2605.20613 |
+| Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
+| Raw resume checkpoints | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints |
+## Release Policy
+The main model repository is intended to expose the latest model-only artifact:
+- `model.safetensors`
+- `config.json`
+- `tokenizer.json`
+- `tokenizer_config.json`
+- `README.md`
+It is not intended to keep every training checkpoint as visible model files. Intermediate FSDP2 `.distcp` checkpoints are large resume artifacts and are kept separately in `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints` when needed. The main repo may still have normal Hugging Face git history, but the current file tree should be treated as the latest public model export.
+Current public artifact: `stage1` HRM fast-cap checkpoint at `step_25000`, converted with EMA weights to `safetensors`. Training is still in progress.
+## Model Details
+| Field | Value |
+|---|---|
+| Model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
+| Standard name | `KoHRM-Text-1.4B` |
+| Training origin | scratch |
+| Architecture family | HRM-Text PrefixLM |
+| Architecture size | `XL` |
+| Parameters | 1,384,120,320 |
+| Context length | 4,096 tokens |
+| Training dtype | bfloat16 |
+| Tokenizer | byte-level BPE, NFC normalization |
+| Vocabulary size | 131,072 |
+| Objective | PrefixLM response-only loss |
+| Optimizer | Adam-atan2 from upstream HRM-Text |
+| EMA | 0.9999 |
+The model config uses `model_type: hrm_text` and `architectures: ["HrmTextForCausalLM"]`. At the time of this checkpoint, `HrmTextForCausalLM` is a project-side custom architecture, not a built-in Transformers architecture.
+## Tokenizer
+The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It intentionally keeps common chat/tool special tokens as stable single tokens where possible.
+| Sample bucket | chars/token |
 |---|---:|
+| Korean general text | 2.60 |
+| Korean legal text | 2.36 |
+| Korean terminal instruction | 2.18 |
 | shell command | 2.68 |
+| tool-call JSON | 3.32 |
 | Python code | 3.37 |
+| English | 4.40 |
+Important formatting tokens include:
+- `<|im_start|>`
+- `<|im_end|>`
+- `<|box_end|>`
+- `<|object_ref_start|>` for direct condition
+- `<|object_ref_end|>` for cot condition
+- `<|quad_start|>` for noisy condition
+- `<|quad_end|>` for synth condition
+## Usage
+### Tokenizer
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "LLM-OS-Models/KoHRM-Text-1.4B",
+    use_fast=True,
+)
+prompt = "<|im_start|><|object_ref_start|>한국어로 현재 디렉터리의 큰 파일을 찾는 명령을 알려주세요.<|im_end|>"
+ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
+print(len(ids), ids[:20])
+```
+### Model Weights
+The repo currently contains a model-only `safetensors` export. Because the architecture is custom (`hrm_text`), direct `AutoModelForCausalLM.from_pretrained(...)` generation requires an HRM-Text-compatible modeling wrapper or remote-code integration. Until that wrapper is added to the model repo, use the project code and raw FSDP2 checkpoint path for internal inference/resume workflows.
+Raw checkpoint inference pattern:
+```python
+from simple_inference_engine import inference_load_checkpoint, inference_generate
+ckpt = inference_load_checkpoint(
+    ckpt_path="/path/to/KoHRM-Text-1.4B-stage1-hrm-fastcap-gbs180",
+    ckpt_epoch=25000,
+    ckpt_use_ema=True,
+    device="cuda",
+)
+prompts = iter([
+    (0, ("direct", "한국어로 `du`와 `df`의 차이를 설명해주세요.")),
+])
+for _, text in inference_generate(
+    ckpt,
+    prompts,
+    max_tokens=4096,
+    max_generation=512,
+    batch_size=1,
+    temp=0.0,
+):
+    print(text)
+```
+For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.
+## Training Data
+All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
+Completed and prepared datasets:
+| Dataset | Tokens | Disk | Use |
+|---|---:|---:|---|
+| `koterm_pretrain_mix_v1` | 711.3M | 2.8G | stage-0/stage0b |
+| HRM cleaned base sample | 250.0M | 994M | included in stage-0 mix |
+| SWE-ZERO + GLM pilot mix | 251.2M | 990M | included in stage-0 mix |
+| Korean legal SFT/task data | 83.1M | 336M | included in stage-0 mix |
+| ToolBench train tool-call data | 127.0M | 500M | included in stage-0 mix |
+| HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
+| Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
+| Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
+| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
+| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
+| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
+| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
+| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
+Major source groups:
+- Upstream HRM-Text cleaned pretraining data from `sapientinc/HRM-Text-data-io-cleaned-20260515`
+- Korean Wikipedia
+- Korean statutes, local ordinances, administrative rules, and precedent corpora
+- ToolBench train trajectories and tool-use instructions
+- Local terminal/code/math conversations
+- SWE-ZERO terminal/code trajectories
+- GLM reasoning samples
+- Small, reviewed subsets of extra reasoning/agent datasets
+Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
+## Training Run
+The current public checkpoint was produced through staged pretraining:
+1. Train `stage-0` on `koterm_pretrain_mix_v1` with 711.3M tokens.
+2. Continue once more on the same available mix as `stage0b`.
+3. Continue to `stage-1` on HRM cleaned fast-cap data with 14.55B tokens.
+4. Convert `stage1 step_25000` EMA weights to `safetensors` and upload to the main model repo.
+Current long-running stage-1 settings:
+| Field | Value |
+|---|---|
+| Hardware | 8 x NVIDIA H200 |
+| Data | `koterm_hrm_cleaned_fastcap_stage1_v1` |
+| Tokens in current stage dataset | 14.55B |
+| Global batch | 180,224 tokens |
+| Local token slots/GPU | 22,528 |
+| Context | 4,096 |
+| LR | 2.2e-4 |
+| LR warmup | 2,000 steps |
+| Checkpoint interval | 5,000 steps |
+| Current public export | `step_25000`, EMA, safetensors |
+The run uses staged continuation. The checkpoint carries model, optimizer, EMA, and recurrent carry state forward. `resume_step_offset` and `total_steps_override` are used so the learning-rate schedule follows the intended longer pretraining run rather than resetting at every data stage.
+The full HRM 328G cleaned corpus is being retokenized with the new 131K tokenizer. That full no-cap retokenization is intended to support a larger 40B+ token training continuation, instead of stopping at the 14.55B fast-cap stage.
+## Intended Use
+This checkpoint is intended for:
+- continued pretraining experiments
+- Korean tokenizer and HRM-Text architecture experiments
+- terminal/tool-call/code pretraining research
+- checkpoint conversion and evaluation work
+It is not yet intended as a finished assistant model.
+## Limitations
+- This is an intermediate checkpoint, not a final aligned instruct model.
+- It has not completed the full planned 40B+ token continuation.
+- It has not completed final SFT or safety tuning.
+- Public benchmark scores for this new checkpoint are not final.
+- Direct Transformers generation requires adding the custom `hrm_text` modeling wrapper or remote-code files.
+- Tool-call JSON validity and terminal action safety must be evaluated before production use.
+## Citation
+This work builds on the HRM-Text architecture and training stack:
+- Paper: https://arxiv.org/html/2605.20613
+- Upstream code: https://github.com/sapientinc/HRM-Text

VRAM_OOM_NOTES_2026-05-24.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# KoHRM-Text VRAM / OOM Notes
+작성일: 2026-05-24
+이 문서는 `KoHRM-Text-1.4B` stage-1 학습 중 VRAM이 시간이 지나며 증가하는 이유, 이전 OOM 원인, 현재 운영 기준을 기록합니다.
+## 현재 관측 상태
+현재 stage-1 run은 다음 설정으로 정상 학습 중입니다.
+| 항목 | 값 |
+|---|---:|
+| GPU | 8 x NVIDIA H200 |
+| GPU utilization | 8장 모두 99% |
+| global batch | 180,224 tokens |
+| local token slots/GPU | 22,528 |
+| context | 4,096 |
+| VRAM | GPU0 약 129.9GB, 나머지 약 127.6GB / 143.8GB |
+| speed | 약 1.02 step/sec |
+| checkpoint interval | 5,000 steps |
+현재 설정은 빠르지만 여유 VRAM이 아주 넓은 편은 아닙니다. H200 장당 약 144GB 중 127-130GB를 사용하므로, NCCL/allocator/compiler/cache/checkpoint 순간 피크가 겹치면 OOM 위험이 다시 생길 수 있습니다.
+## 왜 학습 중 VRAM이 점점 올라가나
+VRAM 증가가 곧바로 “메모리 누수”라는 뜻은 아닙니다. 대형 PyTorch/FSDP/compile 학습에서는 다음 요인이 겹치면서 초반보다 뒤에서 VRAM이 더 높아지는 패턴이 흔합니다.
+### 1. torch.compile / CUDA graph / kernel cache
+HRM-Text 코드는 여러 forward/backward path를 compile합니다. 초반 몇 step에서는 모든 shape/path가 아직 compile되지 않았고, 학습이 진행되며 추가 graph, Triton kernel, CUDA kernel cache가 만들어집니다.
+특히 HRM 구조는 H/L recurrent cycle과 PrefixLM loss가 있어 단순 decoder-only Transformer보다 compile path가 더 복잡합니다. 초반 VRAM만 보고 batch를 크게 잡으면 후속 graph가 생성될 때 추가 메모리를 못 받아 OOM이 날 수 있습니다.
+### 2. final logits buffer 크기
+이번 모델은 vocab이 131,072입니다. upstream HRM-Text 논문 설정의 65,536 vocab보다 두 배입니다.
+batch token slots가 커질수록 final logits 또는 loss 계산 쪽 임시 버퍼가 매우 커집니다.
+예를 들어 local token slots/GPU가 32,768이면 `32768 x 131072` bf16 logits 계열 버퍼가 필요할 수 있습니다. 이론상 단일 bf16 dense buffer만 잡아도 약 8GB 이상이고, 실제 backward/temporary/parallel buffer까지 합치면 훨씬 커집니다.
+이 때문에 처음에는 `global_batch_size=262144` 또는 `229376`이 잠깐 돌아가도, 뒤에서 compile graph와 logits/loss 임시 버퍼가 겹치는 순간 OOM이 날 수 있습니다.
+### 3. FSDP2 / optimizer / EMA 상태
+현재 학습은 model weights만 들고 있는 것이 아닙니다.
+- model parameters
+- gradients
+- optimizer state
+- Adam-atan2 state
+- EMA state
+- FSDP shard/all-gather/reduce-scatter buffers
+- recurrent carry 관련 state
+이 상태들이 step마다 항상 같은 순간에 같은 크기로 보이는 것은 아닙니다. 특정 backward path, optimizer step, checkpoint save 시점에 피크가 올라갈 수 있습니다.
+### 4. NCCL communication buffers
+8 GPU 분산 학습에서는 NCCL 통신 버퍼가 필요합니다. all-gather/reduce-scatter 타이밍, bucket 크기, compile된 그래프 실행 순서에 따라 GPU별 피크가 다르게 보일 수 있습니다.
+GPU0이 다른 GPU보다 더 높게 보이는 것도 일반적으로 가능합니다. rank0가 로깅, 일부 metadata, checkpoint coordination, dataloader/host interaction을 더 맡는 경우가 있기 때문입니다.
+### 5. CUDA caching allocator
+`nvidia-smi`의 used memory는 “현재 텐서가 실제로 쓰는 메모리”만 뜻하지 않습니다. PyTorch CUDA allocator가 한 번 확보한 블록을 재사용하려고 캐시에 잡고 있으면 `nvidia-smi`에는 계속 사용 중처럼 보입니다.
+따라서 step이 진행될수록 used memory가 올라가고 잘 내려가지 않는 것은 정상일 수 있습니다. 중요한 것은 reserved가 계속 무한 증가하는지, 또는 특정 step 이후 안정 plateau를 만드는지입니다.
+### 6. checkpoint 저장 시 순간 피크
+FSDP2 checkpoint 저장 시 `.distcp` shard, metadata, state_dict materialization, host/device transfer가 겹칩니다. 저장 자체는 주로 CPU/disk 작업이지만, 저장 직전/직후 모델 state 접근 때문에 GPU/CPU 메모리 피크가 생길 수 있습니다.
+그래서 너무 잦은 checkpoint 저장은 다음 문제를 만듭니다.
+- step 처리 지연
+- 디스크 사용량 급증
+- HF upload 및 scan 비용 증가
+- 저장 시점 피크 메모리 증가
+현재 5,000 step마다 약 21GB급 FSDP2 checkpoint가 생깁니다. 500 step마다 저장하면 stage-1 기준으로 체크포인트 수와 저장 부하가 10배 늘어 과합니다.
+## 이전 OOM 원인
+이전 OOM은 batch를 크게 잡았을 때 초반 관측 VRAM만 보고 “괜찮다”고 판단한 것이 원인입니다.
+핵심은 다음입니다.
+1. vocab 131K라 logits/loss 관련 임시 버퍼가 큽니다.
+2. HRM recurrent compile path가 초반 몇 step 뒤 추가 메모리를 요구합니다.
+3. H200 8장이라 compute는 충분하지만, 1.4B + 131K vocab + EMA + optimizer + FSDP2 조합에서는 batch를 너무 크게 잡으면 후반 피크가 걸립니다.
+4. `global_batch_size=262144`, `229376`은 초반에는 가능해 보였지만 안정 마진이 부족했습니다.
+현재는 `global_batch_size=180224`로 내려 안정 진행 중입니다.
+## 운영 기준
+현재 stage-1에서는 GPU를 놀리지 않는 것이 우선이지만, OOM으로 run이 죽으면 재시작/검증/체크포인트 정리 비용이 더 큽니다.
+권장 기준:
+| 항목 | 기준 |
+|---|---|
+| primary batch | `global_batch_size=180224` |
+| 저장 주기 | `checkpoint_step_interval=5000` |
+| 로컬 보관 | 최신 2-3개 checkpoint만 유지 |
+| HF main repo | 최신 safetensors export 중심 |
+| HF raw repo | resume가 필요한 FSDP2 checkpoint만 별도 보관 |
+| OOM 재발 시 | batch를 5-10% 낮추고 같은 resume checkpoint에서 재시작 |
+## 500 step checkpoint가 과한 이유
+500 step마다 저장하면 다음 문제가 생깁니다.
+- 현재 FSDP2 checkpoint 하나가 약 21GB입니다.
+- 500 step 간격이면 10,000 step마다 약 20개, 즉 약 420GB가 생깁니다.
+- stage-1 전체 88,522 step 기준으로는 단순 계산상 170개 이상이 생겨 수 TB가 됩니다.
+- 저장 자체가 학습 루프를 방해하고, HF 업로드/스캔도 커집니다.
+따라서 현재처럼 5,000 step 간격으로 저장하고, 로컬은 최신 2-3개만 남기는 편이 맞습니다.
+## 다음 batch 조정 판단
+현재 VRAM 사용량은 높지만 학습 속도는 안정적입니다.
+다음 stage에서 batch를 올리고 싶으면 한 번에 크게 올리지 말고 다음 순서가 낫습니다.
+1. `global_batch_size=180224`로 안정 완료 확인
+2. 다음 dataset stage에서 `196608` 테스트
+3. 2-3천 step 이상 VRAM plateau 확인
+4. checkpoint 저장 시점까지 통과하면 유지
+5. OOM 또는 피크 불안정 시 즉시 `180224` 또는 `172032`로 복귀
+논문 설정과 비교하면 H200 8장은 강하지만, 이번 모델은 vocab이 131K라 upstream과 메모리 구조가 다릅니다. 따라서 “H200이니까 무조건 H100 16장 batch를 넘긴다”는 식으로 잡으면 안정성이 떨어집니다.
+## 결론
+현재 VRAM 상승은 torch compile/cache, 131K vocab logits buffer, FSDP2/optimizer/EMA/NCCL buffer, checkpoint 순간 피크가 겹친 결과로 보는 것이 맞습니다.
+현재 `global_batch_size=180224`, 5,000 step checkpoint, 최신 2-3개 보관 정책은 빠른 학습과 OOM 회피 사이의 현실적인 균형입니다. 학습이 완전히 안정 plateau를 보이면 다음 stage에서만 소폭 증량을 검토합니다.