File size: 11,588 Bytes
f9e20ea 48e9e29 db5a5b3 ecbdf52 48e9e29 f9e20ea 48e9e29 f9e20ea 48e9e29 f9e20ea 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 f131b4d 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 f131b4d 0756b71 f131b4d 0756b71 862683a 0756b71 f131b4d 862683a f131b4d 862683a 0756b71 f131b4d 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 48e9e29 0756b71 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | ---
license: other
language:
- ko
- en
tags:
- hrm-text
- korean
- terminal
- tool-use
- code
- pretraining
pipeline_tag: text-generation
---
# KoHRM-Text-1.4B
`KoHRM-Text-1.4B` is a scratch-pretrained Korean/English/code/terminal/tool-use model based on the `sapientinc/HRM-Text` PrefixLM training stack.
This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.
## Links
| Item | Link |
|---|---|
| HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
| Project code | https://github.com/LLM-OS-Models/KoHRM-text |
| Prepared training data | https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data |
| Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
| HRM-Text paper | https://arxiv.org/html/2605.20613 |
| Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
| Raw resume checkpoints | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints |
## Release Policy
The main model repository is intended to expose the latest model-only artifact:
- `model.safetensors`
- `config.json`
- `tokenizer.json`
- `tokenizer_config.json`
- `README.md`
It is not intended to keep every training checkpoint as visible model files. Intermediate FSDP2 `.distcp` checkpoints are large resume artifacts and are kept separately in `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints` when needed. The main repo may still have normal Hugging Face git history, but the current file tree should be treated as the latest public model export.
Current public artifact: `stage1` HRM fast-cap checkpoint at `step_25000`, converted with EMA weights to `safetensors`. Training is still in progress.
## Model Details
| Field | Value |
|---|---|
| Model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
| Standard name | `KoHRM-Text-1.4B` |
| Training origin | scratch |
| Architecture family | HRM-Text PrefixLM |
| Architecture size | `XL` |
| Parameters | 1,384,120,320 |
| Context length | 4,096 tokens |
| Training dtype | bfloat16 |
| Tokenizer | byte-level BPE, NFC normalization |
| Vocabulary size | 131,072 |
| Objective | PrefixLM response-only loss |
| Optimizer | Adam-atan2 from upstream HRM-Text |
| EMA | 0.9999 |
The model config uses `model_type: hrm_text` and `architectures: ["HrmTextForCausalLM"]`. At the time of this checkpoint, `HrmTextForCausalLM` is a project-side custom architecture, not a built-in Transformers architecture.
## Tokenizer
The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It intentionally keeps common chat/tool special tokens as stable single tokens where possible.
| Sample bucket | chars/token |
|---|---:|
| Korean general text | 2.60 |
| Korean legal text | 2.36 |
| Korean terminal instruction | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| English | 4.40 |
Important formatting tokens include:
- `<|im_start|>`
- `<|im_end|>`
- `<|box_end|>`
- `<|object_ref_start|>` for direct condition
- `<|object_ref_end|>` for cot condition
- `<|quad_start|>` for noisy condition
- `<|quad_end|>` for synth condition
## Usage
### Tokenizer
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"LLM-OS-Models/KoHRM-Text-1.4B",
use_fast=True,
)
prompt = "<|im_start|><|object_ref_start|>ํ๊ตญ์ด๋ก ํ์ฌ ๋๋ ํฐ๋ฆฌ์ ํฐ ํ์ผ์ ์ฐพ๋ ๋ช
๋ น์ ์๋ ค์ฃผ์ธ์.<|im_end|>"
ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
print(len(ids), ids[:20])
```
### Model Weights
The repo currently contains a model-only `safetensors` export. Because the architecture is custom (`hrm_text`), direct `AutoModelForCausalLM.from_pretrained(...)` generation requires an HRM-Text-compatible modeling wrapper or remote-code integration. Until that wrapper is added to the model repo, use the project code and raw FSDP2 checkpoint path for internal inference/resume workflows.
Raw checkpoint inference pattern:
```python
from simple_inference_engine import inference_load_checkpoint, inference_generate
ckpt = inference_load_checkpoint(
ckpt_path="/path/to/KoHRM-Text-1.4B-stage1-hrm-fastcap-gbs180",
ckpt_epoch=25000,
ckpt_use_ema=True,
device="cuda",
)
prompts = iter([
(0, ("direct", "ํ๊ตญ์ด๋ก `du`์ `df`์ ์ฐจ์ด๋ฅผ ์ค๋ช
ํด์ฃผ์ธ์.")),
])
for _, text in inference_generate(
ckpt,
prompts,
max_tokens=4096,
max_generation=512,
batch_size=1,
temp=0.0,
):
print(text)
```
For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.
## Training Data
Prepared data artifacts are uploaded to the Hugging Face dataset repository, not to the model repository:
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
Completed and prepared datasets:
| Dataset | Tokens | Disk | Use |
|---|---:|---:|---|
| `koterm_pretrain_mix_v1` | 711.3M | 2.8G | stage-0/stage0b |
| HRM cleaned base sample | 250.0M | 994M | included in stage-0 mix |
| SWE-ZERO + GLM pilot mix | 251.2M | 990M | included in stage-0 mix |
| Korean legal SFT/task data | 83.1M | 336M | included in stage-0 mix |
| ToolBench train tool-call data | 127.0M | 500M | included in stage-0 mix |
| HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
| Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
| Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
| Korean legal/admin full task data | 629.0M | 2.5G | uploaded to prepared dataset repo |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
| BCAI Finance Korean | 857.7M | 3.3G | prepared and uploaded for later Korean finance/domain stages |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
Major source groups and provenance:
| Source group | Origin | Prepared dataset usage |
|---|---|---|
| HRM-Text cleaned pretraining data | https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515 | `hrm_cleaned_base_sample_v1`, `koterm_hrm_cleaned_fastcap_stage1_v1`; full no-cap retokenization is still running |
| Korean Wikipedia | https://dumps.wikimedia.org/kowiki/20260501/ | `kowiki_raw_full_v1` |
| Korean statutes | https://github.com/legalize-kr/legalize-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
| Korean local ordinances | https://github.com/legalize-kr/ordinance-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
| Korean administrative rules | local Markdown snapshot at `admrule-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
| Korean precedents | local Markdown snapshot at `precedent-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
| ToolBench train data | local extraction under `data_toolbench/data/`; eval split excluded | `sft_toolbench_v1` |
| SWE-ZERO trajectories | https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories | `sft_swe_zero_v1`, `sft_swe_glm_mix_v1` |
| GLM reasoning | https://huggingface.co/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned | `sft_glm_reasoning_v1`, `sft_swe_glm_mix_v1` |
| Claude reasoning sample | https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k | small reviewed reasoning subset inside `hf_extra_reasoning_agent_mm_v1` |
| Open-MM-RL text subset | https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL | text-only reviewed subset inside `hf_extra_reasoning_agent_mm_v1` |
| DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
| structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
| Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
| BCAI Finance Kor | https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-1862K | `sft_bcai_finance_kor_v1` |
The full Korean legal/admin task upload is present in the dataset repository at:
- `korean_legal_tasks_full_v1/`
- `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
- `LEGAL_FULL_TASKS_README.md`
- `sft_bcai_finance_kor_v1/`
- `raw_jsonl/bcai_finance_kor_hrm_20260524.jsonl`
- `FINANCE_BCAI_README.md`
Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
Licenses and terms remain those of the original sources. The prepared dataset upload does not relicense upstream content.
## Training Run
The current public checkpoint was produced through staged pretraining:
1. Train `stage-0` on `koterm_pretrain_mix_v1` with 711.3M tokens.
2. Continue once more on the same available mix as `stage0b`.
3. Continue to `stage-1` on HRM cleaned fast-cap data with 14.55B tokens.
4. Convert `stage1 step_25000` EMA weights to `safetensors` and upload to the main model repo.
Current long-running stage-1 settings:
| Field | Value |
|---|---|
| Hardware | 8 x NVIDIA H200 |
| Data | `koterm_hrm_cleaned_fastcap_stage1_v1` |
| Tokens in current stage dataset | 14.55B |
| Global batch | 180,224 tokens |
| Local token slots/GPU | 22,528 |
| Context | 4,096 |
| LR | 2.2e-4 |
| LR warmup | 2,000 steps |
| Checkpoint interval | 5,000 steps |
| Current public export | `step_25000`, EMA, safetensors |
The run uses staged continuation. The checkpoint carries model, optimizer, EMA, and recurrent carry state forward. `resume_step_offset` and `total_steps_override` are used so the learning-rate schedule follows the intended longer pretraining run rather than resetting at every data stage.
The full HRM 328G cleaned corpus is being retokenized with the new 131K tokenizer. That full no-cap retokenization is intended to support a larger 40B+ token training continuation, instead of stopping at the 14.55B fast-cap stage.
## Intended Use
This checkpoint is intended for:
- continued pretraining experiments
- Korean tokenizer and HRM-Text architecture experiments
- terminal/tool-call/code pretraining research
- checkpoint conversion and evaluation work
It is not yet intended as a finished assistant model.
## Limitations
- This is an intermediate checkpoint, not a final aligned instruct model.
- It has not completed the full planned 40B+ token continuation.
- It has not completed final SFT or safety tuning.
- Public benchmark scores for this new checkpoint are not final.
- Direct Transformers generation requires adding the custom `hrm_text` modeling wrapper or remote-code files.
- Tool-call JSON validity and terminal action safety must be evaluated before production use.
## Citation
This work builds on the HRM-Text architecture and training stack:
- Paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text
|