File size: 11,588 Bytes
f9e20ea
48e9e29
db5a5b3
ecbdf52
48e9e29
f9e20ea
48e9e29
 
f9e20ea
48e9e29
 
 
 
f9e20ea
 
48e9e29
 
0756b71
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
f131b4d
0756b71
 
 
 
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48e9e29
0756b71
 
 
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
 
 
 
 
 
 
 
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f131b4d
 
 
 
0756b71
 
 
 
 
 
 
 
 
 
 
 
 
 
f131b4d
0756b71
 
 
862683a
0756b71
 
 
f131b4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
862683a
f131b4d
 
 
 
 
 
862683a
 
 
0756b71
 
 
f131b4d
 
0756b71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48e9e29
0756b71
48e9e29
0756b71
 
 
 
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
 
 
 
 
48e9e29
0756b71
48e9e29
0756b71
48e9e29
0756b71
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
license: other
language:
- ko
- en
tags:
- hrm-text
- korean
- terminal
- tool-use
- code
- pretraining
pipeline_tag: text-generation
---

# KoHRM-Text-1.4B

`KoHRM-Text-1.4B` is a scratch-pretrained Korean/English/code/terminal/tool-use model based on the `sapientinc/HRM-Text` PrefixLM training stack.

This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Korean/terminal-oriented 131K byte-level BPE tokenizer and a new scratch training run.

## Links

| Item | Link |
|---|---|
| HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
| Project code | https://github.com/LLM-OS-Models/KoHRM-text |
| Prepared training data | https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data |
| Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
| HRM-Text paper | https://arxiv.org/html/2605.20613 |
| Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
| Raw resume checkpoints | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints |

## Release Policy

The main model repository is intended to expose the latest model-only artifact:

- `model.safetensors`
- `config.json`
- `tokenizer.json`
- `tokenizer_config.json`
- `README.md`

It is not intended to keep every training checkpoint as visible model files. Intermediate FSDP2 `.distcp` checkpoints are large resume artifacts and are kept separately in `LLM-OS-Models/KoHRM-Text-1.4B-raw-checkpoints` when needed. The main repo may still have normal Hugging Face git history, but the current file tree should be treated as the latest public model export.

Current public artifact: `stage1` HRM fast-cap checkpoint at `step_25000`, converted with EMA weights to `safetensors`. Training is still in progress.

## Model Details

| Field | Value |
|---|---|
| Model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
| Standard name | `KoHRM-Text-1.4B` |
| Training origin | scratch |
| Architecture family | HRM-Text PrefixLM |
| Architecture size | `XL` |
| Parameters | 1,384,120,320 |
| Context length | 4,096 tokens |
| Training dtype | bfloat16 |
| Tokenizer | byte-level BPE, NFC normalization |
| Vocabulary size | 131,072 |
| Objective | PrefixLM response-only loss |
| Optimizer | Adam-atan2 from upstream HRM-Text |
| EMA | 0.9999 |

The model config uses `model_type: hrm_text` and `architectures: ["HrmTextForCausalLM"]`. At the time of this checkpoint, `HrmTextForCausalLM` is a project-side custom architecture, not a built-in Transformers architecture.

## Tokenizer

The tokenizer was trained for Korean, English, code, shell/terminal text, and JSON/tool-call formats. It intentionally keeps common chat/tool special tokens as stable single tokens where possible.

| Sample bucket | chars/token |
|---|---:|
| Korean general text | 2.60 |
| Korean legal text | 2.36 |
| Korean terminal instruction | 2.18 |
| shell command | 2.68 |
| tool-call JSON | 3.32 |
| Python code | 3.37 |
| English | 4.40 |

Important formatting tokens include:

- `<|im_start|>`
- `<|im_end|>`
- `<|box_end|>`
- `<|object_ref_start|>` for direct condition
- `<|object_ref_end|>` for cot condition
- `<|quad_start|>` for noisy condition
- `<|quad_end|>` for synth condition

## Usage

### Tokenizer

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "LLM-OS-Models/KoHRM-Text-1.4B",
    use_fast=True,
)

prompt = "<|im_start|><|object_ref_start|>ํ•œ๊ตญ์–ด๋กœ ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํฐ ํŒŒ์ผ์„ ์ฐพ๋Š” ๋ช…๋ น์„ ์•Œ๋ ค์ฃผ์„ธ์š”.<|im_end|>"
ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
print(len(ids), ids[:20])
```

### Model Weights

The repo currently contains a model-only `safetensors` export. Because the architecture is custom (`hrm_text`), direct `AutoModelForCausalLM.from_pretrained(...)` generation requires an HRM-Text-compatible modeling wrapper or remote-code integration. Until that wrapper is added to the model repo, use the project code and raw FSDP2 checkpoint path for internal inference/resume workflows.

Raw checkpoint inference pattern:

```python
from simple_inference_engine import inference_load_checkpoint, inference_generate

ckpt = inference_load_checkpoint(
    ckpt_path="/path/to/KoHRM-Text-1.4B-stage1-hrm-fastcap-gbs180",
    ckpt_epoch=25000,
    ckpt_use_ema=True,
    device="cuda",
)

prompts = iter([
    (0, ("direct", "ํ•œ๊ตญ์–ด๋กœ `du`์™€ `df`์˜ ์ฐจ์ด๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”.")),
])

for _, text in inference_generate(
    ckpt,
    prompts,
    max_tokens=4096,
    max_generation=512,
    batch_size=1,
    temp=0.0,
):
    print(text)
```

For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.

## Training Data

Prepared data artifacts are uploaded to the Hugging Face dataset repository, not to the model repository:

https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data

All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.

Completed and prepared datasets:

| Dataset | Tokens | Disk | Use |
|---|---:|---:|---|
| `koterm_pretrain_mix_v1` | 711.3M | 2.8G | stage-0/stage0b |
| HRM cleaned base sample | 250.0M | 994M | included in stage-0 mix |
| SWE-ZERO + GLM pilot mix | 251.2M | 990M | included in stage-0 mix |
| Korean legal SFT/task data | 83.1M | 336M | included in stage-0 mix |
| ToolBench train tool-call data | 127.0M | 500M | included in stage-0 mix |
| HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
| Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
| Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
| Korean legal/admin full task data | 629.0M | 2.5G | uploaded to prepared dataset repo |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
| BCAI Finance Korean | 857.7M | 3.3G | prepared and uploaded for later Korean finance/domain stages |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |

Major source groups and provenance:

| Source group | Origin | Prepared dataset usage |
|---|---|---|
| HRM-Text cleaned pretraining data | https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515 | `hrm_cleaned_base_sample_v1`, `koterm_hrm_cleaned_fastcap_stage1_v1`; full no-cap retokenization is still running |
| Korean Wikipedia | https://dumps.wikimedia.org/kowiki/20260501/ | `kowiki_raw_full_v1` |
| Korean statutes | https://github.com/legalize-kr/legalize-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
| Korean local ordinances | https://github.com/legalize-kr/ordinance-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
| Korean administrative rules | local Markdown snapshot at `admrule-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
| Korean precedents | local Markdown snapshot at `precedent-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
| ToolBench train data | local extraction under `data_toolbench/data/`; eval split excluded | `sft_toolbench_v1` |
| SWE-ZERO trajectories | https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories | `sft_swe_zero_v1`, `sft_swe_glm_mix_v1` |
| GLM reasoning | https://huggingface.co/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned | `sft_glm_reasoning_v1`, `sft_swe_glm_mix_v1` |
| Claude reasoning sample | https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k | small reviewed reasoning subset inside `hf_extra_reasoning_agent_mm_v1` |
| Open-MM-RL text subset | https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL | text-only reviewed subset inside `hf_extra_reasoning_agent_mm_v1` |
| DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
| structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
| Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
| BCAI Finance Kor | https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-1862K | `sft_bcai_finance_kor_v1` |

The full Korean legal/admin task upload is present in the dataset repository at:

- `korean_legal_tasks_full_v1/`
- `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
- `LEGAL_FULL_TASKS_README.md`
- `sft_bcai_finance_kor_v1/`
- `raw_jsonl/bcai_finance_kor_hrm_20260524.jsonl`
- `FINANCE_BCAI_README.md`

Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.

Licenses and terms remain those of the original sources. The prepared dataset upload does not relicense upstream content.

## Training Run

The current public checkpoint was produced through staged pretraining:

1. Train `stage-0` on `koterm_pretrain_mix_v1` with 711.3M tokens.
2. Continue once more on the same available mix as `stage0b`.
3. Continue to `stage-1` on HRM cleaned fast-cap data with 14.55B tokens.
4. Convert `stage1 step_25000` EMA weights to `safetensors` and upload to the main model repo.

Current long-running stage-1 settings:

| Field | Value |
|---|---|
| Hardware | 8 x NVIDIA H200 |
| Data | `koterm_hrm_cleaned_fastcap_stage1_v1` |
| Tokens in current stage dataset | 14.55B |
| Global batch | 180,224 tokens |
| Local token slots/GPU | 22,528 |
| Context | 4,096 |
| LR | 2.2e-4 |
| LR warmup | 2,000 steps |
| Checkpoint interval | 5,000 steps |
| Current public export | `step_25000`, EMA, safetensors |

The run uses staged continuation. The checkpoint carries model, optimizer, EMA, and recurrent carry state forward. `resume_step_offset` and `total_steps_override` are used so the learning-rate schedule follows the intended longer pretraining run rather than resetting at every data stage.

The full HRM 328G cleaned corpus is being retokenized with the new 131K tokenizer. That full no-cap retokenization is intended to support a larger 40B+ token training continuation, instead of stopping at the 14.55B fast-cap stage.

## Intended Use

This checkpoint is intended for:

- continued pretraining experiments
- Korean tokenizer and HRM-Text architecture experiments
- terminal/tool-call/code pretraining research
- checkpoint conversion and evaluation work

It is not yet intended as a finished assistant model.

## Limitations

- This is an intermediate checkpoint, not a final aligned instruct model.
- It has not completed the full planned 40B+ token continuation.
- It has not completed final SFT or safety tuning.
- Public benchmark scores for this new checkpoint are not final.
- Direct Transformers generation requires adding the custom `hrm_text` modeling wrapper or remote-code files.
- Tool-call JSON validity and terminal action safety must be evaluated before production use.

## Citation

This work builds on the HRM-Text architecture and training stack:

- Paper: https://arxiv.org/html/2605.20613
- Upstream code: https://github.com/sapientinc/HRM-Text