Clarify KoHRM data sources
Browse files
README.md
CHANGED
|
@@ -25,6 +25,7 @@ This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Kore
|
|
| 25 |
|---|---|
|
| 26 |
| HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
|
| 27 |
| Project code | https://github.com/LLM-OS-Models/KoHRM-text |
|
|
|
|
| 28 |
| Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
|
| 29 |
| HRM-Text paper | https://arxiv.org/html/2605.20613 |
|
| 30 |
| Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
|
|
@@ -140,6 +141,10 @@ For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.
|
|
| 140 |
|
| 141 |
## Training Data
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
|
| 144 |
|
| 145 |
Completed and prepared datasets:
|
|
@@ -154,25 +159,42 @@ Completed and prepared datasets:
|
|
| 154 |
| HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
|
| 155 |
| Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
|
| 156 |
| Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
|
|
|
|
| 157 |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
|
| 158 |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
|
| 159 |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
|
| 160 |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
|
| 161 |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
|
| 162 |
|
| 163 |
-
Major source groups:
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
-
|
| 167 |
-
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
-
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
|
| 175 |
|
|
|
|
|
|
|
| 176 |
## Training Run
|
| 177 |
|
| 178 |
The current public checkpoint was produced through staged pretraining:
|
|
@@ -227,4 +249,3 @@ This work builds on the HRM-Text architecture and training stack:
|
|
| 227 |
|
| 228 |
- Paper: https://arxiv.org/html/2605.20613
|
| 229 |
- Upstream code: https://github.com/sapientinc/HRM-Text
|
| 230 |
-
|
|
|
|
| 25 |
|---|---|
|
| 26 |
| HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
|
| 27 |
| Project code | https://github.com/LLM-OS-Models/KoHRM-text |
|
| 28 |
+
| Prepared training data | https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data |
|
| 29 |
| Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
|
| 30 |
| HRM-Text paper | https://arxiv.org/html/2605.20613 |
|
| 31 |
| Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
|
|
|
|
| 141 |
|
| 142 |
## Training Data
|
| 143 |
|
| 144 |
+
Prepared data artifacts are uploaded to the Hugging Face dataset repository, not to the model repository:
|
| 145 |
+
|
| 146 |
+
https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
|
| 147 |
+
|
| 148 |
All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
|
| 149 |
|
| 150 |
Completed and prepared datasets:
|
|
|
|
| 159 |
| HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
|
| 160 |
| Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
|
| 161 |
| Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
|
| 162 |
+
| Korean legal/admin full task data | 629.0M | 2.5G | uploaded to prepared dataset repo |
|
| 163 |
| Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
|
| 164 |
| HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
|
| 165 |
| Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
|
| 166 |
| SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
|
| 167 |
| GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
|
| 168 |
|
| 169 |
+
Major source groups and provenance:
|
| 170 |
+
|
| 171 |
+
| Source group | Origin | Prepared dataset usage |
|
| 172 |
+
|---|---|---|
|
| 173 |
+
| HRM-Text cleaned pretraining data | https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515 | `hrm_cleaned_base_sample_v1`, `koterm_hrm_cleaned_fastcap_stage1_v1`; full no-cap retokenization is still running |
|
| 174 |
+
| Korean Wikipedia | https://dumps.wikimedia.org/kowiki/20260501/ | `kowiki_raw_full_v1` |
|
| 175 |
+
| Korean statutes | https://github.com/legalize-kr/legalize-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
|
| 176 |
+
| Korean local ordinances | https://github.com/legalize-kr/ordinance-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
|
| 177 |
+
| Korean administrative rules | local Markdown snapshot at `admrule-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
|
| 178 |
+
| Korean precedents | local Markdown snapshot at `precedent-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
|
| 179 |
+
| ToolBench train data | local extraction under `data_toolbench/data/`; eval split excluded | `sft_toolbench_v1` |
|
| 180 |
+
| SWE-ZERO trajectories | https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories | `sft_swe_zero_v1`, `sft_swe_glm_mix_v1` |
|
| 181 |
+
| GLM reasoning | https://huggingface.co/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned | `sft_glm_reasoning_v1`, `sft_swe_glm_mix_v1` |
|
| 182 |
+
| Claude reasoning sample | https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k | small reviewed reasoning subset inside `hf_extra_reasoning_agent_mm_v1` |
|
| 183 |
+
| Open-MM-RL text subset | https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL | text-only reviewed subset inside `hf_extra_reasoning_agent_mm_v1` |
|
| 184 |
+
| DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
|
| 185 |
+
| structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
|
| 186 |
+
| Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
|
| 187 |
+
|
| 188 |
+
The full Korean legal/admin task upload is present in the dataset repository at:
|
| 189 |
+
|
| 190 |
+
- `korean_legal_tasks_full_v1/`
|
| 191 |
+
- `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
|
| 192 |
+
- `LEGAL_FULL_TASKS_README.md`
|
| 193 |
|
| 194 |
Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
|
| 195 |
|
| 196 |
+
Licenses and terms remain those of the original sources. The prepared dataset upload does not relicense upstream content.
|
| 197 |
+
|
| 198 |
## Training Run
|
| 199 |
|
| 200 |
The current public checkpoint was produced through staged pretraining:
|
|
|
|
| 249 |
|
| 250 |
- Paper: https://arxiv.org/html/2605.20613
|
| 251 |
- Upstream code: https://github.com/sapientinc/HRM-Text
|
|
|