gyung commited on
Commit
f131b4d
·
verified ·
1 Parent(s): 77ff990

Clarify KoHRM data sources

Browse files
Files changed (1) hide show
  1. README.md +32 -11
README.md CHANGED
@@ -25,6 +25,7 @@ This is not a continued finetune of `sapientinc/HRM-Text-1B`. It uses a new Kore
25
  |---|---|
26
  | HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
27
  | Project code | https://github.com/LLM-OS-Models/KoHRM-text |
 
28
  | Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
29
  | HRM-Text paper | https://arxiv.org/html/2605.20613 |
30
  | Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
@@ -140,6 +141,10 @@ For code and training scripts, see https://github.com/LLM-OS-Models/KoHRM-text.
140
 
141
  ## Training Data
142
 
 
 
 
 
143
  All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
144
 
145
  Completed and prepared datasets:
@@ -154,25 +159,42 @@ Completed and prepared datasets:
154
  | HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
155
  | Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
156
  | Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
 
157
  | Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
158
  | HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
159
  | Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
160
  | SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
161
  | GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
162
 
163
- Major source groups:
164
-
165
- - Upstream HRM-Text cleaned pretraining data from `sapientinc/HRM-Text-data-io-cleaned-20260515`
166
- - Korean Wikipedia
167
- - Korean statutes, local ordinances, administrative rules, and precedent corpora
168
- - ToolBench train trajectories and tool-use instructions
169
- - Local terminal/code/math conversations
170
- - SWE-ZERO terminal/code trajectories
171
- - GLM reasoning samples
172
- - Small, reviewed subsets of extra reasoning/agent datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
175
 
 
 
176
  ## Training Run
177
 
178
  The current public checkpoint was produced through staged pretraining:
@@ -227,4 +249,3 @@ This work builds on the HRM-Text architecture and training stack:
227
 
228
  - Paper: https://arxiv.org/html/2605.20613
229
  - Upstream code: https://github.com/sapientinc/HRM-Text
230
-
 
25
  |---|---|
26
  | HF model | https://huggingface.co/LLM-OS-Models/KoHRM-Text-1.4B |
27
  | Project code | https://github.com/LLM-OS-Models/KoHRM-text |
28
+ | Prepared training data | https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data |
29
  | Upstream HRM-Text code | https://github.com/sapientinc/HRM-Text |
30
  | HRM-Text paper | https://arxiv.org/html/2605.20613 |
31
  | Tokenizer | https://huggingface.co/LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K |
 
141
 
142
  ## Training Data
143
 
144
+ Prepared data artifacts are uploaded to the Hugging Face dataset repository, not to the model repository:
145
+
146
+ https://huggingface.co/datasets/LLM-OS-Models/KoHRM-Text-1.4B-prepared-data
147
+
148
  All datasets are converted into HRM-Text V1Dataset style records with `instruction`, `response`, and `condition` fields where possible. The training objective is PrefixLM response-only loss, so the model is trained to predict the response span after seeing the instruction/prompt span.
149
 
150
  Completed and prepared datasets:
 
159
  | HRM cleaned fast-cap stage-1 | 14.55B | 148G | current stage-1 |
160
  | Korean statutes/local ordinances raw full | 308.9M | 1.2G | prepared for later stages |
161
  | Korean administrative rules + precedents raw full | 271.7M | 1.1G | prepared for later stages |
162
+ | Korean legal/admin full task data | 629.0M | 2.5G | uploaded to prepared dataset repo |
163
  | Korean Wikipedia raw full | 462.5M | 1.8G | prepared for later stages |
164
  | HF extra reasoning/agent/mm subset | 112.6M | 444M | prepared, limited weight |
165
  | Local terminal conversations | 9.39B | 36G | prepared for terminal-heavy later stages |
166
  | SWE-ZERO prepared | 182.7M | 720M | pretraining and later SFT |
167
  | GLM reasoning prepared | 68.5M | 282M | pretraining and later SFT |
168
 
169
+ Major source groups and provenance:
170
+
171
+ | Source group | Origin | Prepared dataset usage |
172
+ |---|---|---|
173
+ | HRM-Text cleaned pretraining data | https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515 | `hrm_cleaned_base_sample_v1`, `koterm_hrm_cleaned_fastcap_stage1_v1`; full no-cap retokenization is still running |
174
+ | Korean Wikipedia | https://dumps.wikimedia.org/kowiki/20260501/ | `kowiki_raw_full_v1` |
175
+ | Korean statutes | https://github.com/legalize-kr/legalize-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
176
+ | Korean local ordinances | https://github.com/legalize-kr/ordinance-kr | `korean_legal_raw_full_v1`, `sft_korean_legal_v1`, `korean_legal_tasks_full_v1` |
177
+ | Korean administrative rules | local Markdown snapshot at `admrule-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
178
+ | Korean precedents | local Markdown snapshot at `precedent-kr/` | `korean_admrule_precedent_raw_full_v1`, `korean_legal_tasks_full_v1` |
179
+ | ToolBench train data | local extraction under `data_toolbench/data/`; eval split excluded | `sft_toolbench_v1` |
180
+ | SWE-ZERO trajectories | https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories | `sft_swe_zero_v1`, `sft_swe_glm_mix_v1` |
181
+ | GLM reasoning | https://huggingface.co/datasets/Jackrong/GLM-5.1-Reasoning-1M-Cleaned | `sft_glm_reasoning_v1`, `sft_swe_glm_mix_v1` |
182
+ | Claude reasoning sample | https://huggingface.co/datasets/angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k | small reviewed reasoning subset inside `hf_extra_reasoning_agent_mm_v1` |
183
+ | Open-MM-RL text subset | https://huggingface.co/datasets/TuringEnterprises/Open-MM-RL | text-only reviewed subset inside `hf_extra_reasoning_agent_mm_v1` |
184
+ | DeepSeek agent traces | https://huggingface.co/datasets/TeichAI/DeepSeek-v4-Pro-Agent | limited agent/tool-use subset; license-sensitive |
185
+ | structured Wikipedia | https://huggingface.co/datasets/wikimedia/structured-wikipedia | tokenizer/general text support |
186
+ | Local terminal/code/math conversations | local `swe`, `code`, and `math` parquet conversations | `local_terminal_conversations_ctx9k_resp6k_v1` |
187
+
188
+ The full Korean legal/admin task upload is present in the dataset repository at:
189
+
190
+ - `korean_legal_tasks_full_v1/`
191
+ - `raw_jsonl/korean_legal_tasks_full_20260524.jsonl`
192
+ - `LEGAL_FULL_TASKS_README.md`
193
 
194
  Evaluation-like data is excluded from training where identified, including ToolBench eval, Terminal Bench 2 style data, and benchmark-oriented `chi-bench` data.
195
 
196
+ Licenses and terms remain those of the original sources. The prepared dataset upload does not relicense upstream content.
197
+
198
  ## Training Run
199
 
200
  The current public checkpoint was produced through staged pretraining:
 
249
 
250
  - Paper: https://arxiv.org/html/2605.20613
251
  - Upstream code: https://github.com/sapientinc/HRM-Text