gyung commited on
Commit
ecbdf52
ยท
verified ยท
1 Parent(s): db5a5b3

Update model card with pending TB2-lite evaluation status

Browse files
Files changed (1) hide show
  1. README.md +160 -81
README.md CHANGED
@@ -1,88 +1,167 @@
1
  ---
2
- license: other
3
  language:
4
- - ko
5
  - en
 
 
 
6
  tags:
7
- - hrm-text
8
- - korean
9
  - terminal
10
- - tool-use
11
- - code
12
- - pretraining
13
- pipeline_tag: text-generation
 
14
  ---
15
 
16
- # KoHRM-Text-1.4B
17
-
18
- `KoHRM-Text-1.4B`๋Š” `sapientinc/HRM-Text`์˜ PrefixLM ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ์–ด/์˜์–ด/์ฝ”๋”ฉ/ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ์‚ฌ์šฉ์„ฑ์„ ๋ชฉํ‘œ๋กœ scratch pretrainingํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
19
-
20
- ์ด ์นด๋“œ๋Š” 2026-05-23 ๊ธฐ์ค€ ์ž‘์—… ์ค‘์ธ ๋ชจ๋ธ ์นด๋“œ ์ดˆ์•ˆ์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ ์—…๋กœ๋“œ๋˜๋Š” epoch artifact๋Š” raw HRM-Text FSDP2 checkpoint์ด๋ฉฐ, ๋ฐ”๋กœ Transformers์—์„œ ๋กœ๋“œํ•˜๋Š” ์ตœ์ข… ๋ฐฐํฌ ํ˜•์‹์ด ์•„๋‹™๋‹ˆ๋‹ค.
21
-
22
- ## ๋ชจ๋ธ ์ •๋ณด
23
-
24
- | ํ•ญ๋ชฉ | ๊ฐ’ |
25
- |---|---|
26
- | model id | `LLM-OS-Models/KoHRM-Text-1.4B` |
27
- | base code | `sapientinc/HRM-Text` |
28
- | training from | scratch |
29
- | architecture | HRM-Text `XL` |
30
- | params | 1,384,120,320 |
31
- | context | 4096 tokens |
32
- | dtype | bfloat16 |
33
- | tokenizer | byte-level BPE, NFC normalization |
34
- | vocab | 131,072 |
35
-
36
- ## ํ† ํฌ๋‚˜์ด์ €
37
-
38
- ์ƒˆ tokenizer๋Š” ํ•œ๊ตญ์–ด, ์˜์–ด, ์ฝ”๋“œ, shell, terminal instruction, JSON tool-call์„ ํ•จ๊ป˜ ๊ณ ๋ คํ•ด ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
39
-
40
- | ์ƒ˜ํ”Œ | chars/token |
41
- |---|---:|
42
- | ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ | 2.60 |
43
- | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ  | 2.36 |
44
- | ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„ ์ง€์‹œ | 2.18 |
45
- | shell command | 2.68 |
46
- | tool JSON | 3.32 |
47
- | Python code | 3.37 |
48
- | ์˜์–ด | 4.40 |
49
-
50
- Tokenizer repo: `LLM-OS-Models/HRM-Text-Ko-Terminal-Tokenizer-131K`
51
-
52
- ## ํ•™์Šต ๋ฐ์ดํ„ฐ
53
-
54
- stage-0 ์ž…๋ ฅ์€ ์ „์ฒ˜๋ฆฌ ์™„๋ฃŒ๋œ 711.3M token mix์ž…๋‹ˆ๋‹ค.
55
-
56
- | ๋ฐ์ดํ„ฐ | token |
57
- |---|---:|
58
- | HRM cleaned base sample | 250.0M |
59
- | SWE-ZERO + GLM reasoning mix | 251.2M |
60
- | ํ•œ๊ตญ์–ด ๋ฒ•๋ฅ /์กฐ๋ก€/ํ–‰์ •๊ทœ์น™/ํŒ๋ก€ task | 83.1M |
61
- | ToolBench train tool-call task | 127.0M |
62
- | ํ•ฉ๊ณ„ | 711.3M |
63
-
64
- ์ดํ›„ stage๋Š” HRM cleaned ์›๋ณธ retokenized dataset, local terminal dataset, ์ถ”๊ฐ€ ํ•œ๊ตญ์–ด/์ฝ”๋”ฉ/ํˆด์ฝœ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ ์„ฑ๊ฒฉ์˜ `tb2_lite`, Terminal Bench 2, ToolBench eval, chi-bench๋Š” train์—์„œ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.
65
-
66
- ## ํ•™์Šต ๋ฐฉ์‹
67
-
68
- - Objective: PrefixLM style response-only loss
69
- - Optimizer: HRM-Text upstream Adam-atan2
70
- - Context: 4096 tokens
71
- - Hardware: 8 x NVIDIA H200
72
- - Current stable global batch: 172,032 tokens
73
- - Checkpoint policy: epoch-level raw FSDP2 checkpoint upload
74
-
75
- ๋…ผ๋ฌธ ๊ธฐ๋ณธ global batch๋Š” 196,608 tokens์˜€์ง€๋งŒ, ์ด ๋ชจ๋ธ์€ vocab์ด 131,072๋กœ ์ปค์„œ final logits memory๊ฐ€ ๋” ํฝ๋‹ˆ๋‹ค. ์žฅ๊ธฐ run์—์„œ๋Š” OOM ์—ฌ์œ ๋ฅผ ์œ„ํ•ด 172,032 tokens๋ฅผ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
76
-
77
- Staged pretraining์—์„œ๋Š” checkpoint์˜ model/optimizer/EMA/carry๋ฅผ ์ด์–ด๋ฐ›๊ณ , `resume_step_offset`๊ณผ `total_steps_override`๋กœ LR schedule์„ ์ „์ฒด pretraining ๊ธฐ์ค€์— ๋งž์ถฅ๋‹ˆ๋‹ค. ์ฆ‰, ์ƒˆ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„๋  ๋•Œ๋งˆ๋‹ค ํ•™์Šต์„ ์žฌ์‹œ์ž‘ํ•˜๋˜ optimizer์™€ schedule์„ ๋Š์ง€ ์•Š๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์šด์šฉํ•ฉ๋‹ˆ๋‹ค.
78
-
79
- ## ํ˜„์žฌ ์ƒํƒœ
80
-
81
- - stage-0 training: in progress
82
- - HF upload: epoch checkpoint watcher active
83
- - final Transformers conversion: not yet produced
84
- - public benchmark score: not yet evaluated for this model
85
-
86
- ## ์ œํ•œ์‚ฌํ•ญ
87
-
88
- ํ˜„์žฌ checkpoint artifact๋Š” ์ค‘๊ฐ„ ํ•™์Šต ์‚ฐ์ถœ๋ฌผ์ž…๋‹ˆ๋‹ค. ์•ˆ์ „์„ฑ ์ •๋ ฌ, ์ตœ์ข… instruction tuning, ์ตœ์ข… benchmark, ๋ฐฐํฌ์šฉ ๋ณ€ํ™˜์ด ๋๋‚œ ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ํ„ฐ๋ฏธ๋„/ํˆด์ฝœ ๋Šฅ๋ ฅ์€ ๋ชฉํ‘œ ์˜์—ญ์ด์ง€๋งŒ, stage-0๋งŒ์œผ๋กœ๋Š” ์™„์„ฑ๋œ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
 
3
  - en
4
+ - ko
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
  tags:
 
 
8
  - terminal
9
+ - sft
10
+ - vllm
11
+ - tb2-lite
12
+ - evaluation-pending
13
+ base_model: unknown
14
  ---
15
 
16
+ # LLM-OS-Models/KoHRM-Text-1.4B
17
+
18
+ ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
19
+
20
+ ## ๋ชจ๋ธ ์š”์•ฝ
21
+
22
+ - Base model: `unknown`
23
+ - Training setup: `Terminal SFT`
24
+ - Model card snapshot: `2026-05-23 09:04:40 UTC`
25
+ - Corrected TB2-lite evaluated results currently indexed: `56`
26
+ - Corrected TB2-lite score: `pending / not matched in current result directory`
27
+
28
+ ## Quickstart
29
+
30
+ ์„ค์น˜์™€ ๋กœ๊ทธ์ธ:
31
+
32
+ ```bash
33
+ pip install -U vllm transformers huggingface_hub
34
+ huggingface-cli login
35
+ ```
36
+
37
+ ๊ด€๋ จ ์ฝ”๋“œ:
38
+
39
+ - GitHub: https://github.com/LLM-OS-Models/Terminal
40
+ - vLLM ํ‰๊ฐ€ ์‹คํ–‰: `tb2_lite/scripts/replay_eval.py`
41
+ - chat template/fallback ์ƒ์„ฑ: `tb2_lite/scripts/prompt_builder.py`
42
+ - JSON/command ์ฑ„์ : `tb2_lite/scripts/replay_metrics.py`
43
+
44
+ vLLM ์ง์ ‘ ์‹คํ–‰ ์˜ˆ์‹œ. ํ‰๊ฐ€ ์ฝ”๋“œ์™€ ๋™์ผํ•˜๊ฒŒ chat template์„ ์šฐ์„  ์‚ฌ์šฉํ•˜๊ณ , template์ด ์—†์œผ๋ฉด ChatML/Gemma fallback์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer
48
+ from vllm import LLM, SamplingParams
49
+
50
+ model_id = "LLM-OS-Models/KoHRM-Text-1.4B"
51
+ tp = 1
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
54
+ llm = LLM(
55
+ model=model_id,
56
+ tokenizer=model_id,
57
+ trust_remote_code=True,
58
+ dtype="bfloat16",
59
+ tensor_parallel_size=tp,
60
+ max_model_len=49152,
61
+ gpu_memory_utilization=0.92,
62
+ )
63
+
64
+ messages = [
65
+ {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
66
+ {"role": "user", "content": "Inspect the current directory and list Python files."},
67
+ ]
68
+
69
+ def render_chatml(messages):
70
+ parts = []
71
+ for message in messages:
72
+ role = "assistant" if message["role"] == "assistant" else message["role"]
73
+ if role == "tool":
74
+ role = "user"
75
+ parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
76
+ parts.append("<|im_start|>assistant\n")
77
+ return "".join(parts)
78
+
79
+ def render_gemma4_turn(messages, empty_thought_channel=False):
80
+ parts = ["<bos>"]
81
+ for message in messages:
82
+ role = "model" if message["role"] == "assistant" else message["role"]
83
+ if role == "tool":
84
+ role = "user"
85
+ parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
86
+ parts.append("<|turn>model\n")
87
+ if empty_thought_channel:
88
+ parts.append("<|channel>thought\n<channel|>")
89
+ return "".join(parts)
90
+
91
+ def render_prompt(model_id, tokenizer, messages):
92
+ model_key = model_id.lower()
93
+ if "gemma-4" in model_key:
94
+ try:
95
+ return tokenizer.apply_chat_template(
96
+ messages,
97
+ tokenize=False,
98
+ add_generation_prompt=True,
99
+ enable_thinking=False,
100
+ )
101
+ except Exception:
102
+ return render_gemma4_turn(
103
+ messages,
104
+ empty_thought_channel=("26b" in model_key or "31b" in model_key),
105
+ )
106
+ try:
107
+ return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
108
+ except Exception:
109
+ return render_chatml(messages)
110
+
111
+ prompt = render_prompt(model_id, tokenizer, messages)
112
+ sampling = SamplingParams(
113
+ temperature=0.0,
114
+ top_p=1.0,
115
+ max_tokens=1024,
116
+ repetition_penalty=1.0,
117
+ )
118
+ outputs = llm.generate([prompt], sampling_params=sampling)
119
+ print(outputs[0].outputs[0].text)
120
+ ```
121
+
122
+ ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
123
+
124
+ ```json
125
+ {
126
+ "analysis": "brief reasoning about the next terminal action",
127
+ "plan": "short execution plan",
128
+ "commands": [
129
+ {"keystrokes": "ls -la\n", "duration": 0.1}
130
+ ],
131
+ "task_complete": false
132
+ }
133
+ ```
134
+
135
+ ํ‰๊ฐ€์™€ ๋™์ผํ•œ replay ๋ช…๋ น:
136
+
137
+ ```bash
138
+ python tb2_lite/scripts/replay_eval.py \
139
+ --model LLM-OS-Models/KoHRM-Text-1.4B \
140
+ --model-short LLM-OS-Models__KoHRM-Text-1.4B \
141
+ --eval-path tb2_lite/data/replay_full.jsonl \
142
+ --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
143
+ --dtype bfloat16 \
144
+ --tp 1 \
145
+ --max-model-len 49152 \
146
+ --max-tokens 1024 \
147
+ --temperature 0.0 \
148
+ --top-p 1.0 \
149
+ --gpu-memory-utilization 0.92 \
150
+ --language-model-only
151
+ ```
152
+
153
+ - ๊ธฐ๋ณธ ๊ถŒ์žฅ tensor parallel: `1`. OOM์ด๋ฉด `--tp`์™€ `tensor_parallel_size`๋ฅผ 2/4/8๋กœ ์˜ฌ๋ฆฌ์„ธ์š”.
154
+ - corrected TB2-lite ํ‰๊ฐ€๋Š” `temperature=0.0`, `top_p=1.0`, `max_tokens=1024`๋กœ ๊ณ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
155
+ - Gemma 4๋Š” JSON ์ถœ๋ ฅ์„ ์œ„ํ•ด `enable_thinking=False`๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , 26B/31B ๊ณ„์—ด์€ ํ‰๊ฐ€ ์ฝ”๋“œ์—์„œ empty thought channel ์ฒ˜๋ฆฌ๋ฅผ ์ž๋™ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
156
+
157
+ ## ํ‰๊ฐ€ ์ƒํƒœ
158
+
159
+ - Current corrected TB2-lite score: `pending`
160
+ - Reason: ํ˜„์žฌ `/home/work/.data/tb2_lite_eval/corrected_readme_models_vllm` ์ง‘๊ณ„ ๊ฒฐ๊ณผ์™€ ์ด HF repo๋ช…์ด ์ง์ ‘ ๋งค์นญ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
161
+ - Next step: ๋™์ผํ•œ `tb2_lite/scripts/replay_eval.py` ๊ฒฝ๋กœ๋กœ ํ‰๊ฐ€๋ฅผ ๋Œ๋ฆฐ ๋’ค ์ ์ˆ˜ ์นด๋“œ๋กœ ์ž๋™ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.
162
+
163
+ ## ๋ชจ๋ธ๊ตฐ ํ•ด์„
164
+
165
+ - ์ด repo๋Š” ์•„์ง ํ˜„์žฌ corrected TB2-lite ์ง‘๊ณ„ JSON๊ณผ ์ง์ ‘ ๋งค์นญ๋˜๋Š” ์ ์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
166
+ - TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
167
+ - ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.