File size: 5,045 Bytes
41ec5a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb528be
41ec5a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb528be
41ec5a4
 
 
 
 
 
 
 
 
 
fb528be
41ec5a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- en
- ko
library_name: transformers
pipeline_tag: text-generation
tags:
- terminal
- sft
- vllm
- tb2-lite
base_model: LiquidAI/LFM2.5-1.2B-Instruct
---

# LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth

ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

## ๋ชจ๋ธ ์š”์•ฝ

- Base model: `LiquidAI/LFM2.5-1.2B-Instruct`
- Training setup: `2 epochs, Unsloth SFT`
- Evaluation snapshot: `2026-05-07 22:44:35 UTC`
- Evaluation result id: `lfm25_1p2b_sft_unsloth_e2`

## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

Transformers ์˜ˆ์‹œ:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
    {"role": "user", "content": "List the current directory and identify Python files."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
```

vLLM ์˜ˆ์‹œ:

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
print(result[0].outputs[0].text)
```

๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:

```json
{
  "analysis": "brief reasoning about the next terminal action",
  "plan": "short execution plan",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1}
  ],
  "task_complete": false
}
```

## ํ‰๊ฐ€ ๊ฒฐ๊ณผ

ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.

- Rank: `36 / 44`
- Score: `22.45`
- Command F1: `0.2245`
- Command precision: `0.3097`
- Command recall: `0.2314`
- First command exact: `18.8%`
- Valid JSON: `47.2%`
- Steps / tasks: `303 / 50`
- Template status: `chat_template`
- Rank eligible: `True`
- Eval timestamp: `2026-05-07T21:50:36.580647`
- ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `44`

์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:

```bash
python tb2_lite/scripts/replay_eval.py \
  --model LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth \
  --model-short lfm25_1p2b_sft_unsloth_e2 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
  --dtype bfloat16 \
  --max-model-len 49152 \
  --max-tokens 1024 \
  --temperature 0.0 \
  --top-p 1.0 \
  --gpu-memory-utilization 0.94 \
  --language-model-only
```

Prompt/template audit:

```json
{
  "template_status": "chat_template",
  "rank_eligible": true,
  "steps": 303,
  "tasks": 50
}
```

## ์žฅ์ 

- ํŠน์ • ํฌ๊ธฐ/๊ฐ€์† ๊ฒฝ๋กœ์—์„œ ๋น„์šฉ ๋Œ€๋น„ ๋น ๋ฅธ ์ถ”๋ก ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
- LFM ๊ณ„์—ด์€ Liquid chat template๊ณผ ํ„ฐ๋ฏธ๋„ SFT ํฌ๋งท์„ ๋งž์ถ˜ ๊ฒฝ๋Ÿ‰/ํšจ์œจ ์‹คํ—˜์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ

- recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- JSON ํ˜•์‹ ์‹คํŒจ๊ฐ€ ์žˆ์–ด ์‹คํ–‰ ์ „์— ํŒŒ์‹ฑ ๊ฒ€์ฆ/์žฌ์‹œ๋„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
- Qwen ์ƒ์œ„๊ถŒ ๋Œ€๋น„ command F1์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋Š” ์ง€๋Šฅ ์ฐจ์ด์™€ ํ•จ๊ป˜ ํฌ๋งท, ํ† ํฌ๋‚˜์ด์ €, ํ•™์Šต ๊ฒฝ๋กœ ์ฐจ์ด๊ฐ€ ์„ž์ธ ๊ฐ’์ž…๋‹ˆ๋‹ค.
- ์ด ๋ชจ๋ธ์€ ์ž๋™ ํ„ฐ๋ฏธ๋„ ์กฐ์ž‘ ๋ณด์กฐ์šฉ SFT ๋ชจ๋ธ์ด๋ฉฐ, ์ผ๋ฐ˜ ๋Œ€ํ™”/๋ฒ”์šฉ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
- ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

## ํ•ด์„ ๋ฉ”๋ชจ

TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ํฌ๊ธฐ, chat template ์ผ์น˜, assistant-only masking, tokenizer, ํ•™์Šต ๋ฐ์ดํ„ฐ holdout ์—ฌ๋ถ€๊ฐ€ ๋ชจ๋‘ ์ ์ˆ˜์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

README.md์™€ MODEL_EVALUATION_REPORT.md์˜ ๊ฐ’์ด ๋” ์ตœ์‹ ์ด๋ฉด ํ•ด๋‹น ๊ฐ’์„ ์šฐ์„  ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์นด๋“œ๋Š” ์™„๋ฃŒ๋œ ํ‰๊ฐ€ JSON์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ๋ณ„ ์ €์žฅ์†Œ์— ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜ํ•œ ์Šค๋ƒ…์ƒท์ž…๋‹ˆ๋‹ค.