gyung commited on
Commit
d8a5ee2
ยท
verified ยท
1 Parent(s): fb528be

Update model card with corrected TB2-lite evaluation

Browse files
Files changed (1) hide show
  1. README.md +12 -75
README.md CHANGED
@@ -20,69 +20,14 @@ base_model: LiquidAI/LFM2.5-1.2B-Instruct
20
 
21
  - Base model: `LiquidAI/LFM2.5-1.2B-Instruct`
22
  - Training setup: `2 epochs, Unsloth SFT`
23
- - Evaluation snapshot: `2026-05-07 22:44:35 UTC`
24
  - Evaluation result id: `lfm25_1p2b_sft_unsloth_e2`
25
 
26
- ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
27
-
28
- Transformers ์˜ˆ์‹œ:
29
-
30
- ```python
31
- from transformers import AutoModelForCausalLM, AutoTokenizer
32
- import torch
33
-
34
- model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
35
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
36
- model = AutoModelForCausalLM.from_pretrained(
37
- model_id,
38
- torch_dtype=torch.bfloat16,
39
- device_map="auto",
40
- trust_remote_code=True,
41
- )
42
-
43
- messages = [
44
- {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
45
- {"role": "user", "content": "List the current directory and identify Python files."},
46
- ]
47
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
48
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
49
- outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
50
- print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
51
- ```
52
-
53
- vLLM ์˜ˆ์‹œ:
54
-
55
- ```python
56
- from vllm import LLM, SamplingParams
57
- from transformers import AutoTokenizer
58
-
59
- model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
60
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
- llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
62
- messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
63
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
- result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
65
- print(result[0].outputs[0].text)
66
- ```
67
-
68
- ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
69
-
70
- ```json
71
- {
72
- "analysis": "brief reasoning about the next terminal action",
73
- "plan": "short execution plan",
74
- "commands": [
75
- {"keystrokes": "ls -la\n", "duration": 0.1}
76
- ],
77
- "task_complete": false
78
- }
79
- ```
80
-
81
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
82
 
83
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
84
 
85
- - Rank: `36 / 44`
86
  - Score: `22.45`
87
  - Command F1: `0.2245`
88
  - Command precision: `0.3097`
@@ -90,27 +35,12 @@ print(result[0].outputs[0].text)
90
  - First command exact: `18.8%`
91
  - Valid JSON: `47.2%`
92
  - Steps / tasks: `303 / 50`
 
 
93
  - Template status: `chat_template`
94
  - Rank eligible: `True`
95
  - Eval timestamp: `2026-05-07T21:50:36.580647`
96
- - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `44`
97
-
98
- ์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:
99
-
100
- ```bash
101
- python tb2_lite/scripts/replay_eval.py \
102
- --model LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth \
103
- --model-short lfm25_1p2b_sft_unsloth_e2 \
104
- --eval-path tb2_lite/data/replay_full.jsonl \
105
- --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
106
- --dtype bfloat16 \
107
- --max-model-len 49152 \
108
- --max-tokens 1024 \
109
- --temperature 0.0 \
110
- --top-p 1.0 \
111
- --gpu-memory-utilization 0.94 \
112
- --language-model-only
113
- ```
114
 
115
  Prompt/template audit:
116
 
@@ -129,6 +59,13 @@ Prompt/template audit:
129
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
130
  - LFM ๊ณ„์—ด์€ Liquid chat template๊ณผ ํ„ฐ๋ฏธ๋„ SFT ํฌ๋งท์„ ๋งž์ถ˜ ๊ฒฝ๋Ÿ‰/ํšจ์œจ ์‹คํ—˜์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
131
 
 
 
 
 
 
 
 
132
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
133
 
134
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
 
20
 
21
  - Base model: `LiquidAI/LFM2.5-1.2B-Instruct`
22
  - Training setup: `2 epochs, Unsloth SFT`
23
+ - Evaluation snapshot: `2026-05-08 16:04:07 UTC`
24
  - Evaluation result id: `lfm25_1p2b_sft_unsloth_e2`
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
27
 
28
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
29
 
30
+ - Rank: `43 / 56`
31
  - Score: `22.45`
32
  - Command F1: `0.2245`
33
  - Command precision: `0.3097`
 
35
  - First command exact: `18.8%`
36
  - Valid JSON: `47.2%`
37
  - Steps / tasks: `303 / 50`
38
+ - Sec/step: `0.083`
39
+ - Load time: `57.2s`
40
  - Template status: `chat_template`
41
  - Rank eligible: `True`
42
  - Eval timestamp: `2026-05-07T21:50:36.580647`
43
+ - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `56`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  Prompt/template audit:
46
 
 
59
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
60
  - LFM ๊ณ„์—ด์€ Liquid chat template๊ณผ ํ„ฐ๋ฏธ๋„ SFT ํฌ๋งท์„ ๋งž์ถ˜ ๊ฒฝ๋Ÿ‰/ํšจ์œจ ์‹คํ—˜์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
61
 
62
+ ## ๋ชจ๋ธ๊ตฐ ํ•ด์„
63
+
64
+ - LFM ๊ณ„์—ด์€ base ์ ์ˆ˜ ๋Œ€๋น„ SFT ์ƒ์Šนํญ์ด ํฌ๊ณ  sec/step์ด ๋‚ฎ์•„, ๋ฐ˜๋ณต ํ‰๊ฐ€์™€ RL ์‹คํ—˜์„ ๋Œ๋ฆฌ๊ธฐ ์ข‹์€ ํšจ์œจํ˜• ํ›„๋ณด์ž…๋‹ˆ๋‹ค.
65
+ - ๋‹ค์Œ ๋‹จ๊ณ„์—์„œ๋Š” valid JSON, command precision, premature complete๋ฅผ reward/penalty๋กœ ์ง์ ‘ ์žก๋Š” RL์ด ๊ฐ€์žฅ ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค.
66
+ - ์†๋„๋Š” `0.083` sec/step ์ˆ˜์ค€์œผ๋กœ ๋น ๋ฅธ ํŽธ์ž…๋‹ˆ๋‹ค.
67
+ - RL ํ›„๋ณด์„ฑ: ํ˜„์žฌ ์ ์ˆ˜๋งŒ์œผ๋กœ๋Š” ์ฃผ๋ ฅ ํ›„๋ณด๋ณด๋‹ค ๋ณด์กฐ/๋น„๊ต๊ตฐ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.
68
+
69
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
70
 
71
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.