gyung commited on
Commit
2155b0b
ยท
verified ยท
1 Parent(s): 3f858f2

Update model card with corrected TB2-lite evaluation

Browse files
Files changed (1) hide show
  1. README.md +12 -75
README.md CHANGED
@@ -20,69 +20,14 @@ base_model: Qwen/Qwen3.5-2B
20
 
21
  - Base model: `Qwen/Qwen3.5-2B`
22
  - Training setup: `2 epochs, full fine-tuning, same-count data setting`
23
- - Evaluation snapshot: `2026-05-07 22:44:37 UTC`
24
  - Evaluation result id: `qwen35_2b_sft_samecount_e2`
25
 
26
- ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
27
-
28
- Transformers ์˜ˆ์‹œ:
29
-
30
- ```python
31
- from transformers import AutoModelForCausalLM, AutoTokenizer
32
- import torch
33
-
34
- model_id = "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
35
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
36
- model = AutoModelForCausalLM.from_pretrained(
37
- model_id,
38
- torch_dtype=torch.bfloat16,
39
- device_map="auto",
40
- trust_remote_code=True,
41
- )
42
-
43
- messages = [
44
- {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
45
- {"role": "user", "content": "List the current directory and identify Python files."},
46
- ]
47
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
48
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
49
- outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
50
- print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
51
- ```
52
-
53
- vLLM ์˜ˆ์‹œ:
54
-
55
- ```python
56
- from vllm import LLM, SamplingParams
57
- from transformers import AutoTokenizer
58
-
59
- model_id = "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
60
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
- llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
62
- messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
63
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
- result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
65
- print(result[0].outputs[0].text)
66
- ```
67
-
68
- ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
69
-
70
- ```json
71
- {
72
- "analysis": "brief reasoning about the next terminal action",
73
- "plan": "short execution plan",
74
- "commands": [
75
- {"keystrokes": "ls -la\n", "duration": 0.1}
76
- ],
77
- "task_complete": false
78
- }
79
- ```
80
-
81
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
82
 
83
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
84
 
85
- - Rank: `1 / 44`
86
  - Score: `39.52`
87
  - Command F1: `0.3952`
88
  - Command precision: `0.5082`
@@ -90,27 +35,12 @@ print(result[0].outputs[0].text)
90
  - First command exact: `33.0%`
91
  - Valid JSON: `82.2%`
92
  - Steps / tasks: `303 / 50`
 
 
93
  - Template status: `chat_template`
94
  - Rank eligible: `True`
95
  - Eval timestamp: `2026-05-07T22:06:25.457045`
96
- - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `44`
97
-
98
- ์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:
99
-
100
- ```bash
101
- python tb2_lite/scripts/replay_eval.py \
102
- --model LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount \
103
- --model-short qwen35_2b_sft_samecount_e2 \
104
- --eval-path tb2_lite/data/replay_full.jsonl \
105
- --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
106
- --dtype bfloat16 \
107
- --max-model-len 49152 \
108
- --max-tokens 1024 \
109
- --temperature 0.0 \
110
- --top-p 1.0 \
111
- --gpu-memory-utilization 0.94 \
112
- --language-model-only
113
- ```
114
 
115
  Prompt/template audit:
116
 
@@ -129,6 +59,13 @@ Prompt/template audit:
129
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
130
  - Qwen ๊ณ„์—ด์€ ์ด๋ฒˆ ํ‰๊ฐ€์—์„œ ๋ช…๋ น JSON ์•ˆ์ •์„ฑ๊ณผ command F1์ด ์ „๋ฐ˜์ ์œผ๋กœ ๊ฐ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
131
 
 
 
 
 
 
 
 
132
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
133
 
134
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
 
20
 
21
  - Base model: `Qwen/Qwen3.5-2B`
22
  - Training setup: `2 epochs, full fine-tuning, same-count data setting`
23
+ - Evaluation snapshot: `2026-05-08 16:03:10 UTC`
24
  - Evaluation result id: `qwen35_2b_sft_samecount_e2`
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
27
 
28
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
29
 
30
+ - Rank: `1 / 56`
31
  - Score: `39.52`
32
  - Command F1: `0.3952`
33
  - Command precision: `0.5082`
 
35
  - First command exact: `33.0%`
36
  - Valid JSON: `82.2%`
37
  - Steps / tasks: `303 / 50`
38
+ - Sec/step: `0.081`
39
+ - Load time: `97.1s`
40
  - Template status: `chat_template`
41
  - Rank eligible: `True`
42
  - Eval timestamp: `2026-05-07T22:06:25.457045`
43
+ - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `56`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  Prompt/template audit:
46
 
 
59
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
60
  - Qwen ๊ณ„์—ด์€ ์ด๋ฒˆ ํ‰๊ฐ€์—์„œ ๋ช…๋ น JSON ์•ˆ์ •์„ฑ๊ณผ command F1์ด ์ „๋ฐ˜์ ์œผ๋กœ ๊ฐ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
61
 
62
+ ## ๋ชจ๋ธ๊ตฐ ํ•ด์„
63
+
64
+ - Qwen ๊ณ„์—ด์€ base prior ์ž์ฒด๊ฐ€ ๊ฐ•ํ•˜๊ณ , ์ด๋ฒˆ corrected ํ‰๊ฐ€์—์„œ๋„ chat template ๊ฒฝ๋กœ๊ฐ€ ์ •์ƒ ์ ์šฉ๋œ ์ƒํƒœ์—์„œ ์ตœ์ƒ์œ„๊ถŒ ์ ์ˆ˜๋ฅผ ๋ƒˆ์Šต๋‹ˆ๋‹ค.
65
+ - ํ‰๊ฐ€ ์ฝ”๋“œ๋Š” ๋ชจ๋ธ๋ช…์„ ๋ณด๊ณ  ๊ฐ€์‚ฐํ•˜์ง€ ์•Š์œผ๋ฉฐ `100 * avg_command_f1`๋งŒ ์ˆœ์œ„ ์ ์ˆ˜๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋†’์€ ์ ์ˆ˜๋Š” Qwen์— ํŠนํ™”๋œ ์ฝ”๋“œ๋ผ๊ธฐ๋ณด๋‹ค ํ„ฐ๋ฏธ๋„ next-action ํฌ๋งท๊ณผ base/SFT ์กฐํ•ฉ์ด ์ž˜ ๋งž์€ ๊ฒฐ๊ณผ๋กœ ํ•ด์„ํ•ฉ๋‹ˆ๋‹ค.
66
+ - ์†๋„๋Š” `0.081` sec/step ์ˆ˜์ค€์œผ๋กœ ๋น ๋ฅธ ํŽธ์ž…๋‹ˆ๋‹ค.
67
+ - RL ํ›„๋ณด์„ฑ: top-tier SFT๋กœ reward tuning/GRPO ๋น„๊ต์˜ ๊ธฐ์ค€์„  ํ›„๋ณด์ž…๋‹ˆ๋‹ค.
68
+
69
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
70
 
71
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.