gyung commited on
Commit
39e0a55
ยท
verified ยท
1 Parent(s): e24432a

Update model card with corrected TB2-lite evaluation

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ko
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - terminal
9
+ - sft
10
+ - vllm
11
+ - tb2-lite
12
+ base_model: ByteDance/Ouro-1.4B-Thinking
13
+ ---
14
+
15
+ # LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT
16
+
17
+ ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
18
+
19
+ ## ๋ชจ๋ธ ์š”์•ฝ
20
+
21
+ - Base model: `ByteDance/Ouro-1.4B-Thinking`
22
+ - Training setup: `Terminal SFT`
23
+ - Evaluation snapshot: `2026-05-07 22:48:08 UTC`
24
+ - Evaluation result id: `ouro_1p4b_thinking_terminal_sft`
25
+
26
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
27
+
28
+ Transformers ์˜ˆ์‹œ:
29
+
30
+ ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
32
+ import torch
33
+
34
+ model_id = "LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT"
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ torch_dtype=torch.bfloat16,
39
+ device_map="auto",
40
+ trust_remote_code=True,
41
+ )
42
+
43
+ messages = [
44
+ {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
45
+ {"role": "user", "content": "List the current directory and identify Python files."},
46
+ ]
47
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
48
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
49
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
50
+ print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
51
+ ```
52
+
53
+ vLLM ์˜ˆ์‹œ:
54
+
55
+ ```python
56
+ from vllm import LLM, SamplingParams
57
+ from transformers import AutoTokenizer
58
+
59
+ model_id = "LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
+ llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
62
+ messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
63
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
+ result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
65
+ print(result[0].outputs[0].text)
66
+ ```
67
+
68
+ ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
69
+
70
+ ```json
71
+ {
72
+ "analysis": "brief reasoning about the next terminal action",
73
+ "plan": "short execution plan",
74
+ "commands": [
75
+ {"keystrokes": "ls -la\n", "duration": 0.1}
76
+ ],
77
+ "task_complete": false
78
+ }
79
+ ```
80
+
81
+ ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
82
+
83
+ ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
84
+
85
+ - Rank: `22 / 47`
86
+ - Score: `31.74`
87
+ - Command F1: `0.3174`
88
+ - Command precision: `0.4062`
89
+ - Command recall: `0.3410`
90
+ - First command exact: `24.8%`
91
+ - Valid JSON: `63.7%`
92
+ - Steps / tasks: `303 / 50`
93
+ - Template status: `chat_template`
94
+ - Rank eligible: `True`
95
+ - Eval timestamp: `2026-05-07T22:48:02.585588`
96
+ - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `47`
97
+
98
+ ์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:
99
+
100
+ ```bash
101
+ python tb2_lite/scripts/replay_eval.py \
102
+ --model LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT \
103
+ --model-short ouro_1p4b_thinking_terminal_sft \
104
+ --eval-path tb2_lite/data/replay_full.jsonl \
105
+ --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
106
+ --dtype bfloat16 \
107
+ --max-model-len 49152 \
108
+ --max-tokens 1024 \
109
+ --temperature 0.0 \
110
+ --top-p 1.0 \
111
+ --gpu-memory-utilization 0.94 \
112
+ --language-model-only
113
+ ```
114
+
115
+ Prompt/template audit:
116
+
117
+ ```json
118
+ {
119
+ "template_status": "chat_template",
120
+ "rank_eligible": true,
121
+ "steps": 303,
122
+ "tasks": 50
123
+ }
124
+ ```
125
+
126
+ ## ์žฅ์ 
127
+
128
+ - ํŠน์ • ํฌ๊ธฐ/๊ฐ€์† ๊ฒฝ๋กœ์—์„œ ๋น„์šฉ ๋Œ€๋น„ ๋น ๋ฅธ ์ถ”๋ก ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
129
+ - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
130
+
131
+ ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
132
+
133
+ - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
134
+ - JSON ํ˜•์‹ ์‹คํŒจ๊ฐ€ ์žˆ์–ด ์‹คํ–‰ ์ „์— ํŒŒ์‹ฑ ๊ฒ€์ฆ/์žฌ์‹œ๋„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
135
+ - Ouro ๊ณ„์—ด์€ assistant-only masking ๋ฐ prompt template ์ผ์น˜ ์—ฌ๋ถ€๊ฐ€ ์„ฑ๋Šฅ ํ•ด์„์— ํฐ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
136
+ - ์ด ๋ชจ๋ธ์€ ์ž๋™ ํ„ฐ๋ฏธ๋„ ์กฐ์ž‘ ๋ณด์กฐ์šฉ SFT ๋ชจ๋ธ์ด๋ฉฐ, ์ผ๋ฐ˜ ๋Œ€ํ™”/๋ฒ”์šฉ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
137
+ - ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
138
+
139
+ ## ํ•ด์„ ๋ฉ”๋ชจ
140
+
141
+ TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ํฌ๊ธฐ, chat template ์ผ์น˜, assistant-only masking, tokenizer, ํ•™์Šต ๋ฐ์ดํ„ฐ holdout ์—ฌ๋ถ€๊ฐ€ ๋ชจ๋‘ ์ ์ˆ˜์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
142
+
143
+ README.md์™€ MODEL_EVALUATION_REPORT.md์˜ ๊ฐ’์ด ๋” ์ตœ์‹ ์ด๋ฉด ํ•ด๋‹น ๊ฐ’์„ ์šฐ์„  ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์นด๋“œ๋Š” ์™„๋ฃŒ๋œ ํ‰๊ฐ€ JSON์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ๋ณ„ ์ €์žฅ์†Œ์— ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜ํ•œ ์Šค๋ƒ…์ƒท์ž…๋‹ˆ๋‹ค.