gyung commited on
Commit
581e231
ยท
verified ยท
1 Parent(s): 39e0a55

Update model card with corrected TB2-lite evaluation

Browse files
Files changed (1) hide show
  1. README.md +12 -75
README.md CHANGED
@@ -20,69 +20,14 @@ base_model: ByteDance/Ouro-1.4B-Thinking
20
 
21
  - Base model: `ByteDance/Ouro-1.4B-Thinking`
22
  - Training setup: `Terminal SFT`
23
- - Evaluation snapshot: `2026-05-07 22:48:08 UTC`
24
  - Evaluation result id: `ouro_1p4b_thinking_terminal_sft`
25
 
26
- ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
27
-
28
- Transformers ์˜ˆ์‹œ:
29
-
30
- ```python
31
- from transformers import AutoModelForCausalLM, AutoTokenizer
32
- import torch
33
-
34
- model_id = "LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT"
35
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
36
- model = AutoModelForCausalLM.from_pretrained(
37
- model_id,
38
- torch_dtype=torch.bfloat16,
39
- device_map="auto",
40
- trust_remote_code=True,
41
- )
42
-
43
- messages = [
44
- {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
45
- {"role": "user", "content": "List the current directory and identify Python files."},
46
- ]
47
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
48
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
49
- outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
50
- print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
51
- ```
52
-
53
- vLLM ์˜ˆ์‹œ:
54
-
55
- ```python
56
- from vllm import LLM, SamplingParams
57
- from transformers import AutoTokenizer
58
-
59
- model_id = "LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT"
60
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
- llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
62
- messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
63
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
- result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
65
- print(result[0].outputs[0].text)
66
- ```
67
-
68
- ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
69
-
70
- ```json
71
- {
72
- "analysis": "brief reasoning about the next terminal action",
73
- "plan": "short execution plan",
74
- "commands": [
75
- {"keystrokes": "ls -la\n", "duration": 0.1}
76
- ],
77
- "task_complete": false
78
- }
79
- ```
80
-
81
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
82
 
83
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
84
 
85
- - Rank: `22 / 47`
86
  - Score: `31.74`
87
  - Command F1: `0.3174`
88
  - Command precision: `0.4062`
@@ -90,27 +35,12 @@ print(result[0].outputs[0].text)
90
  - First command exact: `24.8%`
91
  - Valid JSON: `63.7%`
92
  - Steps / tasks: `303 / 50`
 
 
93
  - Template status: `chat_template`
94
  - Rank eligible: `True`
95
  - Eval timestamp: `2026-05-07T22:48:02.585588`
96
- - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `47`
97
-
98
- ์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:
99
-
100
- ```bash
101
- python tb2_lite/scripts/replay_eval.py \
102
- --model LLM-OS-Models/Ouro-1.4B-Thinking-Terminal-SFT \
103
- --model-short ouro_1p4b_thinking_terminal_sft \
104
- --eval-path tb2_lite/data/replay_full.jsonl \
105
- --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
106
- --dtype bfloat16 \
107
- --max-model-len 49152 \
108
- --max-tokens 1024 \
109
- --temperature 0.0 \
110
- --top-p 1.0 \
111
- --gpu-memory-utilization 0.94 \
112
- --language-model-only
113
- ```
114
 
115
  Prompt/template audit:
116
 
@@ -128,6 +58,13 @@ Prompt/template audit:
128
  - ํŠน์ • ํฌ๊ธฐ/๊ฐ€์† ๊ฒฝ๋กœ์—์„œ ๋น„์šฉ ๋Œ€๋น„ ๋น ๋ฅธ ์ถ”๋ก ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
129
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
130
 
 
 
 
 
 
 
 
131
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
132
 
133
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
 
20
 
21
  - Base model: `ByteDance/Ouro-1.4B-Thinking`
22
  - Training setup: `Terminal SFT`
23
+ - Evaluation snapshot: `2026-05-08 16:03:42 UTC`
24
  - Evaluation result id: `ouro_1p4b_thinking_terminal_sft`
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
27
 
28
  ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
29
 
30
+ - Rank: `25 / 56`
31
  - Score: `31.74`
32
  - Command F1: `0.3174`
33
  - Command precision: `0.4062`
 
35
  - First command exact: `24.8%`
36
  - Valid JSON: `63.7%`
37
  - Steps / tasks: `303 / 50`
38
+ - Sec/step: `1.698`
39
+ - Load time: `92.4s`
40
  - Template status: `chat_template`
41
  - Rank eligible: `True`
42
  - Eval timestamp: `2026-05-07T22:48:02.585588`
43
+ - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `56`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  Prompt/template audit:
46
 
 
58
  - ํŠน์ • ํฌ๊ธฐ/๊ฐ€์† ๊ฒฝ๋กœ์—์„œ ๋น„์šฉ ๋Œ€๋น„ ๋น ๋ฅธ ์ถ”๋ก ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
59
  - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
60
 
61
+ ## ๋ชจ๋ธ๊ตฐ ํ•ด์„
62
+
63
+ - Ouro ๊ณ„์—ด์€ Thinking SFT ์ชฝ์—์„œ ์ ์ˆ˜๊ฐ€ ์ž˜ ์˜ฌ๋ผ๊ฐ€์ง€๋งŒ, ๊ฐ™์€ ํ‰๊ฐ€ ๊ธฐ์ค€์—์„œ LFM/Qwen ๋Œ€๋น„ sec/step์ด ์ปค์„œ RL ๋Œ€๋Ÿ‰ ๋ฐ˜๋ณต์—๋Š” ๋น„์šฉ์ด ํฝ๋‹ˆ๋‹ค.
64
+ - ์ ์ˆ˜๋Š” ์˜๋ฏธ ์žˆ์œผ๋‚˜ ์†๋„ ๋ณ‘๋ชฉ์ด ์žˆ์–ด, ์ฃผ๋ ฅ๋ณด๋‹ค๋Š” ์ƒ์œ„ ํ›„๋ณด ํ™•์ธ์šฉ ablation์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
65
+ - ์†๋„๋Š” `1.698` sec/step ์ˆ˜์ค€์œผ๋กœ ๋น ๋ฅธ ํŽธ์ž…๋‹ˆ๋‹ค.
66
+ - RL ํ›„๋ณด์„ฑ: ํ˜„์žฌ ์ ์ˆ˜๋งŒ์œผ๋กœ๋Š” ์ฃผ๋ ฅ ํ›„๋ณด๋ณด๋‹ค ๋ณด์กฐ/๋น„๊ต๊ตฐ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.
67
+
68
  ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
69
 
70
  - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.