gyung commited on
Commit
2b0c216
ยท
verified ยท
1 Parent(s): d3ef7ce

Update model card with corrected TB2-lite evaluation

Browse files
Files changed (1) hide show
  1. README.md +120 -36
README.md CHANGED
@@ -1,59 +1,143 @@
1
  ---
2
- base_model: Qwen/Qwen3.5-2B
 
 
3
  library_name: transformers
4
- model_name: Qwen__Qwen3.5-2B__terminal_sft_2epoch_fullft_samecount
5
  tags:
6
- - generated_from_trainer
7
- - unsloth
8
- - trl
9
  - sft
10
- licence: license
 
 
11
  ---
12
 
13
- # Model Card for Qwen__Qwen3.5-2B__terminal_sft_2epoch_fullft_samecount
 
 
14
 
15
- This model is a fine-tuned version of [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B).
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
- ## Quick start
 
 
 
 
 
 
 
19
 
20
  ```python
21
- from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
- generator = pipeline("text-generation", model="None", device="cuda")
25
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
- print(output["generated_text"])
 
 
 
 
27
  ```
28
 
29
- ## Training procedure
30
 
31
-
 
 
32
 
 
 
 
 
 
 
 
 
33
 
34
- This model was trained with SFT.
35
 
36
- ### Framework versions
 
 
 
 
 
 
 
 
 
37
 
38
- - TRL: 0.24.0
39
- - Transformers: 5.5.0
40
- - Pytorch: 2.10.0
41
- - Datasets: 4.3.0
42
- - Tokenizers: 0.22.2
43
 
44
- ## Citations
45
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
 
47
 
48
- Cite TRL as:
49
-
50
- ```bibtex
51
- @misc{vonwerra2022trl,
52
- title = {{TRL: Transformer Reinforcement Learning}},
53
- author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
54
- year = 2020,
55
- journal = {GitHub repository},
56
- publisher = {GitHub},
57
- howpublished = {\url{https://github.com/huggingface/trl}}
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  }
59
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - ko
5
  library_name: transformers
6
+ pipeline_tag: text-generation
7
  tags:
8
+ - terminal
 
 
9
  - sft
10
+ - vllm
11
+ - tb2-lite
12
+ base_model: Qwen/Qwen3.5-2B
13
  ---
14
 
15
+ # LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount
16
+
17
+ ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
18
 
19
+ ## ๋ชจ๋ธ ์š”์•ฝ
 
20
 
21
+ - Base model: `Qwen/Qwen3.5-2B`
22
+ - Training setup: `2 epochs, full fine-tuning, same-count data setting`
23
+ - Evaluation snapshot: `2026-05-07 22:36:47 UTC`
24
+ - Evaluation result id: `qwen35_2b_sft_samecount_e2`
25
+
26
+ ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
27
+
28
+ Transformers ์˜ˆ์‹œ:
29
 
30
  ```python
31
+ from transformers import AutoModelForCausalLM, AutoTokenizer
32
+ import torch
33
+
34
+ model_id = "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
35
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_id,
38
+ torch_dtype=torch.bfloat16,
39
+ device_map="auto",
40
+ trust_remote_code=True,
41
+ )
42
 
43
+ messages = [
44
+ {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
45
+ {"role": "user", "content": "List the current directory and identify Python files."},
46
+ ]
47
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
48
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
49
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
50
+ print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))
51
  ```
52
 
53
+ vLLM ์˜ˆ์‹œ:
54
 
55
+ ```python
56
+ from vllm import LLM, SamplingParams
57
+ from transformers import AutoTokenizer
58
 
59
+ model_id = "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
61
+ llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
62
+ messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
63
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
+ result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
65
+ print(result[0].outputs[0].text)
66
+ ```
67
 
68
+ ๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
69
 
70
+ ```json
71
+ {
72
+ "analysis": "brief reasoning about the next terminal action",
73
+ "plan": "short execution plan",
74
+ "commands": [
75
+ {"keystrokes": "ls -la\n", "duration": 0.1}
76
+ ],
77
+ "task_complete": false
78
+ }
79
+ ```
80
 
81
+ ## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
 
 
 
 
82
 
83
+ ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
84
 
85
+ - Rank: `1 / 41`
86
+ - Score: `39.52`
87
+ - Command F1: `0.3952`
88
+ - Command precision: `0.5082`
89
+ - Command recall: `0.4101`
90
+ - First command exact: `33.0%`
91
+ - Valid JSON: `82.2%`
92
+ - Steps / tasks: `303 / 50`
93
+ - Template status: `chat_template`
94
+ - Rank eligible: `True`
95
+ - Eval timestamp: `2026-05-07T22:06:25.457045`
96
+ - ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `41`
97
 
98
+ ์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:
99
 
100
+ ```bash
101
+ python tb2_lite/scripts/replay_eval.py \
102
+ --model LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount \
103
+ --model-short qwen35_2b_sft_samecount_e2 \
104
+ --eval-path tb2_lite/data/replay_full.jsonl \
105
+ --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
106
+ --dtype bfloat16 \
107
+ --max-model-len 49152 \
108
+ --max-tokens 1024 \
109
+ --temperature 0.0 \
110
+ --top-p 1.0 \
111
+ --gpu-memory-utilization 0.94 \
112
+ --language-model-only
113
+ ```
114
+
115
+ Prompt/template audit:
116
+
117
+ ```json
118
+ {
119
+ "template_status": "chat_template",
120
+ "rank_eligible": true,
121
+ "steps": 303,
122
+ "tasks": 50
123
  }
124
+ ```
125
+
126
+ ## ์žฅ์ 
127
+
128
+ - ํ˜„์žฌ corrected TB2-lite ๊ธฐ์ค€ ์ƒ์œ„๊ถŒ ์ ์ˆ˜์ด๋ฉฐ, ํ„ฐ๋ฏธ๋„ ๋ช…๋ น ์žฌํ˜„ ์•ˆ์ •์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
129
+ - ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
130
+ - Qwen ๊ณ„์—ด์€ ์ด๋ฒˆ ํ‰๊ฐ€์—์„œ ๋ช…๋ น JSON ์•ˆ์ •์„ฑ๊ณผ command F1์ด ์ „๋ฐ˜์ ์œผ๋กœ ๊ฐ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
131
+
132
+ ## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
133
+
134
+ - recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
135
+ - JSON ํ˜•์‹ ์‹คํŒจ๊ฐ€ ์žˆ์–ด ์‹คํ–‰ ์ „์— ํŒŒ์‹ฑ ๊ฒ€์ฆ/์žฌ์‹œ๋„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
136
+ - ์ด ๋ชจ๋ธ์€ ์ž๋™ ํ„ฐ๋ฏธ๋„ ์กฐ์ž‘ ๋ณด์กฐ์šฉ SFT ๋ชจ๋ธ์ด๋ฉฐ, ์ผ๋ฐ˜ ๋Œ€ํ™”/๋ฒ”์šฉ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
137
+ - ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
138
+
139
+ ## ํ•ด์„ ๋ฉ”๋ชจ
140
+
141
+ TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ํฌ๊ธฐ, chat template ์ผ์น˜, assistant-only masking, tokenizer, ํ•™์Šต ๋ฐ์ดํ„ฐ holdout ์—ฌ๋ถ€๊ฐ€ ๋ชจ๋‘ ์ ์ˆ˜์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
142
+
143
+ README.md์™€ MODEL_EVALUATION_REPORT.md์˜ ๊ฐ’์ด ๋” ์ตœ์‹ ์ด๋ฉด ํ•ด๋‹น ๊ฐ’์„ ์šฐ์„  ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์นด๋“œ๋Š” ์™„๋ฃŒ๋œ ํ‰๊ฐ€ JSON์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ๋ณ„ ์ €์žฅ์†Œ์— ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜ํ•œ ์Šค๋ƒ…์ƒท์ž…๋‹ˆ๋‹ค.