gyung's picture
Update model card with corrected TB2-lite evaluation
91fd18d verified
---
language:
- en
- ko
library_name: transformers
pipeline_tag: text-generation
tags:
- terminal
- sft
- vllm
- tb2-lite
base_model: ByteDance/Ouro-2.6B-Thinking
---
# LLM-OS-Models/Ouro-2.6B-Thinking-Terminal-SFT
ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
## ๋ชจ๋ธ ์š”์•ฝ
- Base model: `ByteDance/Ouro-2.6B-Thinking`
- Training setup: `Terminal SFT`
- Evaluation snapshot: `2026-05-09 00:57:17 UTC`
- Evaluation result id: `ouro_2p6b_thinking_terminal_sft`
## Quickstart
์„ค์น˜์™€ ๋กœ๊ทธ์ธ:
```bash
pip install -U vllm transformers huggingface_hub
huggingface-cli login
```
๊ด€๋ จ ์ฝ”๋“œ:
- GitHub: https://github.com/LLM-OS-Models/Terminal
- vLLM ํ‰๊ฐ€ ์‹คํ–‰: `tb2_lite/scripts/replay_eval.py`
- chat template/fallback ์ƒ์„ฑ: `tb2_lite/scripts/prompt_builder.py`
- JSON/command ์ฑ„์ : `tb2_lite/scripts/replay_metrics.py`
vLLM ์ง์ ‘ ์‹คํ–‰ ์˜ˆ์‹œ. ํ‰๊ฐ€ ์ฝ”๋“œ์™€ ๋™์ผํ•˜๊ฒŒ chat template์„ ์šฐ์„  ์‚ฌ์šฉํ•˜๊ณ , template์ด ์—†์œผ๋ฉด ChatML/Gemma fallback์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_id = "LLM-OS-Models/Ouro-2.6B-Thinking-Terminal-SFT"
tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(
model=model_id,
tokenizer=model_id,
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=tp,
max_model_len=49152,
gpu_memory_utilization=0.92,
)
messages = [
{"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
{"role": "user", "content": "Inspect the current directory and list Python files."},
]
def render_chatml(messages):
parts = []
for message in messages:
role = "assistant" if message["role"] == "assistant" else message["role"]
if role == "tool":
role = "user"
parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
parts.append("<|im_start|>assistant\n")
return "".join(parts)
def render_gemma4_turn(messages, empty_thought_channel=False):
parts = ["<bos>"]
for message in messages:
role = "model" if message["role"] == "assistant" else message["role"]
if role == "tool":
role = "user"
parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
parts.append("<|turn>model\n")
if empty_thought_channel:
parts.append("<|channel>thought\n<channel|>")
return "".join(parts)
def render_prompt(model_id, tokenizer, messages):
model_key = model_id.lower()
if "gemma-4" in model_key:
try:
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
except Exception:
return render_gemma4_turn(
messages,
empty_thought_channel=("26b" in model_key or "31b" in model_key),
)
try:
return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
except Exception:
return render_chatml(messages)
prompt = render_prompt(model_id, tokenizer, messages)
sampling = SamplingParams(
temperature=0.0,
top_p=1.0,
max_tokens=1024,
repetition_penalty=1.0,
)
outputs = llm.generate([prompt], sampling_params=sampling)
print(outputs[0].outputs[0].text)
```
๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:
```json
{
"analysis": "brief reasoning about the next terminal action",
"plan": "short execution plan",
"commands": [
{"keystrokes": "ls -la\n", "duration": 0.1}
],
"task_complete": false
}
```
ํ‰๊ฐ€์™€ ๋™์ผํ•œ replay ๋ช…๋ น:
```bash
python tb2_lite/scripts/replay_eval.py \
--model LLM-OS-Models/Ouro-2.6B-Thinking-Terminal-SFT \
--model-short ouro_2p6b_thinking_terminal_sft \
--eval-path tb2_lite/data/replay_full.jsonl \
--output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
--dtype bfloat16 \
--tp 1 \
--max-model-len 49152 \
--max-tokens 1024 \
--temperature 0.0 \
--top-p 1.0 \
--gpu-memory-utilization 0.92 \
--language-model-only
```
- ๊ธฐ๋ณธ ๊ถŒ์žฅ tensor parallel: `1`. OOM์ด๋ฉด `--tp`์™€ `tensor_parallel_size`๋ฅผ 2/4/8๋กœ ์˜ฌ๋ฆฌ์„ธ์š”.
- corrected TB2-lite ํ‰๊ฐ€๋Š” `temperature=0.0`, `top_p=1.0`, `max_tokens=1024`๋กœ ๊ณ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
- Gemma 4๋Š” JSON ์ถœ๋ ฅ์„ ์œ„ํ•ด `enable_thinking=False`๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , 26B/31B ๊ณ„์—ด์€ ํ‰๊ฐ€ ์ฝ”๋“œ์—์„œ empty thought channel ์ฒ˜๋ฆฌ๋ฅผ ์ž๋™ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
## ํ‰๊ฐ€ ๊ฒฐ๊ณผ
ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” `100 * avg_command_f1`๋งŒ ์‚ฌ์šฉํ•˜๊ณ , `first_cmd_exact_pct`๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.
- Rank: `14 / 56`
- Score: `35.61`
- Command F1: `0.3561`
- Command precision: `0.4586`
- Command recall: `0.3647`
- First command exact: `25.1%`
- Valid JSON: `61.1%`
- Steps / tasks: `303 / 50`
- Sec/step: `3.358`
- Load time: `135.3s`
- Template status: `chat_template`
- Rank eligible: `True`
- Eval timestamp: `2026-05-07T22:57:10.191295`
- ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: `56`
Prompt/template audit:
```json
{
"template_status": "chat_template",
"rank_eligible": true,
"steps": 303,
"tasks": 50
}
```
## ์žฅ์ 
- ์ค‘์ƒ์œ„๊ถŒ ์ ์ˆ˜๋กœ, ๊ธฐ๋ณธ์ ์ธ ํ„ฐ๋ฏธ๋„ next-action imitation์€ ๋น„๊ต์  ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค.
- ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
## ๋ชจ๋ธ๊ตฐ ํ•ด์„
- Ouro ๊ณ„์—ด์€ Thinking SFT ์ชฝ์—์„œ ์ ์ˆ˜๊ฐ€ ์ž˜ ์˜ฌ๋ผ๊ฐ€์ง€๋งŒ, ๊ฐ™์€ ํ‰๊ฐ€ ๊ธฐ์ค€์—์„œ LFM/Qwen ๋Œ€๋น„ sec/step์ด ์ปค์„œ RL ๋Œ€๋Ÿ‰ ๋ฐ˜๋ณต์—๋Š” ๋น„์šฉ์ด ํฝ๋‹ˆ๋‹ค.
- ์ ์ˆ˜๋Š” ์˜๋ฏธ ์žˆ์œผ๋‚˜ ์†๋„ ๋ณ‘๋ชฉ์ด ์žˆ์–ด, ์ฃผ๋ ฅ๋ณด๋‹ค๋Š” ์ƒ์œ„ ํ›„๋ณด ํ™•์ธ์šฉ ablation์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
- ์†๋„๋Š” `3.358` sec/step ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.
- RL ํ›„๋ณด์„ฑ: ์ ์ˆ˜๋Š” ์ถฉ๋ถ„ํ•˜์ง€๋งŒ ์†๋„ ๋น„์šฉ ๋•Œ๋ฌธ์— ์†Œ๊ทœ๋ชจ ๋น„๊ต ํ›„๋ณด๋กœ ๋‘๋Š” ํŽธ์ด ์•ˆ์ „ํ•ฉ๋‹ˆ๋‹ค.
## ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ
- recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- JSON ํ˜•์‹ ์‹คํŒจ๊ฐ€ ์žˆ์–ด ์‹คํ–‰ ์ „์— ํŒŒ์‹ฑ ๊ฒ€์ฆ/์žฌ์‹œ๋„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
- Ouro ๊ณ„์—ด์€ assistant-only masking ๋ฐ prompt template ์ผ์น˜ ์—ฌ๋ถ€๊ฐ€ ์„ฑ๋Šฅ ํ•ด์„์— ํฐ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
- ์ด ๋ชจ๋ธ์€ ์ž๋™ ํ„ฐ๋ฏธ๋„ ์กฐ์ž‘ ๋ณด์กฐ์šฉ SFT ๋ชจ๋ธ์ด๋ฉฐ, ์ผ๋ฐ˜ ๋Œ€ํ™”/๋ฒ”์šฉ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
- ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
## ํ•ด์„ ๋ฉ”๋ชจ
TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ํฌ๊ธฐ, chat template ์ผ์น˜, assistant-only masking, tokenizer, ํ•™์Šต ๋ฐ์ดํ„ฐ holdout ์—ฌ๋ถ€๊ฐ€ ๋ชจ๋‘ ์ ์ˆ˜์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.
README.md์™€ MODEL_EVALUATION_REPORT.md์˜ ๊ฐ’์ด ๋” ์ตœ์‹ ์ด๋ฉด ํ•ด๋‹น ๊ฐ’์„ ์šฐ์„  ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์นด๋“œ๋Š” ์™„๋ฃŒ๋œ ํ‰๊ฐ€ JSON์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ๋ณ„ ์ €์žฅ์†Œ์— ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜ํ•œ ์Šค๋ƒ…์ƒท์ž…๋‹ˆ๋‹ค.