gyung's picture
Update model card with corrected TB2-lite evaluation
fb528be verified
|
raw
history blame
5.05 kB
metadata
language:
  - en
  - ko
library_name: transformers
pipeline_tag: text-generation
tags:
  - terminal
  - sft
  - vllm
  - tb2-lite
base_model: LiquidAI/LFM2.5-1.2B-Instruct

LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth

ํ„ฐ๋ฏธ๋„ ์ž‘์—… ์ž๋™ํ™”๋ฅผ ์œ„ํ•œ Terminal SFT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ์ž‘์—…/์ด์ „ ํ„ฐ๋ฏธ๋„ ์ƒํƒœ๋ฅผ ๋ณด๊ณ  ๋‹ค์Œ์— ์‹คํ–‰ํ•  ๋ช…๋ น์„ JSON ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋Š” ์šฉ๋„๋กœ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์š”์•ฝ

  • Base model: LiquidAI/LFM2.5-1.2B-Instruct
  • Training setup: 2 epochs, Unsloth SFT
  • Evaluation snapshot: 2026-05-07 22:44:35 UTC
  • Evaluation result id: lfm25_1p2b_sft_unsloth_e2

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

Transformers ์˜ˆ์‹œ:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
    {"role": "user", "content": "List the current directory and identify Python files."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=False))

vLLM ์˜ˆ์‹œ:

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(model=model_id, dtype="bfloat16", trust_remote_code=True)
messages = [{"role": "user", "content": "Show disk usage for the current folder."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = llm.generate([prompt], SamplingParams(temperature=0.0, max_tokens=512))
print(result[0].outputs[0].text)

๊ถŒ์žฅ ์ถœ๋ ฅ ํ˜•์‹:

{
  "analysis": "brief reasoning about the next terminal action",
  "plan": "short execution plan",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1}
  ],
  "task_complete": false
}

ํ‰๊ฐ€ ๊ฒฐ๊ณผ

ํ‰๊ฐ€๋Š” corrected TB2-lite replay set์—์„œ vLLM์œผ๋กœ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์œ„ ์ ์ˆ˜๋Š” 100 * avg_command_f1๋งŒ ์‚ฌ์šฉํ•˜๊ณ , first_cmd_exact_pct๋Š” ๋ณด์กฐ ์ง€ํ‘œ๋กœ๋งŒ ๋ด…๋‹ˆ๋‹ค.

  • Rank: 36 / 44
  • Score: 22.45
  • Command F1: 0.2245
  • Command precision: 0.3097
  • Command recall: 0.2314
  • First command exact: 18.8%
  • Valid JSON: 47.2%
  • Steps / tasks: 303 / 50
  • Template status: chat_template
  • Rank eligible: True
  • Eval timestamp: 2026-05-07T21:50:36.580647
  • ํ˜„์žฌ ์ง‘๊ณ„๋œ ํ‰๊ฐ€ ๊ฒฐ๊ณผ ์ˆ˜: 44

์žฌํ˜„ ๋ช…๋ น ์˜ˆ์‹œ:

python tb2_lite/scripts/replay_eval.py \
  --model LLM-OS-Models/LFM2.5-1.2B-Terminal-SFT-2Epoch-Unsloth \
  --model-short lfm25_1p2b_sft_unsloth_e2 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
  --dtype bfloat16 \
  --max-model-len 49152 \
  --max-tokens 1024 \
  --temperature 0.0 \
  --top-p 1.0 \
  --gpu-memory-utilization 0.94 \
  --language-model-only

Prompt/template audit:

{
  "template_status": "chat_template",
  "rank_eligible": true,
  "steps": 303,
  "tasks": 50
}

์žฅ์ 

  • ํŠน์ • ํฌ๊ธฐ/๊ฐ€์† ๊ฒฝ๋กœ์—์„œ ๋น„์šฉ ๋Œ€๋น„ ๋น ๋ฅธ ์ถ”๋ก ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ž˜๋ชป๋œ ๋ช…๋ น์„ ๋งŽ์ด ๋‚ด๊ธฐ๋ณด๋‹ค ๋ณด์ˆ˜์ ์œผ๋กœ ๋งž๋Š” ๋ช…๋ น์„ ๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • LFM ๊ณ„์—ด์€ Liquid chat template๊ณผ ํ„ฐ๋ฏธ๋„ SFT ํฌ๋งท์„ ๋งž์ถ˜ ๊ฒฝ๋Ÿ‰/ํšจ์œจ ์‹คํ—˜์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„์™€ ์ฃผ์˜์‚ฌํ•ญ

  • recall์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•„ ํ•„์š”ํ•œ ๋ช…๋ น ์ผ๋ถ€๋ฅผ ๋น ๋œจ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • JSON ํ˜•์‹ ์‹คํŒจ๊ฐ€ ์žˆ์–ด ์‹คํ–‰ ์ „์— ํŒŒ์‹ฑ ๊ฒ€์ฆ/์žฌ์‹œ๋„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Qwen ์ƒ์œ„๊ถŒ ๋Œ€๋น„ command F1์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋Š” ์ง€๋Šฅ ์ฐจ์ด์™€ ํ•จ๊ป˜ ํฌ๋งท, ํ† ํฌ๋‚˜์ด์ €, ํ•™์Šต ๊ฒฝ๋กœ ์ฐจ์ด๊ฐ€ ์„ž์ธ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ์ด ๋ชจ๋ธ์€ ์ž๋™ ํ„ฐ๋ฏธ๋„ ์กฐ์ž‘ ๋ณด์กฐ์šฉ SFT ๋ชจ๋ธ์ด๋ฉฐ, ์ผ๋ฐ˜ ๋Œ€ํ™”/๋ฒ”์šฉ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ์ƒ์„ฑ ๋ช…๋ น์€ ์‹ค์ œ ์‹คํ–‰ ์ „์— sandbox, allowlist, human review ๊ฐ™์€ ์•ˆ์ „์žฅ์น˜๋ฅผ ๊ฑฐ์ณ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ด์„ ๋ฉ”๋ชจ

TB2-lite ์ ์ˆ˜๋Š” ์ผ๋ฐ˜ ์ง€๋Šฅ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์•„๋‹ˆ๋ผ ํ„ฐ๋ฏธ๋„ next-action JSON ์žฌํ˜„ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ ํฌ๊ธฐ, chat template ์ผ์น˜, assistant-only masking, tokenizer, ํ•™์Šต ๋ฐ์ดํ„ฐ holdout ์—ฌ๋ถ€๊ฐ€ ๋ชจ๋‘ ์ ์ˆ˜์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

README.md์™€ MODEL_EVALUATION_REPORT.md์˜ ๊ฐ’์ด ๋” ์ตœ์‹ ์ด๋ฉด ํ•ด๋‹น ๊ฐ’์„ ์šฐ์„  ํ™•์ธํ•˜์„ธ์š”. ์ด ๋ชจ๋ธ์นด๋“œ๋Š” ์™„๋ฃŒ๋œ ํ‰๊ฐ€ JSON์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ๋ณ„ ์ €์žฅ์†Œ์— ๋น ๋ฅด๊ฒŒ ๋ฐ˜์˜ํ•œ ์Šค๋ƒ…์ƒท์ž…๋‹ˆ๋‹ค.