Instructions to use LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount")
model = AutoModelForImageTextToText.from_pretrained("LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount

SGLang

How to use LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount with Docker Model Runner:
```
docker model run hf.co/LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount
```

Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount / README.md

gyung

Update model card with corrected TB2-lite evaluation

2b9a76d verified 15 days ago

preview code

raw

history blame contribute delete

7.39 kB

metadata

language:
  - en
  - ko
library_name: transformers
pipeline_tag: text-generation
tags:
  - terminal
  - sft
  - vllm
  - tb2-lite
base_model: Qwen/Qwen3.5-2B

LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount

터미널 작업 자동화를 위한 Terminal SFT 모델입니다. 입력된 작업/이전 터미널 상태를 보고 다음에 실행할 명령을 JSON 형태로 생성하는 용도로 학습했습니다.

모델 요약

Base model: Qwen/Qwen3.5-2B
Training setup: 2 epochs, full fine-tuning, same-count data setting
Evaluation snapshot: 2026-05-09 00:57:12 UTC
Evaluation result id: qwen35_2b_sft_samecount_e2

Quickstart

설치와 로그인:

pip install -U vllm transformers huggingface_hub
huggingface-cli login

관련 코드:

GitHub: https://github.com/LLM-OS-Models/Terminal
vLLM 평가 실행: tb2_lite/scripts/replay_eval.py
chat template/fallback 생성: tb2_lite/scripts/prompt_builder.py
JSON/command 채점: tb2_lite/scripts/replay_metrics.py

vLLM 직접 실행 예시. 평가 코드와 동일하게 chat template을 우선 사용하고, template이 없으면 ChatML/Gemma fallback을 사용합니다.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount"
tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
llm = LLM(
    model=model_id,
    tokenizer=model_id,
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=tp,
    max_model_len=49152,
    gpu_memory_utilization=0.92,
)

messages = [
    {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
    {"role": "user", "content": "Inspect the current directory and list Python files."},
]

def render_chatml(messages):
    parts = []
    for message in messages:
        role = "assistant" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
    parts.append("<|im_start|>assistant\n")
    return "".join(parts)

def render_gemma4_turn(messages, empty_thought_channel=False):
    parts = ["<bos>"]
    for message in messages:
        role = "model" if message["role"] == "assistant" else message["role"]
        if role == "tool":
            role = "user"
        parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
    parts.append("<|turn>model\n")
    if empty_thought_channel:
        parts.append("<|channel>thought\n<channel|>")
    return "".join(parts)

def render_prompt(model_id, tokenizer, messages):
    model_key = model_id.lower()
    if "gemma-4" in model_key:
        try:
            return tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False,
            )
        except Exception:
            return render_gemma4_turn(
                messages,
                empty_thought_channel=("26b" in model_key or "31b" in model_key),
            )
    try:
        return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except Exception:
        return render_chatml(messages)

prompt = render_prompt(model_id, tokenizer, messages)
sampling = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=1024,
    repetition_penalty=1.0,
)
outputs = llm.generate([prompt], sampling_params=sampling)
print(outputs[0].outputs[0].text)

권장 출력 형식:

{
  "analysis": "brief reasoning about the next terminal action",
  "plan": "short execution plan",
  "commands": [
    {"keystrokes": "ls -la\n", "duration": 0.1}
  ],
  "task_complete": false
}

평가와 동일한 replay 명령:

python tb2_lite/scripts/replay_eval.py \
  --model LLM-OS-Models/Qwen3.5-2B-Terminal-SFT-2Epoch-FullFT-SameCount \
  --model-short qwen35_2b_sft_samecount_e2 \
  --eval-path tb2_lite/data/replay_full.jsonl \
  --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
  --dtype bfloat16 \
  --tp 1 \
  --max-model-len 49152 \
  --max-tokens 1024 \
  --temperature 0.0 \
  --top-p 1.0 \
  --gpu-memory-utilization 0.92 \
  --language-model-only

기본 권장 tensor parallel: 1. OOM이면 --tp와 tensor_parallel_size를 2/4/8로 올리세요.
corrected TB2-lite 평가는 temperature=0.0, top_p=1.0, max_tokens=1024로 고정했습니다.
Gemma 4는 JSON 출력을 위해 enable_thinking=False를 사용하고, 26B/31B 계열은 평가 코드에서 empty thought channel 처리를 자동 적용합니다.

평가 결과

평가는 corrected TB2-lite replay set에서 vLLM으로 수행했습니다. 순위 점수는 100 * avg_command_f1만 사용하고, first_cmd_exact_pct는 보조 지표로만 봅니다.

Rank: 1 / 56
Score: 39.52
Command F1: 0.3952
Command precision: 0.5082
Command recall: 0.4101
First command exact: 33.0%
Valid JSON: 82.2%
Steps / tasks: 303 / 50
Sec/step: 0.081
Load time: 97.1s
Template status: chat_template
Rank eligible: True
Eval timestamp: 2026-05-07T22:06:25.457045
현재 집계된 평가 결과 수: 56

Prompt/template audit:

{
  "template_status": "chat_template",
  "rank_eligible": true,
  "steps": 303,
  "tasks": 50
}

장점

현재 corrected TB2-lite 기준 상위권 점수이며, 터미널 명령 재현 안정성이 높습니다.
잘못된 명령을 많이 내기보다 보수적으로 맞는 명령을 내는 경향이 있습니다.
Qwen 계열은 이번 평가에서 명령 JSON 안정성과 command F1이 전반적으로 강했습니다.

모델군 해석

Qwen 계열은 base prior 자체가 강하고, 이번 corrected 평가에서도 chat template 경로가 정상 적용된 상태에서 최상위권 점수를 냈습니다.
평가 코드는 모델명을 보고 가산하지 않으며 100 * avg_command_f1만 순위 점수로 사용합니다. 높은 점수는 Qwen에 특화된 코드라기보다 터미널 next-action 포맷과 base/SFT 조합이 잘 맞은 결과로 해석합니다.
속도는 0.081 sec/step 수준으로 빠른 편입니다.
RL 후보성: top-tier SFT로 reward tuning/GRPO 비교의 기준선 후보입니다.

한계와 주의사항

recall이 상대적으로 낮아 필요한 명령 일부를 빠뜨릴 수 있습니다.
JSON 형식 실패가 있어 실행 전에 파싱 검증/재시도가 필요합니다.
이 모델은 자동 터미널 조작 보조용 SFT 모델이며, 일반 대화/범용 추론 성능을 보장하지 않습니다.
생성 명령은 실제 실행 전에 sandbox, allowlist, human review 같은 안전장치를 거쳐야 합니다.

해석 메모

TB2-lite 점수는 일반 지능 벤치마크가 아니라 터미널 next-action JSON 재현 능력을 측정합니다. 따라서 모델 크기, chat template 일치, assistant-only masking, tokenizer, 학습 데이터 holdout 여부가 모두 점수에 영향을 줍니다.

README.md와 MODEL_EVALUATION_REPORT.md의 값이 더 최신이면 해당 값을 우선 확인하세요. 이 모델카드는 완료된 평가 JSON을 기준으로 개별 저장소에 빠르게 반영한 스냅샷입니다.