gyung commited on
Commit
d30c2b8
·
verified ·
1 Parent(s): 256ea65

Update model card with pending TB2-lite evaluation status

Browse files
Files changed (1) hide show
  1. README.md +152 -24
README.md CHANGED
@@ -1,42 +1,170 @@
1
  ---
2
- license: apache-2.0
 
 
3
  library_name: transformers
4
- base_model: google/gemma-4-E2B-it
5
  tags:
6
- - gemma-4
7
- - terminal-agent
8
- - full-finetuning
9
  - tb2-lite
10
- - gemma4-native-template
 
11
  ---
12
 
13
  # LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch
14
 
15
- ## Summary
 
 
16
 
17
  - Base model: `google/gemma-4-E2B-it`
18
- - Source dataset/cache: `/home/work/.data/gemma4_native_sft/datasets/google__gemma-4-E2B-it__liquid_raw_json_masked_8192`
19
- - Training format: Gemma 4 native chat template
20
- - Labels: assistant JSON command response only
21
- - Prompt/history labels are masked with `-100`
22
- - Previous assistant thinking blocks are stripped from history
23
 
24
- ## TB2-lite
25
 
26
- - Result: `pending`
27
 
28
- ## Notes
 
 
 
 
 
29
 
30
- - Source checkpoint: `/home/work/.data/gemma4_native_sft/models/google__gemma-4-E2B-it__terminal_sft_native_liquid_2epoch/checkpoint-2042`
31
- - Checkpoint step: `2042`
32
- - Trainer epoch: `2.0000`
33
- - TB2-lite score: pending GPU evaluation
34
- - Upload policy: checkpoint uploaded immediately after save; score card updates after evaluation.
35
 
36
- ## Loading
37
 
38
  ```python
39
- from transformers import AutoModelForCausalLM, AutoTokenizer
40
- tokenizer = AutoTokenizer.from_pretrained("LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch")
41
- model = AutoModelForCausalLM.from_pretrained("LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch", torch_dtype="auto")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - ko
5
  library_name: transformers
6
+ pipeline_tag: text-generation
7
  tags:
8
+ - terminal
9
+ - sft
10
+ - vllm
11
  - tb2-lite
12
+ - evaluation-pending
13
+ base_model: google/gemma-4-E2B-it
14
  ---
15
 
16
  # LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch
17
 
18
+ 터미널 작업 자동화를 위한 Terminal SFT 모델입니다. 입력된 작업/이전 터미널 상태를 보고 다음에 실행할 명령을 JSON 형태로 생성하는 용도로 학습했습니다.
19
+
20
+ ## 모델 요약
21
 
22
  - Base model: `google/gemma-4-E2B-it`
23
+ - Training setup: `2 epochs`
24
+ - Model card snapshot: `2026-05-08 22:57:08 UTC`
25
+ - Corrected TB2-lite evaluated results currently indexed: `56`
26
+ - Corrected TB2-lite score: `pending / not matched in current result directory`
 
27
 
28
+ ## Quickstart
29
 
30
+ 설치와 로그인:
31
 
32
+ ```bash
33
+ pip install -U vllm transformers huggingface_hub
34
+ huggingface-cli login
35
+ ```
36
+
37
+ 관련 코드:
38
 
39
+ - GitHub: https://github.com/LLM-OS-Models/Terminal
40
+ - vLLM 평가 실행: `tb2_lite/scripts/replay_eval.py`
41
+ - chat template/fallback 생성: `tb2_lite/scripts/prompt_builder.py`
42
+ - JSON/command 채점: `tb2_lite/scripts/replay_metrics.py`
 
43
 
44
+ vLLM 직접 실행 예시. 평가 코드와 동일하게 chat template을 우선 사용하고, template이 없으면 ChatML/Gemma fallback을 사용합니다.
45
 
46
  ```python
47
+ from transformers import AutoTokenizer
48
+ from vllm import LLM, SamplingParams
49
+
50
+ model_id = "LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch"
51
+ tp = 1
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
54
+ llm = LLM(
55
+ model=model_id,
56
+ tokenizer=model_id,
57
+ trust_remote_code=True,
58
+ dtype="bfloat16",
59
+ tensor_parallel_size=tp,
60
+ max_model_len=49152,
61
+ gpu_memory_utilization=0.92,
62
+ )
63
+
64
+ messages = [
65
+ {"role": "system", "content": "You are a terminal automation assistant. Return JSON only."},
66
+ {"role": "user", "content": "Inspect the current directory and list Python files."},
67
+ ]
68
+
69
+ def render_chatml(messages):
70
+ parts = []
71
+ for message in messages:
72
+ role = "assistant" if message["role"] == "assistant" else message["role"]
73
+ if role == "tool":
74
+ role = "user"
75
+ parts.append(f"<|im_start|>{role}\n{message['content']}<|im_end|>\n")
76
+ parts.append("<|im_start|>assistant\n")
77
+ return "".join(parts)
78
+
79
+ def render_gemma4_turn(messages, empty_thought_channel=False):
80
+ parts = ["<bos>"]
81
+ for message in messages:
82
+ role = "model" if message["role"] == "assistant" else message["role"]
83
+ if role == "tool":
84
+ role = "user"
85
+ parts.append(f"<|turn>{role}\n{message['content'].strip()}<turn|>\n")
86
+ parts.append("<|turn>model\n")
87
+ if empty_thought_channel:
88
+ parts.append("<|channel>thought\n<channel|>")
89
+ return "".join(parts)
90
+
91
+ def render_prompt(model_id, tokenizer, messages):
92
+ model_key = model_id.lower()
93
+ if "gemma-4" in model_key:
94
+ try:
95
+ return tokenizer.apply_chat_template(
96
+ messages,
97
+ tokenize=False,
98
+ add_generation_prompt=True,
99
+ enable_thinking=False,
100
+ )
101
+ except Exception:
102
+ return render_gemma4_turn(
103
+ messages,
104
+ empty_thought_channel=("26b" in model_key or "31b" in model_key),
105
+ )
106
+ try:
107
+ return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
108
+ except Exception:
109
+ return render_chatml(messages)
110
+
111
+ prompt = render_prompt(model_id, tokenizer, messages)
112
+ sampling = SamplingParams(
113
+ temperature=0.0,
114
+ top_p=1.0,
115
+ max_tokens=1024,
116
+ repetition_penalty=1.0,
117
+ )
118
+ outputs = llm.generate([prompt], sampling_params=sampling)
119
+ print(outputs[0].outputs[0].text)
120
  ```
121
+
122
+ 권장 출력 형식:
123
+
124
+ ```json
125
+ {
126
+ "analysis": "brief reasoning about the next terminal action",
127
+ "plan": "short execution plan",
128
+ "commands": [
129
+ {"keystrokes": "ls -la\n", "duration": 0.1}
130
+ ],
131
+ "task_complete": false
132
+ }
133
+ ```
134
+
135
+ 평가와 동일한 replay 명령:
136
+
137
+ ```bash
138
+ python tb2_lite/scripts/replay_eval.py \
139
+ --model LLM-OS-Models/gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch \
140
+ --model-short LLM-OS-Models__gemma-4-E2B-it-Terminal-SFT-Native-Liquid-2Epoch \
141
+ --eval-path tb2_lite/data/replay_full.jsonl \
142
+ --output-dir /home/work/.data/tb2_lite_eval/corrected_readme_models_vllm \
143
+ --dtype bfloat16 \
144
+ --tp 1 \
145
+ --max-model-len 49152 \
146
+ --max-tokens 1024 \
147
+ --temperature 0.0 \
148
+ --top-p 1.0 \
149
+ --gpu-memory-utilization 0.92 \
150
+ --thinking-mode off \
151
+ --strip-thinking-history auto \
152
+ --gemma4-empty-thought-channel auto \
153
+ --language-model-only
154
+ ```
155
+
156
+ - 기본 권장 tensor parallel: `1`. OOM이면 `--tp`와 `tensor_parallel_size`를 2/4/8로 올리세요.
157
+ - corrected TB2-lite 평가는 `temperature=0.0`, `top_p=1.0`, `max_tokens=1024`로 고정했습니다.
158
+ - Gemma 4는 JSON 출력을 위해 `enable_thinking=False`를 사용하고, 26B/31B 계열은 평가 코드에서 empty thought channel 처리를 자동 적용합니다.
159
+
160
+ ## 평가 상태
161
+
162
+ - Current corrected TB2-lite score: `pending`
163
+ - Reason: 현재 `/home/work/.data/tb2_lite_eval/corrected_readme_models_vllm` 집계 결과와 이 HF repo명이 직접 매칭되지 않았습니다.
164
+ - Next step: 동일한 `tb2_lite/scripts/replay_eval.py` 경로로 평가를 돌린 뒤 점수 카드로 자동 교체합니다.
165
+
166
+ ## 모델군 해석
167
+
168
+ - Gemma 계열은 native Gemma/Liquid 전처리와 chat template 처리가 중요합니다. 이 repo는 corrected 평가가 끝나면 점수 카드로 교체합니다.
169
+ - TB2-lite 점수는 일반 지능 벤치마크가 아니라 터미널 next-action JSON 재현 능력을 측정합니다.
170
+ - 생성 명령은 실제 실행 전에 sandbox, allowlist, human review 같은 안전장치를 거쳐야 합니다.