File size: 5,296 Bytes
38805cf
 
 
84e31e2
38805cf
84e31e2
 
 
38805cf
84e31e2
 
 
38805cf
 
84e31e2
38805cf
84e31e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
base_model: google/gemma-4-e4b-it
library_name: peft
pipeline_tag: image-text-to-text
tags:
- gemma
- gemma4
- peft
- lora
- video-understanding
- action-recognition
- image-sequence
---

# bear7011/gemma4-e4b-kinetic3K_FT

This repository contains a LoRA adapter fine-tuned from `google/gemma-4-e4b-it` for action recognition on a Kinetics-3K style dataset.

The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos.

## What Was Trained

- Base model: `google/gemma-4-e4b-it`
- Adapter type: LoRA
- Output artifact: adapter-only checkpoint (`adapter_model.safetensors`)
- Task: action recognition / short event description from a short frame sequence

The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with:

- `r=16`
- `lora_alpha=32`
- `lora_dropout=0.05`
- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs.

## Training Data

This model was trained with the dataset at:

- `./dataset/kinetics_3k/kinetic_3K.json`
- Image root: `./dataset/kinetics_3k`

Dataset summary:

- 3,115 training samples
- Each sample contains 4 sequential frames from a video clip
- The user prompt asks the model to identify the action or event in the frame sequence
- The assistant target is a short natural-language action description

Example prompt format:

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image", "image": "frames/<clip_id>/frame_1.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_2.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_3.jpg"},
        {"type": "image", "image": "frames/<clip_id>/frame_4.jpg"},
        {
          "type": "text",
          "text": "Please analyze the sequence of frames from this video. What specific action or event is happening?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {"type": "text", "text": "<action description>"}
      ]
    }
  ]
}
```

## How It Was Trained

Training was performed with a custom supervised fine-tuning pipeline built around:

- `transformers`
- `peft`
- `deepspeed`
- `bitsandbytes` optimizer (`paged_adamw_8bit`)

Core training setup used for this checkpoint:

- Precision: `bf16`
- DeepSpeed: ZeRO Stage 2
- Epochs: `3`
- Total training steps: `1170`
- Per-device batch size: `1`
- Gradient accumulation: `8`
- Effective optimizer: `paged_adamw_8bit`
- Learning rate: `2e-4`
- Weight decay: `0.0`
- Warmup ratio: `0.03`
- LR scheduler: `cosine`
- Gradient checkpointing: enabled
- Save every `200` steps
- Keep last `2` checkpoints

Final trainer summary:

- Train loss: `14.4465`
- Train runtime: `5026.56` seconds
- Train samples/sec: `1.859`
- Train steps/sec: `0.233`

## Training Command

The project launcher was based on:

```bash
MODEL_NAME=google/gemma-4-e4b-it \
DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \
IMAGE_FOLDER=./dataset/kinetics_3k \
OUTPUT_DIR=./output/gemma4_e4b_lora_only \
RUN_NAME=gemma4-e4b-lora-only \
uv run deepspeed \
  --num_gpus 1 \
  --master_port 29500 \
  stage1/train.py \
  --deepspeed deepspeed_config/stage1.json \
  --model_id google/gemma-4-e4b-it \
  --data_path ./dataset/kinetics_3k/kinetic_3K.json \
  --image_folder ./dataset/kinetics_3k \
  --output_dir ./output/gemma4_e4b_lora_only \
  --run_name gemma4-e4b-lora-only \
  --bf16 True \
  --use_lora True \
  --lora_r 16 \
  --lora_alpha 32 \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --optim paged_adamw_8bit \
  --learning_rate 2e-4 \
  --image_encoder_lr 0.0 \
  --projector_lr 0.0 \
  --weight_decay 0.0 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --save_strategy steps \
  --save_steps 200 \
  --save_total_limit 2 \
  --gradient_checkpointing True \
  --logging_steps 10 \
  --dataloader_num_workers 4 \
  --report_to none
```

DeepSpeed config:

```json
{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e7
  },
  "bf16": {
    "enabled": true
  }
}
```

## Usage

This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter.

```python
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from peft import PeftModel

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "google/gemma-4-e4b-it",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(
    base_model,
    "bear7011/gemma4-e4b-kinetic3K_FT",
)
processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it")
```

## Notes

- This checkpoint is an adapter, not a merged full model.
- The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded.
- No separate benchmark or held-out evaluation report is included in this repository.