Image-Text-to-Text
PEFT
Safetensors
gemma
gemma4
lora
video-understanding
action-recognition
image-sequence
conversational
Instructions to use bear7011/gemma4-e4b-kinetic3K_FT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bear7011/gemma4-e4b-kinetic3K_FT with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-e4b-it") model = PeftModel.from_pretrained(base_model, "bear7011/gemma4-e4b-kinetic3K_FT") - Notebooks
- Google Colab
- Kaggle
File size: 5,296 Bytes
38805cf 84e31e2 38805cf 84e31e2 38805cf 84e31e2 38805cf 84e31e2 38805cf 84e31e2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | ---
base_model: google/gemma-4-e4b-it
library_name: peft
pipeline_tag: image-text-to-text
tags:
- gemma
- gemma4
- peft
- lora
- video-understanding
- action-recognition
- image-sequence
---
# bear7011/gemma4-e4b-kinetic3K_FT
This repository contains a LoRA adapter fine-tuned from `google/gemma-4-e4b-it` for action recognition on a Kinetics-3K style dataset.
The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos.
## What Was Trained
- Base model: `google/gemma-4-e4b-it`
- Adapter type: LoRA
- Output artifact: adapter-only checkpoint (`adapter_model.safetensors`)
- Task: action recognition / short event description from a short frame sequence
The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with:
- `r=16`
- `lora_alpha=32`
- `lora_dropout=0.05`
- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs.
## Training Data
This model was trained with the dataset at:
- `./dataset/kinetics_3k/kinetic_3K.json`
- Image root: `./dataset/kinetics_3k`
Dataset summary:
- 3,115 training samples
- Each sample contains 4 sequential frames from a video clip
- The user prompt asks the model to identify the action or event in the frame sequence
- The assistant target is a short natural-language action description
Example prompt format:
```json
{
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "frames/<clip_id>/frame_1.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_2.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_3.jpg"},
{"type": "image", "image": "frames/<clip_id>/frame_4.jpg"},
{
"type": "text",
"text": "Please analyze the sequence of frames from this video. What specific action or event is happening?"
}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "<action description>"}
]
}
]
}
```
## How It Was Trained
Training was performed with a custom supervised fine-tuning pipeline built around:
- `transformers`
- `peft`
- `deepspeed`
- `bitsandbytes` optimizer (`paged_adamw_8bit`)
Core training setup used for this checkpoint:
- Precision: `bf16`
- DeepSpeed: ZeRO Stage 2
- Epochs: `3`
- Total training steps: `1170`
- Per-device batch size: `1`
- Gradient accumulation: `8`
- Effective optimizer: `paged_adamw_8bit`
- Learning rate: `2e-4`
- Weight decay: `0.0`
- Warmup ratio: `0.03`
- LR scheduler: `cosine`
- Gradient checkpointing: enabled
- Save every `200` steps
- Keep last `2` checkpoints
Final trainer summary:
- Train loss: `14.4465`
- Train runtime: `5026.56` seconds
- Train samples/sec: `1.859`
- Train steps/sec: `0.233`
## Training Command
The project launcher was based on:
```bash
MODEL_NAME=google/gemma-4-e4b-it \
DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \
IMAGE_FOLDER=./dataset/kinetics_3k \
OUTPUT_DIR=./output/gemma4_e4b_lora_only \
RUN_NAME=gemma4-e4b-lora-only \
uv run deepspeed \
--num_gpus 1 \
--master_port 29500 \
stage1/train.py \
--deepspeed deepspeed_config/stage1.json \
--model_id google/gemma-4-e4b-it \
--data_path ./dataset/kinetics_3k/kinetic_3K.json \
--image_folder ./dataset/kinetics_3k \
--output_dir ./output/gemma4_e4b_lora_only \
--run_name gemma4-e4b-lora-only \
--bf16 True \
--use_lora True \
--lora_r 16 \
--lora_alpha 32 \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--optim paged_adamw_8bit \
--learning_rate 2e-4 \
--image_encoder_lr 0.0 \
--projector_lr 0.0 \
--weight_decay 0.0 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--save_strategy steps \
--save_steps 200 \
--save_total_limit 2 \
--gradient_checkpointing True \
--logging_steps 10 \
--dataloader_num_workers 4 \
--report_to none
```
DeepSpeed config:
```json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 5e7
},
"bf16": {
"enabled": true
}
}
```
## Usage
This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter.
```python
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from peft import PeftModel
base_model = Gemma4ForConditionalGeneration.from_pretrained(
"google/gemma-4-e4b-it",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(
base_model,
"bear7011/gemma4-e4b-kinetic3K_FT",
)
processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it")
```
## Notes
- This checkpoint is an adapter, not a merged full model.
- The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded.
- No separate benchmark or held-out evaluation report is included in this repository.
|