Update model card with training details

84e31e2 verified 21 days ago

5.3 kB

	---
	base_model: google/gemma-4-e4b-it
	library_name: peft
	pipeline_tag: image-text-to-text
	tags:
	- gemma
	- gemma4
	- peft
	- lora
	- video-understanding
	- action-recognition
	- image-sequence
	---

	# bear7011/gemma4-e4b-kinetic3K_FT

	This repository contains a LoRA adapter fine-tuned from `google/gemma-4-e4b-it` for action recognition on a Kinetics-3K style dataset.

	The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos.

	## What Was Trained

	- Base model: `google/gemma-4-e4b-it`
	- Adapter type: LoRA
	- Output artifact: adapter-only checkpoint (`adapter_model.safetensors`)
	- Task: action recognition / short event description from a short frame sequence

	The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with:

	- `r=16`
	- `lora_alpha=32`
	- `lora_dropout=0.05`
	- Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

	The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs.

	## Training Data

	This model was trained with the dataset at:

	- `./dataset/kinetics_3k/kinetic_3K.json`
	- Image root: `./dataset/kinetics_3k`

	Dataset summary:

	- 3,115 training samples
	- Each sample contains 4 sequential frames from a video clip
	- The user prompt asks the model to identify the action or event in the frame sequence
	- The assistant target is a short natural-language action description

	Example prompt format:

	```json
	{
	"messages": [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "frames/<clip_id>/frame_1.jpg"},
	{"type": "image", "image": "frames/<clip_id>/frame_2.jpg"},
	{"type": "image", "image": "frames/<clip_id>/frame_3.jpg"},
	{"type": "image", "image": "frames/<clip_id>/frame_4.jpg"},
	{
	"type": "text",
	"text": "Please analyze the sequence of frames from this video. What specific action or event is happening?"
	}
	]
	},
	{
	"role": "assistant",
	"content": [
	{"type": "text", "text": "<action description>"}
	]
	}
	]
	}
	```

	## How It Was Trained

	Training was performed with a custom supervised fine-tuning pipeline built around:

	- `transformers`
	- `peft`
	- `deepspeed`
	- `bitsandbytes` optimizer (`paged_adamw_8bit`)

	Core training setup used for this checkpoint:

	- Precision: `bf16`
	- DeepSpeed: ZeRO Stage 2
	- Epochs: `3`
	- Total training steps: `1170`
	- Per-device batch size: `1`
	- Gradient accumulation: `8`
	- Effective optimizer: `paged_adamw_8bit`
	- Learning rate: `2e-4`
	- Weight decay: `0.0`
	- Warmup ratio: `0.03`
	- LR scheduler: `cosine`
	- Gradient checkpointing: enabled
	- Save every `200` steps
	- Keep last `2` checkpoints

	Final trainer summary:

	- Train loss: `14.4465`
	- Train runtime: `5026.56` seconds
	- Train samples/sec: `1.859`
	- Train steps/sec: `0.233`

	## Training Command

	The project launcher was based on:

	```bash
	MODEL_NAME=google/gemma-4-e4b-it \
	DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \
	IMAGE_FOLDER=./dataset/kinetics_3k \
	OUTPUT_DIR=./output/gemma4_e4b_lora_only \
	RUN_NAME=gemma4-e4b-lora-only \
	uv run deepspeed \
	--num_gpus 1 \
	--master_port 29500 \
	stage1/train.py \
	--deepspeed deepspeed_config/stage1.json \
	--model_id google/gemma-4-e4b-it \
	--data_path ./dataset/kinetics_3k/kinetic_3K.json \
	--image_folder ./dataset/kinetics_3k \
	--output_dir ./output/gemma4_e4b_lora_only \
	--run_name gemma4-e4b-lora-only \
	--bf16 True \
	--use_lora True \
	--lora_r 16 \
	--lora_alpha 32 \
	--num_train_epochs 3 \
	--per_device_train_batch_size 1 \
	--gradient_accumulation_steps 8 \
	--optim paged_adamw_8bit \
	--learning_rate 2e-4 \
	--image_encoder_lr 0.0 \
	--projector_lr 0.0 \
	--weight_decay 0.0 \
	--warmup_ratio 0.03 \
	--lr_scheduler_type cosine \
	--save_strategy steps \
	--save_steps 200 \
	--save_total_limit 2 \
	--gradient_checkpointing True \
	--logging_steps 10 \
	--dataloader_num_workers 4 \
	--report_to none
	```

	DeepSpeed config:

	```json
	{
	"train_batch_size": "auto",
	"train_micro_batch_size_per_gpu": "auto",
	"gradient_accumulation_steps": "auto",
	"gradient_clipping": 1.0,
	"zero_optimization": {
	"stage": 2,
	"overlap_comm": true,
	"contiguous_gradients": true,
	"reduce_bucket_size": 5e7
	},
	"bf16": {
	"enabled": true
	}
	}
	```

	## Usage

	This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter.

	```python
	from transformers import AutoProcessor, Gemma4ForConditionalGeneration
	from peft import PeftModel

	base_model = Gemma4ForConditionalGeneration.from_pretrained(
	"google/gemma-4-e4b-it",
	torch_dtype="auto",
	device_map="auto",
	)
	model = PeftModel.from_pretrained(
	base_model,
	"bear7011/gemma4-e4b-kinetic3K_FT",
	)
	processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it")
	```

	## Notes

	- This checkpoint is an adapter, not a merged full model.
	- The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded.
	- No separate benchmark or held-out evaluation report is included in this repository.