Image-Text-to-Text
PEFT
Safetensors
gemma
gemma4
lora
video-understanding
action-recognition
image-sequence
conversational
Instructions to use bear7011/gemma4-e4b-kinetic3K_FT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use bear7011/gemma4-e4b-kinetic3K_FT with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-e4b-it") model = PeftModel.from_pretrained(base_model, "bear7011/gemma4-e4b-kinetic3K_FT") - Notebooks
- Google Colab
- Kaggle
| base_model: google/gemma-4-e4b-it | |
| library_name: peft | |
| pipeline_tag: image-text-to-text | |
| tags: | |
| - gemma | |
| - gemma4 | |
| - peft | |
| - lora | |
| - video-understanding | |
| - action-recognition | |
| - image-sequence | |
| # bear7011/gemma4-e4b-kinetic3K_FT | |
| This repository contains a LoRA adapter fine-tuned from `google/gemma-4-e4b-it` for action recognition on a Kinetics-3K style dataset. | |
| The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos. | |
| ## What Was Trained | |
| - Base model: `google/gemma-4-e4b-it` | |
| - Adapter type: LoRA | |
| - Output artifact: adapter-only checkpoint (`adapter_model.safetensors`) | |
| - Task: action recognition / short event description from a short frame sequence | |
| The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with: | |
| - `r=16` | |
| - `lora_alpha=32` | |
| - `lora_dropout=0.05` | |
| - Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | |
| The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs. | |
| ## Training Data | |
| This model was trained with the dataset at: | |
| - `./dataset/kinetics_3k/kinetic_3K.json` | |
| - Image root: `./dataset/kinetics_3k` | |
| Dataset summary: | |
| - 3,115 training samples | |
| - Each sample contains 4 sequential frames from a video clip | |
| - The user prompt asks the model to identify the action or event in the frame sequence | |
| - The assistant target is a short natural-language action description | |
| Example prompt format: | |
| ```json | |
| { | |
| "messages": [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": "frames/<clip_id>/frame_1.jpg"}, | |
| {"type": "image", "image": "frames/<clip_id>/frame_2.jpg"}, | |
| {"type": "image", "image": "frames/<clip_id>/frame_3.jpg"}, | |
| {"type": "image", "image": "frames/<clip_id>/frame_4.jpg"}, | |
| { | |
| "type": "text", | |
| "text": "Please analyze the sequence of frames from this video. What specific action or event is happening?" | |
| } | |
| ] | |
| }, | |
| { | |
| "role": "assistant", | |
| "content": [ | |
| {"type": "text", "text": "<action description>"} | |
| ] | |
| } | |
| ] | |
| } | |
| ``` | |
| ## How It Was Trained | |
| Training was performed with a custom supervised fine-tuning pipeline built around: | |
| - `transformers` | |
| - `peft` | |
| - `deepspeed` | |
| - `bitsandbytes` optimizer (`paged_adamw_8bit`) | |
| Core training setup used for this checkpoint: | |
| - Precision: `bf16` | |
| - DeepSpeed: ZeRO Stage 2 | |
| - Epochs: `3` | |
| - Total training steps: `1170` | |
| - Per-device batch size: `1` | |
| - Gradient accumulation: `8` | |
| - Effective optimizer: `paged_adamw_8bit` | |
| - Learning rate: `2e-4` | |
| - Weight decay: `0.0` | |
| - Warmup ratio: `0.03` | |
| - LR scheduler: `cosine` | |
| - Gradient checkpointing: enabled | |
| - Save every `200` steps | |
| - Keep last `2` checkpoints | |
| Final trainer summary: | |
| - Train loss: `14.4465` | |
| - Train runtime: `5026.56` seconds | |
| - Train samples/sec: `1.859` | |
| - Train steps/sec: `0.233` | |
| ## Training Command | |
| The project launcher was based on: | |
| ```bash | |
| MODEL_NAME=google/gemma-4-e4b-it \ | |
| DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \ | |
| IMAGE_FOLDER=./dataset/kinetics_3k \ | |
| OUTPUT_DIR=./output/gemma4_e4b_lora_only \ | |
| RUN_NAME=gemma4-e4b-lora-only \ | |
| uv run deepspeed \ | |
| --num_gpus 1 \ | |
| --master_port 29500 \ | |
| stage1/train.py \ | |
| --deepspeed deepspeed_config/stage1.json \ | |
| --model_id google/gemma-4-e4b-it \ | |
| --data_path ./dataset/kinetics_3k/kinetic_3K.json \ | |
| --image_folder ./dataset/kinetics_3k \ | |
| --output_dir ./output/gemma4_e4b_lora_only \ | |
| --run_name gemma4-e4b-lora-only \ | |
| --bf16 True \ | |
| --use_lora True \ | |
| --lora_r 16 \ | |
| --lora_alpha 32 \ | |
| --num_train_epochs 3 \ | |
| --per_device_train_batch_size 1 \ | |
| --gradient_accumulation_steps 8 \ | |
| --optim paged_adamw_8bit \ | |
| --learning_rate 2e-4 \ | |
| --image_encoder_lr 0.0 \ | |
| --projector_lr 0.0 \ | |
| --weight_decay 0.0 \ | |
| --warmup_ratio 0.03 \ | |
| --lr_scheduler_type cosine \ | |
| --save_strategy steps \ | |
| --save_steps 200 \ | |
| --save_total_limit 2 \ | |
| --gradient_checkpointing True \ | |
| --logging_steps 10 \ | |
| --dataloader_num_workers 4 \ | |
| --report_to none | |
| ``` | |
| DeepSpeed config: | |
| ```json | |
| { | |
| "train_batch_size": "auto", | |
| "train_micro_batch_size_per_gpu": "auto", | |
| "gradient_accumulation_steps": "auto", | |
| "gradient_clipping": 1.0, | |
| "zero_optimization": { | |
| "stage": 2, | |
| "overlap_comm": true, | |
| "contiguous_gradients": true, | |
| "reduce_bucket_size": 5e7 | |
| }, | |
| "bf16": { | |
| "enabled": true | |
| } | |
| } | |
| ``` | |
| ## Usage | |
| This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter. | |
| ```python | |
| from transformers import AutoProcessor, Gemma4ForConditionalGeneration | |
| from peft import PeftModel | |
| base_model = Gemma4ForConditionalGeneration.from_pretrained( | |
| "google/gemma-4-e4b-it", | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| model = PeftModel.from_pretrained( | |
| base_model, | |
| "bear7011/gemma4-e4b-kinetic3K_FT", | |
| ) | |
| processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it") | |
| ``` | |
| ## Notes | |
| - This checkpoint is an adapter, not a merged full model. | |
| - The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded. | |
| - No separate benchmark or held-out evaluation report is included in this repository. | |