--- base_model: google/gemma-4-e4b-it library_name: peft pipeline_tag: image-text-to-text tags: - gemma - gemma4 - peft - lora - video-understanding - action-recognition - image-sequence --- # bear7011/gemma4-e4b-kinetic3K_FT This repository contains a LoRA adapter fine-tuned from `google/gemma-4-e4b-it` for action recognition on a Kinetics-3K style dataset. The training code supports both image and video inputs, but this specific checkpoint was trained on 4-frame image sequences extracted from Kinetics clips, not on raw videos. ## What Was Trained - Base model: `google/gemma-4-e4b-it` - Adapter type: LoRA - Output artifact: adapter-only checkpoint (`adapter_model.safetensors`) - Task: action recognition / short event description from a short frame sequence The saved adapter applies LoRA to all 42 text transformer layers of Gemma 4 E4B with: - `r=16` - `lora_alpha=32` - `lora_dropout=0.05` - Target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` The vision tower and projector were kept frozen in this run, so this is effectively a pure LoRA fine-tune on the language backbone conditioned on visual inputs. ## Training Data This model was trained with the dataset at: - `./dataset/kinetics_3k/kinetic_3K.json` - Image root: `./dataset/kinetics_3k` Dataset summary: - 3,115 training samples - Each sample contains 4 sequential frames from a video clip - The user prompt asks the model to identify the action or event in the frame sequence - The assistant target is a short natural-language action description Example prompt format: ```json { "messages": [ { "role": "user", "content": [ {"type": "image", "image": "frames//frame_1.jpg"}, {"type": "image", "image": "frames//frame_2.jpg"}, {"type": "image", "image": "frames//frame_3.jpg"}, {"type": "image", "image": "frames//frame_4.jpg"}, { "type": "text", "text": "Please analyze the sequence of frames from this video. What specific action or event is happening?" } ] }, { "role": "assistant", "content": [ {"type": "text", "text": ""} ] } ] } ``` ## How It Was Trained Training was performed with a custom supervised fine-tuning pipeline built around: - `transformers` - `peft` - `deepspeed` - `bitsandbytes` optimizer (`paged_adamw_8bit`) Core training setup used for this checkpoint: - Precision: `bf16` - DeepSpeed: ZeRO Stage 2 - Epochs: `3` - Total training steps: `1170` - Per-device batch size: `1` - Gradient accumulation: `8` - Effective optimizer: `paged_adamw_8bit` - Learning rate: `2e-4` - Weight decay: `0.0` - Warmup ratio: `0.03` - LR scheduler: `cosine` - Gradient checkpointing: enabled - Save every `200` steps - Keep last `2` checkpoints Final trainer summary: - Train loss: `14.4465` - Train runtime: `5026.56` seconds - Train samples/sec: `1.859` - Train steps/sec: `0.233` ## Training Command The project launcher was based on: ```bash MODEL_NAME=google/gemma-4-e4b-it \ DATA_PATH=./dataset/kinetics_3k/kinetic_3K.json \ IMAGE_FOLDER=./dataset/kinetics_3k \ OUTPUT_DIR=./output/gemma4_e4b_lora_only \ RUN_NAME=gemma4-e4b-lora-only \ uv run deepspeed \ --num_gpus 1 \ --master_port 29500 \ stage1/train.py \ --deepspeed deepspeed_config/stage1.json \ --model_id google/gemma-4-e4b-it \ --data_path ./dataset/kinetics_3k/kinetic_3K.json \ --image_folder ./dataset/kinetics_3k \ --output_dir ./output/gemma4_e4b_lora_only \ --run_name gemma4-e4b-lora-only \ --bf16 True \ --use_lora True \ --lora_r 16 \ --lora_alpha 32 \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --optim paged_adamw_8bit \ --learning_rate 2e-4 \ --image_encoder_lr 0.0 \ --projector_lr 0.0 \ --weight_decay 0.0 \ --warmup_ratio 0.03 \ --lr_scheduler_type cosine \ --save_strategy steps \ --save_steps 200 \ --save_total_limit 2 \ --gradient_checkpointing True \ --logging_steps 10 \ --dataloader_num_workers 4 \ --report_to none ``` DeepSpeed config: ```json { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": 1.0, "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 5e7 }, "bf16": { "enabled": true } } ``` ## Usage This repo contains adapter weights only. Load the base model first, then attach the LoRA adapter. ```python from transformers import AutoProcessor, Gemma4ForConditionalGeneration from peft import PeftModel base_model = Gemma4ForConditionalGeneration.from_pretrained( "google/gemma-4-e4b-it", torch_dtype="auto", device_map="auto", ) model = PeftModel.from_pretrained( base_model, "bear7011/gemma4-e4b-kinetic3K_FT", ) processor = AutoProcessor.from_pretrained("google/gemma-4-e4b-it") ``` ## Notes - This checkpoint is an adapter, not a merged full model. - The repo currently stores final adapter artifacts only; intermediate training checkpoints are intentionally excluded. - No separate benchmark or held-out evaluation report is included in this repository.