--- library_name: transformers license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license pipeline_tag: any-to-any base_model: google/gemma-4-E4B-it datasets: - EVA-UNIT-01/Lilith-v0.3 - zerofata/Gemini-3.1-Pro-GLM5-Characters - zerofata/Instruct-Anime - zerofata/Anime-AMA-Prose - allura-forge/mimo-v2-pro-claude-distill-hs3 - allura-forge/doubao-seed2.0-distill-multiturn-expr-rp - Delta-Vector/Orion-Deepseek-V3-RP-Filtered - Delta-Vector/Orion-Deepseek-R1-RP-Filtered - Gryphe/ChatGPT-4o-Writing-Prompts - Gryphe/Sonnet3.5-Charcard-Roleplay - ToastyPigeon/kimi-stories-instruct - ToastyPigeon/kimi-rp-v3 - ToastyPigeon/fujin-filtered-instruct - Dxniz/Novelist-CoT language: - en --- **Gemma 4 E4B Musica v1** RP/storygen/writing/conversational tune of Gemma-4-E4B-it, fourth model in Musica series. Quite nice and stable, and still fairly smart for its size, while having better vibes. Stabler than 26B-A4B, even, which is bit odd but okay. Both reasoning and non-reasoning work, though reasoning style this time seems to be a mix of Gemma and GLM, without any option to change that. I'd honestly suggest to just always have reasoning on, with how fast this model is to run, it just makes output better. Instruction following seems to be quite good, so is swipe diversity. No refusals detected. This model was made in collaboration with [ArliAI](https://www.arliai.com/) **Training notes** It was straight up impossible to tune E4B for some time, but now it seems to just work on latest axolotl. I've used new hybrid FA2/SDPA attn too, to make training faster. This model uses oddly large amount of VRAM for training, ~55.5GB active per card with FSDP2 with full sharding for a bf16 LoRA. Training seems to be decently fast, 9 hours for 2 epochs, with ~55M trainable tokens, is not too bad. I've decided to change my tuning strategy for this one, and do 2 epochs and use reflected exponential LR decay, to give it proper anneal, seems to have worked quite well. r64a64 LoRA, rex 1e-5, 2 epochs, 9 hours on 2xRTX Pro 6000 Blackwell. [allura-forge/musica-sft-v1-gemma4-pretok](https://huggingface.co/datasets/allura-forge/musica-sft-v1-gemma4-pretok) - pretokenized dataset. [CometML Project](https://www.comet.com/aetherwiing/musica-e4b/view/MG0CxdwODRAGylGSywdR6YLeH/panels) - training graphs and stats. [AuriAetherwiing/G4-E4B-Musica-v1-lora](https://huggingface.co/AuriAetherwiing/G4-E4B-Musica-v1-lora) - LoRA adapter. **Recommended Samplers** - Temperature: 1 - Min-P: 0.02 - NSigma: 2 Don't use repetition penalties of any kind, they harm more than they do good. **Axolotl config**
See Axolotl config ```yaml # ============================================================================= # BASE MODEL # ============================================================================= base_model: /home/arli/models/gemma-4-E4B-it # ============================================================================= # PLUGINS & KERNEL OPTIMIZATIONS # ============================================================================= plugins: - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin cut_cross_entropy: true # ============================================================================= # QUANTIZATION # ============================================================================= load_in_8bit: false load_in_4bit: false # ============================================================================= # DATASET # ============================================================================= shuffle_merged_datasets: true datasets: - path: allura-forge/musica-sft-v1-gemma4-pretok ds_type: parquet type: dataset_prepared_path: ./last_run_prepared val_set_size: 0 # ============================================================================= # OUTPUT & ADAPTER # ============================================================================= output_dir: ./outputs/v1 adapter: lora save_safetensors: true # ============================================================================= # SEQUENCE & SAMPLE PACKING # ============================================================================= sequence_len: 8192 sample_packing: true # DOES in fact work with SDPA pad_to_sequence_len: false # ============================================================================= # LORA # ============================================================================= lora_r: 64 lora_alpha: 64 lora_dropout: 0.0 lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj' lora_mlp_kernel: false lora_qkv_kernel: false lora_o_kernel: false # ============================================================================= # TRAINING HYPERPARAMETERS # ============================================================================= gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 2 optimizer: adamw_torch_fused lr_scheduler: rex learning_rate: 1e-5 warmup_ratio: 0.05 max_grad_norm: 1.0 weight_decay: 0.05 # ============================================================================= # PRECISION # ============================================================================= bf16: auto # ============================================================================= # ATTENTION # ============================================================================= #sdp_attention: true #flash_attention: true #flex_attention: true #torch_compile: true gemma4_hybrid_attn_impl: true # ============================================================================= # LOGGING & MONITORING # ============================================================================= use_comet: true # install comet-ml with pip and do comet login before starting comet_project_name: musica-e4b logging_steps: 1 # ============================================================================= # CHECKPOINTING & SAVING # ============================================================================= auto_resume_from_checkpoints: false evals_per_epoch: 0 saves_per_epoch: 4 save_total_limit: 4 gradient_checkpointing: false gradient_checkpointing_kwargs: use_reentrant: false fsdp_config: fsdp_version: 2 offload_params: false cpu_ram_efficient_loading: false auto_wrap_policy: TRANSFORMER_BASED_WRAP transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer state_dict_type: FULL_STATE_DICT sharding_strategy: FULL_SHARD reshard_after_forward: false activation_checkpointing: true ```