| --- |
| library_name: transformers |
| license: apache-2.0 |
| license_link: https://ai.google.dev/gemma/docs/gemma_4_license |
| pipeline_tag: any-to-any |
| base_model: google/gemma-4-E4B-it |
| datasets: |
| - EVA-UNIT-01/Lilith-v0.3 |
| - zerofata/Gemini-3.1-Pro-GLM5-Characters |
| - zerofata/Instruct-Anime |
| - zerofata/Anime-AMA-Prose |
| - allura-forge/mimo-v2-pro-claude-distill-hs3 |
| - allura-forge/doubao-seed2.0-distill-multiturn-expr-rp |
| - Delta-Vector/Orion-Deepseek-V3-RP-Filtered |
| - Delta-Vector/Orion-Deepseek-R1-RP-Filtered |
| - Gryphe/ChatGPT-4o-Writing-Prompts |
| - Gryphe/Sonnet3.5-Charcard-Roleplay |
| - ToastyPigeon/kimi-stories-instruct |
| - ToastyPigeon/kimi-rp-v3 |
| - ToastyPigeon/fujin-filtered-instruct |
| - Dxniz/Novelist-CoT |
| language: |
| - en |
| --- |
| |
| **Gemma 4 E4B Musica v1** |
|
|
| RP/storygen/writing/conversational tune of Gemma-4-E4B-it, fourth model in Musica series. Quite nice and stable, and still fairly smart for its size, while having better vibes. Stabler than 26B-A4B, even, which is bit odd but okay. |
|
|
| Both reasoning and non-reasoning work, though reasoning style this time seems to be a mix of Gemma and GLM, without any option to change that. I'd honestly suggest to just always have reasoning on, with how fast this model is to run, it just makes output better. |
| Instruction following seems to be quite good, so is swipe diversity. No refusals detected. |
|
|
| This model was made in collaboration with [ArliAI](https://www.arliai.com/) |
|
|
| **Training notes** |
|
|
| It was straight up impossible to tune E4B for some time, but now it seems to just work on latest axolotl. I've used new hybrid FA2/SDPA attn too, to make training faster. |
|
|
| This model uses oddly large amount of VRAM for training, ~55.5GB active per card with FSDP2 with full sharding for a bf16 LoRA. Training seems to be decently fast, 9 hours for 2 epochs, with ~55M trainable tokens, is not too bad. |
|
|
| I've decided to change my tuning strategy for this one, and do 2 epochs and use reflected exponential LR decay, to give it proper anneal, seems to have worked quite well. |
|
|
| r64a64 LoRA, rex 1e-5, 2 epochs, 9 hours on 2xRTX Pro 6000 Blackwell. |
|
|
| [allura-forge/musica-sft-v1-gemma4-pretok](https://huggingface.co/datasets/allura-forge/musica-sft-v1-gemma4-pretok) - pretokenized dataset. |
|
|
| [CometML Project](https://www.comet.com/aetherwiing/musica-e4b/view/MG0CxdwODRAGylGSywdR6YLeH/panels) - training graphs and stats. |
|
|
| [AuriAetherwiing/G4-E4B-Musica-v1-lora](https://huggingface.co/AuriAetherwiing/G4-E4B-Musica-v1-lora) - LoRA adapter. |
|
|
| **Recommended Samplers** |
|
|
| - Temperature: 1 |
|
|
| - Min-P: 0.02 |
|
|
| - NSigma: 2 |
|
|
| Don't use repetition penalties of any kind, they harm more than they do good. |
|
|
| **Axolotl config** |
|
|
| <details><summary>See Axolotl config</summary> |
|
|
| ```yaml |
| # ============================================================================= |
| # BASE MODEL |
| # ============================================================================= |
| base_model: /home/arli/models/gemma-4-E4B-it |
| |
| |
| # ============================================================================= |
| # PLUGINS & KERNEL OPTIMIZATIONS |
| # ============================================================================= |
| plugins: |
| - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin |
| cut_cross_entropy: true |
| |
| |
| # ============================================================================= |
| # QUANTIZATION |
| # ============================================================================= |
| load_in_8bit: false |
| load_in_4bit: false |
| |
| |
| # ============================================================================= |
| # DATASET |
| # ============================================================================= |
| shuffle_merged_datasets: true |
| datasets: |
| - path: allura-forge/musica-sft-v1-gemma4-pretok |
| ds_type: parquet |
| type: |
| |
| dataset_prepared_path: ./last_run_prepared |
| val_set_size: 0 |
| |
| |
| # ============================================================================= |
| # OUTPUT & ADAPTER |
| # ============================================================================= |
| output_dir: ./outputs/v1 |
| adapter: lora |
| save_safetensors: true |
| |
| |
| # ============================================================================= |
| # SEQUENCE & SAMPLE PACKING |
| # ============================================================================= |
| sequence_len: 8192 |
| sample_packing: true # DOES in fact work with SDPA |
| pad_to_sequence_len: false |
| |
| |
| # ============================================================================= |
| # LORA |
| # ============================================================================= |
| lora_r: 64 |
| lora_alpha: 64 |
| lora_dropout: 0.0 |
| lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj' |
| |
| lora_mlp_kernel: false |
| lora_qkv_kernel: false |
| lora_o_kernel: false |
| |
| |
| # ============================================================================= |
| # TRAINING HYPERPARAMETERS |
| # ============================================================================= |
| gradient_accumulation_steps: 8 |
| micro_batch_size: 1 |
| num_epochs: 2 |
| optimizer: adamw_torch_fused |
| lr_scheduler: rex |
| learning_rate: 1e-5 |
| warmup_ratio: 0.05 |
| max_grad_norm: 1.0 |
| weight_decay: 0.05 |
| |
| # ============================================================================= |
| # PRECISION |
| # ============================================================================= |
| bf16: auto |
| |
| |
| # ============================================================================= |
| # ATTENTION |
| # ============================================================================= |
| #sdp_attention: true |
| #flash_attention: true |
| #flex_attention: true |
| #torch_compile: true |
| gemma4_hybrid_attn_impl: true |
| |
| # ============================================================================= |
| # LOGGING & MONITORING |
| # ============================================================================= |
| use_comet: true # install comet-ml with pip and do comet login before starting |
| comet_project_name: musica-e4b |
| logging_steps: 1 |
| |
| |
| # ============================================================================= |
| # CHECKPOINTING & SAVING |
| # ============================================================================= |
| auto_resume_from_checkpoints: false |
| evals_per_epoch: 0 |
| saves_per_epoch: 4 |
| save_total_limit: 4 |
| |
| gradient_checkpointing: false |
| gradient_checkpointing_kwargs: |
| use_reentrant: false |
| |
| fsdp_config: |
| fsdp_version: 2 |
| offload_params: false |
| cpu_ram_efficient_loading: false |
| auto_wrap_policy: TRANSFORMER_BASED_WRAP |
| transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer |
| state_dict_type: FULL_STATE_DICT |
| sharding_strategy: FULL_SHARD |
| reshard_after_forward: false |
| activation_checkpointing: true |
| |
| ``` |
| </details> |