---
library_name: transformers
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
pipeline_tag: any-to-any
base_model: google/gemma-4-E4B-it
datasets:
- EVA-UNIT-01/Lilith-v0.3
- zerofata/Gemini-3.1-Pro-GLM5-Characters
- zerofata/Instruct-Anime
- zerofata/Anime-AMA-Prose
- allura-forge/mimo-v2-pro-claude-distill-hs3
- allura-forge/doubao-seed2.0-distill-multiturn-expr-rp
- Delta-Vector/Orion-Deepseek-V3-RP-Filtered
- Delta-Vector/Orion-Deepseek-R1-RP-Filtered
- Gryphe/ChatGPT-4o-Writing-Prompts
- Gryphe/Sonnet3.5-Charcard-Roleplay
- ToastyPigeon/kimi-stories-instruct
- ToastyPigeon/kimi-rp-v3
- ToastyPigeon/fujin-filtered-instruct
- Dxniz/Novelist-CoT
language:
- en
---

**Gemma 4 E4B Musica v1**

RP/storygen/writing/conversational tune of Gemma-4-E4B-it, fourth model in Musica series. Quite nice and stable, and still fairly smart for its size, while having better vibes. Stabler than 26B-A4B, even, which is bit odd but okay. 

Both reasoning and non-reasoning work, though reasoning style this time seems to be a mix of Gemma and GLM, without any option to change that. I'd honestly suggest to just always have reasoning on, with how fast this model is to run, it just makes output better.
Instruction following seems to be quite good, so is swipe diversity. No refusals detected.

This model was made in collaboration with [ArliAI](https://www.arliai.com/)

**Training notes**

It was straight up impossible to tune E4B for some time, but now it seems to just work on latest axolotl. I've used new hybrid FA2/SDPA attn too, to make training faster.

This model uses oddly large amount of VRAM for training, ~55.5GB active per card with FSDP2 with full sharding for a bf16 LoRA. Training seems to be decently fast, 9 hours for 2 epochs, with ~55M trainable tokens, is not too bad.

I've decided to change my tuning strategy for this one, and do 2 epochs and use reflected exponential LR decay, to give it proper anneal, seems to have worked quite well.

r64a64 LoRA, rex 1e-5, 2 epochs, 9 hours on 2xRTX Pro 6000 Blackwell.

[allura-forge/musica-sft-v1-gemma4-pretok](https://huggingface.co/datasets/allura-forge/musica-sft-v1-gemma4-pretok) - pretokenized dataset.

[CometML Project](https://www.comet.com/aetherwiing/musica-e4b/view/MG0CxdwODRAGylGSywdR6YLeH/panels) - training graphs and stats.

[AuriAetherwiing/G4-E4B-Musica-v1-lora](https://huggingface.co/AuriAetherwiing/G4-E4B-Musica-v1-lora) - LoRA adapter.

**Recommended Samplers**

- Temperature: 1 

- Min-P: 0.02

- NSigma: 2

Don't use repetition penalties of any kind, they harm more than they do good.

**Axolotl config** 

<details><summary>See Axolotl config</summary>

```yaml
# =============================================================================
# BASE MODEL
# =============================================================================
base_model: /home/arli/models/gemma-4-E4B-it


# =============================================================================
# PLUGINS & KERNEL OPTIMIZATIONS
# =============================================================================
plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
cut_cross_entropy: true


# =============================================================================
# QUANTIZATION
# =============================================================================
load_in_8bit: false
load_in_4bit: false


# =============================================================================
# DATASET
# =============================================================================
shuffle_merged_datasets: true
datasets:
  - path: allura-forge/musica-sft-v1-gemma4-pretok
    ds_type: parquet
    type:

dataset_prepared_path: ./last_run_prepared
val_set_size: 0


# =============================================================================
# OUTPUT & ADAPTER
# =============================================================================
output_dir: ./outputs/v1
adapter: lora
save_safetensors: true


# =============================================================================
# SEQUENCE & SAMPLE PACKING
# =============================================================================
sequence_len: 8192 
sample_packing: true # DOES in fact work with SDPA
pad_to_sequence_len: false


# =============================================================================
# LORA
# =============================================================================
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'

lora_mlp_kernel: false
lora_qkv_kernel: false
lora_o_kernel: false


# =============================================================================
# TRAINING HYPERPARAMETERS
# =============================================================================
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch_fused
lr_scheduler: rex
learning_rate: 1e-5
warmup_ratio: 0.05
max_grad_norm: 1.0
weight_decay: 0.05

# =============================================================================
# PRECISION
# =============================================================================
bf16: auto


# =============================================================================
# ATTENTION
# =============================================================================
#sdp_attention: true
#flash_attention: true
#flex_attention: true 
#torch_compile: true 
gemma4_hybrid_attn_impl: true

# =============================================================================
# LOGGING & MONITORING
# =============================================================================
use_comet: true # install comet-ml with pip and do comet login before starting
comet_project_name: musica-e4b
logging_steps: 1


# =============================================================================
# CHECKPOINTING & SAVING
# =============================================================================
auto_resume_from_checkpoints: false
evals_per_epoch: 0
saves_per_epoch: 4
save_total_limit: 4

gradient_checkpointing: false
gradient_checkpointing_kwargs:
  use_reentrant: false
  
fsdp_config:
  fsdp_version: 2
  offload_params: false
  cpu_ram_efficient_loading: false
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
  state_dict_type: FULL_STATE_DICT
  sharding_strategy: FULL_SHARD
  reshard_after_forward: false
  activation_checkpointing: true

```
</details>