Update README.md

c3e35d6 verified 3 days ago

6.59 kB

	---
	library_name: transformers
	license: apache-2.0
	license_link: https://ai.google.dev/gemma/docs/gemma_4_license
	pipeline_tag: any-to-any
	base_model: google/gemma-4-E4B-it
	datasets:
	- EVA-UNIT-01/Lilith-v0.3
	- zerofata/Gemini-3.1-Pro-GLM5-Characters
	- zerofata/Instruct-Anime
	- zerofata/Anime-AMA-Prose
	- allura-forge/mimo-v2-pro-claude-distill-hs3
	- allura-forge/doubao-seed2.0-distill-multiturn-expr-rp
	- Delta-Vector/Orion-Deepseek-V3-RP-Filtered
	- Delta-Vector/Orion-Deepseek-R1-RP-Filtered
	- Gryphe/ChatGPT-4o-Writing-Prompts
	- Gryphe/Sonnet3.5-Charcard-Roleplay
	- ToastyPigeon/kimi-stories-instruct
	- ToastyPigeon/kimi-rp-v3
	- ToastyPigeon/fujin-filtered-instruct
	- Dxniz/Novelist-CoT
	language:
	- en
	---

	Gemma 4 E4B Musica v1

	RP/storygen/writing/conversational tune of Gemma-4-E4B-it, fourth model in Musica series. Quite nice and stable, and still fairly smart for its size, while having better vibes. Stabler than 26B-A4B, even, which is bit odd but okay.

	Both reasoning and non-reasoning work, though reasoning style this time seems to be a mix of Gemma and GLM, without any option to change that. I'd honestly suggest to just always have reasoning on, with how fast this model is to run, it just makes output better.
	Instruction following seems to be quite good, so is swipe diversity. No refusals detected.

	This model was made in collaboration with [ArliAI](https://www.arliai.com/)

	Training notes

	It was straight up impossible to tune E4B for some time, but now it seems to just work on latest axolotl. I've used new hybrid FA2/SDPA attn too, to make training faster.

	This model uses oddly large amount of VRAM for training, ~55.5GB active per card with FSDP2 with full sharding for a bf16 LoRA. Training seems to be decently fast, 9 hours for 2 epochs, with ~55M trainable tokens, is not too bad.

	I've decided to change my tuning strategy for this one, and do 2 epochs and use reflected exponential LR decay, to give it proper anneal, seems to have worked quite well.

	r64a64 LoRA, rex 1e-5, 2 epochs, 9 hours on 2xRTX Pro 6000 Blackwell.

	[allura-forge/musica-sft-v1-gemma4-pretok](https://huggingface.co/datasets/allura-forge/musica-sft-v1-gemma4-pretok) - pretokenized dataset.

	[CometML Project](https://www.comet.com/aetherwiing/musica-e4b/view/MG0CxdwODRAGylGSywdR6YLeH/panels) - training graphs and stats.

	[AuriAetherwiing/G4-E4B-Musica-v1-lora](https://huggingface.co/AuriAetherwiing/G4-E4B-Musica-v1-lora) - LoRA adapter.

	Recommended Samplers

	- Temperature: 1

	- Min-P: 0.02

	- NSigma: 2

	Don't use repetition penalties of any kind, they harm more than they do good.

	Axolotl config

	<details><summary>See Axolotl config</summary>

	```yaml
	# =============================================================================
	# BASE MODEL
	# =============================================================================
	base_model: /home/arli/models/gemma-4-E4B-it


	# =============================================================================
	# PLUGINS & KERNEL OPTIMIZATIONS
	# =============================================================================
	plugins:
	- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
	cut_cross_entropy: true


	# =============================================================================
	# QUANTIZATION
	# =============================================================================
	load_in_8bit: false
	load_in_4bit: false


	# =============================================================================
	# DATASET
	# =============================================================================
	shuffle_merged_datasets: true
	datasets:
	- path: allura-forge/musica-sft-v1-gemma4-pretok
	ds_type: parquet
	type:

	dataset_prepared_path: ./last_run_prepared
	val_set_size: 0


	# =============================================================================
	# OUTPUT & ADAPTER
	# =============================================================================
	output_dir: ./outputs/v1
	adapter: lora
	save_safetensors: true


	# =============================================================================
	# SEQUENCE & SAMPLE PACKING
	# =============================================================================
	sequence_len: 8192
	sample_packing: true # DOES in fact work with SDPA
	pad_to_sequence_len: false


	# =============================================================================
	# LORA
	# =============================================================================
	lora_r: 64
	lora_alpha: 64
	lora_dropout: 0.0
	lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp\|self_attn).(up\|down\|gate\|q\|k\|v\|o)_proj'

	lora_mlp_kernel: false
	lora_qkv_kernel: false
	lora_o_kernel: false


	# =============================================================================
	# TRAINING HYPERPARAMETERS
	# =============================================================================
	gradient_accumulation_steps: 8
	micro_batch_size: 1
	num_epochs: 2
	optimizer: adamw_torch_fused
	lr_scheduler: rex
	learning_rate: 1e-5
	warmup_ratio: 0.05
	max_grad_norm: 1.0
	weight_decay: 0.05

	# =============================================================================
	# PRECISION
	# =============================================================================
	bf16: auto


	# =============================================================================
	# ATTENTION
	# =============================================================================
	#sdp_attention: true
	#flash_attention: true
	#flex_attention: true
	#torch_compile: true
	gemma4_hybrid_attn_impl: true

	# =============================================================================
	# LOGGING & MONITORING
	# =============================================================================
	use_comet: true # install comet-ml with pip and do comet login before starting
	comet_project_name: musica-e4b
	logging_steps: 1


	# =============================================================================
	# CHECKPOINTING & SAVING
	# =============================================================================
	auto_resume_from_checkpoints: false
	evals_per_epoch: 0
	saves_per_epoch: 4
	save_total_limit: 4

	gradient_checkpointing: false
	gradient_checkpointing_kwargs:
	use_reentrant: false

	fsdp_config:
	fsdp_version: 2
	offload_params: false
	cpu_ram_efficient_loading: false
	auto_wrap_policy: TRANSFORMER_BASED_WRAP
	transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
	state_dict_type: FULL_STATE_DICT
	sharding_strategy: FULL_SHARD
	reshard_after_forward: false
	activation_checkpointing: true

	```
	</details>