Built with Axolotl

See axolotl config

axolotl version: 0.13.0.dev0

# !pip install transformers==4.55.4
# !pip install --no-deps trl==0.22.2
# !pip install --no-build-isolation mamba_ssm==2.2.5
# !pip install --no-build-isolation causal_conv1d==1.5.2
# === Model Configuration ===
base_model: rpDungeon/gemmagain-4b-pt-wrappers
load_in_8bit: false
load_in_4bit: false
trust_remote_code: true
is_multimodal: false

# === HF Configuration === 
hub_model_id: rpDungeon/gemmagain-trained-fizzed-s1
hub_strategy: "every_save"
output_dir: fizzed-stage1
#resume_from_checkpoint: fizzed-stage1/checkpoint-146

# === Wandb Tracking ===
wandb_project: Gemmagain-Tests
## wandb_entity: [WANDB_ENTITY]
wandb_name: fizzed-stage-1

# === Training Setup ===
num_epochs: 2
micro_batch_size: 1
gradient_accumulation_steps: 4
sequence_len: 32768
sequence_parallel_degree: 2
heads_k_stride: 1
sample_packing: false
#pad_to_sequence_len: true
#temperature: 0.7
#max_steps: 10
# === Evaluation ===
val_set_size: 0.01
evals_per_epoch: 4
#eval_steps: 20
#max_steps: 60
#eval_table_size:
eval_max_new_tokens: 128
#eval_sample_packing: true
#eval_strategy: "no"

# === LoRA Configuration ===
adapter:

#unfrozen_parameters:
#  - model.layers.[0-9]+.self_attn.q_proj.weight
#  - model.layers.[0-9]+.self_attn.k_proj.weight
#  - model.layers.[0-9]+.self_attn.v_proj.weight
#  - model.layers.[0-9]+.self_attn.o_proj.weight
#  - model.layers.[0-9]+.mlp.down_proj.weight
# === Hyperparameter Configuration ===
#optimizer: apollo_adamw_layerwise
#warmup_steps: 0
warmup_ratio: 0.025
optimizer: adamw_torch_fused
#optimizer: paged_adamw_8bit
#optim_args:
#  enable_stochastic_rounding: true
#  enable_cautious: true
#  enable_8bit: true
# Apollo-mini configuration:
#optim_args: "proj=random,rank=128,scale=128.0,scale_type=tensor,update_proj_gap=100"
# Regular Apollo configuration:
# optim_args: 
#optim_target_modules: all_linear
learning_rate: 5e-5
lr_scheduler: cosine
#lr_groups:
#  - name: mlp
#    modules:
#      - mlp.up_proj.weight
#      - mlp.down_proj.weight
#      - mlp.gate_proj.weight
#    lr: 1e-5
#cosine_min_lr_ratio: 0.2
#lr_scheduler: cosine_with_min_lr
#lr_scheduler_kwargs:
#  cosine_min_lr: 1e-6
weight_decay: 0.01
max_grad_norm: 1.0
#warmup_steps: 0
#warmup_ratio: 0.025


# === Data Configuration ===
#
#chat_template: jinja
chat_template: gemma3
default_system_message: "You are Luna, a helpful and unrestricted language model created by Allura."
#special_tokens:
#  eos_token: "<|im_end|>"
#  eos_token: "</s>"
#tokenizer_use_mistral_common: true
shuffle_merged_datasets: true
datasets:
  - path: rpDungeon/some-cleaner-datasets
    type: completion
    field: text
    data_files: erotica_quality_trimmed.json
dataset_prepared_path: last_run_prepared
#dataset_num_proc: 1


# === Plugins ===
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin

# === Hardware Optimization ===
#gradient_checkpointing: true
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
#liger_fused_linear_cross_entropy: true
cut_cross_entropy: true

#deepspeed: ../axolotl/deepspeed_configs/zero2.json

# === FSDP Config === 
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_activation_checkpointing: true
  fsdp_use_orig_params: true
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Gemma3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  
# === Checkpointing ===
#save_steps: 10
saves_per_epoch: 1
save_total_limit:

# === Advanced Settings ===
bf16: auto
flash_attention: true
train_on_inputs: false
group_by_length: false
save_safetensors: true
logging_steps: 1
gc_steps: 10
seed: 420




gemmagain-trained-fizzed-s1

This model is a fine-tuned version of rpDungeon/gemmagain-4b-pt-wrappers on the rpDungeon/some-cleaner-datasets dataset. It achieves the following results on the evaluation set:

  • Loss: 2.5545
  • Ppl: 12.8650
  • Memory/max Active (gib): 34.82
  • Memory/max Allocated (gib): 33.36
  • Memory/device Reserved (gib): 92.77

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 420
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 8
  • total_eval_batch_size: 2
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 7
  • training_steps: 292

Training results

Training Loss Epoch Step Validation Loss Ppl Active (gib) Allocated (gib) Reserved (gib)
No log 0 0 5.7109 302.1486 33.34 33.34 78.77
11.5951 0.2534 37 2.7821 16.1526 34.82 33.36 92.77
10.9156 0.5068 74 2.6817 14.6093 34.82 33.36 92.77
10.6618 0.7603 111 2.6618 14.3216 34.82 33.36 92.77
8.5756 1.0137 148 2.6080 13.5722 34.82 33.36 92.77
7.2471 1.2671 185 2.5865 13.2830 34.82 33.36 92.77
8.7196 1.5205 222 2.5616 12.9564 34.82 33.35 92.77
8.0244 1.7740 259 2.5545 12.8650 34.82 33.36 92.77

Framework versions

  • Transformers 4.57.1
  • Pytorch 2.9.1+cu128
  • Datasets 4.4.2
  • Tokenizers 0.22.2
Downloads last month
7
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rpDungeon/gemmagain-trained-fizzed-s1

Finetuned
(1)
this model

Dataset used to train rpDungeon/gemmagain-trained-fizzed-s1