See axolotl config

axolotl version: 0.9.2

base_model: mlabonne/gemma-3-27b-it-abliterated-v2
base_model_config: mlabonne/gemma-3-27b-it-abliterated-v2
model_type: Gemma3ForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
bnb_4bit_compute_dtype: bfloat16
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
strict: false

datasets:
#  - path: data/rp_below_50_train_a_1.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
#  - path: data/rp_below_50_train_a_2.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
#  - path: data/rp_below_50_train_a_3.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
#  - path: data/rp_below_50_train_a_4.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
  - path: data/train_chats.parquet
    type: chat_template
    field_messages: messages
    split: train
#  - path: data/bdsm_train.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
val_set_size: 0.0  # or use one of your test datasets with 'validation' split
test_datasets:
  - path: data/val_chats.parquet
    type: chat_template
    field_messages: messages
    split: train
#  - path: data/bdsm_test.parquet
#    type: chat_template
#    field_messages: messages
#    split: train
#  - path: data/rp_below_50_test.parquet
#    type: chat_template
#    field_messages: messages
#    split: train

chat_template: chatml

#val_set_size: 0.05 # This will now be ignored if you have a separate 'validation' split defined above
save_safetensors: true
output_dir: ./outputs/libry_j_ai # <-- OPTIONAL: Consider renaming for clarity with the new model

sequence_len: 4096 # Keep this for now, but be mindful of memory for 27B without Flash Attention

wandb_project: sdg-finetune
wandb_watch: all
wandb_name: Gemma3ForCausalLM

# Batch & Training Config
micro_batch_size: 2
eval_batch_size: 2 # Consistent with micro_batch_size
gradient_accumulation_steps: 4
num_epochs: 2

evals_per_epoch: 5
eval_sample_packing: true # <-- OPTIONAL: Changed back to true for efficiency and consistency
eval_table_size: 0

# Optimizer
optimizer: paged_adamw_32bit # DeepSpeed will manage the optimizer with its settings
adam_beta2: 0.95
adam_epsilon: 0.00001
lr_scheduler: cosine
learning_rate: 1e-4
weight_decay: 0.1
warmup_ratio: 0.03
gradient_clipping: 1.0
group_by_length: true

# Memory & Precision
train_on_inputs: true
train_on_eos: turn
bf16: true
fp16: false
tf32: false
sample_packing: false
pad_to_sequence_len: true # Recommended with sample_packing

roles_to_train: ["assistant", "model", "user", "system"]

gradient_checkpointing: true
save_steps: 500
#gradient_checkpointing_kwargs:
#  use_reentrant: false
#resume_from_checkpoint: ./outputs/libry_run2/checkpoint-379 # <-- CHANGE THIS: Set to 'true' to continue from the last checkpoint
#auto_resume_from_checkpoint: false

logging_steps: 1
flash_attention: false        # DISABLED Flash Attention
flash_attention_v2: false     # DISABLED Flash Attention V2
attn_implementation: eager    # Set attention implementation back to eager

# LoRA Settings
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules: ["q_proj", "v_proj", "o_proj"]
lora_bias: none
lora_task_type: CAUSAL_LM

# DeepSpeed Configuration (Optimized)
deepspeed: deepspeed_configs/zero2_cpu_offload.json # Point to your DeepSpeed config file

# Hugging Face Upload (Disabled)
push_to_hub: false

special_tokens:
  eos_token: "<eos>"
  bos_token: "<bos>"
  unk_token: "<unk>"
  pad_token: "<pad>"

outputs/libry_j_ai

This model is a fine-tuned version of mlabonne/gemma-3-27b-it-abliterated-v2 on the data/train_chats.parquet dataset. It achieves the following results on the evaluation set:

Loss: nan

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 8
optimizer: Use paged_adamw_32bit with betas=(0.9,0.95) and epsilon=1e-05 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 224
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss
2.8256	0.0003	1	nan
1.9027	0.2002	750	nan
2.0386	0.4004	1500	nan
1.5538	0.6006	2250	nan
2.107	0.8008	3000	nan
0.8711	1.0008	3750	nan
0.7524	1.2010	4500	nan
0.8172	1.4012	5250	nan
0.6572	1.6014	6000	nan
0.8024	1.8016	6750	nan

Framework versions

PEFT 0.15.2
Transformers 4.51.3
Pytorch 2.6.0+cu124
Datasets 3.5.1
Tokenizers 0.21.1

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support