BlueStar v3
Qwen3.5 27B
Designed for RP and writing tasks.
Dunno if it's better than v2 but I like it. Main difference is just the addition of some RP reasoning data from GLM5 & K2.5.
Non thinking and thinking are both supported. If you want to use thinking, it is required to prefill the <think>\n as that is how it was trained.
Creation Process: SFT
SFT on approx 56 million tokens.
Same as v2 for the most part with one big difference. Chub dataset was replaced with another version that has reasoning that was trained on the last turn only. This explodes the dataset out to 56 million tokens, but means the multi-turn reasoning gets trained correctly.
Also added a subset of 200 Gryphe RP samples that were shown as having a high lexical difference from my current dataset.
Trained using Axolotl.
Axolotl Config
base_model: Qwen/Qwen3.5-27B
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false
datasets:
- path: ./data/bluestar_v4_sft_2_masked_20260402_120553.jsonl
val_set_size: 0.03
output_dir: ./Qwen3.5-27B-v3-SFT-2
sequence_len: 10756
sample_packing: true
load_in_8bit: true
adapter: lora
lora_r: 128
lora_alpha: 128
peft_use_rslora: true
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- down_proj
- up_proj
# Uncomment below to also target the linear attention projections.
# These use separate in_proj_qkv / in_proj_z / out_proj (Qwen3.5-specific).
- linear_attn.in_proj_qkv
- linear_attn.in_proj_z
- linear_attn.out_proj
wandb_project: Qwen3.5-27B-SFT
wandb_name: Qwen3.5-27B-v3-SFT-2
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 1.2e-5
weight_decay: 0.01
warmup_ratio: 0.05
bf16: auto
tf32: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true
evals_per_epoch: 4
saves_per_epoch: 4
special_tokens:
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen3_5DecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true
- Downloads last month
- 92
Model tree for ApocalypseParty/Qwen3.5-27B-v3-SFT-2
Base model
Qwen/Qwen3.5-27B