Qwen 3.6 27B — Opus CoT Stage 1 (BF16 merged)
Stage-1 of a two-stage SFT recipe on Qwen 3.6 27B focused on Claude-Opus-4.6-style chain-of-thought reasoning. This repo holds the BF16 merged checkpoint — the stage-1 LoRA has already been merged back into the base, so this is a drop-in replacement for Qwen/Qwen3.6-27B with reasoning-tuned weights.
Lineage:
Qwen/Qwen3.6-27B→ stage 1 LoRA (reasoning SFT) → merge → this checkpoint.
Where this fits in the release
| Artifact | Repo |
|---|---|
| Stage-1 BF16 merged (this repo) | samscrack/Qwen3.6-27B-Opus-CoT-Stage1 |
| Stage-2 LoRA adapter (Hermes tool-calling, applies to this base) | samscrack/Qwen3.6-27B-Hermes-S2-LoRA |
| Stage-1 + Stage-2 merged + FP8 quantized (final release) | samscrack/Qwen3.6-27B-Opus-CoT-S1-Hermes-S2-SFT |
If you want the production model, use the FP8 repo above. This repo is intended for users who want to:
- apply the stage-2 LoRA themselves (e.g. with different hyperparameters), or
- finetune further on top of a reasoning-tuned base, or
- run the reasoning-only variant in BF16 without the tool-calling stage.
Intended use
- Local serving in BF16 (~52 GB) on a single ≥80 GB GPU, or sharded across two ≥48 GB GPUs.
- Base for further LoRA / SFT / DPO / RLHF.
- Chain-of-thought-style chat without tool calling.
Quick start (transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("samscrack/Qwen3.6-27B-Opus-CoT-Stage1")
model = AutoModelForCausalLM.from_pretrained(
"samscrack/Qwen3.6-27B-Opus-CoT-Stage1",
torch_dtype="auto",
device_map="auto",
)
msgs = [{"role": "user", "content": "Why does ice float on water?"}]
inputs = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
Apply the stage-2 LoRA (Hermes tool-calling)
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"samscrack/Qwen3.6-27B-Opus-CoT-Stage1", torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(base, "samscrack/Qwen3.6-27B-Hermes-S2-LoRA")
# Optional: merged = model.merge_and_unload()
Training — Stage 1 only
| Method | Supervised fine-tuning, LoRA via Unsloth + TRL SFTTrainer, then merge_and_unload to BF16 |
| Base | Qwen/Qwen3.6-27B (text-only causal LM; *ForConditionalGeneration rewritten to *ForCausalLM for SFT) |
| LoRA | r=64, α=64, dropout=0, targets: q_proj, k_proj, v_proj, o_proj, out_proj, gate_proj, up_proj, down_proj |
| Optimizer / LR | AdamW, 2e-4, cosine warmup, weight decay 0.01 |
| Schedule | 2 epochs, batch 4 × grad_accum 9 → effective batch 72, ctx 8192 |
| Steps / final loss | 346 / 0.250 |
| Wall clock | ~4 h on 2× RTX PRO 6000 Blackwell, DDP via torchrun --standalone --nproc-per-node=2 |
Datasets (concatenated then shuffled)
| Dataset | Rows | Provenance |
|---|---|---|
nohurry/Opus-4.6-Reasoning-3000x-filtered |
3,900 | Claude Opus 4.6 CoT distillations |
khazarai/qwen3.6-plus-high-reasoning-500x |
500 | Qwen 3.6 reasoning samples |
Roman1111111/claude-opus-4.6-10000x |
9,633 | Claude Opus 4.6 CoT distillations |
Software
PyTorch 2.8.0+cu128, Transformers 5.2.0, TRL 0.22.2, PEFT 0.19.1, Unsloth 2026.4.7, datasets 4.3.0.
Limitations
- Inherits all limitations of
Qwen/Qwen3.6-27B— refusal patterns, knowledge cutoff, tokenizer biases. - Reasoning teacher is largely Claude Opus 4.6, so chain-of-thought style and refusal calibration partly reflect Claude's, not Qwen's.
- No tool-calling tuning here — that's stage 2. Out of the box this checkpoint produces plain prose CoT, not Hermes-format
<tool_call>{...}</tool_call>outputs. - No RLHF / DPO step — supervised only.
Acknowledgements
The two-stage recipe and helper code are adapted from Jackrong's Jackrong-llm-finetuning-guide (notebook Qwopus3-5-27b-Colab.ipynb), ported to a local dual-GPU setup with no other changes to the data pipeline.
Dataset authors: nohurry, khazarai, Roman1111111.
Tooling: Unsloth, TRL, PEFT, Qwen team.
License
Apache 2.0, inherited from Qwen/Qwen3.6-27B. Dataset licenses apply to derived behavior — see each dataset card.