cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled
A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.
Recommended temperature:
0.95— this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.
Model Summary
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.
This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.
The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.
Highlights
- 🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
- 🌡️ Recommended temperature:
0.95— unlocks the full depth of the model's reasoning traces. - 🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
- 🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
- 📦 Full model — merged weights included. No base model or adapter setup required.
- 📐 Harmony channel format —
analysis(chain-of-thought) andfinal(response) channel separation, making the reasoning process explicit and inspectable.
Quick Start
command line
mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95
OpenAI Local Service
mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95
Python
# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune
from mlx_tune import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
max_seq_length=4096,
load_in_4bit=True,
)
prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""
outputs = model.generate(
prompt,
max_new_tokens=2048,
temperature=0.95, # ← recommended
top_p=0.9,
)
print(outputs)
Prompt Format (Harmony)
This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:
<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>
- The
analysischannel contains the model's internal reasoning. You can display or hide it depending on your use case. - The
finalchannel contains the deliverable response. - Always prompt the model to begin its reply with
<|start|>assistant<|channel|>analysis<|message|>to elicit chain-of-thought before the final answer.
Training Details
| Parameter | Value |
|---|---|
| Base model | gpt-oss-120b-heretic-mxfp4-q8-hi-mlx |
| Architecture | MoE (Mixture of Experts), 120B params |
| Quantization | mxfp4 weights + q8 hi-precision activations |
| Framework | Apple MLX + mlx-tune |
| Optimizer | BAdam — BlockOptimizer wrapping AdamW |
| LoRA rank / alpha | r=16, α=32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q/k/v/o/gate/up/down projections (SwitchLinear expert layers) |
| Router layers | Full-precision unfrozen, trained directly (not LoRA) |
| Learning rate | 1e-5 (cosine decay) |
| Effective batch size | 2 × grad accum 8 = 16 |
| Max steps | 1000 |
| BAdam switch mode | parallel (head + tail dual-pointer) |
| BAdam switch every | 10 steps |
| Peak memory during training | ~88.2 GB unified memory |
| Training hardware | Apple M2 Ultra, 192 GB |
Dataset
- Source:
Roman1111111/claude-sonnet-4.6-100000X-filtered - Filter:
difficulty == "complex"only → 36,444 samples - Split: 90% train (32,797) / 10% validation (3,644), seed=42
- Format: OpenAI-style
messagescolumn; assistant turns include a separatereasoningfield containing the Claude Sonnet 4.6 chain-of-thought
BAdam — Block-wise Coordinate Descent
Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:
- All 36 Transformer layers are partitioned into individual blocks.
- At each optimizer step, only the active block's gradients are applied; all others are zeroed.
- AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
parallelswitch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.
Benchmark Results
Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.
Evaluated on the 1000-step checkpoint (this released model).
| ID | Task | Verdict | KW% | Time | Tok/s |
|---|---|---|---|---|---|
| math_01 | AMC 2025 — n-Norwegian Number | ✅ PASS | 50% | 56.4s | 39 |
| math_02 | Euler Totient Sum — last 6 digits | ✅ PASS | 100% | 16.2s | 35 |
| math_03 | Lattice Paths Avoiding the Antidiagonal | ✅ PASS | 100% | 16.6s | 34 |
| math_04 | Segmented Sieve in [10¹², 10¹²+10⁶] | ✅ PASS | 100% | 25.7s | 35 |
| code_01 | Median of Two Sorted Arrays | ✅ PASS | 100% | 25.0s | 35 |
| code_02 | Thread-Safe LRU Cache with TTL | ✅ PASS | 100% | 38.9s | 36 |
| code_03 | Persistent Segment Tree — K-th Query | ✅ PASS | 100% | 45.9s | 35 |
| code_04 | Multi-Head Attention + RoPE (NumPy) | ✅ PASS | 100% | 38.0s | 33 |
| code_05 | Dijkstra vs A* on Large Random Graph | ✅ PASS | 100% | 26.1s | 34 |
| logic_01 | Knights & Knaves — Exhaustive SAT | ✅ PASS | 100% | 16.0s | 36 |
| logic_02 | Verify Three Mathematical Claims | ✅ PASS | 100% | 13.2s | 34 |
| sci_01 | Figure-8 Three-Body Orbit — Energy Conservation | ❌ FAIL | 100% | 29.4s | 33 |
| sci_02 | Metropolis-Hastings vs HMC Comparison | ❌ FAIL | 100% | 38.3s | 33 |
| sci_03 | Optimizer Comparison on Rosenbrock | ✅ PASS | 100% | 23.8s | 28 |
Overall: 12/14 PASS | Avg KW: 96.4% | Total benchmark time: 409s
The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.
Inference Speed on M2 Ultra
| Metric | Value |
|---|---|
| Hardware | Apple M2 Ultra, 192 GB unified memory |
| Peak memory (inference) | ~88 GB |
| Throughput | 28–39 tok/s depending on context length |
| Benchmark average | ~34 tok/s |
Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.
more details aout test
Limitations
- Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
- 1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
- No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
- Harmony prompt format required — this model expects the
<|channel|>protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.
Citation / Acknowledgements
- Base model:
gpt-oss-120b-heretic - Training dataset:
Roman1111111/claude-sonnet-4.6-100000X-filtered - Optimizer: BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
- Framework: Apple MLX · mlx-tune
- Reasoning traces distilled from: Claude Sonnet 4.6 (Anthropic)
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled(中文说明)
基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。
推荐温度:
0.95— 本模型基于复杂推理轨迹训练,稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程;高于 1.2 可能影响输出连贯性。
模型简介
cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 是 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx 的完整合并微调版本,使用自定义 BAdam(块坐标 Adam)优化器配合 LoRA,基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹,仅保留 difficulty=complex 样本。
这是完整模型上传,无需额外下载基础模型或 adapter 文件,下载即可直接运行。
模型使用 Harmony channel 协议进行逐步推理(analysis 承载内部思维链,final 承载最终回复),并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。
核心优势
- 🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra(192 GB 统一内存)上维持 28–39 tok/s,这是本模型最突出的特性。
- 🌡️ 推荐温度:
0.95— 充分释放模型思维链推理的深度与多样性。 - 🧠 强推理能力 — 自定义高难度测试集 **12/14 通过,平均关键词命中率 96.4%**,覆盖数学竞赛、系统编程、逻辑推理、科学计算。
- 🍎 完全本地运行 — 无需云端 API,无需 GPU 集群,纯 MLX 在 Mac 上推理。
- 📦 完整模型 — 已合并权重,直接下载使用,无需配置 adapter 或准备基础模型。
- 📐 Harmony channel 格式 —
analysis(思维链)与final(最终回复)channel 分离,推理过程透明可见。
快速开始
# 依赖:mlx-tune + Apple MLX
# pip install mlx-tune
from mlx_tune import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
max_seq_length=4096,
load_in_4bit=True,
)
prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度,以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""
outputs = model.generate(
prompt,
max_new_tokens=2048,
temperature=0.95, # ← 推荐值
top_p=0.9,
)
print(outputs)
提示词格式(Harmony)
本模型使用 gpt-oss Harmony channel 协议,每条消息必须声明所属 channel:
<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>
analysischannel 包含模型内部推理过程,可根据需要展示或隐藏。finalchannel 包含最终输出内容。- 始终以
<|start|>assistant<|channel|>analysis<|message|>引导模型先输出思维链,再给出最终答案。
训练配置
| 参数 | 值 |
|---|---|
| 基础模型 | gpt-oss-120b-heretic-mxfp4-q8-hi-mlx |
| 模型架构 | MoE(混合专家),120B 参数 |
| 量化方式 | mxfp4 权重 + q8 高精度激活值 |
| 训练框架 | Apple MLX + mlx-tune |
| 优化器 | BAdam — BlockOptimizer 包装 AdamW |
| LoRA rank / alpha | r=16, α=32 |
| LoRA dropout | 0.05 |
| LoRA 目标模块 | q/k/v/o/gate/up/down 投影层(SwitchLinear expert 层) |
| Router 层 | 全精度解冻直接微调(不走 LoRA) |
| 学习率 | 1e-5(余弦衰减) |
| 等效批次大小 | 2 × 梯度累积 8 = 16 |
| 训练步数 | 1000 步 |
| BAdam 切换模式 | parallel(头尾双指针并行) |
| BAdam 切换频率 | 每 10 步切换一次 |
| 最大序列长度 | 4096 tokens |
| 训练峰值内存 | ~88.2 GB 统一内存 |
| 训练硬件 | Apple M2 Ultra,192 GB |
数据集
- 来源:
Roman1111111/claude-sonnet-4.6-100000X-filtered - 过滤: 仅保留
difficulty == "complex"→ 36,444 条 - 划分: 90% 训练(32,797 条)/ 10% 验证(3,644 条),seed=42
- 格式: OpenAI 风格
messages列,assistant 消息含独立reasoning字段(Claude Sonnet 4.6 思维链)
BAdam — 块坐标下降原理
在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决:
- 36 个 Transformer 层各自划分为一个块。
- 每次优化步只应用当前活跃块的梯度,其余块梯度归零。
- AdamW 矩状态惰性初始化——非活跃块不积累优化器状态,峰值内存与纯 LoRA 相当。
parallel切换模式同时激活头尾两个块,双指针向中间推进,以约 2 倍内存换取更快的层覆盖速度。
测试结果
自定义高难度测试集,共 14 道题,每题执行超时 60 秒,关键词匹配 + 代码执行自动评分。 使用本次发布的 1000 步完整模型 评估。
| 编号 | 题目 | 结果 | 关键词命中 | 耗时 | 速度 |
|---|---|---|---|---|---|
| math_01 | AMC 2025 — n-Norwegian 数 | ✅ 通过 | 50% | 56.4s | 39 tok/s |
| math_02 | Euler Totient 求和——后 6 位 | ✅ 通过 | 100% | 16.2s | 35 tok/s |
| math_03 | 格路径绕过反对角线 | ✅ 通过 | 100% | 16.6s | 34 tok/s |
| math_04 | [10¹², 10¹²+10⁶] 分段筛法 | ✅ 通过 | 100% | 25.7s | 35 tok/s |
| code_01 | 两个有序数组的中位数 | ✅ 通过 | 100% | 25.0s | 35 tok/s |
| code_02 | 带 TTL 的线程安全 LRU 缓存 | ✅ 通过 | 100% | 38.9s | 36 tok/s |
| code_03 | 持久化线段树 — 第 K 小查询 | ✅ 通过 | 100% | 45.9s | 35 tok/s |
| code_04 | 多头注意力 + RoPE(NumPy) | ✅ 通过 | 100% | 38.0s | 33 tok/s |
| code_05 | Dijkstra vs A* 大随机图对比 | ✅ 通过 | 100% | 26.1s | 34 tok/s |
| logic_01 | 骑士与骗子 — 穷举 SAT | ✅ 通过 | 100% | 16.0s | 36 tok/s |
| logic_02 | 验证三个数学命题 | ✅ 通过 | 100% | 13.2s | 34 tok/s |
| sci_01 | 八字三体轨道——能量守恒 ODE | ❌ 未通过 | 100% | 29.4s | 33 tok/s |
| sci_02 | Metropolis-Hastings vs HMC | ❌ 未通过 | 100% | 38.3s | 33 tok/s |
| sci_03 | Rosenbrock 优化器对比 | ✅ 通过 | 100% | 23.8s | 28 tok/s |
总结:12/14 通过 | 平均关键词命中率 96.4% | 测试总耗时 409 秒
两道未通过的题(sci_01、sci_02)关键词命中率均为 100%——模型完全理解了题意,但生成代码存在数值精度或运行时断言边界问题,这是 1000 步早期 checkpoint 的预期表现。
M2 Ultra 推理速度
| 指标 | 数值 |
|---|---|
| 硬件 | Apple M2 Ultra,192 GB 统一内存 |
| 峰值内存(推理) | ~88 GB |
| 推理速度 | 28–39 tok/s(因上下文长度而异) |
| 测试集平均 | ~34 tok/s |
单台 Mac 本地运行 120B MoE 模型达到 34 tok/s,无需云端、无需量化妥协——这是选择本模型的核心理由。
局限性
- 仅支持 Apple Silicon — 针对 MLX 量化优化,不兼容 CUDA/ROCm,如需其他平台运行需重新量化。
- 1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
- 无 RLHF — 纯监督蒸馏微调,安全对齐强度弱于商业指令模型,使用时请注意。
- 必须使用 Harmony 提示格式 — 模型期待
<|channel|>协议,标准 ChatML 或 Alpaca 格式会导致输出质量下降。
引用与致谢
- 基础模型:
gpt-oss-120b-heretic - 训练数据集:
Roman1111111/claude-sonnet-4.6-100000X-filtered - 优化器:BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
- 训练框架:Apple MLX · mlx-tune
- 推理轨迹蒸馏自:Claude Sonnet 4.6(Anthropic)
- Downloads last month
- 1,056
4-bit