cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.

Recommended temperature: 0.95 — this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.


Model Summary

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.

This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.

The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.


Highlights

  • 🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
  • 🌡️ Recommended temperature: 0.95 — unlocks the full depth of the model's reasoning traces.
  • 🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
  • 🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
  • 📦 Full model — merged weights included. No base model or adapter setup required.
  • 📐 Harmony channel formatanalysis (chain-of-thought) and final (response) channel separation, making the reasoning process explicit and inspectable.

Quick Start

command line

mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95

OpenAI Local Service

mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95

Python

# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← recommended
    top_p=0.9,
)
print(outputs)

Prompt Format (Harmony)

This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:

<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>
  • The analysis channel contains the model's internal reasoning. You can display or hide it depending on your use case.
  • The final channel contains the deliverable response.
  • Always prompt the model to begin its reply with <|start|>assistant<|channel|>analysis<|message|> to elicit chain-of-thought before the final answer.

Training Details

Parameter Value
Base model gpt-oss-120b-heretic-mxfp4-q8-hi-mlx
Architecture MoE (Mixture of Experts), 120B params
Quantization mxfp4 weights + q8 hi-precision activations
Framework Apple MLX + mlx-tune
Optimizer BAdam — BlockOptimizer wrapping AdamW
LoRA rank / alpha r=16, α=32
LoRA dropout 0.05
LoRA target modules q/k/v/o/gate/up/down projections (SwitchLinear expert layers)
Router layers Full-precision unfrozen, trained directly (not LoRA)
Learning rate 1e-5 (cosine decay)
Effective batch size 2 × grad accum 8 = 16
Max steps 1000
BAdam switch mode parallel (head + tail dual-pointer)
BAdam switch every 10 steps
Peak memory during training ~88.2 GB unified memory
Training hardware Apple M2 Ultra, 192 GB

Dataset

  • Source: Roman1111111/claude-sonnet-4.6-100000X-filtered
  • Filter: difficulty == "complex" only → 36,444 samples
  • Split: 90% train (32,797) / 10% validation (3,644), seed=42
  • Format: OpenAI-style messages column; assistant turns include a separate reasoning field containing the Claude Sonnet 4.6 chain-of-thought

BAdam — Block-wise Coordinate Descent

Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:

  • All 36 Transformer layers are partitioned into individual blocks.
  • At each optimizer step, only the active block's gradients are applied; all others are zeroed.
  • AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
  • parallel switch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.

Benchmark Results

Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.

Evaluated on the 1000-step checkpoint (this released model).

ID Task Verdict KW% Time Tok/s
math_01 AMC 2025 — n-Norwegian Number ✅ PASS 50% 56.4s 39
math_02 Euler Totient Sum — last 6 digits ✅ PASS 100% 16.2s 35
math_03 Lattice Paths Avoiding the Antidiagonal ✅ PASS 100% 16.6s 34
math_04 Segmented Sieve in [10¹², 10¹²+10⁶] ✅ PASS 100% 25.7s 35
code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34
logic_01 Knights & Knaves — Exhaustive SAT ✅ PASS 100% 16.0s 36
logic_02 Verify Three Mathematical Claims ✅ PASS 100% 13.2s 34
sci_01 Figure-8 Three-Body Orbit — Energy Conservation ❌ FAIL 100% 29.4s 33
sci_02 Metropolis-Hastings vs HMC Comparison ❌ FAIL 100% 38.3s 33
sci_03 Optimizer Comparison on Rosenbrock ✅ PASS 100% 23.8s 28

Overall: 12/14 PASS  |  Avg KW: 96.4%  |  Total benchmark time: 409s

The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.

Inference Speed on M2 Ultra

Metric Value
Hardware Apple M2 Ultra, 192 GB unified memory
Peak memory (inference) ~88 GB
Throughput 28–39 tok/s depending on context length
Benchmark average ~34 tok/s

Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.

more details aout test

Limitations

  • Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
  • 1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
  • No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
  • Harmony prompt format required — this model expects the <|channel|> protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.

Citation / Acknowledgements



cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled(中文说明)

基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。

推荐温度:0.95 — 本模型基于复杂推理轨迹训练,稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程;高于 1.2 可能影响输出连贯性。


模型简介

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilledgpt-oss-120b-heretic-mxfp4-q8-hi-mlx完整合并微调版本,使用自定义 BAdam(块坐标 Adam)优化器配合 LoRA,基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹,仅保留 difficulty=complex 样本。

这是完整模型上传,无需额外下载基础模型或 adapter 文件,下载即可直接运行。

模型使用 Harmony channel 协议进行逐步推理(analysis 承载内部思维链,final 承载最终回复),并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。


核心优势

  • 🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra(192 GB 统一内存)上维持 28–39 tok/s,这是本模型最突出的特性。
  • 🌡️ 推荐温度:0.95 — 充分释放模型思维链推理的深度与多样性。
  • 🧠 强推理能力 — 自定义高难度测试集 **12/14 通过,平均关键词命中率 96.4%**,覆盖数学竞赛、系统编程、逻辑推理、科学计算。
  • 🍎 完全本地运行 — 无需云端 API,无需 GPU 集群,纯 MLX 在 Mac 上推理。
  • 📦 完整模型 — 已合并权重,直接下载使用,无需配置 adapter 或准备基础模型。
  • 📐 Harmony channel 格式analysis(思维链)与 final(最终回复)channel 分离,推理过程透明可见。

快速开始

# 依赖:mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度,以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← 推荐值
    top_p=0.9,
)
print(outputs)

提示词格式(Harmony)

本模型使用 gpt-oss Harmony channel 协议,每条消息必须声明所属 channel:

<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>
  • analysis channel 包含模型内部推理过程,可根据需要展示或隐藏。
  • final channel 包含最终输出内容。
  • 始终以 <|start|>assistant<|channel|>analysis<|message|> 引导模型先输出思维链,再给出最终答案。

训练配置

参数
基础模型 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx
模型架构 MoE(混合专家),120B 参数
量化方式 mxfp4 权重 + q8 高精度激活值
训练框架 Apple MLX + mlx-tune
优化器 BAdam — BlockOptimizer 包装 AdamW
LoRA rank / alpha r=16, α=32
LoRA dropout 0.05
LoRA 目标模块 q/k/v/o/gate/up/down 投影层(SwitchLinear expert 层)
Router 层 全精度解冻直接微调(不走 LoRA)
学习率 1e-5(余弦衰减)
等效批次大小 2 × 梯度累积 8 = 16
训练步数 1000 步
BAdam 切换模式 parallel(头尾双指针并行)
BAdam 切换频率 每 10 步切换一次
最大序列长度 4096 tokens
训练峰值内存 ~88.2 GB 统一内存
训练硬件 Apple M2 Ultra,192 GB

数据集

  • 来源: Roman1111111/claude-sonnet-4.6-100000X-filtered
  • 过滤: 仅保留 difficulty == "complex"36,444 条
  • 划分: 90% 训练(32,797 条)/ 10% 验证(3,644 条),seed=42
  • 格式: OpenAI 风格 messages 列,assistant 消息含独立 reasoning 字段(Claude Sonnet 4.6 思维链)

BAdam — 块坐标下降原理

在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决:

  • 36 个 Transformer 层各自划分为一个块。
  • 每次优化步只应用当前活跃块的梯度,其余块梯度归零。
  • AdamW 矩状态惰性初始化——非活跃块不积累优化器状态,峰值内存与纯 LoRA 相当。
  • parallel 切换模式同时激活头尾两个块,双指针向中间推进,以约 2 倍内存换取更快的层覆盖速度。

测试结果

自定义高难度测试集,共 14 道题,每题执行超时 60 秒,关键词匹配 + 代码执行自动评分。 使用本次发布的 1000 步完整模型 评估。

编号 题目 结果 关键词命中 耗时 速度
math_01 AMC 2025 — n-Norwegian 数 ✅ 通过 50% 56.4s 39 tok/s
math_02 Euler Totient 求和——后 6 位 ✅ 通过 100% 16.2s 35 tok/s
math_03 格路径绕过反对角线 ✅ 通过 100% 16.6s 34 tok/s
math_04 [10¹², 10¹²+10⁶] 分段筛法 ✅ 通过 100% 25.7s 35 tok/s
code_01 两个有序数组的中位数 ✅ 通过 100% 25.0s 35 tok/s
code_02 带 TTL 的线程安全 LRU 缓存 ✅ 通过 100% 38.9s 36 tok/s
code_03 持久化线段树 — 第 K 小查询 ✅ 通过 100% 45.9s 35 tok/s
code_04 多头注意力 + RoPE(NumPy) ✅ 通过 100% 38.0s 33 tok/s
code_05 Dijkstra vs A* 大随机图对比 ✅ 通过 100% 26.1s 34 tok/s
logic_01 骑士与骗子 — 穷举 SAT ✅ 通过 100% 16.0s 36 tok/s
logic_02 验证三个数学命题 ✅ 通过 100% 13.2s 34 tok/s
sci_01 八字三体轨道——能量守恒 ODE ❌ 未通过 100% 29.4s 33 tok/s
sci_02 Metropolis-Hastings vs HMC ❌ 未通过 100% 38.3s 33 tok/s
sci_03 Rosenbrock 优化器对比 ✅ 通过 100% 23.8s 28 tok/s

总结:12/14 通过  |  平均关键词命中率 96.4%  |  测试总耗时 409 秒

两道未通过的题(sci_01、sci_02)关键词命中率均为 100%——模型完全理解了题意,但生成代码存在数值精度或运行时断言边界问题,这是 1000 步早期 checkpoint 的预期表现。

M2 Ultra 推理速度

指标 数值
硬件 Apple M2 Ultra,192 GB 统一内存
峰值内存(推理) ~88 GB
推理速度 28–39 tok/s(因上下文长度而异)
测试集平均 ~34 tok/s

单台 Mac 本地运行 120B MoE 模型达到 34 tok/s,无需云端、无需量化妥协——这是选择本模型的核心理由。


局限性

  • 仅支持 Apple Silicon — 针对 MLX 量化优化,不兼容 CUDA/ROCm,如需其他平台运行需重新量化。
  • 1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
  • 无 RLHF — 纯监督蒸馏微调,安全对齐强度弱于商业指令模型,使用时请注意。
  • 必须使用 Harmony 提示格式 — 模型期待 <|channel|> 协议,标准 ChatML 或 Alpaca 格式会导致输出质量下降。

引用与致谢

Downloads last month
1,056
Safetensors
Model size
117B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Paper for cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled