cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

A full fine-tuned MoE reasoning model of 60G size, distilled from Claude Sonnet 4.6 and optimized for Apple Silicon based on MLX. Fast, capable, and runs entirely on-device.

Recommended temperature: 0.95 — this model was trained on complex reasoning traces and benefits from slightly higher temperature to fully activate its chain-of-thought behavior. Values below 0.7 tend to flatten reasoning diversity; values above 1.2 may introduce incoherence.

Model Summary

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled is a fully merged fine-tune of gpt-oss-120b-heretic-mxfp4-q8-hi-mlx, trained using a custom BAdam (Block-wise Adam) optimizer with LoRA adapters on Apple MLX. The fine-tuning uses distilled reasoning traces from Claude Sonnet 4.6, filtered to difficulty=complex samples only.

This is a complete model upload — no separate adapter files or base model are needed. Download and run directly.

The model reasons step-by-step through hard problems using the Harmony channel protocol (analysis for internal chain-of-thought, final for the delivered response), and delivers this at surprisingly high throughput on M-series hardware — particularly the M2 Ultra.

Highlights

🚀 Speed-first on Apple Silicon — sustained 28–39 tok/s on M2 Ultra (192 GB unified memory) for a 120B MoE model. This is the defining characteristic of this release.
🌡️ Recommended temperature: 0.95 — unlocks the full depth of the model's reasoning traces.
🧠 Strong reasoning — 12/14 PASS on a custom hard benchmark spanning math olympiad, systems coding, logic, and scientific computing. Average keyword accuracy: 96.4%.
🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
📦 Full model — merged weights included. No base model or adapter setup required.
📐 Harmony channel format — analysis (chain-of-thought) and final (response) channel separation, making the reasoning process explicit and inspectable.

Quick Start

command line

mlx_lm.chat --model cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --max-tokens 10000 --temp 0.95

OpenAI Local Service

mlx-openai-server launch --model-path cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled --model-type lm --host 127.0.0.1 --temperature 0.95

Python

# Requirements: mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>Explain the time complexity of Dijkstra's algorithm and when to prefer A* instead.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← recommended
    top_p=0.9,
)
print(outputs)

Prompt Format (Harmony)

This model uses the gpt-oss Harmony channel protocol. Every message must declare its channel:

<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{your question}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought — generated by model}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer — generated by model}<|return|>

The analysis channel contains the model's internal reasoning. You can display or hide it depending on your use case.
The final channel contains the deliverable response.
Always prompt the model to begin its reply with <|start|>assistant<|channel|>analysis<|message|> to elicit chain-of-thought before the final answer.

Training Details

Parameter	Value
Base model	`gpt-oss-120b-heretic-mxfp4-q8-hi-mlx`
Architecture	MoE (Mixture of Experts), 120B params
Quantization	mxfp4 weights + q8 hi-precision activations
Framework	Apple MLX + mlx-tune
Optimizer	BAdam — BlockOptimizer wrapping AdamW
LoRA rank / alpha	r=16, α=32
LoRA dropout	0.05
LoRA target modules	q/k/v/o/gate/up/down projections (SwitchLinear expert layers)
Router layers	Full-precision unfrozen, trained directly (not LoRA)
Learning rate	1e-5 (cosine decay)
Effective batch size	2 × grad accum 8 = 16
Max steps	1000
BAdam switch mode	`parallel` (head + tail dual-pointer)
BAdam switch every	10 steps
Peak memory during training	~88.2 GB unified memory
Training hardware	Apple M2 Ultra, 192 GB

Dataset

Source: Roman1111111/claude-sonnet-4.6-100000X-filtered
Filter: difficulty == "complex" only → 36,444 samples
Split: 90% train (32,797) / 10% validation (3,644), seed=42
Format: OpenAI-style messages column; assistant turns include a separate reasoning field containing the Claude Sonnet 4.6 chain-of-thought

BAdam — Block-wise Coordinate Descent

Standard full fine-tuning of a 120B model on a single Mac is memory-prohibitive. BAdam solves this with block coordinate descent:

All 36 Transformer layers are partitioned into individual blocks.
At each optimizer step, only the active block's gradients are applied; all others are zeroed.
AdamW moment states are lazily initialized — inactive blocks never accumulate optimizer state, keeping peak memory comparable to LoRA alone.
parallel switch mode activates head + tail blocks simultaneously, with dual pointers advancing inward each cycle. This doubles layer coverage speed at ~2× the memory cost of single-block mode.

Benchmark Results

Custom hard benchmark — 14 tasks across math, coding, logic, and science. 60s execution timeout per task. Auto-graded with keyword matching + live code execution.

Evaluated on the 1000-step checkpoint (this released model).

ID	Task	Verdict	KW%	Time	Tok/s
math_01	AMC 2025 — n-Norwegian Number	✅ PASS	50%	56.4s	39
math_02	Euler Totient Sum — last 6 digits	✅ PASS	100%	16.2s	35
math_03	Lattice Paths Avoiding the Antidiagonal	✅ PASS	100%	16.6s	34
math_04	Segmented Sieve in [10¹², 10¹²+10⁶]	✅ PASS	100%	25.7s	35
code_01	Median of Two Sorted Arrays	✅ PASS	100%	25.0s	35
code_02	Thread-Safe LRU Cache with TTL	✅ PASS	100%	38.9s	36
code_03	Persistent Segment Tree — K-th Query	✅ PASS	100%	45.9s	35
code_04	Multi-Head Attention + RoPE (NumPy)	✅ PASS	100%	38.0s	33
code_05	Dijkstra vs A* on Large Random Graph	✅ PASS	100%	26.1s	34
logic_01	Knights & Knaves — Exhaustive SAT	✅ PASS	100%	16.0s	36
logic_02	Verify Three Mathematical Claims	✅ PASS	100%	13.2s	34
sci_01	Figure-8 Three-Body Orbit — Energy Conservation	❌ FAIL	100%	29.4s	33
sci_02	Metropolis-Hastings vs HMC Comparison	❌ FAIL	100%	38.3s	33
sci_03	Optimizer Comparison on Rosenbrock	✅ PASS	100%	23.8s	28

Overall: 12/14 PASS | Avg KW: 96.4% | Total benchmark time: 409s

The two failures (sci_01, sci_02) both scored 100% on keyword accuracy — the model fully understood the problems but produced code with numerical precision or runtime assertion edge cases. This is consistent with an early 1000-step checkpoint; further training is expected to close these gaps.

Inference Speed on M2 Ultra

Metric	Value
Hardware	Apple M2 Ultra, 192 GB unified memory
Peak memory (inference)	~88 GB
Throughput	28–39 tok/s depending on context length
Benchmark average	~34 tok/s

Running a 120B MoE model at 34 tok/s entirely on a single Mac — no cloud, no quantization compromise in output quality — is the core reason to use this model.

more details aout test

Limitations

Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
1000-step checkpoint — an early fine-tune. Scientific computing tasks with strict numerical tolerances may still have occasional failures.
No RLHF — trained on supervised distillation data only. Safety alignment is not as strict as commercial instruction-tuned models.
Harmony prompt format required — this model expects the <|channel|> protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.

Citation / Acknowledgements

Base model: gpt-oss-120b-heretic
Training dataset: Roman1111111/claude-sonnet-4.6-100000X-filtered
Optimizer: BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
Framework: Apple MLX · mlx-tune
Reasoning traces distilled from: Claude Sonnet 4.6 (Anthropic)

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled（中文说明）

基于 Claude Sonnet 4.6 蒸馏、专为 Apple Silicon 优化的完整 MoE 推理模型。快速、强大、完全本地运行。

推荐温度：0.95 — 本模型基于复杂推理轨迹训练，稍高的温度能充分激活思维链的多样性与深度。低于 0.7 会压缩推理过程；高于 1.2 可能影响输出连贯性。

模型简介

cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled 是 gpt-oss-120b-heretic-mxfp4-q8-hi-mlx 的完整合并微调版本，使用自定义 BAdam（块坐标 Adam）优化器配合 LoRA，基于 Apple MLX 框架训练。训练数据来源于 Claude Sonnet 4.6 的蒸馏推理轨迹，仅保留 difficulty=complex 样本。

这是完整模型上传，无需额外下载基础模型或 adapter 文件，下载即可直接运行。

模型使用 Harmony channel 协议进行逐步推理（analysis 承载内部思维链，final 承载最终回复），并在 M 系芯片——尤其是 M2 Ultra 上——实现了出色的推理速度。

核心优势

🚀 Apple Silicon 极致速度 — 120B MoE 模型在 M2 Ultra（192 GB 统一内存）上维持 28–39 tok/s，这是本模型最突出的特性。
🌡️ 推荐温度：0.95 — 充分释放模型思维链推理的深度与多样性。
🧠 强推理能力 — 自定义高难度测试集 **12/14 通过，平均关键词命中率 96.4%**，覆盖数学竞赛、系统编程、逻辑推理、科学计算。
🍎 完全本地运行 — 无需云端 API，无需 GPU 集群，纯 MLX 在 Mac 上推理。
📦 完整模型 — 已合并权重，直接下载使用，无需配置 adapter 或准备基础模型。
📐 Harmony channel 格式 — analysis（思维链）与 final（最终回复）channel 分离，推理过程透明可见。

快速开始

# 依赖：mlx-tune + Apple MLX
# pip install mlx-tune

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled",
    max_seq_length=4096,
    load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful assistant.
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>请解释 Dijkstra 算法的时间复杂度，以及何时应该优先选用 A*。<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(
    prompt,
    max_new_tokens=2048,
    temperature=0.95,   # ← 推荐值
    top_p=0.9,
)
print(outputs)

提示词格式（Harmony）

本模型使用 gpt-oss Harmony channel 协议，每条消息必须声明所属 channel：

<|start|>system<|message|>{系统提示}
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
<|start|>user<|message|>{你的问题}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{内部思维链 — 由模型生成}<|end|>
<|start|>assistant<|channel|>final<|message|>{最终回复 — 由模型生成}<|return|>

analysis channel 包含模型内部推理过程，可根据需要展示或隐藏。
final channel 包含最终输出内容。
始终以 <|start|>assistant<|channel|>analysis<|message|> 引导模型先输出思维链，再给出最终答案。

训练配置

参数	值
基础模型	`gpt-oss-120b-heretic-mxfp4-q8-hi-mlx`
模型架构	MoE（混合专家），120B 参数
量化方式	mxfp4 权重 + q8 高精度激活值
训练框架	Apple MLX + mlx-tune
优化器	BAdam — BlockOptimizer 包装 AdamW
LoRA rank / alpha	r=16, α=32
LoRA dropout	0.05
LoRA 目标模块	q/k/v/o/gate/up/down 投影层（SwitchLinear expert 层）
Router 层	全精度解冻直接微调（不走 LoRA）
学习率	1e-5（余弦衰减）
等效批次大小	2 × 梯度累积 8 = 16
训练步数	1000 步
BAdam 切换模式	`parallel`（头尾双指针并行）
BAdam 切换频率	每 10 步切换一次
最大序列长度	4096 tokens
训练峰值内存	~88.2 GB 统一内存
训练硬件	Apple M2 Ultra，192 GB

数据集

来源： Roman1111111/claude-sonnet-4.6-100000X-filtered
过滤： 仅保留 difficulty == "complex" → 36,444 条
划分： 90% 训练（32,797 条）/ 10% 验证（3,644 条），seed=42
格式： OpenAI 风格 messages 列，assistant 消息含独立 reasoning 字段（Claude Sonnet 4.6 思维链）

BAdam — 块坐标下降原理

在单台 Mac 上对 120B 模型做全参数微调显存开销极大。BAdam 通过块坐标下降解决：

36 个 Transformer 层各自划分为一个块。
每次优化步只应用当前活跃块的梯度，其余块梯度归零。
AdamW 矩状态惰性初始化——非活跃块不积累优化器状态，峰值内存与纯 LoRA 相当。
parallel 切换模式同时激活头尾两个块，双指针向中间推进，以约 2 倍内存换取更快的层覆盖速度。

测试结果

自定义高难度测试集，共 14 道题，每题执行超时 60 秒，关键词匹配 + 代码执行自动评分。使用本次发布的 1000 步完整模型 评估。

编号	题目	结果	关键词命中	耗时	速度
math_01	AMC 2025 — n-Norwegian 数	✅ 通过	50%	56.4s	39 tok/s
math_02	Euler Totient 求和——后 6 位	✅ 通过	100%	16.2s	35 tok/s
math_03	格路径绕过反对角线	✅ 通过	100%	16.6s	34 tok/s
math_04	[10¹², 10¹²+10⁶] 分段筛法	✅ 通过	100%	25.7s	35 tok/s
code_01	两个有序数组的中位数	✅ 通过	100%	25.0s	35 tok/s
code_02	带 TTL 的线程安全 LRU 缓存	✅ 通过	100%	38.9s	36 tok/s
code_03	持久化线段树 — 第 K 小查询	✅ 通过	100%	45.9s	35 tok/s
code_04	多头注意力 + RoPE（NumPy）	✅ 通过	100%	38.0s	33 tok/s
code_05	Dijkstra vs A* 大随机图对比	✅ 通过	100%	26.1s	34 tok/s
logic_01	骑士与骗子 — 穷举 SAT	✅ 通过	100%	16.0s	36 tok/s
logic_02	验证三个数学命题	✅ 通过	100%	13.2s	34 tok/s
sci_01	八字三体轨道——能量守恒 ODE	❌ 未通过	100%	29.4s	33 tok/s
sci_02	Metropolis-Hastings vs HMC	❌ 未通过	100%	38.3s	33 tok/s
sci_03	Rosenbrock 优化器对比	✅ 通过	100%	23.8s	28 tok/s

总结：12/14 通过 | 平均关键词命中率 96.4% | 测试总耗时 409 秒

两道未通过的题（sci_01、sci_02）关键词命中率均为 100%——模型完全理解了题意，但生成代码存在数值精度或运行时断言边界问题，这是 1000 步早期 checkpoint 的预期表现。

M2 Ultra 推理速度

指标	数值
硬件	Apple M2 Ultra，192 GB 统一内存
峰值内存（推理）	~88 GB
推理速度	28–39 tok/s（因上下文长度而异）
测试集平均	~34 tok/s

单台 Mac 本地运行 120B MoE 模型达到 34 tok/s，无需云端、无需量化妥协——这是选择本模型的核心理由。

局限性

仅支持 Apple Silicon — 针对 MLX 量化优化，不兼容 CUDA/ROCm，如需其他平台运行需重新量化。
1000 步早期 checkpoint — 科学计算类任务在数值精度严苛的场景下仍可能出现边界失败。
无 RLHF — 纯监督蒸馏微调，安全对齐强度弱于商业指令模型，使用时请注意。
必须使用 Harmony 提示格式 — 模型期待 <|channel|> 协议，标准 ChatML 或 Alpaca 格式会导致输出质量下降。

引用与致谢

基础模型：gpt-oss-120b-heretic
训练数据集：Roman1111111/claude-sonnet-4.6-100000X-filtered
优化器：BAdam — Block Coordinate Descent for Large Language Model Fine-tuning
训练框架：Apple MLX · mlx-tune
推理轨迹蒸馏自：Claude Sonnet 4.6（Anthropic）

Downloads last month: 1,056

Safetensors

Model size

117B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

Paper for cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled

On composition and decomposition operations for vector spaces, graphs and matroids

Paper • 2305.16354 • Published Jul 14, 2023