whaletech.ai / W1-4B-dLLM-Base

Whaletech banner

Technical usage and inference docs: TECHNICAL_README.md

✨ 概述 | Overview

W1-dLLM 是一个扩散式语言模型。
W1-dLLM is a diffusion language model.

它建立在一个简单、但如今越来越清楚的想法之上:
It is built on a simple idea that is now becoming increasingly clear:

强大的语言建模,不一定只能依赖自回归。
Powerful language modeling does not have to be autoregressive.

而今天,这个判断第一次不再只是一个猜想。
And today, for the first time, that idea feels like more than a hypothesis.

我们看到了一组非常有分量的结果同时出现:
We are seeing a remarkably meaningful set of results emerge together:

  • 稳定的优化过程
    stable optimization
  • 在灵活长度文本上的真实扩展能力
    genuine scaling on variable-length text
  • 强烈的自我修改与纠错能力
    strong self-revision and error-correction behavior
  • 小参数模型上的明显泛化
    clear generalization in a relatively small model
  • 以及越来越明确的信号:扩散式语言模型正在走出过去“半自回归”的过渡状态
    and an increasingly strong signal that diffusion language models are moving beyond their old “semi-autoregressive” transitional phase

它开始展现的,是一种更本质的能力:
What is beginning to emerge is something more fundamental:

真正的并行生成能力。
True parallel generation.

这个模型不只是会生成文本。
This model does not merely generate text.

它更像是在隐空间里先形成、再修改、再细化自己的表达,然后才把它稳定地落实到语言之中。
It behaves more like a system that first forms, then revises, then refines its expression in latent space before finally settling it into language.

在某个中期 checkpoint,当模型第一次展现出这种能力的时候,全公司的人聚集在同一个屏幕前,仿佛在看新生儿的降临。
At one mid-training checkpoint, when the model first showed this behavior, the entire company gathered around the same screen as if witnessing the birth of a newborn.

而这件事,非常重要。
And that moment mattered.


🧠 模型结构 | Model Architecture

W1-dLLM 采用基于扩散 Transformer 的语言建模结构,而不是标准的自回归解码器。
W1-dLLM uses a diffusion Transformer architecture for language modeling rather than a standard autoregressive decoder.

核心配置 | Core Configuration

  • 48 个扩散 Transformer 模块
    48 diffusion Transformer blocks
  • 若结合输入嵌入层与最终输出层,可视为 50 层模型
    50 layers total if counting the input embedding layer and final output layer
  • 词表大小:64,512
    Vocabulary size: 64,512
  • 隐藏维度:2,048
    Hidden size: 2,048
  • 注意力内部维度:3,072
    Attention inner dimension: 3,072
  • 前馈网络维度:7,168
    FFN dimension: 7,168

关键组件 | Architectural Ingredients

  • 时间步嵌入
    timestep embeddings
  • 旋转位置编码
    rotary positional encoding (RoPE)
  • 自适应层归一化调制
    adaptive LayerNorm modulation
  • 均方根归一化
    RMSNorm
  • SwiGLU 前馈网络
    SwiGLU feed-forward networks

我们认为最关键的几点 | What We Believe Matters Most

1. 扩散 Transformer 结构本身非常重要 | The diffusion Transformer structure itself is essential

这不是“换一个外壳”,而是能力成立的基础。
This is not a cosmetic wrapper change. It is the foundation that makes these capabilities possible.

2. 自适应 LayerNorm 调制非常关键 | Adaptive LayerNorm modulation is critical

没有它,扩散过程中的条件控制和训练稳定性都会明显受影响。
Without it, conditional control in the diffusion process and training stability both degrade significantly.

3. 一些经典自回归 Transformer 优化可以迁移 | Some classic autoregressive Transformer optimizations transfer well

例如,扩大注意力内部维度依然有效。
For example, increasing the attention inner dimension remains beneficial.

4. 双向注意力必须从预训练阶段就一起设计 | Bidirectional attention must be designed in from pretraining

双向能力不能只在推理阶段“补”出来。
Bidirectionality cannot simply be patched in at inference time.

如果希望模型真正并行地思考、修订、收敛,就必须在结构和预训练阶段一起设计。
If one wishes the model to truly think, revise, and converge in parallel, that capability must be baked into both the architecture and the pretraining setup from the start.


⚙️ 训练特征 | Training Characteristics

这次训练里,一个非常重要的工程结论是:
One of the clearest engineering conclusions from this training run is:

扩散式语言模型需要足够大的 batch size。
Diffusion language models require sufficiently large batch sizes.

在我们的训练中,当 batch size 超过 500 之后,模型的训练状态、稳定性和效率都会明显更好。
In our training, once batch size exceeded 500, the model’s training behavior, stability, and efficiency all improved noticeably.

这和很多自回归训练经验并不完全一致,也说明扩散式训练在规模化时有自己独特的最优区间。
This differs from many familiar autoregressive training heuristics and suggests that diffusion training has its own scaling regime.

硬件利用 | Hardware Utilization

我们同时观察到,训练过程中 **模型浮点运算利用率稳定在约 45%**。
We also observed model FLOP utilization remaining around 45% throughout training.

这个数字非常重要。
That number matters.

它说明这条路线并不只是“理论上并行”,而是已经在真实硬件利用上展现出实际价值。
It shows that this path is not merely theoretically parallel — it is already demonstrating real engineering value on actual hardware.

对于一个大批次、强并行的扩散语言模型来说,45% MFU 已经进入很有工程意义的区间。
For a large-batch, strongly parallel diffusion language model, 45% MFU is already meaningful.

而这显然还不是上限。
And it is clearly not the upper bound.

优化过程 | Optimization Behavior

训练损失在整个过程中持续下降,而且异常稳定:
Training loss declined continuously throughout the run, and did so with unusual stability:

  • 没有坍塌
    no collapse
  • 没有回滚
    no rollback
  • 没有明显的不稳定阶段
    no obvious instability event

更重要的是,即使在同一批数据上继续训练,损失仍然能继续往下走。
More importantly, even when continuing to train on the same data, the loss still kept going down.

这给了我们一个很强的信号:
That gives us a strong signal:

扩散式训练可能显著缓解数据枯竭问题。
Diffusion-style training may significantly alleviate data exhaustion.

高质量数据依然极其重要,甚至对小参数扩散语言模型来说可能更重要。
High-quality data remains extremely important — perhaps even more so for smaller diffusion language models.

但这次我们看到的是:
But what we are seeing now is this:

高质量数据不只值钱,而且可以被更充分、更反复地利用。
High-quality data is not only valuable — it may also be reusable more fully and more repeatedly.


🔍 我们发现了什么 | What We Found

1️⃣ 隐空间思考不是表象,它真的存在

1️⃣ Latent-space thinking is not an illusion — it is real

从预训练开始,我们就有意识地让模型学习一种更偏全局观的隐空间思考范式。
From the beginning of pretraining, we intentionally encouraged the model to learn a more global style of latent-space reasoning.

最终出现的,不只是更像样的表面思维链。
What emerged was not merely a more convincing surface-level chain of thought.

模型似乎真的会在表达之前,先组织、回看、重构自己的想法
Before expressing anything, the model appears to genuinely organize, revisit, and reconstruct its own ideas.

更让我们兴奋的是,我们观察到模型在这一过程中会在不同语言之间跳转:
Even more excitingly, we observed the model briefly switching between different languages during this process:

  • 英文 / English
  • 韩文 / Korean
  • 日文 / Japanese
  • 中文 / Chinese
  • 以及一些符号化表达 / and sometimes symbolic fragments

这些内容会短暂出现,然后再逐渐稳定到某一种最终语言形态。
These appear transiently and then gradually settle into a final language form.

这并不像简单的多语言混杂。
This does not look like simple multilingual mixing.

它更像是模型在潜在表示中先寻找更合适的内部表达,再把它压缩到最终可见的语言表面。
It looks more like the model is searching for a better internal representation before compressing it into final visible language.

模型的思考,并不等于它最终输出的文本。
The model’s thinking is not the same thing as its final output text.


2️⃣ 它不是只会续写,而是真的会一边并行一边修改

2️⃣ It does not just continue text — it revises while generating in parallel

这次训练里最令人兴奋的一点,是模型展现出了很强的:
One of the most exciting observations from this run was the model’s strong ability to:

  • 修改
    revise
  • 纠错
    correct itself
  • 反思
    reflect
  • 收敛到更好的答案
    converge toward better answers

但真正关键的是,这种修改发生在并行生成过程本身
But the important part is that this revision happens inside the parallel generation process itself.

这不是先线性生成,再回头补救。
This is not linear generation followed by a repair pass.

模型更像是在整体展开的同时,不断修订并收敛自己的当前预测。
Instead, the model seems to expand globally while continuously revising and converging on its current predictions.

在刷榜的时候,我们看到模型:
During benchmarking, we often saw the model:

  • 最后才提交最终答案
    delay final answer commitment until late
  • 反复修复答案候选
    repeatedly refine answer candidates
  • 在结束前修补最终选择
    repair its final answer choice near the end

这和很多“披着扩散外壳”的半自回归系统很不一样。
This feels fundamentally different from many earlier systems that were still semi-autoregressive underneath a diffusion shell.

我们的结论越来越明确:
Our conclusion is becoming increasingly clear:

双向注意力的能力,不能只在推理时补出来。
The power of bidirectional attention cannot be recovered only at inference time.

它必须从预训练开始就和模型结构一起被设计进去。
It has to be designed jointly with the model architecture from the pretraining stage onward.

一旦这件事做对了,模型表现出来的就不再是半自回归。
Once this is done correctly, what emerges is no longer semi-autoregression.

而是更接近:
It is much closer to:

真正的并行扩散生成。
True parallel diffusion generation.


3️⃣ 灵活长度文本,这次真的被做出来了

3️⃣ We really made flexible-length text work

训练高质量扩散式语言模型本身就很难。
Training a high-quality diffusion language model is already difficult.

而能处理灵活长度文本,则更难。
Handling flexible-length text well is even harder.

不同提示词天然就应该对应不同长度的回答。
Different prompts should naturally lead to responses of different lengths.

所以长度不是边缘变量。
So length is not a peripheral variable.

它本身就是质量的核心变量之一。
It is one of the core variables of quality itself.

当我们真正把“长度”当成一等公民去处理之后,模型在灵活文本上的扩展能力才明显出现。
Once we started treating length as a first-class modeling problem, the model’s flexible-length generation capability improved significantly.

长度不是边角细节,而是建模质量的核心问题。
Length is not a side detail — it is a core modeling problem.


4️⃣ 小模型也能泛化,这一点非常关键

4️⃣ Small models can generalize too — and that matters a lot

即使在相对不大的参数规模下,模型依然展现出清晰的泛化能力。
Even at relatively modest scale, the model still demonstrated clear generalization ability.

而且只需要少量监督微调数据,就可以把模型调到我们想要的方向上。
And with only a small amount of supervised fine-tuning data, it could be steered toward the behaviors we wanted.

这会直接改变部署叙事。
This changes the deployment story directly.

它意味着扩散式语言模型不一定只适合做研究。
It suggests that diffusion language models may not be useful only for research.

它也很可能适合一条非常实用的路线:
They may also fit a very practical path:

  • 快 / fast
  • 便宜 / cheap
  • 好用 / usable

5️⃣ 并行能力终于不再只是口号

5️⃣ Parallelism is finally more than a slogan

过去很多扩散式语言模型系统,本质上还是半自回归模型。
In the past, many diffusion language model systems were still semi-autoregressive in essence.

但这次不一样。
This time feels different.

我们看到的,是更强烈、更明确的真实并行生成能力
What we see now is a much stronger and clearer signal of real parallel generation.

而且这种并行,并不只是“同时吐出多个词元”。
And this parallelism is not simply about producing multiple tokens at once.

它意味着三件事同时发生:
It means three things happening at the same time:

  • 同时生成
    generate simultaneously
  • 同时修改
    revise simultaneously
  • 同时收敛
    converge simultaneously

同时生成,同时修改,同时收敛。
Simultaneously generate. Simultaneously revise. Simultaneously converge.

这直接关系到:
This directly affects:

  • 吞吐 / throughput
  • 时延 / latency
  • 成本 / cost
  • 产品可行性 / product viability

在词元消耗越来越大的今天,这件事尤其有吸引力。
In a world of exploding token consumption, that becomes especially compelling.


6️⃣ 自蒸馏不是想象

6️⃣ Self-distillation is not imaginary

我们越来越相信,自蒸馏 会成为这条路线继续释放并行潜力的关键一步。
We increasingly believe that self-distillation will be a key next step in unlocking even more of this paradigm’s parallel potential.

今天看到的并行水平,还远远没有达到上限。
The level of parallelism we see today is still far from the ceiling.

通过自蒸馏,我们有机会:
Through self-distillation, we may be able to:

  • 进一步减少扩散步数
    reduce the number of diffusion steps
  • 让每一步处理更多词元
    process more tokens per step
  • 降低完成生成所需的总步数
    lower the total number of steps needed for generation
  • 在保持质量的同时提升吞吐
    improve throughput while preserving quality

换句话说,就是:
In practical terms, that means:

每一步更大、总步数更少、速度更快、词元更便宜。
Bigger steps, fewer total steps, faster generation, cheaper tokens.

而这件事的意义,不只是推理更便宜。
And the significance goes beyond inference cost alone.

它意味着扩散式语言模型的效率曲线,还远远没有被走完。
It suggests that the efficiency curve of diffusion language models is still far from fully explored.

今天看到的速度,还不是这条路线真正的速度。
The speed we see today is not yet the true speed of this paradigm.


🌍 为什么这件事重要 | Why This Matters

过去很多人会把扩散式语言模型看成:
For a long time, diffusion language models were often treated as:

  • 一种有趣的替代路线
    an interesting alternative
  • 一种研究方向
    a research curiosity
  • 或者一种不同的解码方法
    or merely a different decoding strategy

但这次的结果指向的是更深的一件事:
But these results point to something deeper:

它可能真的是一种新的语言模型范式。
This may truly be a new language modeling paradigm.

最重要的不是某一个单点指标。
What matters is not one isolated metric.

而是一组很罕见的优势,可能正在同时成立
It is the rare possibility that many advantages may hold at the same time:

  • 更便宜、更稳定的训练动态
    cheaper, more stable training dynamics
  • 更高效的 GPU batch 利用
    more efficient GPU batch utilization
  • 更强的吞吐能力
    stronger throughput
  • 更低的首词元返回时延
    lower time-to-first-token latency
  • 更便宜的词元单价
    cheaper per-token pricing
  • 通往隐空间思考的自然路径
    a natural path toward latent-space thinking
  • 通往自我修改的自然路径
    a natural path toward self-revision
  • 更灵活的建模方式
    more flexible modeling
  • 更广阔的后训练优化空间
    a large post-training optimization space still waiting to be explored

说得更直接一点:
To put it plainly:

扩散式语言模型的价值,现在已经不是哲学问题。
The value of diffusion language models is no longer philosophical.

它正在成为工程现实。
It is becoming an engineering reality.


🖼️ 走向更大的融合 | Toward Broader Fusion

我们也正在积极探索文字、图像等更大规模融合的扩散模型
We are also actively exploring larger-scale fused diffusion models across text, images, and beyond.

我们相信,扩散式建模不只适用于语言。
We do not believe diffusion-based modeling is limited to language alone.

它同样可能为今天以视觉—语言—动作模型为基础的具身智能系统,尤其是机器人模型,提供另一条替代性的思路。
It may also provide an alternative path for embodied intelligence systems built on vision-language-action foundations, especially in robotics.

所以,今天看到的并不是终点。
So what we are seeing today is not the end.

甚至可能还不是真正的开始。
It may not even be the real beginning yet.


⚠️ 局限性 | Limitations

我们也希望对这件事保持清醒。
We also want to stay clear-eyed about where things stand.

这套系统并不意味着扩散模型已经在所有维度上替代了自回归模型。
This system does not mean diffusion models have already replaced autoregressive models across every dimension.

眼下最令人兴奋的能力仍然很早期:
The most exciting capabilities are still early:

  • 隐空间推理
    latent-space reasoning
  • 迭代式自我修改
    iterative self-revision
  • 跨模态统一扩散建模
    unified cross-modal diffusion modeling

这些方向都还缺少足够成熟、足够标准化的评估体系。
All of these still lack sufficiently mature and standardized evaluation frameworks.

所以我们应该保持野心,也保持诚实。
So we should remain ambitious, but also honest.

即便如此,眼下的信号已经足够强,让我们可以明确说出这句话:
Even so, the signal is already strong enough for us to say this clearly:

扩散式语言模型的上限,远远还没有被真正看到。
The upper bound of diffusion language models is still far from truly visible.


💥 我们相信什么 | What We Believe

我们相信,自己正在做一件真正重要的事情。
We believe we are working on something genuinely important.

我们希望把词元的价格打下来,把快、便宜、好用刻进模型的基因里。
We want to drive token prices down dramatically and bake fast, cheap, and useful into the model’s DNA.

价格屠夫,一定有市场。
There is a market for being a ruthless price disruptor.

而这条路,还有太多空间没有被探索。
And there is still so much left unexplored.


🤝 加入我们 | Join Us

如果你也拥有:
If you also value:

  • 诚实可靠的品格
    honesty and reliability
  • 强烈的好奇心
    deep curiosity
  • 愿意探索未知的勇气
    the courage to explore the unknown

欢迎考虑加入我们。
we would love for you to consider joining us.

在蓝色鲸鱼,我们拥有:
At Whaletech, we have:

  • 充足的 GPU 资源
    abundant GPU resources
  • 高效的迭代节奏
    fast iteration speed
  • 优秀而投入的同事
    exceptional and deeply committed teammates
  • 始终热烈的技术氛围
    an intensely energetic technical culture

期待和你一起,在更广阔的世界里,继续探索那些尚未被定义的可能。
We look forward to exploring a much larger frontier together — one that has not yet been fully defined.


Whaletech banner
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support