TokenWang commited on 27 days ago

Commit

7223bcf

verified ·

1 Parent(s): 11da020

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

.gitattributes +9 -0
docs/assets/benchmarks/generation.webp +2 -2
docs/assets/benchmarks/interleaved.webp +2 -2
docs/assets/perform_vs_speed_avg3.png +3 -0
docs/assets/perform_vs_speed_avg8.png +3 -0
docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp +3 -0
docs/assets/showcases/interleave/case_0003_beachfront_villa.webp +3 -0
docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp +3 -0
docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp +3 -0
docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp +3 -0
docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp +3 -0
docs/assets/showcases/vqa/agentic_case.webp +2 -2
docs/assets/showcases/vqa/agentic_case_2.webp +2 -2
docs/assets/teaser_1.png +2 -2
docs/assets/teaser_2.png +3 -0
docs/inference_infra.md +45 -31
docs/inference_infra_CN.md +46 -32
docs/showcases.md +7 -7
docs/showcases_CN.md +8 -6

.gitattributes CHANGED Viewed

@@ -164,3 +164,12 @@ docs/assets/showcases/vla/2.png filter=lfs diff=lfs merge=lfs -text
 docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
 docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
 docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text

 docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
 docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
 docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/perform_vs_speed_avg3.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/perform_vs_speed_avg8.png filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0003_beachfront_villa.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp filter=lfs diff=lfs merge=lfs -text
+docs/assets/teaser_2.png filter=lfs diff=lfs merge=lfs -text

docs/assets/benchmarks/generation.webp CHANGED Viewed

Git LFS Details

SHA256: cb93af198b32d7acac46d9eb33c2f2094c856fc37aff8caf34d1be0eeaafa74f
Pointer size: 131 Bytes
Size of remote file: 379 kB

Git LFS Details

SHA256: ae516a2b2c190212f77f94a9f07bf1efeaad13912fc18e249221aaff57d1a554
Pointer size: 131 Bytes
Size of remote file: 361 kB

docs/assets/benchmarks/interleaved.webp CHANGED Viewed

Git LFS Details

SHA256: 513e559e23326782188c3b3a6e3ed4eb56af560e14e81a14f44230db55d593d1
Pointer size: 131 Bytes
Size of remote file: 137 kB

Git LFS Details

SHA256: b32ea4a7982467f378fb77c10c2b8e9493e7670fa43d9c50a471aab77b93b6ed
Pointer size: 131 Bytes
Size of remote file: 213 kB

docs/assets/perform_vs_speed_avg3.png ADDED Viewed

Git LFS Details

SHA256: 92ad8409ce54f122c8f8aa42772d0da45ac388cda729f204082df9b87993d192
Pointer size: 131 Bytes
Size of remote file: 154 kB

docs/assets/perform_vs_speed_avg8.png ADDED Viewed

Git LFS Details

SHA256: 4e7387290f0218f717d3bec77ad042ff04584ee3ebd707c1f0260f892ffed7c7
Pointer size: 131 Bytes
Size of remote file: 151 kB

docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp ADDED Viewed

Git LFS Details

SHA256: 979863dae87263f4cfd0e4a247edd3e205a6e634336b91047494ea142295a5b1
Pointer size: 131 Bytes
Size of remote file: 773 kB

docs/assets/showcases/interleave/case_0003_beachfront_villa.webp ADDED Viewed

Git LFS Details

SHA256: d858a2115edd9e83f7acf5aaa022ba8b84d710834f37fdf302fba045d809c19e
Pointer size: 131 Bytes
Size of remote file: 295 kB

docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp ADDED Viewed

Git LFS Details

SHA256: 29286325fd88357ea43b2c3520c4762916d7b4f6c975d2fd86bad2acc716aa95
Pointer size: 131 Bytes
Size of remote file: 276 kB

docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp ADDED Viewed

Git LFS Details

SHA256: eb1e495a0fded91dfccbd8a55bbc124a09ebbed75befc2bf0460925623995ac7
Pointer size: 131 Bytes
Size of remote file: 754 kB

docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp ADDED Viewed

Git LFS Details

SHA256: fd41c65cb3053fcd34786aaf40cdc60193644434f22c4b5cc952b241ac5dae1c
Pointer size: 131 Bytes
Size of remote file: 556 kB

docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp ADDED Viewed

Git LFS Details

SHA256: caf8ed8c8ea986cb9ebd0a2017c25ca53fa49f0aee8253fa95746e704f7ecdb8
Pointer size: 131 Bytes
Size of remote file: 989 kB

docs/assets/showcases/vqa/agentic_case.webp CHANGED Viewed

Git LFS Details

SHA256: 036d2dd2c36c388487179654351268a7d7aecd37313023d54006413038220c31
Pointer size: 131 Bytes
Size of remote file: 321 kB

Git LFS Details

SHA256: 8ce0276b70a7f85473c01272a546fa98cf9fc9c7ce0b56ef7eb7cdc35a9ec781
Pointer size: 131 Bytes
Size of remote file: 517 kB

docs/assets/showcases/vqa/agentic_case_2.webp CHANGED Viewed

Git LFS Details

SHA256: 5b3eab6d774b1499e3c57adbd028aaa5da7aee5432d2e30774f2464b62834acf
Pointer size: 131 Bytes
Size of remote file: 378 kB

Git LFS Details

SHA256: 578348b8030e07d1dc30e999c3e6040e87a5d74db61303d1e17ab949574b8a94
Pointer size: 131 Bytes
Size of remote file: 677 kB

docs/assets/teaser_1.png CHANGED Viewed

Git LFS Details

SHA256: 41c9bad2f89cf92d6ff2e454a65821ff0f53033de55dda3b10117ea519d89cfe
Pointer size: 132 Bytes
Size of remote file: 2.99 MB

Git LFS Details

SHA256: 3c4bcdc467525309fe6f5d7df377c8ef92af69fb2791218dee1cfdd2d0d75af6
Pointer size: 132 Bytes
Size of remote file: 2.97 MB

docs/assets/teaser_2.png ADDED Viewed

Git LFS Details

SHA256: 8e2d2bb4a32a663a03651d086ef96c7883eca56b660d4b8503afbb620cef08b5
Pointer size: 132 Bytes
Size of remote file: 8.45 MB

docs/inference_infra.md CHANGED Viewed

@@ -17,8 +17,8 @@ These two engines exchange generation state through pinned shared memory and hig
 This design provides practical benefits in production:
-- Independent parallelism (for example, understanding with `TP=2`, generation
-  with `CFG=2` or `SP=2`).
 - Independent resource allocation (different GPU counts and memory budgets).
 - Independent scaling for text-heavy vs. image-heavy traffic.
 - Better operational isolation and simpler performance tuning.
@@ -32,7 +32,7 @@ In most production setups, `Separate` is the default choice because it gives cle
 ### Attention for Multimodal Prefill of NEO-Unify
-NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
 Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
@@ -47,23 +47,27 @@ The benchmark below compares two implementations for Neo-style multimodal prefil
   integration cost and faster iteration.
 - **FA3 implementation**: higher absolute performance on supported hardware.
-| batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | speedup (×) |
-| ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
-|     8 |        4096 |              88 |          1.95 |       0.81 |       **2.41×** |
-|     8 |        8192 |             171 |          6.55 |       2.68 |       **2.45×** |
-|     8 |       65536 |             150 |         43.30 |      14.95 |       **2.90×** |
-|    16 |        4096 |             379 |          4.12 |       1.68 |       **2.46×** |
-|    16 |        8192 |             246 |         17.76 |       7.40 |       **2.40×** |
-|    16 |       65536 |             206 |        107.74 |      33.66 |       **3.20×** |
-|    32 |        4096 |             726 |          8.46 |       3.46 |       **2.44×** |
-|    32 |        8192 |             536 |         31.74 |      13.24 |       **2.40×** |
-|    32 |       65536 |             417 |        171.00 |      58.26 |       **2.94×** |
-|    64 |        4096 |            1170 |         16.08 |       6.88 |       **2.34×** |
-|    64 |        8192 |            1177 |         55.48 |      22.91 |       **2.42×** |
-|    64 |       65536 |            1291 |        348.89 |     124.82 |       **2.80×** |
-|   128 |        4096 |            2057 |         30.89 |      12.53 |       **2.47×** |
-|   128 |        8192 |            2196 |        104.73 |      43.22 |       **2.42×** |
-|   128 |       65536 |            2205 |        706.60 |     241.67 |       **2.92×** |
 ### Deployment
@@ -76,14 +80,19 @@ see [`deployment.md`](./deployment.md).
 The table below is the benchmark template for **2048x2048** image generation.
 Fill in measured numbers for each machine and deployment profile.
-| GPU | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
-| ---------- | ----------------- | --------------------------: | ---------------------: |
 | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
 | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
 | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
 | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
 In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
@@ -93,13 +102,18 @@ The table below compares the latency of a single diffusion step for
 **2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
 all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
 `2x H100`.
-| Model | Understanding | Generation | Per-step latency (s/step) |
-| ----- | ------------: | ---------: | -------------------------: |
-| Qwen-Image-2512 | 7B | 20B | 1.478 |
-| Z-Image | 4B | 6B | 1.110 |
-| GLM-Image | 9B | 7B | 1.394 |
-| ERNIE-Image | 8B | 8B | 1.565 |
-| LongCat-Image | 8B | 6B | 0.796 |
-| NEO-Unify (1x, no TP/CFG parallel) | 8B | 8B | 0.312 |
 | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |

 This design provides practical benefits in production:
+- Independent parallelism (for example, understanding with `TP=2` (Tensor Parallel=2), generation
+  with `CFG=2` (CFG Parallel=2) or `SP=2` (Sequence Parallel=2)).
 - Independent resource allocation (different GPU counts and memory budgets).
 - Independent scaling for text-heavy vs. image-heavy traffic.
 - Better operational isolation and simpler performance tuning.
 ### Attention for Multimodal Prefill of NEO-Unify
+NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FlashAttention3 (FA3) codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
 Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
   integration cost and faster iteration.
 - **FA3 implementation**: higher absolute performance on supported hardware.
+<div align="center">
+|  batch  | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | speedup (×) |
+|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:-----------:|
+|    8    |     4096    |       88        |    1.95     |   0.81   |  **2.41×**  |
+|    8    |     8192    |       171       |    6.55     |   2.68   |  **2.45×**  |
+|    8    |    65536    |       150       |   43.30     |  14.95   |  **2.90×**  |
+|   16    |     4096    |       379       |    4.12     |   1.68   |  **2.46×**  |
+|   16    |     8192    |       246       |   17.76     |   7.40   |  **2.40×**  |
+|   16    |    65536    |       206       |  107.74     |  33.66   |  **3.20×**  |
+|   32    |     4096    |       726       |    8.46     |   3.46   |  **2.44×**  |
+|   32    |     8192    |       536       |   31.74     |  13.24   |  **2.40×**  |
+|   32    |    65536    |       417       |  171.00     |  58.26   |  **2.94×**  |
+|   64    |     4096    |       1170      |   16.08     |   6.88   |  **2.34×**  |
+|   64    |     8192    |       1177      |   55.48     |  22.91   |  **2.42×**  |
+|   64    |    65536    |       1291      |  348.89     | 124.82   |  **2.80×**  |
+|  128    |     4096    |       2057      |   30.89     |  12.53   |  **2.47×**  |
+|  128    |     8192    |       2196      |  104.73     |  43.22   |  **2.42×**  |
+|  128    |    65536    |       2205      |  706.60     | 241.67   |  **2.92×**  |
+</div>
 ### Deployment
 The table below is the benchmark template for **2048x2048** image generation.
 Fill in measured numbers for each machine and deployment profile.
+Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
+<div align="center">
+| GPU  | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
+|:----:|:-----------------:|:-------------------------:|:----------------------:|
 | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
 | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
 | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
 | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
+</div>
 In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
 **2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
 all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
 `2x H100`.
+Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
+<div align="center">
+|       Model       | Understanding | Generation | Per-step latency (s/step) |
+|:-----------------:|:-------------:|:----------:|:-------------------------:|
+| Qwen-Image-2512   |      7B       |     20B    |           1.478           |
+| Z-Image           |      4B       |     6B     |           1.110           |
+| GLM-Image         |      9B       |     7B     |           1.394           |
+| ERNIE-Image       |      8B       |     8B     |           1.565           |
+| LongCat-Image     |      8B       |     6B     |           0.796           |
+| NEO-Unify (1x, no TP/CFG parallelism) | 8B | 8B | 0.312 |
 | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
+</div>

docs/inference_infra_CN.md CHANGED Viewed

@@ -17,7 +17,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型，但在实际生产
 该设计在生产中具有以下实际收益：
-- 并行策略相互独立（例如理解侧 `TP=2`，生成侧 `CFG=2` 或 `SP=2`）；
 - 资源配额相互独立（可分配不同的 GPU 数量与显存预算）；
 - 针对文本密集型与图像密集型流量可分别弹性扩缩；
 - 运维隔离更清晰，性能调优也更简单。
@@ -31,7 +31,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型，但在实际生产
 ### NEO-Unify 多模态 Prefill 的注意力
-NEO-Unify 的 prefill 注意力并非标准因果注意力：文本 token 仍保持因果，而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码，我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FA3 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
 具体做法是新增一个可选的 image_token_tag 参数，用以逐行调整掩码：文本行沿用标准因果掩码；图像行不再采用朴素的因果截断，而是被允许关注其之前的全部文本 token，以及所在图像 span 内的全部图像 token。
@@ -46,23 +46,27 @@ NEO-Unify 的 prefill 注意力并非标准因果注意力：文本 token 仍保
 - **Triton 实现**：更容易迁移到现有代码库，集成成本低、迭代更快；
 - **FA3 实现**：在受支持的硬件上绝对性能更高。
-| batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | 加速比 (×) |
-| ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
-|     8 |        4096 |              88 |          1.95 |       0.81 |       **2.41×** |
-|     8 |        8192 |             171 |          6.55 |       2.68 |       **2.45×** |
-|     8 |       65536 |             150 |         43.30 |      14.95 |       **2.90×** |
-|    16 |        4096 |             379 |          4.12 |       1.68 |       **2.46×** |
-|    16 |        8192 |             246 |         17.76 |       7.40 |       **2.40×** |
-|    16 |       65536 |             206 |        107.74 |      33.66 |       **3.20×** |
-|    32 |        4096 |             726 |          8.46 |       3.46 |       **2.44×** |
-|    32 |        8192 |             536 |         31.74 |      13.24 |       **2.40×** |
-|    32 |       65536 |             417 |        171.00 |      58.26 |       **2.94×** |
-|    64 |        4096 |            1170 |         16.08 |       6.88 |       **2.34×** |
-|    64 |        8192 |            1177 |         55.48 |      22.91 |       **2.42×** |
-|    64 |       65536 |            1291 |        348.89 |     124.82 |       **2.80×** |
-|   128 |        4096 |            2057 |         30.89 |      12.53 |       **2.47×** |
-|   128 |        8192 |            2196 |        104.73 |      43.22 |       **2.42×** |
-|   128 |       65536 |            2205 |        706.60 |     241.67 |       **2.92×** |
 ### 部署
@@ -73,26 +77,36 @@ Docker 镜像、启动命令与 API 测试的简明操作手册，请参见 [`de
 ### 生成性能
 下表为 **2048x2048** 图像生成的基准模板，列出了不同机型与部署配置下的实测数据。
-| GPU | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
-| ---------- | ----------------- | --------------------------: | ---------------------: |
 | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
 | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
 | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
 | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
 在 NEO-Unify 中，生成阶段所用的 KV cache 由理解模块提供，因此 T2I（文生图）与 I2I（图像编辑）在运行时特征上几乎一致。为简洁起见，此处仅给出 T2I 的延迟数据。
 ### 跨模型速度对比
-下表对比了在启用 **CFG** 条件下，生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外，所有数据均在 **H100** 上测得；其中 `NEO-Unify (TP2+CFG2)` 使用的是 `2x H100`。
-| 模型 | 理解模块 | 生成模块 | 单步延迟 (s/step) |
-| ----- | -------: | -------: | ----------------: |
-| Qwen-Image-2512 | 7B | 20B | 1.478 |
-| Z-Image | 4B | 6B | 1.110 |
-| GLM-Image | 9B | 7B | 1.394 |
-| ERNIE-Image | 8B | 8B | 1.565 |
-| LongCat-Image | 8B | 6B | 0.796 |
-| NEO-Unify (1x，无 TP/CFG 并行) | 8B | 8B | 0.312 |
-| NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |

 该设计在生产中具有以下实际收益：
+- 并行策略相互独立（例如理解侧 `TP=2`（张量并行=2），生成侧 `CFG=2`（CFG 并行=2）或 `SP=2`（序列并行=2））；
 - 资源配额相互独立（可分配不同的 GPU 数量与显存预算）；
 - 针对文本密集型与图像密集型流量可分别弹性扩缩；
 - 运维隔离更清晰，性能调优也更简单。
 ### NEO-Unify 多模态 Prefill 的注意力
+NEO-Unify 的 prefill 注意力并非标准因果注意力：文本 token 仍保持因果，而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码，我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FlashAttention3 (FA3) 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
 具体做法是新增一个可选的 image_token_tag 参数，用以逐行调整掩码：文本行沿用标准因果掩码；图像行不再采用朴素的因果截断，而是被允许关注其之前的全部文本 token，以及所在图像 span 内的全部图像 token。
 - **Triton 实现**：更容易迁移到现有代码库，集成成本低、迭代更快；
 - **FA3 实现**：在受支持的硬件上绝对性能更高。
+<div align="center">
+|  batch  | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | 加速比 (×) |
+|:-------:|:-----------:|:---------------:|:-----------:|:--------:|:----------:|
+|    8    |     4096    |       88        |    1.95     |   0.81   |  **2.41×** |
+|    8    |     8192    |       171       |    6.55     |   2.68   |  **2.45×** |
+|    8    |    65536    |       150       |   43.30     |  14.95   |  **2.90×** |
+|   16    |     4096    |       379       |    4.12     |   1.68   |  **2.46×** |
+|   16    |     8192    |       246       |   17.76     |   7.40   |  **2.40×** |
+|   16    |    65536    |       206       |  107.74     |  33.66   |  **3.20×** |
+|   32    |     4096    |       726       |    8.46     |   3.46   |  **2.44×** |
+|   32    |     8192    |       536       |   31.74     |  13.24   |  **2.40×** |
+|   32    |    65536    |       417       |  171.00     |  58.26   |  **2.94×** |
+|   64    |     4096    |       1170      |   16.08     |   6.88   |  **2.34×** |
+|   64    |     8192    |       1177      |   55.48     |  22.91   |  **2.42×** |
+|   64    |    65536    |       1291      |  348.89     | 124.82   |  **2.80×** |
+|  128    |     4096    |       2057      |   30.89     |  12.53   |  **2.47×** |
+|  128    |     8192    |       2196      |  104.73     |  43.22   |  **2.42×** |
+|  128    |    65536    |       2205      |  706.60     | 241.67   |  **2.92×** |
+</div>
 ### 部署
 ### 生成性能
 下表为 **2048x2048** 图像生成的基准模板，列出了不同机型与部署配置下的实测数据。
+注：TP2+CFG2 表示张量并行=2 + CFG 并行=2。
+<div align="center">
+| GPU  | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
+|:----:|:--------:|:-----------------:|:--------------:|
 | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
 | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
 | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
 | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
+</div>
 在 NEO-Unify 中，生成阶段所用的 KV cache 由理解模块提供，因此 T2I（文生图）与 I2I（图像编辑）在运行时特征上几乎一致。为简洁起见，此处仅给出 T2I 的延迟数据。
 ### 跨模型速度对比
+下表对比了在启用**CFG**条件下，生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外，所有数据均在 **H100** 上测得；其中 `NEO-Unify (TP2+CFG2)` 使用的是 `2x H100`。
+注：TP2+CFG2 表示张量并行=2 + CFG 并行=2。
+<div align="center">
+|           模型            | 理解模块 | 生成模块 | 单步延迟 (s/step) |
+|:-------------------------:|:--------:|:--------:|:-----------------:|
+| Qwen-Image-2512           |    7B    |   20B    |       1.478       |
+| Z-Image                   |    4B    |    6B    |       1.110       |
+| GLM-Image                 |    9B    |    7B    |       1.394       |
+| ERNIE-Image               |    8B    |    8B    |       1.565       |
+| LongCat-Image             |    8B    |    6B    |       0.796       |
+| NEO-Unify (1x，无TP/CFG并行) |   8B    |    8B    |       0.312       |
+| NEO-Unify (TP2+CFG2) |    8B    |    8B    |       0.158       |
+</div>

docs/showcases.md CHANGED Viewed

@@ -215,15 +215,17 @@ answer.
 Reproducible prompts are in
 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
 | |
 | :---: |
-| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_01.webp">](./assets/showcases/interleave/case_01.webp) |
-| [<img alt="interleave case 02" src="./assets/showcases/interleave/case_02.webp">](./assets/showcases/interleave/case_02.webp) |
-| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_03.webp">](./assets/showcases/interleave/case_03.webp) |
-| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_04.webp">](./assets/showcases/interleave/case_04.webp) |
-| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_05.webp">](./assets/showcases/interleave/case_05.webp) |
 #### ♻️ *Interleaved Generation (Reasoning)*
@@ -252,8 +254,6 @@ Reproducible prompts are in [`examples/vqa/data/samples.jsonl`](../examples/vqa/
 #### 📝 *Visual Understanding (Agentic)*
-Reproducible prompts are in [`examples/vqa/data/samples_agentic.jsonl`](../examples/vqa/data/samples_agentic.jsonl).
 | |
 | :---: |
 | [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |

 Reproducible prompts are in
 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
+All examples are generated with think-mode reasoning; the chain-of-thought is omitted in some cases for cleaner visualization.
 | |
 | :---: |
+| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
+| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
+| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
+| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
+| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
+| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
 #### ♻️ *Interleaved Generation (Reasoning)*
 #### 📝 *Visual Understanding (Agentic)*
 | |
 | :---: |
 | [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |

docs/showcases_CN.md CHANGED Viewed

@@ -197,16 +197,18 @@
 下方每个案例均为 `model.interleave_gen` 的一次完整响应：模型先在 `<think>...</think>` 推理块中生成若干中间图像，再输出最终图文交错的答案。
-可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/sample.jsonl)。
 | |
 | :---: |
-| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_01.webp">](./assets/showcases/interleave/case_01.webp) |
-| [<img alt="interleave case 02" src="./assets/showcases/interleave/case_02.webp">](./assets/showcases/interleave/case_02.webp) |
-| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_03.webp">](./assets/showcases/interleave/case_03.webp) |
-| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_04.webp">](./assets/showcases/interleave/case_04.webp) |
-| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_05.webp">](./assets/showcases/interleave/case_05.webp) |
 ---

 下方每个案例均为 `model.interleave_gen` 的一次完整响应：模型先在 `<think>...</think>` 推理块中生成若干中间图像，再输出最终图文交错的答案。
+可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl)。
+所有示例均带 think 推理生成；为可视化简洁，部分示例未展示思维链。
 | |
 | :---: |
+| [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
+| [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
+| [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
+| [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
+| [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
+| [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
 ---