TokenWang commited on
Commit
7223bcf
·
verified ·
1 Parent(s): 11da020

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -164,3 +164,12 @@ docs/assets/showcases/vla/2.png filter=lfs diff=lfs merge=lfs -text
164
  docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
165
  docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
166
  docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
164
  docs/assets/showcases/vla/3.mp4 filter=lfs diff=lfs merge=lfs -text
165
  docs/assets/showcases/vla/3.png filter=lfs diff=lfs merge=lfs -text
166
  docs/assets/teaser_1.png filter=lfs diff=lfs merge=lfs -text
167
+ docs/assets/perform_vs_speed_avg3.png filter=lfs diff=lfs merge=lfs -text
168
+ docs/assets/perform_vs_speed_avg8.png filter=lfs diff=lfs merge=lfs -text
169
+ docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp filter=lfs diff=lfs merge=lfs -text
170
+ docs/assets/showcases/interleave/case_0003_beachfront_villa.webp filter=lfs diff=lfs merge=lfs -text
171
+ docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp filter=lfs diff=lfs merge=lfs -text
172
+ docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp filter=lfs diff=lfs merge=lfs -text
173
+ docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp filter=lfs diff=lfs merge=lfs -text
174
+ docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp filter=lfs diff=lfs merge=lfs -text
175
+ docs/assets/teaser_2.png filter=lfs diff=lfs merge=lfs -text
docs/assets/benchmarks/generation.webp CHANGED

Git LFS Details

  • SHA256: cb93af198b32d7acac46d9eb33c2f2094c856fc37aff8caf34d1be0eeaafa74f
  • Pointer size: 131 Bytes
  • Size of remote file: 379 kB

Git LFS Details

  • SHA256: ae516a2b2c190212f77f94a9f07bf1efeaad13912fc18e249221aaff57d1a554
  • Pointer size: 131 Bytes
  • Size of remote file: 361 kB
docs/assets/benchmarks/interleaved.webp CHANGED

Git LFS Details

  • SHA256: 513e559e23326782188c3b3a6e3ed4eb56af560e14e81a14f44230db55d593d1
  • Pointer size: 131 Bytes
  • Size of remote file: 137 kB

Git LFS Details

  • SHA256: b32ea4a7982467f378fb77c10c2b8e9493e7670fa43d9c50a471aab77b93b6ed
  • Pointer size: 131 Bytes
  • Size of remote file: 213 kB
docs/assets/perform_vs_speed_avg3.png ADDED

Git LFS Details

  • SHA256: 92ad8409ce54f122c8f8aa42772d0da45ac388cda729f204082df9b87993d192
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB
docs/assets/perform_vs_speed_avg8.png ADDED

Git LFS Details

  • SHA256: 4e7387290f0218f717d3bec77ad042ff04584ee3ebd707c1f0260f892ffed7c7
  • Pointer size: 131 Bytes
  • Size of remote file: 151 kB
docs/assets/showcases/interleave/case_0001_makeup_three_looks.webp ADDED

Git LFS Details

  • SHA256: 979863dae87263f4cfd0e4a247edd3e205a6e634336b91047494ea142295a5b1
  • Pointer size: 131 Bytes
  • Size of remote file: 773 kB
docs/assets/showcases/interleave/case_0003_beachfront_villa.webp ADDED

Git LFS Details

  • SHA256: d858a2115edd9e83f7acf5aaa022ba8b84d710834f37fdf302fba045d809c19e
  • Pointer size: 131 Bytes
  • Size of remote file: 295 kB
docs/assets/showcases/interleave/case_0004_scented_candle_promo.webp ADDED

Git LFS Details

  • SHA256: 29286325fd88357ea43b2c3520c4762916d7b4f6c975d2fd86bad2acc716aa95
  • Pointer size: 131 Bytes
  • Size of remote file: 276 kB
docs/assets/showcases/interleave/case_0005_matchgirl_warm_au.webp ADDED

Git LFS Details

  • SHA256: eb1e495a0fded91dfccbd8a55bbc124a09ebbed75befc2bf0460925623995ac7
  • Pointer size: 131 Bytes
  • Size of remote file: 754 kB
docs/assets/showcases/interleave/case_0006_orange_cat_travel.webp ADDED

Git LFS Details

  • SHA256: fd41c65cb3053fcd34786aaf40cdc60193644434f22c4b5cc952b241ac5dae1c
  • Pointer size: 131 Bytes
  • Size of remote file: 556 kB
docs/assets/showcases/interleave/case_0007_bowie_slide_design.webp ADDED

Git LFS Details

  • SHA256: caf8ed8c8ea986cb9ebd0a2017c25ca53fa49f0aee8253fa95746e704f7ecdb8
  • Pointer size: 131 Bytes
  • Size of remote file: 989 kB
docs/assets/showcases/vqa/agentic_case.webp CHANGED

Git LFS Details

  • SHA256: 036d2dd2c36c388487179654351268a7d7aecd37313023d54006413038220c31
  • Pointer size: 131 Bytes
  • Size of remote file: 321 kB

Git LFS Details

  • SHA256: 8ce0276b70a7f85473c01272a546fa98cf9fc9c7ce0b56ef7eb7cdc35a9ec781
  • Pointer size: 131 Bytes
  • Size of remote file: 517 kB
docs/assets/showcases/vqa/agentic_case_2.webp CHANGED

Git LFS Details

  • SHA256: 5b3eab6d774b1499e3c57adbd028aaa5da7aee5432d2e30774f2464b62834acf
  • Pointer size: 131 Bytes
  • Size of remote file: 378 kB

Git LFS Details

  • SHA256: 578348b8030e07d1dc30e999c3e6040e87a5d74db61303d1e17ab949574b8a94
  • Pointer size: 131 Bytes
  • Size of remote file: 677 kB
docs/assets/teaser_1.png CHANGED

Git LFS Details

  • SHA256: 41c9bad2f89cf92d6ff2e454a65821ff0f53033de55dda3b10117ea519d89cfe
  • Pointer size: 132 Bytes
  • Size of remote file: 2.99 MB

Git LFS Details

  • SHA256: 3c4bcdc467525309fe6f5d7df377c8ef92af69fb2791218dee1cfdd2d0d75af6
  • Pointer size: 132 Bytes
  • Size of remote file: 2.97 MB
docs/assets/teaser_2.png ADDED

Git LFS Details

  • SHA256: 8e2d2bb4a32a663a03651d086ef96c7883eca56b660d4b8503afbb620cef08b5
  • Pointer size: 132 Bytes
  • Size of remote file: 8.45 MB
docs/inference_infra.md CHANGED
@@ -17,8 +17,8 @@ These two engines exchange generation state through pinned shared memory and hig
17
 
18
  This design provides practical benefits in production:
19
 
20
- - Independent parallelism (for example, understanding with `TP=2`, generation
21
- with `CFG=2` or `SP=2`).
22
  - Independent resource allocation (different GPU counts and memory budgets).
23
  - Independent scaling for text-heavy vs. image-heavy traffic.
24
  - Better operational isolation and simpler performance tuning.
@@ -32,7 +32,7 @@ In most production setups, `Separate` is the default choice because it gives cle
32
 
33
  ### Attention for Multimodal Prefill of NEO-Unify
34
 
35
- NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
36
 
37
  Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
38
 
@@ -47,23 +47,27 @@ The benchmark below compares two implementations for Neo-style multimodal prefil
47
  integration cost and faster iteration.
48
  - **FA3 implementation**: higher absolute performance on supported hardware.
49
 
50
- | batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | speedup (×) |
51
- | ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
52
- | 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
53
- | 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
54
- | 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
55
- | 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
56
- | 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
57
- | 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
58
- | 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
59
- | 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
60
- | 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
61
- | 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
62
- | 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
63
- | 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
64
- | 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
65
- | 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
66
- | 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
 
 
 
 
67
 
68
 
69
  ### Deployment
@@ -76,14 +80,19 @@ see [`deployment.md`](./deployment.md).
76
 
77
  The table below is the benchmark template for **2048x2048** image generation.
78
  Fill in measured numbers for each machine and deployment profile.
 
79
 
80
- | GPU | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
81
- | ---------- | ----------------- | --------------------------: | ---------------------: |
 
 
82
  | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
83
  | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
84
  | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
85
  | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
86
 
 
 
87
  In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
88
 
89
 
@@ -93,13 +102,18 @@ The table below compares the latency of a single diffusion step for
93
  **2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
94
  all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
95
  `2x H100`.
96
-
97
- | Model | Understanding | Generation | Per-step latency (s/step) |
98
- | ----- | ------------: | ---------: | -------------------------: |
99
- | Qwen-Image-2512 | 7B | 20B | 1.478 |
100
- | Z-Image | 4B | 6B | 1.110 |
101
- | GLM-Image | 9B | 7B | 1.394 |
102
- | ERNIE-Image | 8B | 8B | 1.565 |
103
- | LongCat-Image | 8B | 6B | 0.796 |
104
- | NEO-Unify (1x, no TP/CFG parallel) | 8B | 8B | 0.312 |
 
 
 
105
  | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
 
 
 
17
 
18
  This design provides practical benefits in production:
19
 
20
+ - Independent parallelism (for example, understanding with `TP=2` (Tensor Parallel=2), generation
21
+ with `CFG=2` (CFG Parallel=2) or `SP=2` (Sequence Parallel=2)).
22
  - Independent resource allocation (different GPU counts and memory budgets).
23
  - Independent scaling for text-heavy vs. image-heavy traffic.
24
  - Better operational isolation and simpler performance tuning.
 
32
 
33
  ### Attention for Multimodal Prefill of NEO-Unify
34
 
35
+ NEO-Unify's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FlashAttention3 (FA3) codebase. Our FA3 branch is available at [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention).
36
 
37
  Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.
38
 
 
47
  integration cost and faster iteration.
48
  - **FA3 implementation**: higher absolute performance on supported hardware.
49
 
50
+ <div align="center">
51
+
52
+ | batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | speedup (×) |
53
+ |:-------:|:-----------:|:---------------:|:-----------:|:--------:|:-----------:|
54
+ | 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
55
+ | 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
56
+ | 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
57
+ | 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
58
+ | 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
59
+ | 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
60
+ | 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
61
+ | 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
62
+ | 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
63
+ | 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
64
+ | 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
65
+ | 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
66
+ | 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
67
+ | 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
68
+ | 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
69
+
70
+ </div>
71
 
72
 
73
  ### Deployment
 
80
 
81
  The table below is the benchmark template for **2048x2048** image generation.
82
  Fill in measured numbers for each machine and deployment profile.
83
+ Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
84
 
85
+ <div align="center">
86
+
87
+ | GPU | Deployment Config | Per-step Latency (s/step) | End-to-end Latency (s) |
88
+ |:----:|:-----------------:|:-------------------------:|:----------------------:|
89
  | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
90
  | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
91
  | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
92
  | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
93
 
94
+ </div>
95
+
96
  In NEO-Unify, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.
97
 
98
 
 
102
  **2048x2048** image generation with **CFG enabled**. Unless otherwise noted,
103
  all measurements are taken on **H100**; the `NEO-Unify (TP2+CFG2)` result uses
104
  `2x H100`.
105
+ Note: TP2+CFG2 means Tensor Parallel=2 + CFG Parallel=2.
106
+
107
+ <div align="center">
108
+
109
+ | Model | Understanding | Generation | Per-step latency (s/step) |
110
+ |:-----------------:|:-------------:|:----------:|:-------------------------:|
111
+ | Qwen-Image-2512 | 7B | 20B | 1.478 |
112
+ | Z-Image | 4B | 6B | 1.110 |
113
+ | GLM-Image | 9B | 7B | 1.394 |
114
+ | ERNIE-Image | 8B | 8B | 1.565 |
115
+ | LongCat-Image | 8B | 6B | 0.796 |
116
+ | NEO-Unify (1x, no TP/CFG parallelism) | 8B | 8B | 0.312 |
117
  | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
118
+
119
+ </div>
docs/inference_infra_CN.md CHANGED
@@ -17,7 +17,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型,但在实际生产
17
 
18
  该设计在生产中具有以下实际收益:
19
 
20
- - 并行策略相互独立(例如理解侧 `TP=2`,生成侧 `CFG=2` 或 `SP=2`);
21
  - 资源配额相互独立(可分配不同的 GPU 数量与显存预算);
22
  - 针对文本密集型与图像密集型流量可分别弹性扩缩;
23
  - 运维隔离更清晰,性能调优也更简单。
@@ -31,7 +31,7 @@ SenseNova-U1 对外呈现为一个统一的多模态模型,但在实际生产
31
 
32
  ### NEO-Unify 多模态 Prefill 的注意力
33
 
34
- NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保持因果,而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码,我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FA3 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
35
 
36
  具体做法是新增一个可选的 image_token_tag 参数,用以逐行调整掩码:文本行沿用标准因果掩码;图像行不再采用朴素的因果截断,而是被允许关注其之前的全部文本 token,以及所在图像 span 内的全部图像 token。
37
 
@@ -46,23 +46,27 @@ NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保
46
  - **Triton 实现**:更容易迁移到现有代码库,集成成本低、迭代更快;
47
  - **FA3 实现**:在受支持的硬件上绝对性能更高。
48
 
49
- | batch | max_seq_len | image_token_num | triton (ms) | fa3 (ms) | 加速比 (×) |
50
- | ----: | ----------: | --------------: | -------------: | ----------: | ----------: |
51
- | 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
52
- | 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
53
- | 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
54
- | 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
55
- | 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
56
- | 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
57
- | 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
58
- | 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
59
- | 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
60
- | 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
61
- | 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
62
- | 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
63
- | 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
64
- | 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
65
- | 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
 
 
 
 
66
 
67
 
68
  ### 部署
@@ -73,26 +77,36 @@ Docker 镜像、启动命令与 API 测试的简明操作手册,请参见 [`de
73
  ### 生成性能
74
 
75
  下表为 **2048x2048** 图像生成的基准模板,列出了不同机型与部署配置下的实测数据。
 
76
 
77
- | GPU | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
78
- | ---------- | ----------------- | --------------------------: | ---------------------: |
 
 
79
  | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
80
  | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
81
  | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
82
  | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
83
 
 
 
84
  在 NEO-Unify 中,生成阶段所用的 KV cache 由理解模块提供,因此 T2I(文生图)与 I2I(图像编辑)在运行时特征上几乎一致。为简洁起见,此处仅给出 T2I 的延迟数据。
85
 
86
  ### 跨模型速度对比
87
 
88
- 下表对比了在启用 **CFG** 条件下,生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外,所有数据均在 **H100** 上测得;其中 `NEO-Unify (TP2+CFG2)` 使用的是 `2x H100`。
89
-
90
- | 模型 | 理解模块 | 生成模块 | 单步延迟 (s/step) |
91
- | ----- | -------: | -------: | ----------------: |
92
- | Qwen-Image-2512 | 7B | 20B | 1.478 |
93
- | Z-Image | 4B | 6B | 1.110 |
94
- | GLM-Image | 9B | 7B | 1.394 |
95
- | ERNIE-Image | 8B | 8B | 1.565 |
96
- | LongCat-Image | 8B | 6B | 0.796 |
97
- | NEO-Unify (1x,无 TP/CFG 并行) | 8B | 8B | 0.312 |
98
- | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
 
 
 
 
 
 
17
 
18
  该设计在生产中具有以下实际收益:
19
 
20
+ - 并行策略相互独立(例如理解侧 `TP=2`(张量并行=2),生成侧 `CFG=2`(CFG 并行=2)或 `SP=2`(序列并行=2
21
  - 资源配额相互独立(可分配不同的 GPU 数量与显存预算);
22
  - 针对文本密集型与图像密集型流量可分别弹性扩缩;
23
  - 运维隔离更清晰,性能调优也更简单。
 
31
 
32
  ### NEO-Unify 多模态 Prefill 的注意力
33
 
34
+ NEO-Unify 的 prefill 注意力并非标准因果注意力:文本 token 仍保持因果,而图像 token 则会同时关注整个文本前缀以及完整的图像 span。为支持这种混合掩码,我们对栈内两套注意力实现都进行了改造——Triton 内核与官方 FlashAttention3 (FA3) 代码库。我们的 FA3 分支见 [WANDY666/flash-attention](https://github.com/WANDY666/flash-attention)。
35
 
36
  具体做法是新增一个可选的 image_token_tag 参数,用以逐行调整掩码:文本行沿用标准因果掩码;图像行不再采用朴素的因果截断,而是被允许关注其之前的全部文本 token,以及所在图像 span 内的全部图像 token。
37
 
 
46
  - **Triton 实现**:更容易迁移到现有代码库,集成成本低、迭代更快;
47
  - **FA3 实现**:在受支持的硬件上绝对性能更高。
48
 
49
+ <div align="center">
50
+
51
+ | batch | max_seq_len | image_token_num | triton (ms) | FA3 (ms) | 加速比 (×) |
52
+ |:-------:|:-----------:|:---------------:|:-----------:|:--------:|:----------:|
53
+ | 8 | 4096 | 88 | 1.95 | 0.81 | **2.41×** |
54
+ | 8 | 8192 | 171 | 6.55 | 2.68 | **2.45×** |
55
+ | 8 | 65536 | 150 | 43.30 | 14.95 | **2.90×** |
56
+ | 16 | 4096 | 379 | 4.12 | 1.68 | **2.46×** |
57
+ | 16 | 8192 | 246 | 17.76 | 7.40 | **2.40×** |
58
+ | 16 | 65536 | 206 | 107.74 | 33.66 | **3.20×** |
59
+ | 32 | 4096 | 726 | 8.46 | 3.46 | **2.44×** |
60
+ | 32 | 8192 | 536 | 31.74 | 13.24 | **2.40×** |
61
+ | 32 | 65536 | 417 | 171.00 | 58.26 | **2.94×** |
62
+ | 64 | 4096 | 1170 | 16.08 | 6.88 | **2.34×** |
63
+ | 64 | 8192 | 1177 | 55.48 | 22.91 | **2.42×** |
64
+ | 64 | 65536 | 1291 | 348.89 | 124.82 | **2.80×** |
65
+ | 128 | 4096 | 2057 | 30.89 | 12.53 | **2.47×** |
66
+ | 128 | 8192 | 2196 | 104.73 | 43.22 | **2.42×** |
67
+ | 128 | 65536 | 2205 | 706.60 | 241.67 | **2.92×** |
68
+
69
+ </div>
70
 
71
 
72
  ### 部署
 
77
  ### 生成性能
78
 
79
  下表为 **2048x2048** 图像生成的基准模板,列出了不同机型与部署配置下的实测数据。
80
+ 注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
81
 
82
+ <div align="center">
83
+
84
+ | GPU | 部署配置 | 单步延迟 (s/step) | 端到端延迟 (s) |
85
+ |:----:|:--------:|:-----------------:|:--------------:|
86
  | H100 | TP2+CFG2 / colocate | 0.158 | 9.23 |
87
  | H200 | TP2+CFG2 / colocate | 0.152 | 9.54 |
88
  | 5090 | TP2+CFG2 / separate | 0.415 | 23.04 |
89
  | L40S | TP2+CFG2 / separate | 0.443 | 25.62 |
90
 
91
+ </div>
92
+
93
  在 NEO-Unify 中,生成阶段所用的 KV cache 由理解模块提供,因此 T2I(文生图)与 I2I(图像编辑)在运行时特征上几乎一致。为简洁起见,此处仅给出 T2I 的延迟数据。
94
 
95
  ### 跨模型速度对比
96
 
97
+ 下表对比了在启用**CFG**条件下,生成 **2048x2048** 图像时单个 diffusion step 的延迟。除特别说明外,所有数据均在 **H100** 上测得;其中 `NEO-Unify (TP2+CFG2)` 使用的是 `2x H100`。
98
+ 注:TP2+CFG2 表示张量并行=2 + CFG 并行=2。
99
+
100
+ <div align="center">
101
+
102
+ | 模型 | 理解模块 | 生成模块 | 单步延迟 (s/step) |
103
+ |:-------------------------:|:--------:|:--------:|:-----------------:|
104
+ | Qwen-Image-2512 | 7B | 20B | 1.478 |
105
+ | Z-Image | 4B | 6B | 1.110 |
106
+ | GLM-Image | 9B | 7B | 1.394 |
107
+ | ERNIE-Image | 8B | 8B | 1.565 |
108
+ | LongCat-Image | 8B | 6B | 0.796 |
109
+ | NEO-Unify (1x,无TP/CFG并行) | 8B | 8B | 0.312 |
110
+ | NEO-Unify (TP2+CFG2) | 8B | 8B | 0.158 |
111
+
112
+ </div>
docs/showcases.md CHANGED
@@ -215,15 +215,17 @@ answer.
215
 
216
  Reproducible prompts are in
217
  [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
 
218
 
219
 
220
  | |
221
  | :---: |
222
- | [<img alt="interleave case 01" src="./assets/showcases/interleave/case_01.webp">](./assets/showcases/interleave/case_01.webp) |
223
- | [<img alt="interleave case 02" src="./assets/showcases/interleave/case_02.webp">](./assets/showcases/interleave/case_02.webp) |
224
- | [<img alt="interleave case 03" src="./assets/showcases/interleave/case_03.webp">](./assets/showcases/interleave/case_03.webp) |
225
- | [<img alt="interleave case 04" src="./assets/showcases/interleave/case_04.webp">](./assets/showcases/interleave/case_04.webp) |
226
- | [<img alt="interleave case 05" src="./assets/showcases/interleave/case_05.webp">](./assets/showcases/interleave/case_05.webp) |
 
227
 
228
 
229
  #### ♻️ *Interleaved Generation (Reasoning)*
@@ -252,8 +254,6 @@ Reproducible prompts are in [`examples/vqa/data/samples.jsonl`](../examples/vqa/
252
 
253
  #### 📝 *Visual Understanding (Agentic)*
254
 
255
- Reproducible prompts are in [`examples/vqa/data/samples_agentic.jsonl`](../examples/vqa/data/samples_agentic.jsonl).
256
-
257
  | |
258
  | :---: |
259
  | [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
 
215
 
216
  Reproducible prompts are in
217
  [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl).
218
+ All examples are generated with think-mode reasoning; the chain-of-thought is omitted in some cases for cleaner visualization.
219
 
220
 
221
  | |
222
  | :---: |
223
+ | [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
224
+ | [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
225
+ | [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
226
+ | [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
227
+ | [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
228
+ | [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
229
 
230
 
231
  #### ♻️ *Interleaved Generation (Reasoning)*
 
254
 
255
  #### 📝 *Visual Understanding (Agentic)*
256
 
 
 
257
  | |
258
  | :---: |
259
  | [<img alt="vqa agentic case" src="./assets/showcases/vqa/agentic_case.webp">](./assets/showcases/vqa/agentic_case.webp) |
docs/showcases_CN.md CHANGED
@@ -197,16 +197,18 @@
197
 
198
  下方每个案例均为 `model.interleave_gen` 的一次完整响应:模型先在 `<think>...</think>` 推理块中生成若干中间图像,再输出最终图文交错的答案。
199
 
200
- 可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/sample.jsonl)。
 
201
 
202
 
203
  | |
204
  | :---: |
205
- | [<img alt="interleave case 01" src="./assets/showcases/interleave/case_01.webp">](./assets/showcases/interleave/case_01.webp) |
206
- | [<img alt="interleave case 02" src="./assets/showcases/interleave/case_02.webp">](./assets/showcases/interleave/case_02.webp) |
207
- | [<img alt="interleave case 03" src="./assets/showcases/interleave/case_03.webp">](./assets/showcases/interleave/case_03.webp) |
208
- | [<img alt="interleave case 04" src="./assets/showcases/interleave/case_04.webp">](./assets/showcases/interleave/case_04.webp) |
209
- | [<img alt="interleave case 05" src="./assets/showcases/interleave/case_05.webp">](./assets/showcases/interleave/case_05.webp) |
 
210
 
211
  ---
212
 
 
197
 
198
  下方每个案例均为 `model.interleave_gen` 的一次完整响应:模型先在 `<think>...</think>` 推理块中生成若干中间图像,再输出最终图文交错的答案。
199
 
200
+ 可复现的 prompt 位于 [`examples/interleave/data/samples.jsonl`](../examples/interleave/data/samples.jsonl)。
201
+ 所有示例均带 think 推理生成;为可视化简洁,部分示例未展示思维链。
202
 
203
 
204
  | |
205
  | :---: |
206
+ | [<img alt="interleave case 03" src="./assets/showcases/interleave/case_0003_beachfront_villa.webp">](./assets/showcases/interleave/case_0003_beachfront_villa.webp) |
207
+ | [<img alt="interleave case 04" src="./assets/showcases/interleave/case_0004_scented_candle_promo.webp">](./assets/showcases/interleave/case_0004_scented_candle_promo.webp) |
208
+ | [<img alt="interleave case 05" src="./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp">](./assets/showcases/interleave/case_0005_matchgirl_warm_au.webp) |
209
+ | [<img alt="interleave case 06" src="./assets/showcases/interleave/case_0006_orange_cat_travel.webp">](./assets/showcases/interleave/case_0006_orange_cat_travel.webp) |
210
+ | [<img alt="interleave case 01" src="./assets/showcases/interleave/case_0001_makeup_three_looks.webp">](./assets/showcases/interleave/case_0001_makeup_three_looks.webp) |
211
+ | [<img alt="interleave case 07" src="./assets/showcases/interleave/case_0007_bowie_slide_design.webp">](./assets/showcases/interleave/case_0007_bowie_slide_design.webp) |
212
 
213
  ---
214