SenseNova-U1-8B-MoT / docs /inference_infrastructure.md
Merjia's picture
Duplicate from sensenova/SenseNova-U1-8B-MoT
5b442e6

Inference Infrastructure

← Back to main README

This document describes the inference infrastructure behind SenseNova-U1, built on top of LightLLM and LightX2V.

Overview

SenseNova-U1 is exposed as one unified multimodal model, but the understanding and generation paths exhibit different execution shapes in production. They tend to prefer different scheduling policies, parallelization strategies, and resource ratios, rather than a single shared serving configuration. When both are coupled inside one monolithic runtime, these choices become unnecessarily tied together, which can leave both paths operating away from their respective optimal points.

To avoid this coupling, SenseNova-U1 adopts a disaggregated architecture:

  • LightLLM for understanding, text streaming, and control flow
  • LightX2V for image generation

These two engines exchange generation state through pinned shared memory and high-performance transfer kernels. The handoff is lightweight, while each side can still run with its own optimal execution policy.

LightLLM + LightX2V decoupled architecture

This design provides practical benefits in production:

  • Independent parallelism (for example, understanding with TP=2, generation with CFG=2 or SP=2).
  • Independent resource allocation (different GPU counts and memory budgets).
  • Independent scaling for text-heavy vs. image-heavy traffic.
  • Better operational isolation and simpler performance tuning.

The same architecture can be deployed in two modes, depending on your hardware budget and traffic pattern:

  • Separate: LightLLM and LightX2V run on different GPU groups.
  • Colocate: LightLLM and LightX2V run as separate processes on the same GPU.

In most production setups, Separate is the default choice because it gives clearer bottleneck control and independent scaling. Colocate is useful for quick validation, generation-heave scenes, or smaller GPU setups.

Attention for Multimodal Prefill of Neo

Neo's prefill attention is not standard causal attention. Text tokens remain causal, while image tokens attend to the full text prefix together with the entire image span. To support this hybrid masking pattern, we modified both attention implementations in our stack: the Triton kernel and the official FA3 codebase. Our FA3 branch is available at WANDY666/flash-attention.

Concretely, we introduced an optional image_token_tag argument that adjusts the mask row by row. Text rows keep the standard causal mask. Image rows, instead of using plain causal truncation, are allowed to attend to all preceding text tokens and all image tokens within the image span.

To preserve the causal-triangle speedup whenever possible, the kernel makes the decision per M-block. It OR-reduces the image_token_tag values inside the current block: if the block contains no image token, it keeps the standard causal K-range; if the block contains image tokens, it extends the K-range to cover the required image span. As a result, pure-text blocks still follow the normal causal path, while only the relevant blocks pay the extra work needed by the hybrid mask.

Neo multimodal attention behavior

The overhead therefore does not depend on a fixed ratio, but on how image tokens are distributed across the sequence and across M-block boundaries. When image rows are concentrated in only part of the sequence, the extra work is correspondingly localized. For text-only requests, image_token_tag is empty, and the kernel falls back to vanilla FA3 with no additional overhead. The benchmark below compares two implementations for Neo-style multimodal prefill:

  • Triton implementation: easier to migrate into existing codebases, with lower integration cost and faster iteration.
  • FA3 implementation: higher absolute performance on supported hardware.
batch max_seq_len image_token_num triton (ms) fa3 (ms) speedup (×)
8 4096 88 1.95 0.81 2.41×
8 8192 171 6.55 2.68 2.45×
8 65536 150 43.30 14.95 2.90×
16 4096 379 4.12 1.68 2.46×
16 8192 246 17.76 7.40 2.40×
16 65536 206 107.74 33.66 3.20×
32 4096 726 8.46 3.46 2.44×
32 8192 536 31.74 13.24 2.40×
32 65536 417 171.00 58.26 2.94×
64 4096 1170 16.08 6.88 2.34×
64 8192 1177 55.48 22.91 2.42×
64 65536 1291 348.89 124.82 2.80×
128 4096 2057 30.89 12.53 2.47×
128 8192 2196 104.73 43.22 2.42×
128 65536 2205 706.60 241.67 2.92×

Deployment

For a concise deployment runbook (Docker image, startup command, and API tests), see deployment.md.

Generation Performance

The table below is the benchmark template for 2048x2048 image generation. Fill in measured numbers for each machine and deployment profile.

Machine Type Deployment Config Per-step Latency (s/step) End-to-end Latency (s)
H100 TP2+CFG2 / colocate 0.158 9.23
H200 TP2+CFG2 / colocate 0.152 9.54
5090 TP2+CFG2 / separate 0.415 23.04
L40S TP2+CFG2 / separate 0.443 25.62

In Neo, the KV cache for the generation stage is provided by the understanding module, so T2I (generation) and I2I (editing) have very similar runtime characteristics. For brevity, we report only T2I latency here.