SenseNova-U1-8B-MoT / docs /deployment.md
Merjia's picture
Duplicate from sensenova/SenseNova-U1-8B-MoT
5b442e6

LightLLM + LightX2V Deployment

This guide provides a practical deployment flow for serving SenseNova-U1 with LightLLM + LightX2V using the Docker image lightx2v/lightllm_lightx2v:20260407.

1) Pull and enter the Docker image

docker pull lightx2v/lightllm_lightx2v:20260407
docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash

2) Clone runtime dependencies inside the container

The image may not include the latest source trees. Clone both repositories and pin LightLLM to the validated branch:

git clone https://github.com/ModelTC/LightX2V.git
git clone https://github.com/ModelTC/LightLLM.git
cd LightLLM
git checkout neo_plus_clean

3) X2I-related arguments

When enabling image generation in the same API server, use the following flags:

  • --enable_multimodal_x2i
    Enable image generation capability.
  • --x2i_server_used_gpus
    Number of GPUs reserved for the X2I generation server.
  • --x2i_server_deploy_mode {colocate,separate}
    • colocate: understanding and generation share the same visible GPU pool.
    • separate: understanding and generation are deployed as separate services, and can use different GPU sets.
  • --x2i_use_naive_impl
    Use the native/naive PyTorch backend for X2I (debugging/testing only, not for production throughput).

4) Deployment modes

Mode A: colocate (single service, shared GPU pool)

Use this mode for quick validation and simpler operations. The LLM understanding path (--tp) and X2I generation path (--x2i_server_used_gpus) consume resources from the same visible GPUs.

Example (2 GPUs total):

  • understanding path: tp=2
  • generation path: cfg=2 (configured in neopp_dense_parallel_cfg.json)
PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
  --model_dir $MODEL_DIR \
  --enable_multimodal_x2i \
  --x2i_server_deploy_mode colocate \
  --x2i_server_used_gpus 2 \
  --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
  --host 0.0.0.0 \
  --port 8000 \
  --max_req_total_len 65536 \
  --mem_fraction 0.75 \
  --tp 2

Mode B: separate (understanding and generation decoupled)

separate is conceptually similar to PD-style decoupling in LLM serving: split different stages onto different GPU groups so a long stage does not block the short stage.

For multimodal serving, image generation is usually the long stage, while understanding is short and lightweight. Separating them allows understanding requests to keep flowing even when generation workers are busy.

Recommended deployment profiles:

  1. Default profile (continuity-first): Understanding tp=1 + Generation 1 GPU

    • Understanding: --tp 1
    • Generation: --x2i_server_used_gpus 1
    • Use as the baseline profile for mixed workloads. It keeps the pipeline simple while avoiding head-of-line blocking between understanding and generation.
  2. Understanding-expanded profile: Understanding tp=2 + Generation 1 GPU

    • Understanding: --tp 2
    • Generation: --x2i_server_used_gpus 1
    • Use when complex prompts or high understanding QPS become the bottleneck.
  3. Generation-expanded profile: Understanding tp=1/2 + Generation parallel

    • Understanding: --tp 1 or --tp 2
    • Generation option A (2 GPUs): --x2i_server_used_gpus 2 + /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json
    • Generation option B (4 GPUs): --x2i_server_used_gpus 4 + /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json
    • Use when generation latency/throughput dominates (most common scaling path).

Example launch (separate mode in API server):

PYTHONPATH=/workspace/LightX2V/ \
python -m lightllm.server.api_server \
  --model_dir $MODEL_DIR \
  --enable_multimodal_x2i \
  --x2i_server_deploy_mode separate \
  --x2i_server_used_gpus 1 \
  --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
  --host 0.0.0.0 \
  --port 8000 \
  --max_req_total_len 65536 \
  --mem_fraction 0.75 \
  --tp 2

5) Quantization

separate mode also enables independent quantization strategies for understanding and generation.

Because understanding and generation are decoupled, you can tune quality/latency for each path independently:

  1. Understanding FP16/BF16 + Generation FP8

    • Understanding: no quantization flag (keep default precision)
    • Generation: use FP8 generation config, for example /workspace/LightX2V/configs/neopp/neopp_dense_fp8.json
    • Recommended as the default quantized profile for production.
  2. Understanding FP8 + Generation FP8

    • Understanding: add --quant_type fp8w8a8
    • Generation: use FP8 generation config /workspace/LightX2V/configs/neopp/neopp_dense_fp8.json
    • Use when GPU memory/throughput is the primary constraint.

Notes:

  • --quant_type fp8w8a8 controls quantization on the understanding path.
  • Generation-side precision is controlled by --x2v_gen_model_config.

6) OpenAI-compatible API

Once the API server is up, you can send requests through the OpenAI-compatible endpoint exposed by LightLLM. A minimal text-to-image example:

python examples/serving/client.py \
  --mode t2i \
  --prompt "A cozy coffee shop storefront with infographic style."

See examples/serving/client.py for more modes (VQA, editing, interleaved) and request formats.