| # LightLLM + LightX2V Deployment |
|
|
| This guide provides a practical deployment flow for serving SenseNova-U1 with |
| LightLLM + LightX2V using the Docker image |
| `lightx2v/lightllm_lightx2v:20260407`. |
|
|
| ## 1) Pull and enter the Docker image |
|
|
| ```bash |
| docker pull lightx2v/lightllm_lightx2v:20260407 |
| docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash |
| ``` |
|
|
| ## 2) Clone runtime dependencies inside the container |
|
|
| The image may not include the latest source trees. Clone both repositories and |
| pin LightLLM to the validated branch: |
|
|
| ```bash |
| git clone https://github.com/ModelTC/LightX2V.git |
| git clone https://github.com/ModelTC/LightLLM.git |
| cd LightLLM |
| git checkout neo_plus_clean |
| ``` |
|
|
| ## 3) X2I-related arguments |
|
|
| When enabling image generation in the same API server, use the following flags: |
|
|
| - `--enable_multimodal_x2i` |
| Enable image generation capability. |
| - `--x2i_server_used_gpus` |
| Number of GPUs reserved for the X2I generation server. |
| - `--x2i_server_deploy_mode {colocate,separate}` |
| - `colocate`: understanding and generation share the same visible GPU pool. |
| - `separate`: understanding and generation are deployed as separate services, and |
| can use different GPU sets. |
| - `--x2i_use_naive_impl` |
| Use the native/naive PyTorch backend for X2I (debugging/testing only, not for |
| production throughput). |
| |
| ## 4) Deployment modes |
|
|
| ### Mode A: `colocate` (single service, shared GPU pool) |
|
|
| Use this mode for quick validation and simpler operations. The LLM understanding |
| path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources |
| from the same visible GPUs. |
|
|
| Example (2 GPUs total): |
| - understanding path: `tp=2` |
| - generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`) |
|
|
| ```bash |
| PYTHONPATH=/workspace/LightX2V/ \ |
| python -m lightllm.server.api_server \ |
| --model_dir $MODEL_DIR \ |
| --enable_multimodal_x2i \ |
| --x2i_server_deploy_mode colocate \ |
| --x2i_server_used_gpus 2 \ |
| --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \ |
| --host 0.0.0.0 \ |
| --port 8000 \ |
| --max_req_total_len 65536 \ |
| --mem_fraction 0.75 \ |
| --tp 2 |
| ``` |
|
|
| ### Mode B: `separate` (understanding and generation decoupled) |
|
|
| `separate` is conceptually similar to PD-style decoupling in LLM serving: split |
| different stages onto different GPU groups so a long stage does not block the |
| short stage. |
|
|
| For multimodal serving, image generation is usually the long stage, while |
| understanding is short and lightweight. Separating them allows understanding |
| requests to keep flowing even when generation workers are busy. |
|
|
| Recommended deployment profiles: |
|
|
| 1. **Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU** |
| - Understanding: `--tp 1` |
| - Generation: `--x2i_server_used_gpus 1` |
| - Use as the baseline profile for mixed workloads. It keeps the pipeline simple |
| while avoiding head-of-line blocking between understanding and generation. |
| |
| 2. **Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU** |
| - Understanding: `--tp 2` |
| - Generation: `--x2i_server_used_gpus 1` |
| - Use when complex prompts or high understanding QPS become the bottleneck. |
|
|
| 3. **Generation-expanded profile: Understanding `tp=1/2` + Generation parallel** |
| - Understanding: `--tp 1` or `--tp 2` |
| - Generation option A (2 GPUs): `--x2i_server_used_gpus 2` + |
| `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json` |
| - Generation option B (4 GPUs): `--x2i_server_used_gpus 4` + |
| `/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json` |
| - Use when generation latency/throughput dominates (most common scaling path). |
| |
| Example launch (separate mode in API server): |
|
|
| ```bash |
| PYTHONPATH=/workspace/LightX2V/ \ |
| python -m lightllm.server.api_server \ |
| --model_dir $MODEL_DIR \ |
| --enable_multimodal_x2i \ |
| --x2i_server_deploy_mode separate \ |
| --x2i_server_used_gpus 1 \ |
| --x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \ |
| --host 0.0.0.0 \ |
| --port 8000 \ |
| --max_req_total_len 65536 \ |
| --mem_fraction 0.75 \ |
| --tp 2 |
| ``` |
|
|
| ## 5) Quantization |
|
|
| `separate` mode also enables independent quantization strategies for |
| understanding and generation. |
|
|
| Because understanding and generation are decoupled, you can tune quality/latency |
| for each path independently: |
|
|
| 1. **Understanding FP16/BF16 + Generation FP8** |
| - Understanding: no quantization flag (keep default precision) |
| - Generation: use FP8 generation config, for example |
| `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json` |
| - Recommended as the default quantized profile for production. |
| |
| 2. **Understanding FP8 + Generation FP8** |
| - Understanding: add `--quant_type fp8w8a8` |
| - Generation: use FP8 generation config |
| `/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json` |
| - Use when GPU memory/throughput is the primary constraint. |
| |
| Notes: |
| - `--quant_type fp8w8a8` controls quantization on the understanding path. |
| - Generation-side precision is controlled by `--x2v_gen_model_config`. |
|
|
| ## 6) OpenAI-compatible API |
|
|
| Once the API server is up, you can send requests through the OpenAI-compatible |
| endpoint exposed by LightLLM. A minimal text-to-image example: |
|
|
| ```bash |
| python examples/serving/client.py \ |
| --mode t2i \ |
| --prompt "A cozy coffee shop storefront with infographic style." |
| ``` |
|
|
| See [`examples/serving/client.py`](../examples/serving/client.py) for more modes |
| (VQA, editing, interleaved) and request formats. |