SenseNova-U1-8B-MoT / docs /deployment.md

Duplicate from sensenova/SenseNova-U1-8B-MoT

5b442e6 5 days ago

5.49 kB

	# LightLLM + LightX2V Deployment

	This guide provides a practical deployment flow for serving SenseNova-U1 with
	LightLLM + LightX2V using the Docker image
	`lightx2v/lightllm_lightx2v:20260407`.

	## 1) Pull and enter the Docker image

	```bash
	docker pull lightx2v/lightllm_lightx2v:20260407
	docker run --gpus all --ipc=host --network host -it lightx2v/lightllm_lightx2v:20260407 /bin/bash
	```

	## 2) Clone runtime dependencies inside the container

	The image may not include the latest source trees. Clone both repositories and
	pin LightLLM to the validated branch:

	```bash
	git clone https://github.com/ModelTC/LightX2V.git
	git clone https://github.com/ModelTC/LightLLM.git
	cd LightLLM
	git checkout neo_plus_clean
	```

	## 3) X2I-related arguments

	When enabling image generation in the same API server, use the following flags:

	- `--enable_multimodal_x2i`
	Enable image generation capability.
	- `--x2i_server_used_gpus`
	Number of GPUs reserved for the X2I generation server.
	- `--x2i_server_deploy_mode {colocate,separate}`
	- `colocate`: understanding and generation share the same visible GPU pool.
	- `separate`: understanding and generation are deployed as separate services, and
	can use different GPU sets.
	- `--x2i_use_naive_impl`
	Use the native/naive PyTorch backend for X2I (debugging/testing only, not for
	production throughput).

	## 4) Deployment modes

	### Mode A: `colocate` (single service, shared GPU pool)

	Use this mode for quick validation and simpler operations. The LLM understanding
	path (`--tp`) and X2I generation path (`--x2i_server_used_gpus`) consume resources
	from the same visible GPUs.

	Example (2 GPUs total):
	- understanding path: `tp=2`
	- generation path: `cfg=2` (configured in `neopp_dense_parallel_cfg.json`)

	```bash
	PYTHONPATH=/workspace/LightX2V/ \
	python -m lightllm.server.api_server \
	--model_dir $MODEL_DIR \
	--enable_multimodal_x2i \
	--x2i_server_deploy_mode colocate \
	--x2i_server_used_gpus 2 \
	--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json \
	--host 0.0.0.0 \
	--port 8000 \
	--max_req_total_len 65536 \
	--mem_fraction 0.75 \
	--tp 2
	```

	### Mode B: `separate` (understanding and generation decoupled)

	`separate` is conceptually similar to PD-style decoupling in LLM serving: split
	different stages onto different GPU groups so a long stage does not block the
	short stage.

	For multimodal serving, image generation is usually the long stage, while
	understanding is short and lightweight. Separating them allows understanding
	requests to keep flowing even when generation workers are busy.

	Recommended deployment profiles:

	1. Default profile (continuity-first): Understanding `tp=1` + Generation 1 GPU
	- Understanding: `--tp 1`
	- Generation: `--x2i_server_used_gpus 1`
	- Use as the baseline profile for mixed workloads. It keeps the pipeline simple
	while avoiding head-of-line blocking between understanding and generation.

	2. Understanding-expanded profile: Understanding `tp=2` + Generation 1 GPU
	- Understanding: `--tp 2`
	- Generation: `--x2i_server_used_gpus 1`
	- Use when complex prompts or high understanding QPS become the bottleneck.

	3. Generation-expanded profile: Understanding `tp=1/2` + Generation parallel
	- Understanding: `--tp 1` or `--tp 2`
	- Generation option A (2 GPUs): `--x2i_server_used_gpus 2` +
	`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg.json`
	- Generation option B (4 GPUs): `--x2i_server_used_gpus 4` +
	`/workspace/LightX2V/configs/neopp/neopp_dense_parallel_cfg_seq.json`
	- Use when generation latency/throughput dominates (most common scaling path).

	Example launch (separate mode in API server):

	```bash
	PYTHONPATH=/workspace/LightX2V/ \
	python -m lightllm.server.api_server \
	--model_dir $MODEL_DIR \
	--enable_multimodal_x2i \
	--x2i_server_deploy_mode separate \
	--x2i_server_used_gpus 1 \
	--x2v_gen_model_config /workspace/LightX2V/configs/neopp/neopp_dense.json \
	--host 0.0.0.0 \
	--port 8000 \
	--max_req_total_len 65536 \
	--mem_fraction 0.75 \
	--tp 2
	```

	## 5) Quantization

	`separate` mode also enables independent quantization strategies for
	understanding and generation.

	Because understanding and generation are decoupled, you can tune quality/latency
	for each path independently:

	1. Understanding FP16/BF16 + Generation FP8
	- Understanding: no quantization flag (keep default precision)
	- Generation: use FP8 generation config, for example
	`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
	- Recommended as the default quantized profile for production.

	2. Understanding FP8 + Generation FP8
	- Understanding: add `--quant_type fp8w8a8`
	- Generation: use FP8 generation config
	`/workspace/LightX2V/configs/neopp/neopp_dense_fp8.json`
	- Use when GPU memory/throughput is the primary constraint.

	Notes:
	- `--quant_type fp8w8a8` controls quantization on the understanding path.
	- Generation-side precision is controlled by `--x2v_gen_model_config`.

	## 6) OpenAI-compatible API

	Once the API server is up, you can send requests through the OpenAI-compatible
	endpoint exposed by LightLLM. A minimal text-to-image example:

	```bash
	python examples/serving/client.py \
	--mode t2i \
	--prompt "A cozy coffee shop storefront with infographic style."
	```

	See [`examples/serving/client.py`](../examples/serving/client.py) for more modes
	(VQA, editing, interleaved) and request formats.