Tested on B300 and got roughly the same output as FP8
#4
by O-delicious - opened
# Build and install flashinfer from source with fp4 quantization fix.
# Keep this aligned with the v0.5.9 base image's flashinfer_python version.
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/sgl-workspace/flashinfer-build \
git config --global http.proxy "${http_proxy:-$HTTP_PROXY}" && \
git config --global https.proxy "${https_proxy:-$HTTPS_PROXY}" && \
if [ ! -d /sgl-workspace/flashinfer-build/flashinfer/.git ]; then \
rm -rf /sgl-workspace/flashinfer-build/flashinfer && \
git clone https://github.com/flashinfer-ai/flashinfer.git /sgl-workspace/flashinfer-build/flashinfer; \
fi && \
cd /sgl-workspace/flashinfer-build/flashinfer && \
git fetch --tags origin && \
git checkout -f v0.6.3 && \
git submodule sync --recursive && \
git submodule update --init --recursive --force && \
git config user.email "build@example.com" && \
git config user.name "Build" && \
git remote add nvjullin https://github.com/nvjullin/flashinfer 2>/dev/null || true && \
git fetch nvjullin fix-fp4-quant-padding && \
git cherry-pick --skip || true && \
git cherry-pick a022c4d4 72d6572b && \
git submodule sync --recursive && \
git submodule update --init --recursive --force && \
test -f 3rdparty/spdlog/include/spdlog/sinks/stdout_color_sinks.h && \
cd flashinfer-jit-cache && \
MAX_JOBS=32 FLASHINFER_NVCC_THREADS=2 FLASHINFER_CUDA_ARCH_LIST="10.0a 10.3a" python -m build --no-isolation --skip-dependency-check --wheel && \
python -m pip install dist/*.whl
Change a bit on the cherry-pick commit id.
Hosted with
python3 -m sglang.launch_server \
--model /data/GLM-5-NVFP4 \
--port 8000 \
--tensor-parallel-size 8 \
--ep-size 8 \
--quantization modelopt_fp4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--trust-remote-code \
--chunked-prefill-size 16384 \
--max-prefill-tokens 4096 \
--mem-fraction-static 0.80 \
--max-running-requests 32 \
--disable-custom-all-reduce \
--served-model-name glm-5-fp8
And I got benchmark in B300
#Input tokens: 1048576
#Output tokens: 16384
Starting initial single prompt test run...
Backend: sglang
API URL: http://127.0.0.1:8000/generate
Base URL: http://127.0.0.1:8000
Model ID: glm-5-fp8
Host: 0.0.0.0, Port: 30000
Testing base URL connectivity...
Base URL http://127.0.0.1:8000 responded with status: 200
Testing API endpoint http://127.0.0.1:8000/generate...
API URL http://127.0.0.1:8000/generate responded with status: 400
Test prompt length: 65536
Test output length: 32
Initial test run completed. Starting main benchmark run...
100%|ββββββββββ| 16/16 [01:55<00:00, 7.20s/it]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 2.0
Max reqeuest concurrency: 16
Successful requests: 16
Benchmark duration (s): 115.13
Total input tokens: 1048576
Total generated tokens: 16384
Total generated tokens (retokenized): 16384
Request throughput (req/s): 0.14
Input token throughput (tok/s): 9107.59
Output token throughput (tok/s): 142.31
Total token throughput (tok/s): 9249.90
Concurrency: 15.74
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 113289.81
Median E2E Latency (ms): 113700.03
---------------Time to First Token----------------
Mean TTFT (ms): 41681.64
Median TTFT (ms): 42160.25
P99 TTFT (ms): 75511.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 70.00
Median TPOT (ms): 69.93
P99 TPOT (ms): 105.97
---------------Inter-token Latency----------------
Mean ITL (ms): 70.00
Median ITL (ms): 33.94
P99 ITL (ms): 34.55
==================================================
It turns out that it is roughly the same throughput compared to GLM-FP8 hosted with vllm. (16 concurrency, 64k/1k in/out)
Is it a normal result?
With speculative decoding we can achieve
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 2.0
Max reqeuest concurrency: 16
Successful requests: 16
Benchmark duration (s): 96.43
Total input tokens: 1048576
Total generated tokens: 16384
Total generated tokens (retokenized): 16384
Request throughput (req/s): 0.17
Input token throughput (tok/s): 10873.56
Output token throughput (tok/s): 169.90
Total token throughput (tok/s): 11043.46
Concurrency: 15.67
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 94425.83
Median E2E Latency (ms): 94824.34
---------------Time to First Token----------------
Mean TTFT (ms): 42753.15
Median TTFT (ms): 43100.63
P99 TTFT (ms): 78462.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 50.51
Median TPOT (ms): 50.56
P99 TPOT (ms): 87.76
---------------Inter-token Latency----------------
Mean ITL (ms): 201.26
Median ITL (ms): 49.49
P99 ITL (ms): 55.94
==================================================
But it seems not stable. It never survived 8h and crashed with error:
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[WARNING] batch_size=11711 exceeds shared mem limit, falling back to low-smem kernel
[2026-03-24 07:17:20 TP0] Prefill batch, #new-seq: 1, #new-token: 11712, #cached-token: 0, token usage: 0.04, #running-req: 1, #queue-req: 0, input throughput (token/s): 15732.26, cuda graph: False
[2026-03-24 07:17:21 TP0] Decode batch, #running-req: 1, #token: 47104, token usage: 0.02, accept len: 3.44, accept rate: 0.86, cuda graph: True, gen throughput (token/s): 53.02, #queue-req: 0
[2026-03-24 07:17:22 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:22 TP5] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:22 TP6] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3171, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1127, in event_loop_normal
self.self_check_during_idle()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 332, in self_check_during_idle
self.check_memory()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 244, in check_memory
raise_error_or_warn(
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 4082, in raise_error_or_warn
raise ValueError(message)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=2954112, available_size=18048, evictable_size=2935808, protected_size=0
[2026-03-24 07:17:23] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:23] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:23] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:23] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:24] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-03-24 07:17:24] ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 2511, in running_phase_sigquit_handler
kill_process_tree(os.getpid())
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 1153, in kill_process_tree
sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1134, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 87, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 119, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 106, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.12/dist-packages/starlette/responses.py", line 280, in __call__
await self.background()
File "/usr/local/lib/python3.12/dist-packages/starlette/background.py", line 36, in __call__
await task()
File "/usr/local/lib/python3.12/dist-packages/starlette/background.py", line 21, in __call__
await self.func(*self.args, **self.kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 1431, in abort_request
await asyncio.sleep(2)
File "/usr/lib/python3.12/asyncio/tasks.py", line 665, in sleep
return await future
^^^^^^^^^^^^
asyncio.exceptions.CancelledError
[2026-03-24 07:17:24] ERROR: Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
File "uvloop/loop.pyx", line 476, in uvloop.loop.Loop._on_idle
File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
File "uvloop/cbhandles.pyx", line 61, in uvloop.loop.Handle._run
File "/sgl-workspace/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 2511, in running_phase_sigquit_handler
kill_process_tree(os.getpid())
File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 1153, in kill_process_tree
sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 701, in lifespan
await receive()
File "/usr/local/lib/python3.12/dist-packages/uvicorn/lifespan/on.py", line 137, in receive
return await self.receive_queue.get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
await getter
asyncio.exceptions.CancelledError
[W324 07:17:41.738975876 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())