Does the A100 work?
Does the A100 work?
I think so. I only tested 35B, both vllm and transforms work
it's not working on 4 x NVIDIA RTX 5000 Ada Generation 32GB machine
@todiadiyatmo Since my environment changes from time to time, I double-checked in my current environment on A100 and it works well.
Package Version
---------------------------------------- --------------------------------
accelerate 1.13.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.3
aiosignal 1.4.0
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.84.0
anyio 4.12.1
apache-tvm-ffi 0.1.9
astor 0.8.1
attrs 25.4.0
auto-round 0.10.2
blake3 1.0.8
cachetools 7.0.5
cbor2 5.8.0
certifi 2026.2.25
cffi 2.0.0
charset-normalizer 3.4.5
click 8.3.1
cloudpickle 3.1.2
compressed-tensors 0.13.0
cryptography 46.0.5
cuda-bindings 12.9.4
cuda-pathfinder 1.4.1
cuda-python 12.9.4
datasets 4.6.1
depyf 0.20.0
dill 0.4.0
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
docstring_parser 0.17.0
einops 0.8.2
email-validator 2.3.0
fastapi 0.135.1
fastapi-cli 0.0.24
fastapi-cloud-cli 0.14.1
fastar 0.8.0
filelock 3.25.0
flashinfer-python 0.6.4
frozenlist 1.8.0
fsspec 2026.2.0
gguf 0.18.0
gitdb 4.0.12
GitPython 3.1.46
googleapis-common-protos 1.73.0
grpcio 1.78.0
h11 0.16.0
hf-xet 1.3.2
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
httpx-sse 0.4.3
huggingface_hub 1.6.0
idna 3.11
ijson 3.5.0
importlib_metadata 8.7.1
interegular 0.3.3
Jinja2 3.1.6
jiter 0.13.0
jmespath 1.1.0
jsonschema 4.26.0
jsonschema-specifications 2025.9.1
lark 1.2.2
llguidance 1.3.0
llvmlite 0.44.0
lm-format-enforcer 0.11.3
loguru 0.7.3
markdown-it-py 4.0.0
MarkupSafe 3.0.3
mcp 1.26.0
mdurl 0.1.2
mistral_common 1.9.1
model-hosting-container-standards 0.1.13
mpmath 1.3.0
msgspec 0.20.0
multidict 6.7.1
multiprocess 0.70.18
networkx 3.6.1
ninja 1.13.0
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.18.0
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-cutlass-dsl 4.4.1
nvidia-cutlass-dsl-libs-base 4.4.1
nvidia-ml-py 13.590.48
nvidia-nccl-cu12 2.27.5
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvshmem-cu12 3.4.5
nvidia-nvtx-cu12 12.8.90
openai 2.24.0
openai-harmony 0.0.8
opencv-python-headless 4.13.0.92
opentelemetry-api 1.40.0
opentelemetry-exporter-otlp 1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc 1.40.0
opentelemetry-exporter-otlp-proto-http 1.40.0
opentelemetry-proto 1.40.0
opentelemetry-sdk 1.40.0
opentelemetry-semantic-conventions 0.61b0
opentelemetry-semantic-conventions-ai 0.4.15
outlines_core 0.2.11
packaging 26.0
pandas 3.0.1
partial-json-parser 0.2.1.1.post7
pillow 12.1.1
pip 26.0.1
platformdirs 4.9.4
prometheus_client 0.24.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.5
psutil 7.2.2
py-cpuinfo 9.0.0
pyarrow 23.0.1
pybase64 1.4.3
pycountry 26.2.16
pycparser 3.0
pydantic 2.12.5
pydantic_core 2.41.5
pydantic-extra-types 2.11.0
pydantic-settings 2.13.1
Pygments 2.19.2
PyJWT 2.11.0
python-dateutil 2.9.0.post0
python-dotenv 1.2.2
python-json-logger 4.0.0
python-multipart 0.0.22
PyYAML 6.0.3
pyzmq 27.1.0
quack-kernels 0.3.2
referencing 0.37.0
regex 2026.2.28
requests 2.32.5
rich 14.3.3
rich-toolkit 0.19.7
rignore 0.7.6
rpds-py 0.30.0
rustbpe 0.1.0
safetensors 0.7.0
sentencepiece 0.2.1
sentry-sdk 2.54.0
setproctitle 1.3.7
setuptools 80.10.2
shellingham 1.5.4
six 1.17.0
smmap 5.0.3
sniffio 1.3.1
sse-starlette 3.3.2
starlette 0.52.1
supervisor 4.3.0
sympy 1.14.0
tabulate 0.10.0
threadpoolctl 3.6.0
tiktoken 0.12.0
tokenizers 0.22.2
torch 2.10.0
torch_c_dlpack_ext 0.1.5
torchaudio 2.10.0
torchvision 0.25.0
tqdm 4.67.3
transformers 5.3.0
triton 3.6.0
typer 0.24.1
typing_extensions 4.15.0
typing-inspection 0.4.2
urllib3 2.6.3
uvicorn 0.41.0
uvloop 0.22.1
vllm 0.17.1rc1.dev28+g098d84473.cu128
wandb 0.25.0
watchfiles 1.1.1
websockets 16.0
xgrammar 0.1.29
xxhash 3.6.0
yarl 1.23.0
zipp 3.23.0
@wenhuach can you share your config
RTX 3090 not working with
TokenizersBackenderror
same issue here, using the vllm/vllm-openai:latest docker image with Intel/Qwen3.5-27B-int4-AutoRound
@wenhuach can you share your config
RTX 3090 not working with
TokenizersBackenderrorsame issue here, using the
vllm/vllm-openai:latestdocker image withIntel/Qwen3.5-27B-int4-AutoRound
It's working fine for me, here's my compose commands:
command: >
--model Intel/Qwen3.5-27B-int4-AutoRound
--served_model_name qwen35-27b
--tokenizer Qwen/Qwen3.5-27B
--tensor-parallel-size 2
--gpu-memory-utilization 0.90
--max-model-len 100000
--max-num-batched-tokens 4096
--kv-cache-dtype fp8
--reasoning-parser qwen3
--enable-prefix-caching
--mm-encoder-tp-mode data
--mm-processor-cache-type shm
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--max-cudagraph-capture-size 32
--dtype float16
# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.
Seems work with tp=2 but not tp=4 (so 3090 cannot run it)
# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.Seems work with tp=2 but not tp=4 (so 3090 cannot run it)
Which model are you running?
The 122b one
# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.Seems work with tp=2 but not tp=4 (so 3090 cannot run it)
You can with -tp 2 -pp 2, but with performance penalty.
The 122b one
Did you try the --dtype float16, I encountered this issue, and that fixed it. If not, you can try the suggested --quantization gptq.
@nephepritou
Wow, i got it working with tp 2 and pp 2 !! thank you very much !!
command: [
"/models/Intel/Qwen3.5-122B-A10B-int4-AutoRound",
"--max-model-len","200000",
"--max-num-seqs", "4",
"--dtype", "float16",
"--enable-auto-tool-choice",
"--reasoning-parser","qwen3",
"--tool-call-parser","qwen3_coder",
"--gpu-memory-utilization", "0.92",
"--host", "0.0.0.0",
"--port", "8000",
"--enable-chunked-prefill",
"--disable-custom-all-reduce",
"--enable-prefix-caching",
"--tensor-parallel-size", "2",
"--pipeline-parallel-size", "2",
"--max-num-batched-tokens", "8096",
"--trust-remote-code",
"--attention-backend","FLASHINFER",
'--override-generation-config={"top_p":0.95,"temperature":0.6,"top_k":20,"presence_penalty":0.0,"repetition_penalty":1.0}',
]
its run quite fast also
guidellm benchmark run \
--target http://127.0.0.1:8000 \
--rate 8 \
--outputs json,csv \
--profile concurrent \
--max-seconds 30 \
--data "prompt_tokens=10000,output_tokens=1000" \
--processor Intel/Qwen3.5-122B-A10B-int4-AutoRound
ℹ Server Throughput Statistics (All Requests)
|============|=====|======|=======|======|========|=========|========|=======|========|=========|
| Benchmark | Requests |||| Input Tokens || Output Tokens || Total Tokens ||
| Strategy | Per Sec || Concurrency || Per Sec || Per Sec || Per Sec ||
| | Mdn | Mean | Mdn | Mean | Mdn | Mean | Mdn | Mean | Mdn | Mean |
|------------|-----|------|-------|------|--------|---------|--------|-------|--------|---------|
| concurrent | 0.1 | 0.0 | 8.0 | 8.0 | 4020.6 | 18055.6 | 401.2 | 609.2 | 4422.3 | 18664.8 |
|============|=====|======|=======|======|========|=========|========|=======|========|=========|