Does the A100 work?

#1
by xz123321 - opened

Does the A100 work?

Intel org

I think so. I only tested 35B, both vllm and transforms work

it's not working on 4 x NVIDIA RTX 5000 Ada Generation 32GB machine

@wenhuach can you share your config

RTX 3090 not working with TokenizersBackend error

Intel org
edited Mar 11

@todiadiyatmo Since my environment changes from time to time, I double-checked in my current environment on A100 and it works well.

Package                                  Version
---------------------------------------- --------------------------------
accelerate                               1.13.0
aiohappyeyeballs                         2.6.1
aiohttp                                  3.13.3
aiosignal                                1.4.0
annotated-doc                            0.0.4
annotated-types                          0.7.0
anthropic                                0.84.0
anyio                                    4.12.1
apache-tvm-ffi                           0.1.9
astor                                    0.8.1
attrs                                    25.4.0
auto-round                               0.10.2
blake3                                   1.0.8
cachetools                               7.0.5
cbor2                                    5.8.0
certifi                                  2026.2.25
cffi                                     2.0.0
charset-normalizer                       3.4.5
click                                    8.3.1
cloudpickle                              3.1.2
compressed-tensors                       0.13.0
cryptography                             46.0.5
cuda-bindings                            12.9.4
cuda-pathfinder                          1.4.1
cuda-python                              12.9.4
datasets                                 4.6.1
depyf                                    0.20.0
dill                                     0.4.0
diskcache                                5.6.3
distro                                   1.9.0
dnspython                                2.8.0
docstring_parser                         0.17.0
einops                                   0.8.2
email-validator                          2.3.0
fastapi                                  0.135.1
fastapi-cli                              0.0.24
fastapi-cloud-cli                        0.14.1
fastar                                   0.8.0
filelock                                 3.25.0
flashinfer-python                        0.6.4
frozenlist                               1.8.0
fsspec                                   2026.2.0
gguf                                     0.18.0
gitdb                                    4.0.12
GitPython                                3.1.46
googleapis-common-protos                 1.73.0
grpcio                                   1.78.0
h11                                      0.16.0
hf-xet                                   1.3.2
httpcore                                 1.0.9
httptools                                0.7.1
httpx                                    0.28.1
httpx-sse                                0.4.3
huggingface_hub                          1.6.0
idna                                     3.11
ijson                                    3.5.0
importlib_metadata                       8.7.1
interegular                              0.3.3
Jinja2                                   3.1.6
jiter                                    0.13.0
jmespath                                 1.1.0
jsonschema                               4.26.0
jsonschema-specifications                2025.9.1
lark                                     1.2.2
llguidance                               1.3.0
llvmlite                                 0.44.0
lm-format-enforcer                       0.11.3
loguru                                   0.7.3
markdown-it-py                           4.0.0
MarkupSafe                               3.0.3
mcp                                      1.26.0
mdurl                                    0.1.2
mistral_common                           1.9.1
model-hosting-container-standards        0.1.13
mpmath                                   1.3.0
msgspec                                  0.20.0
multidict                                6.7.1
multiprocess                             0.70.18
networkx                                 3.6.1
ninja                                    1.13.0
numba                                    0.61.2
numpy                                    2.2.6
nvidia-cublas-cu12                       12.8.4.1
nvidia-cuda-cupti-cu12                   12.8.90
nvidia-cuda-nvrtc-cu12                   12.8.93
nvidia-cuda-runtime-cu12                 12.8.90
nvidia-cudnn-cu12                        9.10.2.21
nvidia-cudnn-frontend                    1.18.0
nvidia-cufft-cu12                        11.3.3.83
nvidia-cufile-cu12                       1.13.1.3
nvidia-curand-cu12                       10.3.9.90
nvidia-cusolver-cu12                     11.7.3.90
nvidia-cusparse-cu12                     12.5.8.93
nvidia-cusparselt-cu12                   0.7.1
nvidia-cutlass-dsl                       4.4.1
nvidia-cutlass-dsl-libs-base             4.4.1
nvidia-ml-py                             13.590.48
nvidia-nccl-cu12                         2.27.5
nvidia-nvjitlink-cu12                    12.8.93
nvidia-nvshmem-cu12                      3.4.5
nvidia-nvtx-cu12                         12.8.90
openai                                   2.24.0
openai-harmony                           0.0.8
opencv-python-headless                   4.13.0.92
opentelemetry-api                        1.40.0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
opentelemetry-semantic-conventions-ai    0.4.15
outlines_core                            0.2.11
packaging                                26.0
pandas                                   3.0.1
partial-json-parser                      0.2.1.1.post7
pillow                                   12.1.1
pip                                      26.0.1
platformdirs                             4.9.4
prometheus_client                        0.24.1
prometheus-fastapi-instrumentator        7.1.0
propcache                                0.4.1
protobuf                                 6.33.5
psutil                                   7.2.2
py-cpuinfo                               9.0.0
pyarrow                                  23.0.1
pybase64                                 1.4.3
pycountry                                26.2.16
pycparser                                3.0
pydantic                                 2.12.5
pydantic_core                            2.41.5
pydantic-extra-types                     2.11.0
pydantic-settings                        2.13.1
Pygments                                 2.19.2
PyJWT                                    2.11.0
python-dateutil                          2.9.0.post0
python-dotenv                            1.2.2
python-json-logger                       4.0.0
python-multipart                         0.0.22
PyYAML                                   6.0.3
pyzmq                                    27.1.0
quack-kernels                            0.3.2
referencing                              0.37.0
regex                                    2026.2.28
requests                                 2.32.5
rich                                     14.3.3
rich-toolkit                             0.19.7
rignore                                  0.7.6
rpds-py                                  0.30.0
rustbpe                                  0.1.0
safetensors                              0.7.0
sentencepiece                            0.2.1
sentry-sdk                               2.54.0
setproctitle                             1.3.7
setuptools                               80.10.2
shellingham                              1.5.4
six                                      1.17.0
smmap                                    5.0.3
sniffio                                  1.3.1
sse-starlette                            3.3.2
starlette                                0.52.1
supervisor                               4.3.0
sympy                                    1.14.0
tabulate                                 0.10.0
threadpoolctl                            3.6.0
tiktoken                                 0.12.0
tokenizers                               0.22.2
torch                                    2.10.0
torch_c_dlpack_ext                       0.1.5
torchaudio                               2.10.0
torchvision                              0.25.0
tqdm                                     4.67.3
transformers                             5.3.0
triton                                   3.6.0
typer                                    0.24.1
typing_extensions                        4.15.0
typing-inspection                        0.4.2
urllib3                                  2.6.3
uvicorn                                  0.41.0
uvloop                                   0.22.1
vllm                                     0.17.1rc1.dev28+g098d84473.cu128
wandb                                    0.25.0
watchfiles                               1.1.1
websockets                               16.0
xgrammar                                 0.1.29
xxhash                                   3.6.0
yarl                                     1.23.0
zipp                                     3.23.0

@wenhuach can you share your config

RTX 3090 not working with TokenizersBackend error

same issue here, using the vllm/vllm-openai:latest docker image with Intel/Qwen3.5-27B-int4-AutoRound

@wenhuach can you share your config

RTX 3090 not working with TokenizersBackend error

same issue here, using the vllm/vllm-openai:latest docker image with Intel/Qwen3.5-27B-int4-AutoRound

It's working fine for me, here's my compose commands:

    command: >
      --model Intel/Qwen3.5-27B-int4-AutoRound
      --served_model_name qwen35-27b
      --tokenizer Qwen/Qwen3.5-27B
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-model-len 100000
      --max-num-batched-tokens 4096
      --kv-cache-dtype fp8
      --reasoning-parser qwen3
      --enable-prefix-caching
      --mm-encoder-tp-mode data
      --mm-processor-cache-type shm
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --max-cudagraph-capture-size 32
      --dtype float16
# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.

Seems work with tp=2 but not tp=4 (so 3090 cannot run it)

# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.

Seems work with tp=2 but not tp=4 (so 3090 cannot run it)

Which model are you running?

The 122b one

# MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq.

Seems work with tp=2 but not tp=4 (so 3090 cannot run it)

You can with -tp 2 -pp 2, but with performance penalty.

The 122b one

Did you try the --dtype float16, I encountered this issue, and that fixed it. If not, you can try the suggested --quantization gptq.

@nephepritou
Wow, i got it working with tp 2 and pp 2 !! thank you very much !!

    command: [
      "/models/Intel/Qwen3.5-122B-A10B-int4-AutoRound",
      "--max-model-len","200000",
      "--max-num-seqs", "4",
      "--dtype", "float16",
      "--enable-auto-tool-choice",
      "--reasoning-parser","qwen3",
      "--tool-call-parser","qwen3_coder",
      "--gpu-memory-utilization", "0.92",
      "--host", "0.0.0.0",
      "--port", "8000",
      "--enable-chunked-prefill",
      "--disable-custom-all-reduce",
      "--enable-prefix-caching",
      "--tensor-parallel-size", "2",
      "--pipeline-parallel-size", "2",
      "--max-num-batched-tokens", "8096",
      "--trust-remote-code",
      "--attention-backend","FLASHINFER",
      '--override-generation-config={"top_p":0.95,"temperature":0.6,"top_k":20,"presence_penalty":0.0,"repetition_penalty":1.0}',
    ]

its run quite fast also

guidellm benchmark run \
    --target http://127.0.0.1:8000 \
    --rate 8 \
    --outputs json,csv \
    --profile concurrent \
    --max-seconds 30 \
    --data "prompt_tokens=10000,output_tokens=1000" \
    --processor Intel/Qwen3.5-122B-A10B-int4-AutoRound 

ℹ Server Throughput Statistics (All Requests)
|============|=====|======|=======|======|========|=========|========|=======|========|=========|
| Benchmark  | Requests               |||| Input Tokens    || Output Tokens || Total Tokens    ||
| Strategy   | Per Sec   || Concurrency || Per Sec         || Per Sec       || Per Sec         ||
|            | Mdn | Mean | Mdn   | Mean | Mdn    | Mean    | Mdn    | Mean  | Mdn    | Mean    |
|------------|-----|------|-------|------|--------|---------|--------|-------|--------|---------|
| concurrent | 0.1 | 0.0  | 8.0   | 8.0  | 4020.6 | 18055.6 | 401.2  | 609.2 | 4422.3 | 18664.8 |
|============|=====|======|=======|======|========|=========|========|=======|========|=========|

Sign up or log in to comment