The program gets stuck on Windows and never stops loading vllm

#1
by Milor123 - opened

The program gets stuck on Windows and never stops loading vllm

User  Qwen3.5-9B-PARO  ♥ 20:38  docker run --pull=always --rm -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3.5-0.8B-PARO
chat: 从 z-lab/paroquant 拉取
摘要:sha256:228d0677c50c936787e9e56b8256cc58dc3d173d7c164a3afa0b9e603442d329
Status: Image is up to date for ghcr.io/z-lab/paroquant:chat
正在加载模型 (vllm)...

then if I press control+C
this comes out

^CTraceback (most recent call last):
File "/app/paroquant/inference/base.py", line 150, in create_generator
cls = getattr(importlib.import_module(module_path), class_name)
Please provide the text you would like me to translate
File "/usr/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
//MOD user instruct//
File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 999, in exec_module
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
File "/app/paroquant/inference/backends/vllm/__init__.py", line 1, in <module>
from .generator import VllmGenerator
File "/app/paroquant/inference/backends/vllm/generator.py", line 11, in <module>
import paroquant.inference.backends.vllm.plugin  # noqa: F401 — registers quantization config
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/paroquant/inference/backends/vllm/plugin.py", line 20, in <module>
import paroquant.kernels.cuda  # noqa: F401 — registers torch.ops.rotation.rotate
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/paroquant/kernels/cuda/__init__.py", line 7, in <module>
_C = load(
^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1767, in load
return _jit_compile(
Please provide the text you would like me to translate
File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2233, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
Please provide the text you would like me to translate
File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2731, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
Please provide the text you would like me to translate
ImportError: /root/.cache/torch_extensions/py312_cu130/paroquant_rotation/paroquant_rotation.so: cannot open shared object file: No such file or directory

上述异常是导致以下异常的直接原因:

Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/paroquant/cli/chat.py", line 242, in <module>
main()
File "/app/paroquant/cli/chat.py", line 226, in main
asyncio.run(
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
Please provide the text you would like me to translate
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
Please provide the text you would like me to translate
File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/paroquant/cli/chat.py", line 149, in run_chat_app
generator = create_generator(backend, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/paroquant/inference/base.py", line 152, in create_generator
raise ImportError(f'Backend {backend!r} requires: pip install "{install_hint}"') from e
ImportError: Backend 'vllm' requires: pip install "paroquant[vllm]"

I have cuda 13 in windows, RTX 4070 12GB VRAM

in the server

docker run --pull=always --rm -it --gpus all --ipc=host -p 8888:8000 ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3.5-0.8B-PARO
serve: Pulling from z-lab/paroquant
Digest: sha256:3d993f27e16dc38aecdea253db3dc608dcfd9f7d6521299e3cd07d169658c69f
Status: Image is up to date for ghcr.io/z-lab/paroquant:serve
^CTraceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/paroquant/cli/serve.py", line 103, in <module>
    main()
  File "/app/paroquant/cli/serve.py", line 95, in main
    _serve_vllm()
  File "/app/paroquant/cli/serve.py", line 16, in _serve_vllm
    import paroquant.inference.backends.vllm.plugin  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/paroquant/inference/backends/vllm/__init__.py", line 1, in <module>
    from .generator import VllmGenerator
  File "/app/paroquant/inference/backends/vllm/generator.py", line 11, in <module>
    import paroquant.inference.backends.vllm.plugin  # noqa: F401 — registers quantization config
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/paroquant/inference/backends/vllm/plugin.py", line 20, in <module>
    import paroquant.kernels.cuda  # noqa: F401 — registers torch.ops.rotation.rotate
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/paroquant/kernels/cuda/__init__.py", line 7, in <module>
    _C = load(
         ^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1767, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2207, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2362, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/opt/venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2693, in _run_ninja_build
    subprocess.run(
  File "/usr/lib/python3.12/subprocess.py", line 550, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1196, in communicate
    stdout = self.stdout.read()
             ^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
liang2kl changed discussion status to closed

Sign up or log in to comment