flash attention not supported
torch error occurred when loaded with attn_implementation="flash_attention_2":
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:109: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
Traceback (most recent call last):
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 233, in
result = translator.translate_json_dict(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 191, in translate_json_dict
result[key] = self.translate_and_align_one(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 177, in translate_and_align_one
zh_fixed = self.align_special_tokens(jp_text, zh_draft)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 160, in align_special_tokens
output_ids = self.align_model.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/transformers/generation/utils.py", line 2535, in generate
result = decoding_method(
^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/transformers/generation/utils.py", line 2783, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
This issue likely stems from compatibility gaps between the new Qwen 3.5 hybrid architecture and current versions of transformers or flash-attention-2, which can cause numerical instability (NaN/Inf).
I'd recommend disabling Flash Attention 2 and using the default SDPA for now. We’ll probably need to wait for upcoming library updates to fully support this new design.
Accidentally closed this with my previous comment—feel free to reopen it if you still have questions.