flash attention not supported

by cc433 - opened Mar 17

Mar 17

torch error occurred when loaded with attn_implementation="flash_attention_2":
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:109: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
Traceback (most recent call last):
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 233, in
result = translator.translate_json_dict(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 191, in translate_json_dict
result[key] = self.translate_and_align_one(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 177, in translate_and_align_one
zh_fixed = self.align_special_tokens(jp_text, zh_draft)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/nvme2/cc/SakuraLLM/translate_json.py", line 160, in align_special_tokens
output_ids = self.align_model.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/transformers/generation/utils.py", line 2535, in generate
result = decoding_method(
^^^^^^^^^^^^^^^^
File "/home/cc/miniconda3/envs/sakura/lib/python3.12/site-packages/transformers/generation/utils.py", line 2783, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

soundstarrain

Murasaki Project org Mar 18

This issue likely stems from compatibility gaps between the new Qwen 3.5 hybrid architecture and current versions of transformers or flash-attention-2, which can cause numerical instability (NaN/Inf).

I'd recommend disabling Flash Attention 2 and using the default SDPA for now. We’ll probably need to wait for upcoming library updates to fully support this new design.

soundstarrain changed discussion status to closed Mar 18

soundstarrain

Murasaki Project org Mar 18

Accidentally closed this with my previous comment—feel free to reopen it if you still have questions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment