[Solved] Unable to run test example (Torch OOM)

by earlzero - opened 8 days ago

•

I followed the instructions and tried to run this snippet in clean venv. I'm getting OOM error from PyTorch on any images I have (downsize one to 200x100 pixels, the same result).

More details

Exact code snippet

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/falcon-perception",
    trust_remote_code=True,
    device_map={"": "cuda:0"},
)

image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]

for p in preds:
    print(p["xy"], p["hw"])

pip freeze output

accelerate==1.13.0
annotated-doc==0.0.4     
anyio==4.13.0 
certifi==2026.2.25
charset-normalizer==3.4.7                                                                                                     
click==8.3.1          
cuda-bindings==13.2.0                                          
cuda-pathfinder==1.5.0                                         
cuda-toolkit==13.0.2                                                                                                          
einops==0.8.2     
filelock==3.25.2                                                                                                              
fsspec==2026.3.0                                                                                                              
h11==0.16.0                                                                                                                   
hf-xet==1.4.3                                                  
httpcore==1.0.9                                                
httpx==0.28.1                                                                                                                 
huggingface_hub==1.8.0                                         
idna==3.11                                                                                                                    
Jinja2==3.1.6
markdown-it-py==4.0.0        
MarkupSafe==3.0.3                                                                                                             
mdurl==0.1.2      
mpmath==1.3.0               
networkx==3.6.1             
numpy==2.4.4                                                                                                                  
nvidia-cublas==13.1.0.3
nvidia-cuda-cupti==13.0.85 
nvidia-cuda-nvrtc==13.0.88 
nvidia-cuda-runtime==13.0.96                                                                                                  
nvidia-cudnn-cu13==9.19.0.56
nvidia-cufft==12.0.0.61
nvidia-cufile==1.15.1.6                                                                                                       
nvidia-curand==10.4.0.35   
nvidia-cusolver==12.0.4.66
nvidia-cusparse==12.6.3.3                                                                                                     
nvidia-cusparselt-cu13==0.8.0                                  
nvidia-nccl-cu13==2.28.9                                                                                                      
nvidia-nvjitlink==13.0.88                                                                                                     
nvidia-nvshmem-cu13==3.4.5                                                                                                    
nvidia-nvtx==13.0.85           
packaging==26.0                                                                                                               
pillow==12.2.0                                                 
psutil==7.2.2                  
pycocotools==2.0.11                                                                                                           
Pygments==2.20.0
PyYAML==6.0.3        
regex==2026.3.32                                               
requests==2.33.1               
rich==14.3.3                                                   
safetensors==0.7.0                                             
setuptools==81.0.0         
shellingham==1.5.4            
sympy==1.14.0
tokenizers==0.22.2             
torch==2.11.0                                                  
torchaudio==2.11.0                                             
torchvision==0.26.0            
tqdm==4.67.3   
transformers==5.5.0        
triton==3.6.0                                                                                                                 
typer==0.24.1     
typing_extensions==4.15.0
urllib3==2.6.3

I'm getting the following error

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster
 downloads.                                                    
Loading weights: 100%|█████████████████████████████████████████████████████████████████████| 231/231 [00:00<00:00, 700.48it/s]Traceback (most recent call last):
  File "/media/fastdata/ez/llm/falcon-perception/runme.py", line 12, in <module>
    preds = model.generate(image, "cat")[0]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                        File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in d
ecorate_context                                                
    return func(*args, **kwargs)                     
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 712, in generate
    logits_BSV, h_BSD = self.forward(  
                        ^^^^^^^^^^^^^                                                                                           File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 607, in forward       
    h_BSD = layer(                                                                                                            
            ^^^^^^                                                                                                              File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in 
_wrapped_call_impl                                             
    return self._call_impl(*args, **kwargs)                                                                                   
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                        File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 216, in forward
    x = x + self.attention( 
            ^^^^^^^^^^^^^^^                                                                                                     File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in 
_wrapped_call_impl         
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                     File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in 
_call_impl            
    return forward_call(*args, **kwargs)                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 136, in forward                                                                  output, aux_output = flex_fn(xq, xk, xv, block_mask=attention_masks, return_aux=AuxRequest(lse=True))
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1038, in compile_wrapper                                                                                                              
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                    File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1053, 
in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1037, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1798, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1570, in codegen_and_compile
    compiled_module = graph.compile_to_module()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2499, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2509, in _compile_to_module
    mod = self._compile_to_module_lines(wrapper_code)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2584, in _compile_to_module_lines
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 3764, in load_by_key_path
    mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py", line 35, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_ez/ea/cea3gi7xoya3mx47nktg63fj4yvsnckz2juzl4yg75pobcw4nqi3.py", line 638, in <module>
    async_compile.wait(globals())
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 699, in wait
    self._wait_futures(scope)
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 719, in _wait_futures
    kernel = result.result()
             ^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 4361, in result
    return self.result_fn()
           ^^^^^^^^^^^^^^^^
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 453, in get_result
    kernel.precompile(
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 503, in precompile
    self._make_launchers()
  File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 664, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused_flex_attention_0 Required: 149248 Hardware limit:101376 Reducing block sizes or `num_stages` may help.

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

nvidia-smi output

Sat Apr  4 21:30:45 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:2D:00.0  On |                  N/A |
|  0%   41C    P8             33W /  350W |     937MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4104      G   /usr/bin/gnome-shell                    404MiB |
|    0   N/A  N/A            4210      G   /usr/bin/Xwayland                         7MiB |
|    0   N/A  N/A            6011      G   ...rack-uuid=3190708988185955192        319MiB |
+-----------------------------------------------------------------------------------------+

I have 64gb of ddr4 ram beyond gpu.

dheerapat38

7 days ago

same here, in my case RTX5060TI with 16GB VRAM and 16GB system RAM

lkhphuc

Technology Innovation Institute org 7 days ago

hi, sorry to hear this. This look like coming from pytorch's flex attention compilation so might be tricky to debug.

I wonder if you have tried our official repo instead? https://github.com/tiiuae/Falcon-Perception
It has support for both batch inference (similar to this hf repo) and paged inference for efficient deployment. Both are only depends on pytorch (no transformers).
It also has support for MLX if you have a Mac as well.

dheerapat38

7 days ago

hi, sorry to hear this. This look like coming from pytorch's flex attention compilation so might be tricky to debug.

I wonder if you have tried our official repo instead? https://github.com/tiiuae/Falcon-Perception
It has support for both batch inference (similar to this hf repo) and paged inference for efficient deployment. Both are only depends on pytorch (no transformers).
It also has support for MLX if you have a Mac as well.

I use exactly code from the repo, uv add the package and use exactly the code in readme

lkhphuc

Technology Innovation Institute org 7 days ago

Hi @dheerapat38 , do you have the same exact error as @earlzero as above, or is it OOM somewhere else?
Could you post the full stacktrace (possibly on github, if you use that code so it's easier to track https://github.com/tiiuae/Falcon-Perception)?

I will try to see what I can do, but since this is deep inside torch compile internal and I don't have a 16gb card to test, might be a bit challenging.

dheerapat38

6 days ago

i got the same exactly OOM, AI try to by pass compilation process and it still failed, Opus 4.6

lkhphuc

Technology Innovation Institute org 4 days ago

•

edited 4 days ago

https://github.com/tiiuae/Falcon-Perception/issues/7#issuecomment-4205874111
Hi, can you edit one line locally and try again from the comment above?
Basically just pass flex_attn_kernel_options={"BLOCK_M": 64, "BLOCK_N": 64, "num_stages": 1},?

lkhphuc

Technology Innovation Institute org 4 days ago

https://github.com/tiiuae/Falcon-Perception/pull/8
I added a pr to potentially fix this. Would be grateful if you can try it out and let me know how it goes on your hardware.

lkhphuc changed discussion title from Unable to run test example (Torch OOM) to [Solved] Unable to run test example (Torch OOM) 4 days ago

lkhphuc

Technology Innovation Institute org 4 days ago

https://github.com/tiiuae/Falcon-Perception/issues/7#issuecomment-4210903932
problem seems to be fixed here.

lkhphuc changed discussion status to closed 3 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment