[Solved] Unable to run test example (Torch OOM)
I followed the instructions and tried to run this snippet in clean venv. I'm getting OOM error from PyTorch on any images I have (downsize one to 200x100 pixels, the same result).
More details
Exact code snippet
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-perception",
trust_remote_code=True,
device_map={"": "cuda:0"},
)
image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]
for p in preds:
print(p["xy"], p["hw"])
pip freeze output
accelerate==1.13.0
annotated-doc==0.0.4
anyio==4.13.0
certifi==2026.2.25
charset-normalizer==3.4.7
click==8.3.1
cuda-bindings==13.2.0
cuda-pathfinder==1.5.0
cuda-toolkit==13.0.2
einops==0.8.2
filelock==3.25.2
fsspec==2026.3.0
h11==0.16.0
hf-xet==1.4.3
httpcore==1.0.9
httpx==0.28.1
huggingface_hub==1.8.0
idna==3.11
Jinja2==3.1.6
markdown-it-py==4.0.0
MarkupSafe==3.0.3
mdurl==0.1.2
mpmath==1.3.0
networkx==3.6.1
numpy==2.4.4
nvidia-cublas==13.1.0.3
nvidia-cuda-cupti==13.0.85
nvidia-cuda-nvrtc==13.0.88
nvidia-cuda-runtime==13.0.96
nvidia-cudnn-cu13==9.19.0.56
nvidia-cufft==12.0.0.61
nvidia-cufile==1.15.1.6
nvidia-curand==10.4.0.35
nvidia-cusolver==12.0.4.66
nvidia-cusparse==12.6.3.3
nvidia-cusparselt-cu13==0.8.0
nvidia-nccl-cu13==2.28.9
nvidia-nvjitlink==13.0.88
nvidia-nvshmem-cu13==3.4.5
nvidia-nvtx==13.0.85
packaging==26.0
pillow==12.2.0
psutil==7.2.2
pycocotools==2.0.11
Pygments==2.20.0
PyYAML==6.0.3
regex==2026.3.32
requests==2.33.1
rich==14.3.3
safetensors==0.7.0
setuptools==81.0.0
shellingham==1.5.4
sympy==1.14.0
tokenizers==0.22.2
torch==2.11.0
torchaudio==2.11.0
torchvision==0.26.0
tqdm==4.67.3
transformers==5.5.0
triton==3.6.0
typer==0.24.1
typing_extensions==4.15.0
urllib3==2.6.3
I'm getting the following error
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster
downloads.
Loading weights: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 231/231 [00:00<00:00, 700.48it/s]Traceback (most recent call last):
File "/media/fastdata/ez/llm/falcon-perception/runme.py", line 12, in <module>
preds = model.generate(image, "cat")[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in d
ecorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^ File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 712, in generate
logits_BSV, h_BSD = self.forward(
^^^^^^^^^^^^^ File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 607, in forward
h_BSD = layer(
^^^^^^ File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in
_wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 216, in forward
x = x + self.attention(
^^^^^^^^^^^^^^^ File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in
_wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in
_call_impl
return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ez/.cache/huggingface/modules/transformers_modules/tiiuae/falcon_hyphen_perception/460b089849cbe000e02a9c28273a8
ec8555362a0/modeling_falcon_perception.py", line 136, in forward output, aux_output = flex_fn(xq, xk, xv, block_mask=attention_masks, return_aux=AuxRequest(lse=True))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1038, in compile_wrapper
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1053,
in _compile_fx_inner
raise InductorError(e, currentframe()).with_traceback(
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1037, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1798, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/compile_fx.py", line 1570, in codegen_and_compile
compiled_module = graph.compile_to_module()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2499, in compile_to_module
return self._compile_to_module()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2509, in _compile_to_module
mod = self._compile_to_module_lines(wrapper_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/graph.py", line 2584, in _compile_to_module_lines
mod = PyCodeCache.load_by_key_path(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 3764, in load_by_key_path
mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py", line 35, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_ez/ea/cea3gi7xoya3mx47nktg63fj4yvsnckz2juzl4yg75pobcw4nqi3.py", line 638, in <module>
async_compile.wait(globals())
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 699, in wait
self._wait_futures(scope)
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 719, in _wait_futures
kernel = result.result()
^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/codecache.py", line 4361, in result
return self.result_fn()
^^^^^^^^^^^^^^^^
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/async_compile.py", line 453, in get_result
kernel.precompile(
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 503, in precompile
self._make_launchers()
File "/media/fastdata/ez/llm/falcon-perception/.env/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 664, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
torch._inductor.exc.InductorError: RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_tem_fused_flex_attention_0 Required: 149248 Hardware limit:101376 Reducing block sizes or `num_stages` may help.
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
nvidia-smi output
Sat Apr 4 21:30:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:2D:00.0 On | N/A |
| 0% 41C P8 33W / 350W | 937MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4104 G /usr/bin/gnome-shell 404MiB |
| 0 N/A N/A 4210 G /usr/bin/Xwayland 7MiB |
| 0 N/A N/A 6011 G ...rack-uuid=3190708988185955192 319MiB |
+-----------------------------------------------------------------------------------------+
I have 64gb of ddr4 ram beyond gpu.
same here, in my case RTX5060TI with 16GB VRAM and 16GB system RAM
hi, sorry to hear this. This look like coming from pytorch's flex attention compilation so might be tricky to debug.
I wonder if you have tried our official repo instead? https://github.com/tiiuae/Falcon-Perception
It has support for both batch inference (similar to this hf repo) and paged inference for efficient deployment. Both are only depends on pytorch (no transformers).
It also has support for MLX if you have a Mac as well.
hi, sorry to hear this. This look like coming from pytorch's flex attention compilation so might be tricky to debug.
I wonder if you have tried our official repo instead? https://github.com/tiiuae/Falcon-Perception
It has support for both batch inference (similar to this hf repo) and paged inference for efficient deployment. Both are only depends on pytorch (no transformers).
It also has support for MLX if you have a Mac as well.
I use exactly code from the repo, uv add the package and use exactly the code in readme
Hi @dheerapat38 , do you have the same exact error as @earlzero as above, or is it OOM somewhere else?
Could you post the full stacktrace (possibly on github, if you use that code so it's easier to track https://github.com/tiiuae/Falcon-Perception)?
I will try to see what I can do, but since this is deep inside torch compile internal and I don't have a 16gb card to test, might be a bit challenging.
i got the same exactly OOM, AI try to by pass compilation process and it still failed, Opus 4.6
https://github.com/tiiuae/Falcon-Perception/issues/7#issuecomment-4205874111
Hi, can you edit one line locally and try again from the comment above?
Basically just pass flex_attn_kernel_options={"BLOCK_M": 64, "BLOCK_N": 64, "num_stages": 1},?
https://github.com/tiiuae/Falcon-Perception/pull/8
I added a pr to potentially fix this. Would be grateful if you can try it out and let me know how it goes on your hardware.
https://github.com/tiiuae/Falcon-Perception/issues/7#issuecomment-4210903932
problem seems to be fixed here.