Quickstart
Prerequisites
- Python 3.12 (the package requires
>=3.12,<3.13) - CUDA toolkit with
nvcconPATH(the package compiles a CUDA C++ extension at install time) - A CUDA GPU (or set
TORCH_CUDA_ARCH_LISTto cross-compile, e.g.TORCH_CUDA_ARCH_LIST="8.0 9.0")
The CUDA toolkit version must share the same major version as the CUDA
bindings in your PyTorch install (e.g. toolkit 12.4 with torch+cu128 is fine;
toolkit 12.4 with torch+cu130 will fail).
On Slurm clusters, run the install on a GPU node or load the CUDA module first:
module load cuda12.4/toolkit/12.4.1 # example; adjust for your cluster
export CUDA_HOME=/usr/local/cuda # or wherever the toolkit lives
Installation
Install PyTorch first with bindings matching your CUDA toolkit, then install
this package with --no-build-isolation so it builds the C++ extension against
your existing PyTorch:
# 1. Install PyTorch (adjust the index URL for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# 2. Install nemotron-ocr
cd nemotron-ocr
pip install --no-build-isolation -v .
Why
--no-build-isolation? Without it, pip creates a temporary build environment and installs the latest PyTorch from PyPI. That PyTorch's CUDA version may not match your system'snvcc, causing the C++ extension build to fail with a CUDA version mismatch error.
Verify the installation (the C++ extension must load without errors):
python -c "from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2; print('OK')"
Usage
NemotronOCRV2 is the recommended entry point for OCR inference:
from nemotron_ocr.inference.pipeline_v2 import NemotronOCRV2
ocr = NemotronOCRV2()
predictions = ocr("ocr-example-input-1.png")
for pred in predictions:
print(f" - Text: '{pred['text']}', Confidence: {pred['confidence']:.2f}")
The level of detection merging can be adjusted with merge_level:
ocr(image_path, merge_level="word") # individual words
ocr(image_path, merge_level="sentence") # merged into sentences
ocr(image_path, merge_level="paragraph") # merged into paragraphs (default)
Inference modes
# Detector only — bounding boxes, no text (fastest, lowest memory)
ocr_det = NemotronOCRV2(detector_only=True)
# Skip relational — per-word text, no reading-order grouping
ocr_fast = NemotronOCRV2(skip_relational=True)
# Profiling — per-phase CUDA-synced timing in logs
ocr_profile = NemotronOCRV2(verbose_post=True)
Example script
python example.py ocr-example-input-1.png
python example.py ocr-example-input-1.png --merge-level word
python example.py ocr-example-input-1.png --detector-only
python example.py ocr-example-input-1.png --skip-relational