mweinbach1's picture
Use faster 768 GPU detector
05e0955 verified
metadata
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
base_model: nvidia/nemotron-ocr-v2
library_name: coreml
tags:
  - coreml
  - ocr
  - text-detection
  - text-recognition
  - apple-silicon
  - apple-neural-engine
  - swift

Nemotron OCR v2 CoreML

CoreML conversion of the English neural stages from NVIDIA Nemotron OCR v2.

SwiftPM package: github.com/mweinbach/OCRCoreML

Included Models

stage file input outputs
detector DetectorGPUInt8_768.mlpackage image: Float32[1, 3, 768, 768] prob, rboxes, features
recognizer RecognizerFeaturesInt8.mlpackage regions: Float32[128, 128, 8, 32] logits, features
relational RelationalInt8.mlpackage rectified regions, original quads, recognizer features, valid count words, lines, line_log_var

The recognizer emits transformer features; those are required by the relational model, so this bundle covers the full neural OCR pipeline rather than detector-only inference.

Pipeline Boundary

The original Python package uses CUDA/C++ helpers for the non-neural stages: rotated-box NMS, rboxes to quads, quad rectification, feature-map grid sampling, sequence decoding, relation-graph decoding, and reading-order formatting. Those operations are not CoreML models. Apple apps integrating this bundle must port or replace those post-processing steps.

The linked SwiftPM package includes wrappers for all three CoreML models and a greedy recognizer decoder. It exposes raw tensors rather than claiming complete image-to-text OCR until the geometric and graph post-processing is ported.

Files

file purpose
DetectorGPUInt8_768.mlpackage/ detector CoreML package
RecognizerFeaturesInt8.mlpackage/ recognizer CoreML package with logits and features
RelationalInt8.mlpackage/ relational CoreML package
charset.txt English checkpoint charset
model_config.json English checkpoint config
configs/ conversion configs used for the three packages
benchmarks/ local CoreML benchmark results
parity/ PyTorch-vs-CoreML parity reports
checksums.sha256 SHA-256 checksums for package files
LICENSE, NOTICE license terms and redistribution notice

Performance

Local median latencies after warmup:

stage GPU/ALL median CPU+NE median CPU median
detector 10.65 ms 50.46 ms 157.71 ms
recognizer + features 4.53 ms 11.04 ms 47.58 ms
relational 1.72 ms 6.38 ms 34.53 ms

GPU/CoreML ALL is the best single-shot latency path on the test machine. CPU+ANE is useful when GPU time needs to be reserved for rendering or other workloads.

Swift Usage

import OCRCoreML

let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
let detectorPrediction = try pipeline.detect(image: cgImage)

let recognizerPrediction = try pipeline.recognize(regions: regions)
let decoded = try pipeline.recognizer.decode(
    logits: recognizerPrediction.output.logits,
    count: detectedRegionCount
)

let relationalPrediction = try pipeline.relate(
    rectifiedQuads: relationalRegionFeatures,
    originalQuads: originalQuads,
    recognizerFeatures: recognizerPrediction.output.features,
    numValid: detectedRegionCount
)

See the SwiftPM docs for exact app integration notes: https://github.com/mweinbach/OCRCoreML

License

The converted model weights inherit the NVIDIA Open Model License Agreement. The upstream source code and helper scripts are Apache 2.0. See LICENSE and NOTICE.