Use faster 768 GPU detector

05e0955 verified 30 days ago

3.86 kB

license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
base_model: nvidia/nemotron-ocr-v2
library_name: coreml
tags:
  - coreml
  - ocr
  - text-detection
  - text-recognition
  - apple-silicon
  - apple-neural-engine
  - swift

Nemotron OCR v2 CoreML

CoreML conversion of the English neural stages from NVIDIA Nemotron OCR v2.

SwiftPM package: github.com/mweinbach/OCRCoreML

Included Models

stage	file	input	outputs
detector	`DetectorGPUInt8_768.mlpackage`	`image: Float32[1, 3, 768, 768]`	`prob`, `rboxes`, `features`
recognizer	`RecognizerFeaturesInt8.mlpackage`	`regions: Float32[128, 128, 8, 32]`	`logits`, `features`
relational	`RelationalInt8.mlpackage`	rectified regions, original quads, recognizer features, valid count	`words`, `lines`, `line_log_var`

The recognizer emits transformer features; those are required by the relational model, so this bundle covers the full neural OCR pipeline rather than detector-only inference.

Pipeline Boundary

The original Python package uses CUDA/C++ helpers for the non-neural stages: rotated-box NMS, rboxes to quads, quad rectification, feature-map grid sampling, sequence decoding, relation-graph decoding, and reading-order formatting. Those operations are not CoreML models. Apple apps integrating this bundle must port or replace those post-processing steps.

The linked SwiftPM package includes wrappers for all three CoreML models and a greedy recognizer decoder. It exposes raw tensors rather than claiming complete image-to-text OCR until the geometric and graph post-processing is ported.

Files

file	purpose
`DetectorGPUInt8_768.mlpackage/`	detector CoreML package
`RecognizerFeaturesInt8.mlpackage/`	recognizer CoreML package with logits and features
`RelationalInt8.mlpackage/`	relational CoreML package
`charset.txt`	English checkpoint charset
`model_config.json`	English checkpoint config
`configs/`	conversion configs used for the three packages
`benchmarks/`	local CoreML benchmark results
`parity/`	PyTorch-vs-CoreML parity reports
`checksums.sha256`	SHA-256 checksums for package files
`LICENSE`, `NOTICE`	license terms and redistribution notice

Performance

Local median latencies after warmup:

stage	GPU/ALL median	CPU+NE median	CPU median
detector	10.65 ms	50.46 ms	157.71 ms
recognizer + features	4.53 ms	11.04 ms	47.58 ms
relational	1.72 ms	6.38 ms	34.53 ms

GPU/CoreML ALL is the best single-shot latency path on the test machine. CPU+ANE is useful when GPU time needs to be reserved for rendering or other workloads.

Swift Usage

import OCRCoreML

let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
let detectorPrediction = try pipeline.detect(image: cgImage)

let recognizerPrediction = try pipeline.recognize(regions: regions)
let decoded = try pipeline.recognizer.decode(
    logits: recognizerPrediction.output.logits,
    count: detectedRegionCount
)

let relationalPrediction = try pipeline.relate(
    rectifiedQuads: relationalRegionFeatures,
    originalQuads: originalQuads,
    recognizerFeatures: recognizerPrediction.output.features,
    numValid: detectedRegionCount
)

See the SwiftPM docs for exact app integration notes: https://github.com/mweinbach/OCRCoreML

License

The converted model weights inherit the NVIDIA Open Model License Agreement. The upstream source code and helper scripts are Apache 2.0. See LICENSE and NOTICE.