File size: 3,855 Bytes
ed77b73 05e0955 ed77b73 05e0955 ed77b73 05e0955 ed77b73 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | ---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
base_model: nvidia/nemotron-ocr-v2
library_name: coreml
tags:
- coreml
- ocr
- text-detection
- text-recognition
- apple-silicon
- apple-neural-engine
- swift
---
# Nemotron OCR v2 CoreML
CoreML conversion of the English neural stages from
[NVIDIA Nemotron OCR v2](https://huggingface.co/nvidia/nemotron-ocr-v2).
SwiftPM package:
[github.com/mweinbach/OCRCoreML](https://github.com/mweinbach/OCRCoreML)
## Included Models
| stage | file | input | outputs |
|---|---|---|---|
| detector | `DetectorGPUInt8_768.mlpackage` | `image: Float32[1, 3, 768, 768]` | `prob`, `rboxes`, `features` |
| recognizer | `RecognizerFeaturesInt8.mlpackage` | `regions: Float32[128, 128, 8, 32]` | `logits`, `features` |
| relational | `RelationalInt8.mlpackage` | rectified regions, original quads, recognizer features, valid count | `words`, `lines`, `line_log_var` |
The recognizer emits transformer `features`; those are required by the
relational model, so this bundle covers the full neural OCR pipeline rather
than detector-only inference.
## Pipeline Boundary
The original Python package uses CUDA/C++ helpers for the non-neural stages:
rotated-box NMS, `rboxes` to quads, quad rectification, feature-map grid
sampling, sequence decoding, relation-graph decoding, and reading-order
formatting. Those operations are not CoreML models. Apple apps integrating this
bundle must port or replace those post-processing steps.
The linked SwiftPM package includes wrappers for all three CoreML models and a
greedy recognizer decoder. It exposes raw tensors rather than claiming complete
image-to-text OCR until the geometric and graph post-processing is ported.
## Files
| file | purpose |
|---|---|
| `DetectorGPUInt8_768.mlpackage/` | detector CoreML package |
| `RecognizerFeaturesInt8.mlpackage/` | recognizer CoreML package with logits and features |
| `RelationalInt8.mlpackage/` | relational CoreML package |
| `charset.txt` | English checkpoint charset |
| `model_config.json` | English checkpoint config |
| `configs/` | conversion configs used for the three packages |
| `benchmarks/` | local CoreML benchmark results |
| `parity/` | PyTorch-vs-CoreML parity reports |
| `checksums.sha256` | SHA-256 checksums for package files |
| `LICENSE`, `NOTICE` | license terms and redistribution notice |
## Performance
Local median latencies after warmup:
| stage | GPU/ALL median | CPU+NE median | CPU median |
|---|---:|---:|---:|
| detector | 10.65 ms | 50.46 ms | 157.71 ms |
| recognizer + features | 4.53 ms | 11.04 ms | 47.58 ms |
| relational | 1.72 ms | 6.38 ms | 34.53 ms |
GPU/CoreML `ALL` is the best single-shot latency path on the test machine.
CPU+ANE is useful when GPU time needs to be reserved for rendering or other
workloads.
## Swift Usage
```swift
import OCRCoreML
let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
let detectorPrediction = try pipeline.detect(image: cgImage)
let recognizerPrediction = try pipeline.recognize(regions: regions)
let decoded = try pipeline.recognizer.decode(
logits: recognizerPrediction.output.logits,
count: detectedRegionCount
)
let relationalPrediction = try pipeline.relate(
rectifiedQuads: relationalRegionFeatures,
originalQuads: originalQuads,
recognizerFeatures: recognizerPrediction.output.features,
numValid: detectedRegionCount
)
```
See the SwiftPM docs for exact app integration notes:
<https://github.com/mweinbach/OCRCoreML>
## License
The converted model weights inherit the
[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
The upstream source code and helper scripts are Apache 2.0. See `LICENSE` and
`NOTICE`.
|