File size: 3,855 Bytes
ed77b73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05e0955
ed77b73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05e0955
ed77b73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05e0955
ed77b73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
base_model: nvidia/nemotron-ocr-v2
library_name: coreml
tags:
- coreml
- ocr
- text-detection
- text-recognition
- apple-silicon
- apple-neural-engine
- swift
---

# Nemotron OCR v2 CoreML

CoreML conversion of the English neural stages from
[NVIDIA Nemotron OCR v2](https://huggingface.co/nvidia/nemotron-ocr-v2).

SwiftPM package:
[github.com/mweinbach/OCRCoreML](https://github.com/mweinbach/OCRCoreML)

## Included Models

| stage | file | input | outputs |
|---|---|---|---|
| detector | `DetectorGPUInt8_768.mlpackage` | `image: Float32[1, 3, 768, 768]` | `prob`, `rboxes`, `features` |
| recognizer | `RecognizerFeaturesInt8.mlpackage` | `regions: Float32[128, 128, 8, 32]` | `logits`, `features` |
| relational | `RelationalInt8.mlpackage` | rectified regions, original quads, recognizer features, valid count | `words`, `lines`, `line_log_var` |

The recognizer emits transformer `features`; those are required by the
relational model, so this bundle covers the full neural OCR pipeline rather
than detector-only inference.

## Pipeline Boundary

The original Python package uses CUDA/C++ helpers for the non-neural stages:
rotated-box NMS, `rboxes` to quads, quad rectification, feature-map grid
sampling, sequence decoding, relation-graph decoding, and reading-order
formatting. Those operations are not CoreML models. Apple apps integrating this
bundle must port or replace those post-processing steps.

The linked SwiftPM package includes wrappers for all three CoreML models and a
greedy recognizer decoder. It exposes raw tensors rather than claiming complete
image-to-text OCR until the geometric and graph post-processing is ported.

## Files

| file | purpose |
|---|---|
| `DetectorGPUInt8_768.mlpackage/` | detector CoreML package |
| `RecognizerFeaturesInt8.mlpackage/` | recognizer CoreML package with logits and features |
| `RelationalInt8.mlpackage/` | relational CoreML package |
| `charset.txt` | English checkpoint charset |
| `model_config.json` | English checkpoint config |
| `configs/` | conversion configs used for the three packages |
| `benchmarks/` | local CoreML benchmark results |
| `parity/` | PyTorch-vs-CoreML parity reports |
| `checksums.sha256` | SHA-256 checksums for package files |
| `LICENSE`, `NOTICE` | license terms and redistribution notice |

## Performance

Local median latencies after warmup:

| stage | GPU/ALL median | CPU+NE median | CPU median |
|---|---:|---:|---:|
| detector | 10.65 ms | 50.46 ms | 157.71 ms |
| recognizer + features | 4.53 ms | 11.04 ms | 47.58 ms |
| relational | 1.72 ms | 6.38 ms | 34.53 ms |

GPU/CoreML `ALL` is the best single-shot latency path on the test machine.
CPU+ANE is useful when GPU time needs to be reserved for rendering or other
workloads.

## Swift Usage

```swift
import OCRCoreML

let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
let detectorPrediction = try pipeline.detect(image: cgImage)

let recognizerPrediction = try pipeline.recognize(regions: regions)
let decoded = try pipeline.recognizer.decode(
    logits: recognizerPrediction.output.logits,
    count: detectedRegionCount
)

let relationalPrediction = try pipeline.relate(
    rectifiedQuads: relationalRegionFeatures,
    originalQuads: originalQuads,
    recognizerFeatures: recognizerPrediction.output.features,
    numValid: detectedRegionCount
)
```

See the SwiftPM docs for exact app integration notes:
<https://github.com/mweinbach/OCRCoreML>

## License

The converted model weights inherit the
[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
The upstream source code and helper scripts are Apache 2.0. See `LICENSE` and
`NOTICE`.