Use faster 768 GPU detector

05e0955 verified 30 days ago

3.86 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
	base_model: nvidia/nemotron-ocr-v2
	library_name: coreml
	tags:
	- coreml
	- ocr
	- text-detection
	- text-recognition
	- apple-silicon
	- apple-neural-engine
	- swift
	---

	# Nemotron OCR v2 CoreML

	CoreML conversion of the English neural stages from
	[NVIDIA Nemotron OCR v2](https://huggingface.co/nvidia/nemotron-ocr-v2).

	SwiftPM package:
	[github.com/mweinbach/OCRCoreML](https://github.com/mweinbach/OCRCoreML)

	## Included Models

	\| stage \| file \| input \| outputs \|
	\|---\|---\|---\|---\|
	\| detector \| `DetectorGPUInt8_768.mlpackage` \| `image: Float32[1, 3, 768, 768]` \| `prob`, `rboxes`, `features` \|
	\| recognizer \| `RecognizerFeaturesInt8.mlpackage` \| `regions: Float32[128, 128, 8, 32]` \| `logits`, `features` \|
	\| relational \| `RelationalInt8.mlpackage` \| rectified regions, original quads, recognizer features, valid count \| `words`, `lines`, `line_log_var` \|

	The recognizer emits transformer `features`; those are required by the
	relational model, so this bundle covers the full neural OCR pipeline rather
	than detector-only inference.

	## Pipeline Boundary

	The original Python package uses CUDA/C++ helpers for the non-neural stages:
	rotated-box NMS, `rboxes` to quads, quad rectification, feature-map grid
	sampling, sequence decoding, relation-graph decoding, and reading-order
	formatting. Those operations are not CoreML models. Apple apps integrating this
	bundle must port or replace those post-processing steps.

	The linked SwiftPM package includes wrappers for all three CoreML models and a
	greedy recognizer decoder. It exposes raw tensors rather than claiming complete
	image-to-text OCR until the geometric and graph post-processing is ported.

	## Files

	\| file \| purpose \|
	\|---\|---\|
	\| `DetectorGPUInt8_768.mlpackage/` \| detector CoreML package \|
	\| `RecognizerFeaturesInt8.mlpackage/` \| recognizer CoreML package with logits and features \|
	\| `RelationalInt8.mlpackage/` \| relational CoreML package \|
	\| `charset.txt` \| English checkpoint charset \|
	\| `model_config.json` \| English checkpoint config \|
	\| `configs/` \| conversion configs used for the three packages \|
	\| `benchmarks/` \| local CoreML benchmark results \|
	\| `parity/` \| PyTorch-vs-CoreML parity reports \|
	\| `checksums.sha256` \| SHA-256 checksums for package files \|
	\| `LICENSE`, `NOTICE` \| license terms and redistribution notice \|

	## Performance

	Local median latencies after warmup:

	\| stage \| GPU/ALL median \| CPU+NE median \| CPU median \|
	\|---\|---:\|---:\|---:\|
	\| detector \| 10.65 ms \| 50.46 ms \| 157.71 ms \|
	\| recognizer + features \| 4.53 ms \| 11.04 ms \| 47.58 ms \|
	\| relational \| 1.72 ms \| 6.38 ms \| 34.53 ms \|

	GPU/CoreML `ALL` is the best single-shot latency path on the test machine.
	CPU+ANE is useful when GPU time needs to be reserved for rendering or other
	workloads.

	## Swift Usage

	```swift
	import OCRCoreML

	let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
	let detectorPrediction = try pipeline.detect(image: cgImage)

	let recognizerPrediction = try pipeline.recognize(regions: regions)
	let decoded = try pipeline.recognizer.decode(
	logits: recognizerPrediction.output.logits,
	count: detectedRegionCount
	)

	let relationalPrediction = try pipeline.relate(
	rectifiedQuads: relationalRegionFeatures,
	originalQuads: originalQuads,
	recognizerFeatures: recognizerPrediction.output.features,
	numValid: detectedRegionCount
	)
	```

	See the SwiftPM docs for exact app integration notes:
	<https://github.com/mweinbach/OCRCoreML>

	## License

	The converted model weights inherit the
	[NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
	The upstream source code and helper scripts are Apache 2.0. See `LICENSE` and
	`NOTICE`.