johaness14
/

PP-OCR-PyTorch

Image-to-Text

Model card Files Files and versions

xet

Community

johaness14 commited on 22 days ago

Commit

5ba5a2f

verified ·

1 Parent(s): 4d9f515

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -597

README.md CHANGED Viewed

@@ -1,598 +1,11 @@
-# PP-OCR Native PyTorch
-Native PyTorch inference implementation for PP-OCRv5 text detection and text recognition.
-This project runs OCR end-to-end without PaddlePaddle runtime and without Hugging Face runtime for inference. The runtime uses native PyTorch `.pth` weights.
-## Important Note
-This repository, including the reverse engineering work, codebase generation, refactoring, and documentation, was produced with AI assistance using GPT-5.4 through Codex TUI.
-It has not yet gone through comprehensive human verification, formal human annotation, or full manual audit. Use it as an engineering baseline, not as a claim of production-grade equivalence to the original Paddle implementation.
-## Acknowledgements
-- PaddlePaddle: https://www.paddlepaddle.org.cn/
-- PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
-This project is a native PyTorch reimplementation inspired by PaddleOCR model structure, inference behavior, and exported assets. PaddlePaddle and PaddleOCR remain the original upstream engines and references behind the model family.
-## Features
-- Native PyTorch detection:
-  - `PP-OCRv5_mobile_det`
-  - `PP-OCRv5_server_det`
-- Native PyTorch recognition:
-  - `PP-OCRv5_mobile_rec`
-  - `PP-OCRv5_server_rec`
-- End-to-end OCR pipeline:
-  - image -> detection -> crop -> recognition -> paragraph grouping -> text
-- Lazy preset-based weight download from Hugging Face
-- Generic post-processing
-- Full text output and structured OCR JSON output
-## Runtime Scope
-The active inference runtime does not depend on:
-- PaddlePaddle runtime
-- PaddleOCR runtime
-- Hugging Face runtime
-The runtime uses:
-- PyTorch
-- OpenCV
-- detection post-processing utilities
-## Supported Presets
-### `mobile`
-Uses:
-- detection: `PP-OCRv5_mobile_det_native.pth`
-- recognition: `PP-OCRv5_mobile_rec_native.pth`
-### `server`
-Uses:
-- detection: `PP-OCRv5_server_det_native.pth`
-- recognition: `PP-OCRv5_server_rec_native.pth`
-## Installation
-## Requirements
-- Python
-- PyTorch
-- dependencies from `requirements.txt`
-Main dependencies:
-- `torch`
-- `torchvision`
-- `opencv-python`
-- `safetensors`
-- `pyclipper`
-- `packaging`
-- `shapely`
-## Setup
-If you need a fresh environment:
-```powershell
-python -m venv venv
-venv\Scripts\activate
-pip install -r requirements.txt
-```
-If your environment already exists, just activate the project `venv`.
-## Weights Layout
-The default local layout is:
-```text
-weights/
-  ppocrv5_dict.txt
-  mobile/
-    PP-OCRv5_mobile_det_native.pth
-    PP-OCRv5_mobile_rec_native.pth
-    ...
-  server/
-    PP-OCRv5_server_det_native.pth
-    PP-OCRv5_server_rec_native.pth
-    ...
-```
-If required files are missing, the pipeline will try to download only the requested preset from:
-```text
-https://huggingface.co/johaness14/PP-OCR-PyTorch
-```
-Download behavior:
-- only the requested preset is downloaded
-- only missing files are downloaded
-- the dictionary file is also checked and downloaded when missing
-## Usage
-## Basic OCR
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server
-```
-Mobile preset:
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset mobile
-```
-## Raw Text Output
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --raw-text
-```
-## Save Plain Text
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --output result.txt
-```
-## Save Full OCR JSON
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --json-output result.json
-```
-## Override Weights Manually
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py ^
-  --image path\to\image.jpg ^
-  --det-weights weights\server\PP-OCRv5_server_det_native.pth ^
-  --rec-weights weights\server\PP-OCRv5_server_rec_native.pth
-```
-## Override Local Weights Folder or Repo Source
-```powershell
-venv\Scripts\python.exe ocr_pipeline.py ^
-  --image path\to\image.jpg ^
-  --preset mobile ^
-  --weights-dir weights ^
-  --repo-id johaness14/PP-OCR-PyTorch ^
-  --revision main
-```
-## CLI Arguments
-`ocr_pipeline.py` supports:
-- `--image`
-  - required input image path
-- `--preset`
-  - `mobile` or `server`
-- `--det-weights`
-  - optional detection weights override
-- `--rec-weights`
-  - optional recognition weights override
-- `--dict-path`
-  - optional dictionary override
-- `--weights-dir`
-  - local weights root directory
-- `--repo-id`
-  - Hugging Face repository id
-- `--revision`
-  - branch, tag, or commit
-- `--raw-text`
-  - return text before generic post-processing
-- `--output`
-  - save final text to file
-- `--json-output`
-  - save full OCR payload to JSON
-## Example Output
-Plain text:
-```text
-Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
-```
-JSON output contains:
-- `detections`
-- `paragraphs`
-- `full_text`
-- `raw_full_text`
-## Project Structure
-```text
-ppocr_native/
-  common.py
-  weights.py
-  detection/
-    __init__.py
-    mobile.py
-    server.py
-    pipeline.py
-  recognition/
-    __init__.py
-    mobile.py
-    server.py
-    postprocess.py
-    pipeline.py
-weights/
-  ppocrv5_dict.txt
-  mobile/
-  server/
-ocr_pipeline.py
-requirements.txt
-README.md
-```
-## Developer Notes
-This section summarizes the runtime pipeline, architecture, layers, and modules used in the native PyTorch implementation.
-## Design Goals
-The codebase is organized around:
-- clean module boundaries
-- practical separation of detection and recognition
-- no unnecessary abstraction
-- minimal shared utilities
-- inference-first implementation
-The target is clean inference code, not a full training framework.
-## End-to-End Pipeline
-```text
-Input Image
-    |
-    v
-Detection
-    |
-    v
-Polygon Boxes
-    |
-    v
-Reading Order Sort
-    |
-    v
-Perspective Crop
-    |
-    v
-Recognition
-    |
-    v
-Generic Post-Processing
-    |
-    v
-Paragraph Grouping
-    |
-    v
-Full Text
-```
-Execution flow:
-1. `ocr_pipeline.py` resolves preset assets
-2. `load_detection_model(...)` loads the native detection model
-3. `load_recognition_model(...)` loads the native recognition model
-4. `OCRPipeline.predict(...)` runs detection then recognition
-5. recognized lines are grouped into paragraphs and `full_text`
-## Shared Modules
-### `ppocr_native/common.py`
-Shared runtime utilities:
-- `pad_same`
-- `SamePadMaxPool2d`
-- `ConvNormLayer`
-- `load_model_weights`
-This file is intentionally shared because both detection and recognition use it.
-### `ppocr_native/weights.py`
-Weight asset management:
-- preset manifest
-- local path resolution
-- Hugging Face download
-- lazy fetch for missing files only
-This module is responsible for assets, not inference.
-## Detection
-Detection code lives in:
-- `ppocr_native/detection/mobile.py`
-- `ppocr_native/detection/server.py`
-- `ppocr_native/detection/pipeline.py`
-### Detection Pipeline
-The detection pipeline:
-1. resizes the image using the inference policy
-2. normalizes the image
-3. runs model forward
-4. applies DB-style post-processing
-5. returns polygon boxes
-### Mobile Detection Architecture
-High-level topology:
-```text
-Input
-  -> PPLCNetV3 backbone (scale 0.75)
-  -> projection conv
-  -> RSEFPN
-  -> DBHead
-  -> text map
-```
-Main modules:
-- `MobileBackboneEncoder`
-- `PPLCNetV3Layer`
-- `RepBranch`
-- `MobileSEModule`
-- `RSEFPN`
-- `DBHead`
-Characteristics:
-- lightweight
-- narrow channels
-- efficient grouped/depthwise-style blocks
-- simple neck
-### Server Detection Architecture
-High-level topology:
-```text
-Input
-  -> StemBlock
-  -> PPHGNetV2_B4 backbone
-  -> LKPAN
-  -> PFHeadLocal
-  -> refined text map
-```
-Main modules:
-- `StemBlock`
-- `HGV2Stage`
-- `HGV2Block`
-- `LKPAN`
-- `IntraclassBlock`
-- `PFHeadLocal`
-Characteristics:
-- much wider backbone
-- heavier neck
-- stronger feature refinement
-- higher representational capacity
-### Detection Design Difference
-Mobile detection emphasizes:
-- efficiency
-- smaller parameter count
-- cheaper feature fusion
-Server detection emphasizes:
-- richer multi-scale representation
-- heavier fusion and refinement
-- stronger local detail recovery
-## Recognition
-Recognition code lives in:
-- `ppocr_native/recognition/mobile.py`
-- `ppocr_native/recognition/server.py`
-- `ppocr_native/recognition/postprocess.py`
-- `ppocr_native/recognition/pipeline.py`
-### Recognition Pipeline
-The recognition pipeline:
-1. receives text crops from detection
-2. resizes to height `48` with dynamic width
-3. runs recognition forward
-4. applies greedy CTC decode
-5. applies generic post-processing
-6. groups lines into paragraphs
-### Mobile Recognition Architecture
-High-level topology:
-```text
-Input crop
-  -> PPLCNetV3 text backbone (scale 0.95)
-  -> avg pool to height=1
-  -> CTCEncoder
-  -> Linear classifier
-  -> CTC decode
-```
-Main modules:
-- `MobileRecEncoder`
-- `PPLCNetV3Layer`
-- `RepBranch`
-- `MobileSEModule`
-- `CTCEncoder`
-- `SVTRBlock`
-- `CTCRecognitionHead`
-Characteristics:
-- efficient
-- asymmetric stride for OCR sequences
-- smaller feature width
-### Server Recognition Architecture
-High-level topology:
-```text
-Input crop
-  -> StemBlockRec
-  -> PPHGNetV2_B4 text backbone
-  -> avg pool to height=1
-  -> CTCEncoder
-  -> Linear classifier
-  -> CTC decode
-```
-Main modules:
-- `StemBlockRec`
-- `ServerRecBackbone`
-- `HGV2Stage`
-- `HGV2Block`
-- `CTCEncoder`
-- `SVTRBlock`
-- `CTCRecognitionHead`
-Characteristics:
-- much wider backbone
-- richer visual features
-- heavier encoder input
-### Recognition Post-Processing
-`ppocr_native/recognition/postprocess.py` is intentionally generic.
-It performs:
-- whitespace cleanup
-- spacing normalization around punctuation
-- light casing normalization for clearly inconsistent tokens
-It does not perform:
-- domain-specific lexicon correction
-- language-model correction
-- strict heuristic rewriting tied to one document type
-### Layout Grouping
-After recognition, the pipeline also builds:
-- line list
-- paragraph list
-- `full_text`
-- `raw_full_text`
-Grouping is heuristic-based and uses:
-- reading order
-- box position
-- vertical gap
-- left-indent consistency
-This is suitable for ordinary single-column documents, but it is not designed for complex layouts such as:
-- multi-column pages
-- tables
-- magazines
-- heavily structured forms
-## Mobile vs Server
-In short:
-### Mobile
-- smaller
-- faster
-- more efficient
-- better for constrained environments
-### Server
-- larger
-- richer feature representation
-- heavier compute cost
-- better when capacity matters more than size
-## Parameter Summary
-Native model parameter counts:
-| Model | Parameters |
-|---|---:|
-| Mobile Detection | 3,547,457 |
-| Server Detection | 21,979,682 |
-| Mobile Recognition | 7,752,589 |
-| Server Recognition | 21,094,553 |
-Pipeline totals:
-| Pipeline | Parameters |
-|---|---:|
-| Mobile OCR total | 11,300,046 |
-| Server OCR total | 43,074,235 |
-## Why Server Is Larger
-The server preset is larger because it uses:
-- wider backbones
-- much larger stage channel widths
-- heavier detection necks
-- richer refinement modules
-- larger recognition feature representations before classification
-So the gap is not only the number of layers. It also comes from:
-- layer type
-- channel width
-- kernel size
-- fusion complexity
-- refinement complexity
-## Practical Notes
-- The active runtime does not require the `PaddleOCR` folder.
-- The project can run from `ocr_pipeline.py` directly.
-- Detection and recognition are intentionally split for maintainability.
-- The codebase is intended for native PyTorch inference, not full training.
-## Limitations
-- Paragraph grouping is heuristic-based.
-- It is not tuned for complex document layouts.
-- Post-processing is intentionally generic, so OCR typos are not force-corrected.
-- Asset download requires network access to the configured Hugging Face repository.
-- The repository has not yet received full human verification.

+---
+base_model:
+- PaddlePaddle/PP-OCRv5_server_det_safetensors
+- PaddlePaddle/PP-OCRv5_server_rec_safetensors
+- PaddlePaddle/PP-OCRv5_mobile_det_safetensors
+- PaddlePaddle/PP-OCRv5_mobile_rec_safetensors
+pipeline_tag: image-to-text
+---
+# PP-OCR Native PyTorch
+This repository contains only the weights. If you want to run these weights, you can visit the link [here](https://github.com/JohanesSetiawan/pp-ocr-pytorch)