johaness14 commited on
Commit
5ba5a2f
·
verified ·
1 Parent(s): 4d9f515

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -597
README.md CHANGED
@@ -1,598 +1,11 @@
1
- # PP-OCR Native PyTorch
2
-
3
- Native PyTorch inference implementation for PP-OCRv5 text detection and text recognition.
4
-
5
- This project runs OCR end-to-end without PaddlePaddle runtime and without Hugging Face runtime for inference. The runtime uses native PyTorch `.pth` weights.
6
-
7
- ## Important Note
8
-
9
- This repository, including the reverse engineering work, codebase generation, refactoring, and documentation, was produced with AI assistance using GPT-5.4 through Codex TUI.
10
-
11
- It has not yet gone through comprehensive human verification, formal human annotation, or full manual audit. Use it as an engineering baseline, not as a claim of production-grade equivalence to the original Paddle implementation.
12
-
13
- ## Acknowledgements
14
-
15
- - PaddlePaddle: https://www.paddlepaddle.org.cn/
16
- - PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
17
-
18
- This project is a native PyTorch reimplementation inspired by PaddleOCR model structure, inference behavior, and exported assets. PaddlePaddle and PaddleOCR remain the original upstream engines and references behind the model family.
19
-
20
- ## Features
21
-
22
- - Native PyTorch detection:
23
- - `PP-OCRv5_mobile_det`
24
- - `PP-OCRv5_server_det`
25
- - Native PyTorch recognition:
26
- - `PP-OCRv5_mobile_rec`
27
- - `PP-OCRv5_server_rec`
28
- - End-to-end OCR pipeline:
29
- - image -> detection -> crop -> recognition -> paragraph grouping -> text
30
- - Lazy preset-based weight download from Hugging Face
31
- - Generic post-processing
32
- - Full text output and structured OCR JSON output
33
-
34
- ## Runtime Scope
35
-
36
- The active inference runtime does not depend on:
37
-
38
- - PaddlePaddle runtime
39
- - PaddleOCR runtime
40
- - Hugging Face runtime
41
-
42
- The runtime uses:
43
-
44
- - PyTorch
45
- - OpenCV
46
- - detection post-processing utilities
47
-
48
- ## Supported Presets
49
-
50
- ### `mobile`
51
-
52
- Uses:
53
-
54
- - detection: `PP-OCRv5_mobile_det_native.pth`
55
- - recognition: `PP-OCRv5_mobile_rec_native.pth`
56
-
57
- ### `server`
58
-
59
- Uses:
60
-
61
- - detection: `PP-OCRv5_server_det_native.pth`
62
- - recognition: `PP-OCRv5_server_rec_native.pth`
63
-
64
- ## Installation
65
-
66
- ## Requirements
67
-
68
- - Python
69
- - PyTorch
70
- - dependencies from `requirements.txt`
71
-
72
- Main dependencies:
73
-
74
- - `torch`
75
- - `torchvision`
76
- - `opencv-python`
77
- - `safetensors`
78
- - `pyclipper`
79
- - `packaging`
80
- - `shapely`
81
-
82
- ## Setup
83
-
84
- If you need a fresh environment:
85
-
86
- ```powershell
87
- python -m venv venv
88
- venv\Scripts\activate
89
- pip install -r requirements.txt
90
- ```
91
-
92
- If your environment already exists, just activate the project `venv`.
93
-
94
- ## Weights Layout
95
-
96
- The default local layout is:
97
-
98
- ```text
99
- weights/
100
- ppocrv5_dict.txt
101
- mobile/
102
- PP-OCRv5_mobile_det_native.pth
103
- PP-OCRv5_mobile_rec_native.pth
104
- ...
105
- server/
106
- PP-OCRv5_server_det_native.pth
107
- PP-OCRv5_server_rec_native.pth
108
- ...
109
- ```
110
-
111
- If required files are missing, the pipeline will try to download only the requested preset from:
112
-
113
- ```text
114
- https://huggingface.co/johaness14/PP-OCR-PyTorch
115
- ```
116
-
117
- Download behavior:
118
-
119
- - only the requested preset is downloaded
120
- - only missing files are downloaded
121
- - the dictionary file is also checked and downloaded when missing
122
-
123
- ## Usage
124
-
125
- ## Basic OCR
126
-
127
- ```powershell
128
- venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server
129
- ```
130
-
131
- Mobile preset:
132
-
133
- ```powershell
134
- venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset mobile
135
- ```
136
-
137
- ## Raw Text Output
138
-
139
- ```powershell
140
- venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --raw-text
141
- ```
142
-
143
- ## Save Plain Text
144
-
145
- ```powershell
146
- venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --output result.txt
147
- ```
148
-
149
- ## Save Full OCR JSON
150
-
151
- ```powershell
152
- venv\Scripts\python.exe ocr_pipeline.py --image path\to\image.jpg --preset server --json-output result.json
153
- ```
154
-
155
- ## Override Weights Manually
156
-
157
- ```powershell
158
- venv\Scripts\python.exe ocr_pipeline.py ^
159
- --image path\to\image.jpg ^
160
- --det-weights weights\server\PP-OCRv5_server_det_native.pth ^
161
- --rec-weights weights\server\PP-OCRv5_server_rec_native.pth
162
- ```
163
-
164
- ## Override Local Weights Folder or Repo Source
165
-
166
- ```powershell
167
- venv\Scripts\python.exe ocr_pipeline.py ^
168
- --image path\to\image.jpg ^
169
- --preset mobile ^
170
- --weights-dir weights ^
171
- --repo-id johaness14/PP-OCR-PyTorch ^
172
- --revision main
173
- ```
174
-
175
- ## CLI Arguments
176
-
177
- `ocr_pipeline.py` supports:
178
-
179
- - `--image`
180
- - required input image path
181
- - `--preset`
182
- - `mobile` or `server`
183
- - `--det-weights`
184
- - optional detection weights override
185
- - `--rec-weights`
186
- - optional recognition weights override
187
- - `--dict-path`
188
- - optional dictionary override
189
- - `--weights-dir`
190
- - local weights root directory
191
- - `--repo-id`
192
- - Hugging Face repository id
193
- - `--revision`
194
- - branch, tag, or commit
195
- - `--raw-text`
196
- - return text before generic post-processing
197
- - `--output`
198
- - save final text to file
199
- - `--json-output`
200
- - save full OCR payload to JSON
201
-
202
- ## Example Output
203
-
204
- Plain text:
205
-
206
- ```text
207
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
208
- ```
209
-
210
- JSON output contains:
211
-
212
- - `detections`
213
- - `paragraphs`
214
- - `full_text`
215
- - `raw_full_text`
216
-
217
- ## Project Structure
218
-
219
- ```text
220
- ppocr_native/
221
- common.py
222
- weights.py
223
- detection/
224
- __init__.py
225
- mobile.py
226
- server.py
227
- pipeline.py
228
- recognition/
229
- __init__.py
230
- mobile.py
231
- server.py
232
- postprocess.py
233
- pipeline.py
234
-
235
- weights/
236
- ppocrv5_dict.txt
237
- mobile/
238
- server/
239
-
240
- ocr_pipeline.py
241
- requirements.txt
242
- README.md
243
- ```
244
-
245
- ## Developer Notes
246
-
247
- This section summarizes the runtime pipeline, architecture, layers, and modules used in the native PyTorch implementation.
248
-
249
- ## Design Goals
250
-
251
- The codebase is organized around:
252
-
253
- - clean module boundaries
254
- - practical separation of detection and recognition
255
- - no unnecessary abstraction
256
- - minimal shared utilities
257
- - inference-first implementation
258
-
259
- The target is clean inference code, not a full training framework.
260
-
261
- ## End-to-End Pipeline
262
-
263
- ```text
264
- Input Image
265
- |
266
- v
267
- Detection
268
- |
269
- v
270
- Polygon Boxes
271
- |
272
- v
273
- Reading Order Sort
274
- |
275
- v
276
- Perspective Crop
277
- |
278
- v
279
- Recognition
280
- |
281
- v
282
- Generic Post-Processing
283
- |
284
- v
285
- Paragraph Grouping
286
- |
287
- v
288
- Full Text
289
- ```
290
-
291
- Execution flow:
292
-
293
- 1. `ocr_pipeline.py` resolves preset assets
294
- 2. `load_detection_model(...)` loads the native detection model
295
- 3. `load_recognition_model(...)` loads the native recognition model
296
- 4. `OCRPipeline.predict(...)` runs detection then recognition
297
- 5. recognized lines are grouped into paragraphs and `full_text`
298
-
299
- ## Shared Modules
300
-
301
- ### `ppocr_native/common.py`
302
-
303
- Shared runtime utilities:
304
-
305
- - `pad_same`
306
- - `SamePadMaxPool2d`
307
- - `ConvNormLayer`
308
- - `load_model_weights`
309
-
310
- This file is intentionally shared because both detection and recognition use it.
311
-
312
- ### `ppocr_native/weights.py`
313
 
314
- Weight asset management:
315
-
316
- - preset manifest
317
- - local path resolution
318
- - Hugging Face download
319
- - lazy fetch for missing files only
320
-
321
- This module is responsible for assets, not inference.
322
-
323
- ## Detection
324
-
325
- Detection code lives in:
326
-
327
- - `ppocr_native/detection/mobile.py`
328
- - `ppocr_native/detection/server.py`
329
- - `ppocr_native/detection/pipeline.py`
330
-
331
- ### Detection Pipeline
332
-
333
- The detection pipeline:
334
-
335
- 1. resizes the image using the inference policy
336
- 2. normalizes the image
337
- 3. runs model forward
338
- 4. applies DB-style post-processing
339
- 5. returns polygon boxes
340
-
341
- ### Mobile Detection Architecture
342
-
343
- High-level topology:
344
-
345
- ```text
346
- Input
347
- -> PPLCNetV3 backbone (scale 0.75)
348
- -> projection conv
349
- -> RSEFPN
350
- -> DBHead
351
- -> text map
352
- ```
353
-
354
- Main modules:
355
-
356
- - `MobileBackboneEncoder`
357
- - `PPLCNetV3Layer`
358
- - `RepBranch`
359
- - `MobileSEModule`
360
- - `RSEFPN`
361
- - `DBHead`
362
-
363
- Characteristics:
364
-
365
- - lightweight
366
- - narrow channels
367
- - efficient grouped/depthwise-style blocks
368
- - simple neck
369
-
370
- ### Server Detection Architecture
371
-
372
- High-level topology:
373
-
374
- ```text
375
- Input
376
- -> StemBlock
377
- -> PPHGNetV2_B4 backbone
378
- -> LKPAN
379
- -> PFHeadLocal
380
- -> refined text map
381
- ```
382
-
383
- Main modules:
384
-
385
- - `StemBlock`
386
- - `HGV2Stage`
387
- - `HGV2Block`
388
- - `LKPAN`
389
- - `IntraclassBlock`
390
- - `PFHeadLocal`
391
-
392
- Characteristics:
393
-
394
- - much wider backbone
395
- - heavier neck
396
- - stronger feature refinement
397
- - higher representational capacity
398
-
399
- ### Detection Design Difference
400
-
401
- Mobile detection emphasizes:
402
-
403
- - efficiency
404
- - smaller parameter count
405
- - cheaper feature fusion
406
-
407
- Server detection emphasizes:
408
-
409
- - richer multi-scale representation
410
- - heavier fusion and refinement
411
- - stronger local detail recovery
412
-
413
- ## Recognition
414
-
415
- Recognition code lives in:
416
-
417
- - `ppocr_native/recognition/mobile.py`
418
- - `ppocr_native/recognition/server.py`
419
- - `ppocr_native/recognition/postprocess.py`
420
- - `ppocr_native/recognition/pipeline.py`
421
-
422
- ### Recognition Pipeline
423
-
424
- The recognition pipeline:
425
-
426
- 1. receives text crops from detection
427
- 2. resizes to height `48` with dynamic width
428
- 3. runs recognition forward
429
- 4. applies greedy CTC decode
430
- 5. applies generic post-processing
431
- 6. groups lines into paragraphs
432
-
433
- ### Mobile Recognition Architecture
434
-
435
- High-level topology:
436
-
437
- ```text
438
- Input crop
439
- -> PPLCNetV3 text backbone (scale 0.95)
440
- -> avg pool to height=1
441
- -> CTCEncoder
442
- -> Linear classifier
443
- -> CTC decode
444
- ```
445
-
446
- Main modules:
447
-
448
- - `MobileRecEncoder`
449
- - `PPLCNetV3Layer`
450
- - `RepBranch`
451
- - `MobileSEModule`
452
- - `CTCEncoder`
453
- - `SVTRBlock`
454
- - `CTCRecognitionHead`
455
-
456
- Characteristics:
457
-
458
- - efficient
459
- - asymmetric stride for OCR sequences
460
- - smaller feature width
461
-
462
- ### Server Recognition Architecture
463
-
464
- High-level topology:
465
-
466
- ```text
467
- Input crop
468
- -> StemBlockRec
469
- -> PPHGNetV2_B4 text backbone
470
- -> avg pool to height=1
471
- -> CTCEncoder
472
- -> Linear classifier
473
- -> CTC decode
474
- ```
475
-
476
- Main modules:
477
-
478
- - `StemBlockRec`
479
- - `ServerRecBackbone`
480
- - `HGV2Stage`
481
- - `HGV2Block`
482
- - `CTCEncoder`
483
- - `SVTRBlock`
484
- - `CTCRecognitionHead`
485
-
486
- Characteristics:
487
-
488
- - much wider backbone
489
- - richer visual features
490
- - heavier encoder input
491
-
492
- ### Recognition Post-Processing
493
-
494
- `ppocr_native/recognition/postprocess.py` is intentionally generic.
495
-
496
- It performs:
497
-
498
- - whitespace cleanup
499
- - spacing normalization around punctuation
500
- - light casing normalization for clearly inconsistent tokens
501
-
502
- It does not perform:
503
-
504
- - domain-specific lexicon correction
505
- - language-model correction
506
- - strict heuristic rewriting tied to one document type
507
-
508
- ### Layout Grouping
509
-
510
- After recognition, the pipeline also builds:
511
-
512
- - line list
513
- - paragraph list
514
- - `full_text`
515
- - `raw_full_text`
516
-
517
- Grouping is heuristic-based and uses:
518
-
519
- - reading order
520
- - box position
521
- - vertical gap
522
- - left-indent consistency
523
-
524
- This is suitable for ordinary single-column documents, but it is not designed for complex layouts such as:
525
-
526
- - multi-column pages
527
- - tables
528
- - magazines
529
- - heavily structured forms
530
-
531
- ## Mobile vs Server
532
-
533
- In short:
534
-
535
- ### Mobile
536
-
537
- - smaller
538
- - faster
539
- - more efficient
540
- - better for constrained environments
541
-
542
- ### Server
543
-
544
- - larger
545
- - richer feature representation
546
- - heavier compute cost
547
- - better when capacity matters more than size
548
-
549
- ## Parameter Summary
550
-
551
- Native model parameter counts:
552
-
553
- | Model | Parameters |
554
- |---|---:|
555
- | Mobile Detection | 3,547,457 |
556
- | Server Detection | 21,979,682 |
557
- | Mobile Recognition | 7,752,589 |
558
- | Server Recognition | 21,094,553 |
559
-
560
- Pipeline totals:
561
-
562
- | Pipeline | Parameters |
563
- |---|---:|
564
- | Mobile OCR total | 11,300,046 |
565
- | Server OCR total | 43,074,235 |
566
-
567
- ## Why Server Is Larger
568
-
569
- The server preset is larger because it uses:
570
-
571
- - wider backbones
572
- - much larger stage channel widths
573
- - heavier detection necks
574
- - richer refinement modules
575
- - larger recognition feature representations before classification
576
-
577
- So the gap is not only the number of layers. It also comes from:
578
-
579
- - layer type
580
- - channel width
581
- - kernel size
582
- - fusion complexity
583
- - refinement complexity
584
-
585
- ## Practical Notes
586
-
587
- - The active runtime does not require the `PaddleOCR` folder.
588
- - The project can run from `ocr_pipeline.py` directly.
589
- - Detection and recognition are intentionally split for maintainability.
590
- - The codebase is intended for native PyTorch inference, not full training.
591
-
592
- ## Limitations
593
-
594
- - Paragraph grouping is heuristic-based.
595
- - It is not tuned for complex document layouts.
596
- - Post-processing is intentionally generic, so OCR typos are not force-corrected.
597
- - Asset download requires network access to the configured Hugging Face repository.
598
- - The repository has not yet received full human verification.
 
1
+ ---
2
+ base_model:
3
+ - PaddlePaddle/PP-OCRv5_server_det_safetensors
4
+ - PaddlePaddle/PP-OCRv5_server_rec_safetensors
5
+ - PaddlePaddle/PP-OCRv5_mobile_det_safetensors
6
+ - PaddlePaddle/PP-OCRv5_mobile_rec_safetensors
7
+ pipeline_tag: image-to-text
8
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ # PP-OCR Native PyTorch
11
+ This repository contains only the weights. If you want to run these weights, you can visit the link [here](https://github.com/JohanesSetiawan/pp-ocr-pytorch)