welcomyou commited on
Commit
30c683d
·
verified ·
1 Parent(s): 7bd37f2

docs: model card

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: lightgbm
3
+ pipeline_tag: tabular-classification
4
+ license: mit
5
+ language:
6
+ - vi
7
+ tags:
8
+ - lightgbm
9
+ - classification
10
+ - vietnamese
11
+ - document-segmentation
12
+ - page-splitting
13
+ ---
14
+
15
+ # LightGBM page splitter — Vietnamese admin batches
16
+
17
+ Two LightGBM Booster models used by [ScanIndex](https://github.com/welcomyou/scanindex) to split a multi-document scan batch into individual document boundaries:
18
+
19
+ | Model | Task |
20
+ |---|---|
21
+ | `lightgbm_splitter/doc_start/model.txt` | Binary: is this page the **start** of a new document? |
22
+ | `lightgbm_splitter/signer_page/model.txt` | Per-document: which page contains the signer block? |
23
+
24
+ ## Files
25
+
26
+ - `lightgbm_splitter/doc_start/model.txt` — LightGBM Booster (text format, portable)
27
+ - `lightgbm_splitter/doc_start/model.joblib` — sklearn wrapper (optional)
28
+ - `lightgbm_splitter/signer_page/model.txt`
29
+ - `lightgbm_splitter/signer_page/model.joblib`
30
+
31
+ ## Features
32
+
33
+ Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See `build_doc_start_features` and the `predict_*` helpers in [scanindex/core/digitization/page_splitter.py](https://github.com/welcomyou/scanindex/blob/main/scanindex/core/digitization/page_splitter.py).
34
+
35
+ ## Loading
36
+
37
+ ```python
38
+ import lightgbm as lgb
39
+ from huggingface_hub import snapshot_download
40
+ local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models")
41
+ booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt")
42
+ ```
43
+
44
+ ## License
45
+
46
+ MIT.