| ---
|
| library_name: lightgbm
|
| pipeline_tag: tabular-classification
|
| license: mit
|
| language:
|
| - vi
|
| tags:
|
| - lightgbm
|
| - classification
|
| - vietnamese
|
| - document-segmentation
|
| - page-splitting
|
| ---
|
|
|
| # LightGBM page splitter — Vietnamese admin batches
|
|
|
| Two LightGBM Booster models used by [ScanIndex](https://github.com/welcomyou/scanindex) to split a multi-document scan batch into individual document boundaries:
|
|
|
| | Model | Task |
|
| |---|---|
|
| | `lightgbm_splitter/doc_start/model.txt` | Binary: is this page the **start** of a new document? |
|
| | `lightgbm_splitter/signer_page/model.txt` | Per-document: which page contains the signer block? |
|
|
|
| ## Files
|
|
|
| - `lightgbm_splitter/doc_start/model.txt` — LightGBM Booster (text format, portable)
|
| - `lightgbm_splitter/doc_start/model.joblib` — sklearn wrapper (optional)
|
| - `lightgbm_splitter/signer_page/model.txt`
|
| - `lightgbm_splitter/signer_page/model.joblib`
|
|
|
| ## Features
|
|
|
| Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See `build_doc_start_features` and the `predict_*` helpers in [scanindex/core/digitization/page_splitter.py](https://github.com/welcomyou/scanindex/blob/main/scanindex/core/digitization/page_splitter.py).
|
|
|
| ## Loading
|
|
|
| ```python
|
| import lightgbm as lgb
|
| from huggingface_hub import snapshot_download
|
| local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models")
|
| booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt")
|
| ```
|
|
|
| ## License
|
|
|
| MIT.
|
|
|