--- library_name: lightgbm pipeline_tag: tabular-classification license: mit language: - vi tags: - lightgbm - classification - vietnamese - document-segmentation - page-splitting --- # LightGBM page splitter — Vietnamese admin batches Two LightGBM Booster models used by [ScanIndex](https://github.com/welcomyou/scanindex) to split a multi-document scan batch into individual document boundaries: | Model | Task | |---|---| | `lightgbm_splitter/doc_start/model.txt` | Binary: is this page the **start** of a new document? | | `lightgbm_splitter/signer_page/model.txt` | Per-document: which page contains the signer block? | ## Files - `lightgbm_splitter/doc_start/model.txt` — LightGBM Booster (text format, portable) - `lightgbm_splitter/doc_start/model.joblib` — sklearn wrapper (optional) - `lightgbm_splitter/signer_page/model.txt` - `lightgbm_splitter/signer_page/model.joblib` ## Features Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See `build_doc_start_features` and the `predict_*` helpers in [scanindex/core/digitization/page_splitter.py](https://github.com/welcomyou/scanindex/blob/main/scanindex/core/digitization/page_splitter.py). ## Loading ```python import lightgbm as lgb from huggingface_hub import snapshot_download local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models") booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt") ``` ## License MIT.