metadata
library_name: lightgbm
pipeline_tag: tabular-classification
license: mit
language:
- vi
tags:
- lightgbm
- classification
- vietnamese
- document-segmentation
- page-splitting
LightGBM page splitter — Vietnamese admin batches
Two LightGBM Booster models used by ScanIndex to split a multi-document scan batch into individual document boundaries:
| Model | Task |
|---|---|
lightgbm_splitter/doc_start/model.txt |
Binary: is this page the start of a new document? |
lightgbm_splitter/signer_page/model.txt |
Per-document: which page contains the signer block? |
Files
lightgbm_splitter/doc_start/model.txt— LightGBM Booster (text format, portable)lightgbm_splitter/doc_start/model.joblib— sklearn wrapper (optional)lightgbm_splitter/signer_page/model.txtlightgbm_splitter/signer_page/model.joblib
Features
Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See build_doc_start_features and the predict_* helpers in scanindex/core/digitization/page_splitter.py.
Loading
import lightgbm as lgb
from huggingface_hub import snapshot_download
local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models")
booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt")
License
MIT.