welcomyou
/

lightgbm-vn-page-splitter

Tabular Classification

document-segmentation

Model card Files Files and versions

lightgbm-vn-page-splitter / README.md

welcomyou's picture

docs: model card

30c683d verified 5 days ago

|

history blame contribute delete

1.61 kB

	---
	library_name: lightgbm
	pipeline_tag: tabular-classification
	license: mit
	language:
	- vi
	tags:
	- lightgbm
	- classification
	- vietnamese
	- document-segmentation
	- page-splitting
	---

	# LightGBM page splitter — Vietnamese admin batches

	Two LightGBM Booster models used by [ScanIndex](https://github.com/welcomyou/scanindex) to split a multi-document scan batch into individual document boundaries:

	\| Model \| Task \|
	\|---\|---\|
	\| `lightgbm_splitter/doc_start/model.txt` \| Binary: is this page the start of a new document? \|
	\| `lightgbm_splitter/signer_page/model.txt` \| Per-document: which page contains the signer block? \|

	## Files

	- `lightgbm_splitter/doc_start/model.txt` — LightGBM Booster (text format, portable)
	- `lightgbm_splitter/doc_start/model.joblib` — sklearn wrapper (optional)
	- `lightgbm_splitter/signer_page/model.txt`
	- `lightgbm_splitter/signer_page/model.joblib`

	## Features

	Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See `build_doc_start_features` and the `predict_*` helpers in [scanindex/core/digitization/page_splitter.py](https://github.com/welcomyou/scanindex/blob/main/scanindex/core/digitization/page_splitter.py).

	## Loading

	```python
	import lightgbm as lgb
	from huggingface_hub import snapshot_download
	local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models")
	booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt")
	```

	## License

	MIT.