welcomyou
/

lightgbm-vn-page-splitter

Tabular Classification

document-segmentation

Model card Files Files and versions

welcomyou commited on 4 days ago

Commit

30c683d

·

verified ·

1 Parent(s): 7bd37f2

docs: model card

Files changed (1) hide show

README.md +46 -0

README.md ADDED Viewed

	@@ -0,0 +1,46 @@

+---
+library_name: lightgbm
+pipeline_tag: tabular-classification
+license: mit
+language:
+  - vi
+tags:
+  - lightgbm
+  - classification
+  - vietnamese
+  - document-segmentation
+  - page-splitting
+---
+# LightGBM page splitter — Vietnamese admin batches
+Two LightGBM Booster models used by [ScanIndex](https://github.com/welcomyou/scanindex) to split a multi-document scan batch into individual document boundaries:
+| Model | Task |
+|---|---|
+| `lightgbm_splitter/doc_start/model.txt` | Binary: is this page the **start** of a new document? |
+| `lightgbm_splitter/signer_page/model.txt` | Per-document: which page contains the signer block? |
+## Files
+- `lightgbm_splitter/doc_start/model.txt` — LightGBM Booster (text format, portable)
+- `lightgbm_splitter/doc_start/model.joblib` — sklearn wrapper (optional)
+- `lightgbm_splitter/signer_page/model.txt`
+- `lightgbm_splitter/signer_page/model.joblib`
+## Features
+Page-level features extracted from canonical OCR JSON: header/footer signals, regime presence, signer-block markers, page index, relative position, etc. See `build_doc_start_features` and the `predict_*` helpers in [scanindex/core/digitization/page_splitter.py](https://github.com/welcomyou/scanindex/blob/main/scanindex/core/digitization/page_splitter.py).
+## Loading
+```python
+import lightgbm as lgb
+from huggingface_hub import snapshot_download
+local = snapshot_download("welcomyou/lightgbm-vn-page-splitter", local_dir="models")
+booster = lgb.Booster(model_file=f"{local}/lightgbm_splitter/doc_start/model.txt")
+```
+## License
+MIT.