Update README.md
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ datasets:
|
|
| 9 |
# FinePDFs-OCR-Quality classifier (English)
|
| 10 |
|
| 11 |
## Model summary
|
| 12 |
-
This is a classifier for judging the
|
| 13 |
[annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
|
| 14 |
|
| 15 |
### How to use in transformers
|
|
@@ -21,8 +21,8 @@ import re
|
|
| 21 |
CHUNK_SIZE = 2048 - 2
|
| 22 |
MAX_CHARS = 10_000
|
| 23 |
|
| 24 |
-
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/
|
| 25 |
-
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/
|
| 26 |
regex_whitespace = re.compile(r'\s')
|
| 27 |
|
| 28 |
def create_text_chunks(text: str, tokenizer):
|
|
|
|
| 9 |
# FinePDFs-OCR-Quality classifier (English)
|
| 10 |
|
| 11 |
## Model summary
|
| 12 |
+
This is a classifier for judging the ocr quality value of web pages. It was developed to filter and curate well extracted content from web datasets and was trained on 1304547
|
| 13 |
[annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
|
| 14 |
|
| 15 |
### How to use in transformers
|
|
|
|
| 21 |
CHUNK_SIZE = 2048 - 2
|
| 22 |
MAX_CHARS = 10_000
|
| 23 |
|
| 24 |
+
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn")
|
| 25 |
+
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn")
|
| 26 |
regex_whitespace = re.compile(r'\s')
|
| 27 |
|
| 28 |
def create_text_chunks(text: str, tokenizer):
|