Update ML Intern artifact metadata

fc834be verified 2 days ago

4.98 kB

	---
	tags:
	- ml-intern
	---
	# Document Re-enrichment Module

	Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.

	## The Problem

	Your documents have inconsistent formatting:
	- Titles aren't bold, same font size as body text
	- Section headings sometimes lack bold/size formatting
	- Body text is sometimes bold or underlined

	Your parser relies on formatting cues (bold, font size) to classify heading levels → misclassification.

	## The Solution

	```
	Original Document → LLM Re-enrichment → Re-enriched Copy → Your Parser
	```

	The module:
	1. Extracts all paragraphs with their formatting metadata
	2. Chunks large documents into LLM-digestible batches (adaptive sliding window with overlap)
	3. Classifies each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
	4. Applies correct formatting (bold + font size + named styles) to a copy — original is never modified

	## Setup

	```bash
	git clone https://huggingface.co/dwijverma2/doc-enricher
	cd doc-enricher
	pip install python-docx requests
	```

	Requires [Ollama](https://ollama.ai) running locally with Llama3:
	```bash
	ollama pull llama3
	ollama serve # if not already running
	```

	## Usage

	### Python API
	```python
	from doc_enricher import DocumentEnricher, DocxHandler

	enricher = DocumentEnricher(
	handler=DocxHandler(),
	model="llama3",
	)

	# Single file
	enricher.enrich("report.docx", "report_enriched.docx")

	# Batch — all .docx files in a directory
	enricher.enrich_batch("./originals/", "./enriched/")
	```

	### CLI
	```bash
	# Single file
	python -m doc_enricher.cli report.docx -o report_enriched.docx

	# Batch mode
	python -m doc_enricher.cli --batch ./originals/ -o ./enriched/

	# Custom model + verbose
	python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
	```

	### CLI Options
	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `-o, --output` \| `{name}_enriched.docx` \| Output path \|
	\| `--batch` \| off \| Process entire directory \|
	\| `--model` \| `llama3` \| Ollama model name \|
	\| `--ollama-url` \| `http://localhost:11434` \| Ollama API URL \|
	\| `--max-tokens` \| `3000` \| Token budget per LLM chunk \|
	\| `--overlap` \| `3` \| Paragraph overlap between chunks \|
	\| `--no-formatting-hints` \| off \| Don't send existing formatting to LLM \|
	\| `-v, --verbose` \| off \| Debug logging \|

	## How It Works

	### Chunking Strategy
	Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) — sequential sentence classification requires joint context.

	### LLM Classification
	Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.

	### Formatting Application
	Both named styles (`Title`, `Heading 1`, `Normal`) and run-level formatting (bold + font size) are applied — works whether your parser checks `para.style.name` or inspects run formatting directly.

	## Architecture

	```
	doc_enricher/
	├── __init__.py # Package entry point
	├── base_handler.py # Abstract handler interface (extend for PDF/HTML)
	├── handlers/
	│ ├── __init__.py
	│ └── docx_handler.py # DOCX: extract paragraphs + apply formatting
	├── llm_client.py # Ollama /api/chat with JSON-constrained output
	├── chunker.py # Adaptive sliding window with overlap
	├── enricher.py # Main orchestrator
	└── cli.py # Command-line interface
	test_module.py # Test suite (no Ollama required)
	```

	### Adding a New Format
	Implement `BaseHandler`:
	```python
	from doc_enricher.base_handler import BaseHandler, ParagraphInfo

	class PdfHandler(BaseHandler):
	def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
	...
	def apply_classifications(self, src_path, dst_path, classifications):
	...
	```

	## Tests

	```bash
	python test_module.py
	```

	Creates a DOCX with intentionally broken formatting and verifies:
	- ✅ Paragraph extraction with metadata
	- ✅ Chunking (single chunk, multi-chunk, 150+ paragraphs)
	- ✅ Classification application (mock LLM, no Ollama needed)
	- ✅ Original document unchanged
	- ✅ Edge cases (empty, whitespace-only, single paragraph)

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern