| --- |
| tags: |
| - ml-intern |
| --- |
| # Document Re-enrichment Module |
|
|
| Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings. |
|
|
| ## The Problem |
|
|
| Your documents have inconsistent formatting: |
| - Titles aren't bold, same font size as body text |
| - Section headings sometimes lack bold/size formatting |
| - Body text is sometimes bold or underlined |
|
|
| Your parser relies on formatting cues (bold, font size) to classify heading levels β **misclassification**. |
|
|
| ## The Solution |
|
|
| ``` |
| Original Document β LLM Re-enrichment β Re-enriched Copy β Your Parser |
| ``` |
|
|
| The module: |
| 1. **Extracts** all paragraphs with their formatting metadata |
| 2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap) |
| 3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama |
| 4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β original is never modified |
|
|
| ## Setup |
|
|
| ```bash |
| git clone https://huggingface.co/dwijverma2/doc-enricher |
| cd doc-enricher |
| pip install python-docx requests |
| ``` |
|
|
| Requires [Ollama](https://ollama.ai) running locally with Llama3: |
| ```bash |
| ollama pull llama3 |
| ollama serve # if not already running |
| ``` |
|
|
| ## Usage |
|
|
| ### Python API |
| ```python |
| from doc_enricher import DocumentEnricher, DocxHandler |
| |
| enricher = DocumentEnricher( |
| handler=DocxHandler(), |
| model="llama3", |
| ) |
| |
| # Single file |
| enricher.enrich("report.docx", "report_enriched.docx") |
| |
| # Batch β all .docx files in a directory |
| enricher.enrich_batch("./originals/", "./enriched/") |
| ``` |
|
|
| ### CLI |
| ```bash |
| # Single file |
| python -m doc_enricher.cli report.docx -o report_enriched.docx |
| |
| # Batch mode |
| python -m doc_enricher.cli --batch ./originals/ -o ./enriched/ |
| |
| # Custom model + verbose |
| python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v |
| ``` |
|
|
| ### CLI Options |
| | Flag | Default | Description | |
| |------|---------|-------------| |
| | `-o, --output` | `{name}_enriched.docx` | Output path | |
| | `--batch` | off | Process entire directory | |
| | `--model` | `llama3` | Ollama model name | |
| | `--ollama-url` | `http://localhost:11434` | Ollama API URL | |
| | `--max-tokens` | `3000` | Token budget per LLM chunk | |
| | `--overlap` | `3` | Paragraph overlap between chunks | |
| | `--no-formatting-hints` | off | Don't send existing formatting to LLM | |
| | `-v, --verbose` | off | Debug logging | |
|
|
| ## How It Works |
|
|
| ### Chunking Strategy |
| Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β sequential sentence classification requires joint context. |
|
|
| ### LLM Classification |
| Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints. |
|
|
| ### Formatting Application |
| Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β works whether your parser checks `para.style.name` or inspects run formatting directly. |
|
|
| ## Architecture |
|
|
| ``` |
| doc_enricher/ |
| βββ __init__.py # Package entry point |
| βββ base_handler.py # Abstract handler interface (extend for PDF/HTML) |
| βββ handlers/ |
| β βββ __init__.py |
| β βββ docx_handler.py # DOCX: extract paragraphs + apply formatting |
| βββ llm_client.py # Ollama /api/chat with JSON-constrained output |
| βββ chunker.py # Adaptive sliding window with overlap |
| βββ enricher.py # Main orchestrator |
| βββ cli.py # Command-line interface |
| test_module.py # Test suite (no Ollama required) |
| ``` |
|
|
| ### Adding a New Format |
| Implement `BaseHandler`: |
| ```python |
| from doc_enricher.base_handler import BaseHandler, ParagraphInfo |
| |
| class PdfHandler(BaseHandler): |
| def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]: |
| ... |
| def apply_classifications(self, src_path, dst_path, classifications): |
| ... |
| ``` |
|
|
| ## Tests |
|
|
| ```bash |
| python test_module.py |
| ``` |
|
|
| Creates a DOCX with intentionally broken formatting and verifies: |
| - β
Paragraph extraction with metadata |
| - β
Chunking (single chunk, multi-chunk, 150+ paragraphs) |
| - β
Classification application (mock LLM, no Ollama needed) |
| - β
Original document unchanged |
| - β
Edge cases (empty, whitespace-only, single paragraph) |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|