--- tags: - ml-intern --- # Document Re-enrichment Module Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings. ## The Problem Your documents have inconsistent formatting: - Titles aren't bold, same font size as body text - Section headings sometimes lack bold/size formatting - Body text is sometimes bold or underlined Your parser relies on formatting cues (bold, font size) to classify heading levels → **misclassification**. ## The Solution ``` Original Document → LLM Re-enrichment → Re-enriched Copy → Your Parser ``` The module: 1. **Extracts** all paragraphs with their formatting metadata 2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap) 3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama 4. **Applies** correct formatting (bold + font size + named styles) to a **copy** — original is never modified ## Setup ```bash git clone https://huggingface.co/dwijverma2/doc-enricher cd doc-enricher pip install python-docx requests ``` Requires [Ollama](https://ollama.ai) running locally with Llama3: ```bash ollama pull llama3 ollama serve # if not already running ``` ## Usage ### Python API ```python from doc_enricher import DocumentEnricher, DocxHandler enricher = DocumentEnricher( handler=DocxHandler(), model="llama3", ) # Single file enricher.enrich("report.docx", "report_enriched.docx") # Batch — all .docx files in a directory enricher.enrich_batch("./originals/", "./enriched/") ``` ### CLI ```bash # Single file python -m doc_enricher.cli report.docx -o report_enriched.docx # Batch mode python -m doc_enricher.cli --batch ./originals/ -o ./enriched/ # Custom model + verbose python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v ``` ### CLI Options | Flag | Default | Description | |------|---------|-------------| | `-o, --output` | `{name}_enriched.docx` | Output path | | `--batch` | off | Process entire directory | | `--model` | `llama3` | Ollama model name | | `--ollama-url` | `http://localhost:11434` | Ollama API URL | | `--max-tokens` | `3000` | Token budget per LLM chunk | | `--overlap` | `3` | Paragraph overlap between chunks | | `--no-formatting-hints` | off | Don't send existing formatting to LLM | | `-v, --verbose` | off | Debug logging | ## How It Works ### Chunking Strategy Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) — sequential sentence classification requires joint context. ### LLM Classification Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints. ### Formatting Application Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied — works whether your parser checks `para.style.name` or inspects run formatting directly. ## Architecture ``` doc_enricher/ ├── __init__.py # Package entry point ├── base_handler.py # Abstract handler interface (extend for PDF/HTML) ├── handlers/ │ ├── __init__.py │ └── docx_handler.py # DOCX: extract paragraphs + apply formatting ├── llm_client.py # Ollama /api/chat with JSON-constrained output ├── chunker.py # Adaptive sliding window with overlap ├── enricher.py # Main orchestrator └── cli.py # Command-line interface test_module.py # Test suite (no Ollama required) ``` ### Adding a New Format Implement `BaseHandler`: ```python from doc_enricher.base_handler import BaseHandler, ParagraphInfo class PdfHandler(BaseHandler): def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]: ... def apply_classifications(self, src_path, dst_path, classifications): ... ``` ## Tests ```bash python test_module.py ``` Creates a DOCX with intentionally broken formatting and verifies: - ✅ Paragraph extraction with metadata - ✅ Chunking (single chunk, multi-chunk, 150+ paragraphs) - ✅ Classification application (mock LLM, no Ollama needed) - ✅ Original document unchanged - ✅ Edge cases (empty, whitespace-only, single paragraph) ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern