tags:
- ml-intern
Document Re-enrichment Module
Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
The Problem
Your documents have inconsistent formatting:
- Titles aren't bold, same font size as body text
- Section headings sometimes lack bold/size formatting
- Body text is sometimes bold or underlined
Your parser relies on formatting cues (bold, font size) to classify heading levels β misclassification.
The Solution
Original Document β LLM Re-enrichment β Re-enriched Copy β Your Parser
The module:
- Extracts all paragraphs with their formatting metadata
- Chunks large documents into LLM-digestible batches (adaptive sliding window with overlap)
- Classifies each paragraph as
TITLE,SECTION_HEADING, orBODYusing Llama3 via Ollama - Applies correct formatting (bold + font size + named styles) to a copy β original is never modified
Setup
git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests
Requires Ollama running locally with Llama3:
ollama pull llama3
ollama serve # if not already running
Usage
Python API
from doc_enricher import DocumentEnricher, DocxHandler
enricher = DocumentEnricher(
handler=DocxHandler(),
model="llama3",
)
# Single file
enricher.enrich("report.docx", "report_enriched.docx")
# Batch β all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")
CLI
# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx
# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
CLI Options
| Flag | Default | Description |
|---|---|---|
-o, --output |
{name}_enriched.docx |
Output path |
--batch |
off | Process entire directory |
--model |
llama3 |
Ollama model name |
--ollama-url |
http://localhost:11434 |
Ollama API URL |
--max-tokens |
3000 |
Token budget per LLM chunk |
--overlap |
3 |
Paragraph overlap between chunks |
--no-formatting-hints |
off | Don't send existing formatting to LLM |
-v, --verbose |
off | Debug logging |
How It Works
Chunking Strategy
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on Cohan et al. 2019 β sequential sentence classification requires joint context.
LLM Classification
Uses Ollama's /api/chat endpoint with "format": "json" (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
Formatting Application
Both named styles (Title, Heading 1, Normal) and run-level formatting (bold + font size) are applied β works whether your parser checks para.style.name or inspects run formatting directly.
Architecture
doc_enricher/
βββ __init__.py # Package entry point
βββ base_handler.py # Abstract handler interface (extend for PDF/HTML)
βββ handlers/
β βββ __init__.py
β βββ docx_handler.py # DOCX: extract paragraphs + apply formatting
βββ llm_client.py # Ollama /api/chat with JSON-constrained output
βββ chunker.py # Adaptive sliding window with overlap
βββ enricher.py # Main orchestrator
βββ cli.py # Command-line interface
test_module.py # Test suite (no Ollama required)
Adding a New Format
Implement BaseHandler:
from doc_enricher.base_handler import BaseHandler, ParagraphInfo
class PdfHandler(BaseHandler):
def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
...
def apply_classifications(self, src_path, dst_path, classifications):
...
Tests
python test_module.py
Creates a DOCX with intentionally broken formatting and verifies:
- β Paragraph extraction with metadata
- β Chunking (single chunk, multi-chunk, 150+ paragraphs)
- β Classification application (mock LLM, no Ollama needed)
- β Original document unchanged
- β Edge cases (empty, whitespace-only, single paragraph)
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern