Document Re-enrichment Module

Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.

The Problem

Your documents have inconsistent formatting:

  • Titles aren't bold, same font size as body text
  • Section headings sometimes lack bold/size formatting
  • Body text is sometimes bold or underlined

Your parser relies on formatting cues (bold, font size) to classify heading levels β†’ misclassification.

The Solution

Original Document β†’ LLM Re-enrichment β†’ Re-enriched Copy β†’ Your Parser

The module:

  1. Extracts all paragraphs with their formatting metadata
  2. Chunks large documents into LLM-digestible batches (adaptive sliding window with overlap)
  3. Classifies each paragraph as TITLE, SECTION_HEADING, or BODY using Llama3 via Ollama
  4. Applies correct formatting (bold + font size + named styles) to a copy β€” original is never modified

Setup

git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests

Requires Ollama running locally with Llama3:

ollama pull llama3
ollama serve  # if not already running

Usage

Python API

from doc_enricher import DocumentEnricher, DocxHandler

enricher = DocumentEnricher(
    handler=DocxHandler(),
    model="llama3",
)

# Single file
enricher.enrich("report.docx", "report_enriched.docx")

# Batch β€” all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")

CLI

# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx

# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/

# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v

CLI Options

Flag Default Description
-o, --output {name}_enriched.docx Output path
--batch off Process entire directory
--model llama3 Ollama model name
--ollama-url http://localhost:11434 Ollama API URL
--max-tokens 3000 Token budget per LLM chunk
--overlap 3 Paragraph overlap between chunks
--no-formatting-hints off Don't send existing formatting to LLM
-v, --verbose off Debug logging

How It Works

Chunking Strategy

Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on Cohan et al. 2019 β€” sequential sentence classification requires joint context.

LLM Classification

Uses Ollama's /api/chat endpoint with "format": "json" (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.

Formatting Application

Both named styles (Title, Heading 1, Normal) and run-level formatting (bold + font size) are applied β€” works whether your parser checks para.style.name or inspects run formatting directly.

Architecture

doc_enricher/
β”œβ”€β”€ __init__.py              # Package entry point
β”œβ”€β”€ base_handler.py          # Abstract handler interface (extend for PDF/HTML)
β”œβ”€β”€ handlers/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── docx_handler.py      # DOCX: extract paragraphs + apply formatting
β”œβ”€β”€ llm_client.py            # Ollama /api/chat with JSON-constrained output
β”œβ”€β”€ chunker.py               # Adaptive sliding window with overlap
β”œβ”€β”€ enricher.py              # Main orchestrator
└── cli.py                   # Command-line interface
test_module.py               # Test suite (no Ollama required)

Adding a New Format

Implement BaseHandler:

from doc_enricher.base_handler import BaseHandler, ParagraphInfo

class PdfHandler(BaseHandler):
    def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
        ...
    def apply_classifications(self, src_path, dst_path, classifications):
        ...

Tests

python test_module.py

Creates a DOCX with intentionally broken formatting and verifies:

  • βœ… Paragraph extraction with metadata
  • βœ… Chunking (single chunk, multi-chunk, 150+ paragraphs)
  • βœ… Classification application (mock LLM, no Ollama needed)
  • βœ… Original document unchanged
  • βœ… Edge cases (empty, whitespace-only, single paragraph)

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for dwijverma2/doc-enricher