Update ML Intern artifact metadata

fc834be verified 2 days ago

4.98 kB

tags:
  - ml-intern

Document Re-enrichment Module

Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.

The Problem

Your documents have inconsistent formatting:

Titles aren't bold, same font size as body text
Section headings sometimes lack bold/size formatting
Body text is sometimes bold or underlined

Your parser relies on formatting cues (bold, font size) to classify heading levels → misclassification.

The Solution

Original Document → LLM Re-enrichment → Re-enriched Copy → Your Parser

The module:

Extracts all paragraphs with their formatting metadata
Chunks large documents into LLM-digestible batches (adaptive sliding window with overlap)
Classifies each paragraph as TITLE, SECTION_HEADING, or BODY using Llama3 via Ollama
Applies correct formatting (bold + font size + named styles) to a copy — original is never modified

Setup

git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests

Requires Ollama running locally with Llama3:

ollama pull llama3
ollama serve  # if not already running

Usage

Python API

from doc_enricher import DocumentEnricher, DocxHandler

enricher = DocumentEnricher(
    handler=DocxHandler(),
    model="llama3",
)

# Single file
enricher.enrich("report.docx", "report_enriched.docx")

# Batch — all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")

CLI

# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx

# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/

# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v

CLI Options

Flag	Default	Description
`-o, --output`	`{name}_enriched.docx`	Output path
`--batch`	off	Process entire directory
`--model`	`llama3`	Ollama model name
`--ollama-url`	`http://localhost:11434`	Ollama API URL
`--max-tokens`	`3000`	Token budget per LLM chunk
`--overlap`	`3`	Paragraph overlap between chunks
`--no-formatting-hints`	off	Don't send existing formatting to LLM
`-v, --verbose`	off	Debug logging

How It Works

Chunking Strategy

Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on Cohan et al. 2019 — sequential sentence classification requires joint context.

LLM Classification

Uses Ollama's /api/chat endpoint with "format": "json" (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.

Formatting Application

Both named styles (Title, Heading 1, Normal) and run-level formatting (bold + font size) are applied — works whether your parser checks para.style.name or inspects run formatting directly.

Architecture

doc_enricher/
├── __init__.py              # Package entry point
├── base_handler.py          # Abstract handler interface (extend for PDF/HTML)
├── handlers/
│   ├── __init__.py
│   └── docx_handler.py      # DOCX: extract paragraphs + apply formatting
├── llm_client.py            # Ollama /api/chat with JSON-constrained output
├── chunker.py               # Adaptive sliding window with overlap
├── enricher.py              # Main orchestrator
└── cli.py                   # Command-line interface
test_module.py               # Test suite (no Ollama required)

Adding a New Format

Implement BaseHandler:

from doc_enricher.base_handler import BaseHandler, ParagraphInfo

class PdfHandler(BaseHandler):
    def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
        ...
    def apply_classifications(self, src_path, dst_path, classifications):
        ...

Tests

python test_module.py

Creates a DOCX with intentionally broken formatting and verifies:

✅ Paragraph extraction with metadata
✅ Chunking (single chunk, multi-chunk, 150+ paragraphs)
✅ Classification application (mock LLM, no Ollama needed)
✅ Original document unchanged
✅ Edge cases (empty, whitespace-only, single paragraph)

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern