File size: 4,979 Bytes

---
tags:
- ml-intern
---
# Document Re-enrichment Module

Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.

## The Problem

Your documents have inconsistent formatting:
- Titles aren't bold, same font size as body text
- Section headings sometimes lack bold/size formatting
- Body text is sometimes bold or underlined

Your parser relies on formatting cues (bold, font size) to classify heading levels → **misclassification**.

## The Solution

```
Original Document → LLM Re-enrichment → Re-enriched Copy → Your Parser
```

The module:
1. **Extracts** all paragraphs with their formatting metadata
2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
4. **Applies** correct formatting (bold + font size + named styles) to a **copy** — original is never modified

## Setup

```bash
git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests
```

Requires [Ollama](https://ollama.ai) running locally with Llama3:
```bash
ollama pull llama3
ollama serve  # if not already running
```

## Usage

### Python API
```python
from doc_enricher import DocumentEnricher, DocxHandler

enricher = DocumentEnricher(
    handler=DocxHandler(),
    model="llama3",
)

# Single file
enricher.enrich("report.docx", "report_enriched.docx")

# Batch — all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")
```

### CLI
```bash
# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx

# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/

# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
```

### CLI Options
| Flag | Default | Description |
|------|---------|-------------|
| `-o, --output` | `{name}_enriched.docx` | Output path |
| `--batch` | off | Process entire directory |
| `--model` | `llama3` | Ollama model name |
| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
| `--max-tokens` | `3000` | Token budget per LLM chunk |
| `--overlap` | `3` | Paragraph overlap between chunks |
| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
| `-v, --verbose` | off | Debug logging |

## How It Works

### Chunking Strategy
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) — sequential sentence classification requires joint context.

### LLM Classification
Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.

### Formatting Application
Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied — works whether your parser checks `para.style.name` or inspects run formatting directly.

## Architecture

```
doc_enricher/
├── __init__.py              # Package entry point
├── base_handler.py          # Abstract handler interface (extend for PDF/HTML)
├── handlers/
│   ├── __init__.py
│   └── docx_handler.py      # DOCX: extract paragraphs + apply formatting
├── llm_client.py            # Ollama /api/chat with JSON-constrained output
├── chunker.py               # Adaptive sliding window with overlap
├── enricher.py              # Main orchestrator
└── cli.py                   # Command-line interface
test_module.py               # Test suite (no Ollama required)
```

### Adding a New Format
Implement `BaseHandler`:
```python
from doc_enricher.base_handler import BaseHandler, ParagraphInfo

class PdfHandler(BaseHandler):
    def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
        ...
    def apply_classifications(self, src_path, dst_path, classifications):
        ...
```

## Tests

```bash
python test_module.py
```

Creates a DOCX with intentionally broken formatting and verifies:
- ✅ Paragraph extraction with metadata
- ✅ Chunking (single chunk, multi-chunk, 150+ paragraphs)
- ✅ Classification application (mock LLM, no Ollama needed)
- ✅ Original document unchanged
- ✅ Edge cases (empty, whitespace-only, single paragraph)

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern