doc-enricher / README.md
dwijverma2's picture
Update ML Intern artifact metadata
fc834be verified
---
tags:
- ml-intern
---
# Document Re-enrichment Module
Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
## The Problem
Your documents have inconsistent formatting:
- Titles aren't bold, same font size as body text
- Section headings sometimes lack bold/size formatting
- Body text is sometimes bold or underlined
Your parser relies on formatting cues (bold, font size) to classify heading levels β†’ **misclassification**.
## The Solution
```
Original Document β†’ LLM Re-enrichment β†’ Re-enriched Copy β†’ Your Parser
```
The module:
1. **Extracts** all paragraphs with their formatting metadata
2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β€” original is never modified
## Setup
```bash
git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests
```
Requires [Ollama](https://ollama.ai) running locally with Llama3:
```bash
ollama pull llama3
ollama serve # if not already running
```
## Usage
### Python API
```python
from doc_enricher import DocumentEnricher, DocxHandler
enricher = DocumentEnricher(
handler=DocxHandler(),
model="llama3",
)
# Single file
enricher.enrich("report.docx", "report_enriched.docx")
# Batch β€” all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")
```
### CLI
```bash
# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx
# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
```
### CLI Options
| Flag | Default | Description |
|------|---------|-------------|
| `-o, --output` | `{name}_enriched.docx` | Output path |
| `--batch` | off | Process entire directory |
| `--model` | `llama3` | Ollama model name |
| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
| `--max-tokens` | `3000` | Token budget per LLM chunk |
| `--overlap` | `3` | Paragraph overlap between chunks |
| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
| `-v, --verbose` | off | Debug logging |
## How It Works
### Chunking Strategy
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β€” sequential sentence classification requires joint context.
### LLM Classification
Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
### Formatting Application
Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β€” works whether your parser checks `para.style.name` or inspects run formatting directly.
## Architecture
```
doc_enricher/
β”œβ”€β”€ __init__.py # Package entry point
β”œβ”€β”€ base_handler.py # Abstract handler interface (extend for PDF/HTML)
β”œβ”€β”€ handlers/
β”‚ β”œβ”€β”€ __init__.py
β”‚ └── docx_handler.py # DOCX: extract paragraphs + apply formatting
β”œβ”€β”€ llm_client.py # Ollama /api/chat with JSON-constrained output
β”œβ”€β”€ chunker.py # Adaptive sliding window with overlap
β”œβ”€β”€ enricher.py # Main orchestrator
└── cli.py # Command-line interface
test_module.py # Test suite (no Ollama required)
```
### Adding a New Format
Implement `BaseHandler`:
```python
from doc_enricher.base_handler import BaseHandler, ParagraphInfo
class PdfHandler(BaseHandler):
def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
...
def apply_classifications(self, src_path, dst_path, classifications):
...
```
## Tests
```bash
python test_module.py
```
Creates a DOCX with intentionally broken formatting and verifies:
- βœ… Paragraph extraction with metadata
- βœ… Chunking (single chunk, multi-chunk, 150+ paragraphs)
- βœ… Classification application (mock LLM, no Ollama needed)
- βœ… Original document unchanged
- βœ… Edge cases (empty, whitespace-only, single paragraph)
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern