File size: 4,979 Bytes
fc834be dcbb30e fc834be | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | ---
tags:
- ml-intern
---
# Document Re-enrichment Module
Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
## The Problem
Your documents have inconsistent formatting:
- Titles aren't bold, same font size as body text
- Section headings sometimes lack bold/size formatting
- Body text is sometimes bold or underlined
Your parser relies on formatting cues (bold, font size) to classify heading levels β **misclassification**.
## The Solution
```
Original Document β LLM Re-enrichment β Re-enriched Copy β Your Parser
```
The module:
1. **Extracts** all paragraphs with their formatting metadata
2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β original is never modified
## Setup
```bash
git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests
```
Requires [Ollama](https://ollama.ai) running locally with Llama3:
```bash
ollama pull llama3
ollama serve # if not already running
```
## Usage
### Python API
```python
from doc_enricher import DocumentEnricher, DocxHandler
enricher = DocumentEnricher(
handler=DocxHandler(),
model="llama3",
)
# Single file
enricher.enrich("report.docx", "report_enriched.docx")
# Batch β all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")
```
### CLI
```bash
# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx
# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
```
### CLI Options
| Flag | Default | Description |
|------|---------|-------------|
| `-o, --output` | `{name}_enriched.docx` | Output path |
| `--batch` | off | Process entire directory |
| `--model` | `llama3` | Ollama model name |
| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
| `--max-tokens` | `3000` | Token budget per LLM chunk |
| `--overlap` | `3` | Paragraph overlap between chunks |
| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
| `-v, --verbose` | off | Debug logging |
## How It Works
### Chunking Strategy
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β sequential sentence classification requires joint context.
### LLM Classification
Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
### Formatting Application
Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β works whether your parser checks `para.style.name` or inspects run formatting directly.
## Architecture
```
doc_enricher/
βββ __init__.py # Package entry point
βββ base_handler.py # Abstract handler interface (extend for PDF/HTML)
βββ handlers/
β βββ __init__.py
β βββ docx_handler.py # DOCX: extract paragraphs + apply formatting
βββ llm_client.py # Ollama /api/chat with JSON-constrained output
βββ chunker.py # Adaptive sliding window with overlap
βββ enricher.py # Main orchestrator
βββ cli.py # Command-line interface
test_module.py # Test suite (no Ollama required)
```
### Adding a New Format
Implement `BaseHandler`:
```python
from doc_enricher.base_handler import BaseHandler, ParagraphInfo
class PdfHandler(BaseHandler):
def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
...
def apply_classifications(self, src_path, dst_path, classifications):
...
```
## Tests
```bash
python test_module.py
```
Creates a DOCX with intentionally broken formatting and verifies:
- β
Paragraph extraction with metadata
- β
Chunking (single chunk, multi-chunk, 150+ paragraphs)
- β
Classification application (mock LLM, no Ollama needed)
- β
Original document unchanged
- β
Edge cases (empty, whitespace-only, single paragraph)
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
|