File size: 4,979 Bytes
fc834be
 
 
 
dcbb30e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc834be
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
tags:
- ml-intern
---
# Document Re-enrichment Module

Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.

## The Problem

Your documents have inconsistent formatting:
- Titles aren't bold, same font size as body text
- Section headings sometimes lack bold/size formatting
- Body text is sometimes bold or underlined

Your parser relies on formatting cues (bold, font size) to classify heading levels β†’ **misclassification**.

## The Solution

```
Original Document β†’ LLM Re-enrichment β†’ Re-enriched Copy β†’ Your Parser
```

The module:
1. **Extracts** all paragraphs with their formatting metadata
2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β€” original is never modified

## Setup

```bash
git clone https://huggingface.co/dwijverma2/doc-enricher
cd doc-enricher
pip install python-docx requests
```

Requires [Ollama](https://ollama.ai) running locally with Llama3:
```bash
ollama pull llama3
ollama serve  # if not already running
```

## Usage

### Python API
```python
from doc_enricher import DocumentEnricher, DocxHandler

enricher = DocumentEnricher(
    handler=DocxHandler(),
    model="llama3",
)

# Single file
enricher.enrich("report.docx", "report_enriched.docx")

# Batch β€” all .docx files in a directory
enricher.enrich_batch("./originals/", "./enriched/")
```

### CLI
```bash
# Single file
python -m doc_enricher.cli report.docx -o report_enriched.docx

# Batch mode
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/

# Custom model + verbose
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
```

### CLI Options
| Flag | Default | Description |
|------|---------|-------------|
| `-o, --output` | `{name}_enriched.docx` | Output path |
| `--batch` | off | Process entire directory |
| `--model` | `llama3` | Ollama model name |
| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
| `--max-tokens` | `3000` | Token budget per LLM chunk |
| `--overlap` | `3` | Paragraph overlap between chunks |
| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
| `-v, --verbose` | off | Debug logging |

## How It Works

### Chunking Strategy
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β€” sequential sentence classification requires joint context.

### LLM Classification
Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.

### Formatting Application
Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β€” works whether your parser checks `para.style.name` or inspects run formatting directly.

## Architecture

```
doc_enricher/
β”œβ”€β”€ __init__.py              # Package entry point
β”œβ”€β”€ base_handler.py          # Abstract handler interface (extend for PDF/HTML)
β”œβ”€β”€ handlers/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── docx_handler.py      # DOCX: extract paragraphs + apply formatting
β”œβ”€β”€ llm_client.py            # Ollama /api/chat with JSON-constrained output
β”œβ”€β”€ chunker.py               # Adaptive sliding window with overlap
β”œβ”€β”€ enricher.py              # Main orchestrator
└── cli.py                   # Command-line interface
test_module.py               # Test suite (no Ollama required)
```

### Adding a New Format
Implement `BaseHandler`:
```python
from doc_enricher.base_handler import BaseHandler, ParagraphInfo

class PdfHandler(BaseHandler):
    def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
        ...
    def apply_classifications(self, src_path, dst_path, classifications):
        ...
```

## Tests

```bash
python test_module.py
```

Creates a DOCX with intentionally broken formatting and verifies:
- βœ… Paragraph extraction with metadata
- βœ… Chunking (single chunk, multi-chunk, 150+ paragraphs)
- βœ… Classification application (mock LLM, no Ollama needed)
- βœ… Original document unchanged
- βœ… Edge cases (empty, whitespace-only, single paragraph)

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern