Add README with usage docs
Browse files
README.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Document Re-enrichment Module
|
| 2 |
+
|
| 3 |
+
Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
|
| 4 |
+
|
| 5 |
+
## The Problem
|
| 6 |
+
|
| 7 |
+
Your documents have inconsistent formatting:
|
| 8 |
+
- Titles aren't bold, same font size as body text
|
| 9 |
+
- Section headings sometimes lack bold/size formatting
|
| 10 |
+
- Body text is sometimes bold or underlined
|
| 11 |
+
|
| 12 |
+
Your parser relies on formatting cues (bold, font size) to classify heading levels β **misclassification**.
|
| 13 |
+
|
| 14 |
+
## The Solution
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
Original Document β LLM Re-enrichment β Re-enriched Copy β Your Parser
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
The module:
|
| 21 |
+
1. **Extracts** all paragraphs with their formatting metadata
|
| 22 |
+
2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
|
| 23 |
+
3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
|
| 24 |
+
4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β original is never modified
|
| 25 |
+
|
| 26 |
+
## Setup
|
| 27 |
+
|
| 28 |
+
```bash
|
| 29 |
+
git clone https://huggingface.co/dwijverma2/doc-enricher
|
| 30 |
+
cd doc-enricher
|
| 31 |
+
pip install python-docx requests
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
Requires [Ollama](https://ollama.ai) running locally with Llama3:
|
| 35 |
+
```bash
|
| 36 |
+
ollama pull llama3
|
| 37 |
+
ollama serve # if not already running
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Usage
|
| 41 |
+
|
| 42 |
+
### Python API
|
| 43 |
+
```python
|
| 44 |
+
from doc_enricher import DocumentEnricher, DocxHandler
|
| 45 |
+
|
| 46 |
+
enricher = DocumentEnricher(
|
| 47 |
+
handler=DocxHandler(),
|
| 48 |
+
model="llama3",
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
# Single file
|
| 52 |
+
enricher.enrich("report.docx", "report_enriched.docx")
|
| 53 |
+
|
| 54 |
+
# Batch β all .docx files in a directory
|
| 55 |
+
enricher.enrich_batch("./originals/", "./enriched/")
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### CLI
|
| 59 |
+
```bash
|
| 60 |
+
# Single file
|
| 61 |
+
python -m doc_enricher.cli report.docx -o report_enriched.docx
|
| 62 |
+
|
| 63 |
+
# Batch mode
|
| 64 |
+
python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
|
| 65 |
+
|
| 66 |
+
# Custom model + verbose
|
| 67 |
+
python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### CLI Options
|
| 71 |
+
| Flag | Default | Description |
|
| 72 |
+
|------|---------|-------------|
|
| 73 |
+
| `-o, --output` | `{name}_enriched.docx` | Output path |
|
| 74 |
+
| `--batch` | off | Process entire directory |
|
| 75 |
+
| `--model` | `llama3` | Ollama model name |
|
| 76 |
+
| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
|
| 77 |
+
| `--max-tokens` | `3000` | Token budget per LLM chunk |
|
| 78 |
+
| `--overlap` | `3` | Paragraph overlap between chunks |
|
| 79 |
+
| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
|
| 80 |
+
| `-v, --verbose` | off | Debug logging |
|
| 81 |
+
|
| 82 |
+
## How It Works
|
| 83 |
+
|
| 84 |
+
### Chunking Strategy
|
| 85 |
+
Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β sequential sentence classification requires joint context.
|
| 86 |
+
|
| 87 |
+
### LLM Classification
|
| 88 |
+
Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
|
| 89 |
+
|
| 90 |
+
### Formatting Application
|
| 91 |
+
Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β works whether your parser checks `para.style.name` or inspects run formatting directly.
|
| 92 |
+
|
| 93 |
+
## Architecture
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
doc_enricher/
|
| 97 |
+
βββ __init__.py # Package entry point
|
| 98 |
+
βββ base_handler.py # Abstract handler interface (extend for PDF/HTML)
|
| 99 |
+
βββ handlers/
|
| 100 |
+
β βββ __init__.py
|
| 101 |
+
β βββ docx_handler.py # DOCX: extract paragraphs + apply formatting
|
| 102 |
+
βββ llm_client.py # Ollama /api/chat with JSON-constrained output
|
| 103 |
+
βββ chunker.py # Adaptive sliding window with overlap
|
| 104 |
+
βββ enricher.py # Main orchestrator
|
| 105 |
+
βββ cli.py # Command-line interface
|
| 106 |
+
test_module.py # Test suite (no Ollama required)
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Adding a New Format
|
| 110 |
+
Implement `BaseHandler`:
|
| 111 |
+
```python
|
| 112 |
+
from doc_enricher.base_handler import BaseHandler, ParagraphInfo
|
| 113 |
+
|
| 114 |
+
class PdfHandler(BaseHandler):
|
| 115 |
+
def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
|
| 116 |
+
...
|
| 117 |
+
def apply_classifications(self, src_path, dst_path, classifications):
|
| 118 |
+
...
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## Tests
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
python test_module.py
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
Creates a DOCX with intentionally broken formatting and verifies:
|
| 128 |
+
- β
Paragraph extraction with metadata
|
| 129 |
+
- β
Chunking (single chunk, multi-chunk, 150+ paragraphs)
|
| 130 |
+
- β
Classification application (mock LLM, no Ollama needed)
|
| 131 |
+
- β
Original document unchanged
|
| 132 |
+
- β
Edge cases (empty, whitespace-only, single paragraph)
|