dwijverma2 commited on
Commit
dcbb30e
Β·
verified Β·
1 Parent(s): 0fbb682

Add README with usage docs

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Document Re-enrichment Module
2
+
3
+ Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
4
+
5
+ ## The Problem
6
+
7
+ Your documents have inconsistent formatting:
8
+ - Titles aren't bold, same font size as body text
9
+ - Section headings sometimes lack bold/size formatting
10
+ - Body text is sometimes bold or underlined
11
+
12
+ Your parser relies on formatting cues (bold, font size) to classify heading levels β†’ **misclassification**.
13
+
14
+ ## The Solution
15
+
16
+ ```
17
+ Original Document β†’ LLM Re-enrichment β†’ Re-enriched Copy β†’ Your Parser
18
+ ```
19
+
20
+ The module:
21
+ 1. **Extracts** all paragraphs with their formatting metadata
22
+ 2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
23
+ 3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
24
+ 4. **Applies** correct formatting (bold + font size + named styles) to a **copy** β€” original is never modified
25
+
26
+ ## Setup
27
+
28
+ ```bash
29
+ git clone https://huggingface.co/dwijverma2/doc-enricher
30
+ cd doc-enricher
31
+ pip install python-docx requests
32
+ ```
33
+
34
+ Requires [Ollama](https://ollama.ai) running locally with Llama3:
35
+ ```bash
36
+ ollama pull llama3
37
+ ollama serve # if not already running
38
+ ```
39
+
40
+ ## Usage
41
+
42
+ ### Python API
43
+ ```python
44
+ from doc_enricher import DocumentEnricher, DocxHandler
45
+
46
+ enricher = DocumentEnricher(
47
+ handler=DocxHandler(),
48
+ model="llama3",
49
+ )
50
+
51
+ # Single file
52
+ enricher.enrich("report.docx", "report_enriched.docx")
53
+
54
+ # Batch β€” all .docx files in a directory
55
+ enricher.enrich_batch("./originals/", "./enriched/")
56
+ ```
57
+
58
+ ### CLI
59
+ ```bash
60
+ # Single file
61
+ python -m doc_enricher.cli report.docx -o report_enriched.docx
62
+
63
+ # Batch mode
64
+ python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
65
+
66
+ # Custom model + verbose
67
+ python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
68
+ ```
69
+
70
+ ### CLI Options
71
+ | Flag | Default | Description |
72
+ |------|---------|-------------|
73
+ | `-o, --output` | `{name}_enriched.docx` | Output path |
74
+ | `--batch` | off | Process entire directory |
75
+ | `--model` | `llama3` | Ollama model name |
76
+ | `--ollama-url` | `http://localhost:11434` | Ollama API URL |
77
+ | `--max-tokens` | `3000` | Token budget per LLM chunk |
78
+ | `--overlap` | `3` | Paragraph overlap between chunks |
79
+ | `--no-formatting-hints` | off | Don't send existing formatting to LLM |
80
+ | `-v, --verbose` | off | Debug logging |
81
+
82
+ ## How It Works
83
+
84
+ ### Chunking Strategy
85
+ Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) β€” sequential sentence classification requires joint context.
86
+
87
+ ### LLM Classification
88
+ Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
89
+
90
+ ### Formatting Application
91
+ Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied β€” works whether your parser checks `para.style.name` or inspects run formatting directly.
92
+
93
+ ## Architecture
94
+
95
+ ```
96
+ doc_enricher/
97
+ β”œβ”€β”€ __init__.py # Package entry point
98
+ β”œβ”€β”€ base_handler.py # Abstract handler interface (extend for PDF/HTML)
99
+ β”œβ”€β”€ handlers/
100
+ β”‚ β”œβ”€β”€ __init__.py
101
+ β”‚ └── docx_handler.py # DOCX: extract paragraphs + apply formatting
102
+ β”œβ”€β”€ llm_client.py # Ollama /api/chat with JSON-constrained output
103
+ β”œβ”€β”€ chunker.py # Adaptive sliding window with overlap
104
+ β”œβ”€β”€ enricher.py # Main orchestrator
105
+ └── cli.py # Command-line interface
106
+ test_module.py # Test suite (no Ollama required)
107
+ ```
108
+
109
+ ### Adding a New Format
110
+ Implement `BaseHandler`:
111
+ ```python
112
+ from doc_enricher.base_handler import BaseHandler, ParagraphInfo
113
+
114
+ class PdfHandler(BaseHandler):
115
+ def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
116
+ ...
117
+ def apply_classifications(self, src_path, dst_path, classifications):
118
+ ...
119
+ ```
120
+
121
+ ## Tests
122
+
123
+ ```bash
124
+ python test_module.py
125
+ ```
126
+
127
+ Creates a DOCX with intentionally broken formatting and verifies:
128
+ - βœ… Paragraph extraction with metadata
129
+ - βœ… Chunking (single chunk, multi-chunk, 150+ paragraphs)
130
+ - βœ… Classification application (mock LLM, no Ollama needed)
131
+ - βœ… Original document unchanged
132
+ - βœ… Edge cases (empty, whitespace-only, single paragraph)