dwijverma2
/

doc-enricher

ml-intern

Model card Files Files and versions

xet

Community

dwijverma2 commited on 11 days ago

Commit

dcbb30e

verified ·

1 Parent(s): 0fbb682

Add README with usage docs

Browse files

Files changed (1) hide show

README.md +132 -0

README.md ADDED Viewed

	@@ -0,0 +1,132 @@

+# Document Re-enrichment Module
+Fixes inconsistent document formatting using a local LLM (Ollama + Llama3) so your rule-based parser can correctly identify headings.
+## The Problem
+Your documents have inconsistent formatting:
+- Titles aren't bold, same font size as body text
+- Section headings sometimes lack bold/size formatting
+- Body text is sometimes bold or underlined
+Your parser relies on formatting cues (bold, font size) to classify heading levels → **misclassification**.
+## The Solution
+```
+Original Document → LLM Re-enrichment → Re-enriched Copy → Your Parser
+```
+The module:
+1. **Extracts** all paragraphs with their formatting metadata
+2. **Chunks** large documents into LLM-digestible batches (adaptive sliding window with overlap)
+3. **Classifies** each paragraph as `TITLE`, `SECTION_HEADING`, or `BODY` using Llama3 via Ollama
+4. **Applies** correct formatting (bold + font size + named styles) to a **copy** — original is never modified
+## Setup
+```bash
+git clone https://huggingface.co/dwijverma2/doc-enricher
+cd doc-enricher
+pip install python-docx requests
+```
+Requires [Ollama](https://ollama.ai) running locally with Llama3:
+```bash
+ollama pull llama3
+ollama serve  # if not already running
+```
+## Usage
+### Python API
+```python
+from doc_enricher import DocumentEnricher, DocxHandler
+enricher = DocumentEnricher(
+    handler=DocxHandler(),
+    model="llama3",
+)
+# Single file
+enricher.enrich("report.docx", "report_enriched.docx")
+# Batch — all .docx files in a directory
+enricher.enrich_batch("./originals/", "./enriched/")
+```
+### CLI
+```bash
+# Single file
+python -m doc_enricher.cli report.docx -o report_enriched.docx
+# Batch mode
+python -m doc_enricher.cli --batch ./originals/ -o ./enriched/
+# Custom model + verbose
+python -m doc_enricher.cli report.docx -o out.docx --model llama3:8b -v
+```
+### CLI Options
+| Flag | Default | Description |
+|------|---------|-------------|
+| `-o, --output` | `{name}_enriched.docx` | Output path |
+| `--batch` | off | Process entire directory |
+| `--model` | `llama3` | Ollama model name |
+| `--ollama-url` | `http://localhost:11434` | Ollama API URL |
+| `--max-tokens` | `3000` | Token budget per LLM chunk |
+| `--overlap` | `3` | Paragraph overlap between chunks |
+| `--no-formatting-hints` | off | Don't send existing formatting to LLM |
+| `-v, --verbose` | off | Debug logging |
+## How It Works
+### Chunking Strategy
+Documents are split into chunks that fit within Llama3's 8K context window (~3000 tokens of paragraph text, leaving room for the system prompt and response). Adjacent chunks overlap by 3 paragraphs so boundary paragraphs have context from both sides. Based on [Cohan et al. 2019](https://arxiv.org/abs/1909.04054) — sequential sentence classification requires joint context.
+### LLM Classification
+Uses Ollama's `/api/chat` endpoint with `"format": "json"` (constrained decoding) for reliable structured output. The prompt includes existing formatting metadata (style name, bold, font size) as hints.
+### Formatting Application
+Both **named styles** (`Title`, `Heading 1`, `Normal`) and **run-level formatting** (bold + font size) are applied — works whether your parser checks `para.style.name` or inspects run formatting directly.
+## Architecture
+```
+doc_enricher/
+├── __init__.py              # Package entry point
+├── base_handler.py          # Abstract handler interface (extend for PDF/HTML)
+├── handlers/
+│   ├── __init__.py
+│   └── docx_handler.py      # DOCX: extract paragraphs + apply formatting
+├── llm_client.py            # Ollama /api/chat with JSON-constrained output
+├── chunker.py               # Adaptive sliding window with overlap
+├── enricher.py              # Main orchestrator
+└── cli.py                   # Command-line interface
+test_module.py               # Test suite (no Ollama required)
+```
+### Adding a New Format
+Implement `BaseHandler`:
+```python
+from doc_enricher.base_handler import BaseHandler, ParagraphInfo
+class PdfHandler(BaseHandler):
+    def extract_paragraphs(self, filepath: str) -> list[ParagraphInfo]:
+        ...
+    def apply_classifications(self, src_path, dst_path, classifications):
+        ...
+```
+## Tests
+```bash
+python test_module.py
+```
+Creates a DOCX with intentionally broken formatting and verifies:
+- ✅ Paragraph extraction with metadata
+- ✅ Chunking (single chunk, multi-chunk, 150+ paragraphs)
+- ✅ Classification application (mock LLM, no Ollama needed)
+- ✅ Original document unchanged
+- ✅ Edge cases (empty, whitespace-only, single paragraph)