Spaces:

JaydeepR
/

TenderIQ

Sleeping

JaydeepR Claude Sonnet 4.6 commited on 15 days ago

Commit

f42bfb0

1 Parent(s): c419ba2

Step 5: pdf_utils and chunker — PyMuPDF extraction and text chunking

Implements specs/03_pdf_utils_and_chunker.md. extract_pages returns per-page
dicts; is_text_pdf uses avg-chars heuristic; render_page_to_image produces PIL
images for OCR. chunk_tender splits on clause headings up to 2000 chars;
chunk_bidder emits one chunk per ExtractedPage with full OCR metadata.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (3) hide show

core/chunker.py +54 -2
core/pdf_utils.py +24 -3
specs/03_pdf_utils_and_chunker.md +80 -0

core/chunker.py CHANGED Viewed

@@ -1,11 +1,63 @@
 from core.ocr_pipeline import ExtractedPage
 def chunk_tender(pages: list[dict], tender_id: str) -> list[dict]:
-    raise NotImplementedError
 def chunk_bidder(
     pages: list[ExtractedPage], bidder_id: str, doc_name: str
 ) -> list[dict]:
-    raise NotImplementedError

+import re
 from core.ocr_pipeline import ExtractedPage
+_MAX_CHUNK_CHARS = 2000
 def chunk_tender(pages: list[dict], tender_id: str) -> list[dict]:
+    chunks = []
+    for page_dict in pages:
+        page_no = page_dict["page"]
+        text = page_dict["text"].strip()
+        if not text:
+            continue
+        if len(text) <= _MAX_CHUNK_CHARS:
+            pieces = [text]
+        else:
+            # Split on clause headings or double newlines
+            splits = re.split(r'(?m)(?=^\d+(\.\d+)*\s+)', text)
+            pieces = []
+            current = ""
+            for s in splits:
+                if len(current) + len(s) <= _MAX_CHUNK_CHARS:
+                    current += s
+                else:
+                    if current:
+                        pieces.append(current)
+                    current = s
+            if current:
+                pieces.append(current)
+        for i, piece in enumerate(pieces):
+            piece = piece.strip()
+            if not piece:
+                continue
+            chunks.append({
+                "text": piece,
+                "tender_id": tender_id,
+                "page": page_no,
+                "chunk_id": f"{tender_id}_p{page_no}_c{i}",
+            })
+    return chunks
 def chunk_bidder(
     pages: list[ExtractedPage], bidder_id: str, doc_name: str
 ) -> list[dict]:
+    chunks = []
+    for page in pages:
+        text = page.text.strip() if page.text else ""
+        if not text:
+            continue
+        safe_doc = doc_name.replace("/", "_").replace("\\", "_")
+        chunks.append({
+            "text": text,
+            "bidder_id": bidder_id,
+            "doc_name": doc_name,
+            "page": page.page,
+            "source_type": page.source_type,
+            "ocr_confidence": page.confidence,
+            "chunk_id": f"{bidder_id}_{safe_doc}_p{page.page}",
+        })
+    return chunks

core/pdf_utils.py CHANGED Viewed

@@ -1,14 +1,35 @@
 from pathlib import Path
 import PIL.Image
 def extract_pages(path: Path) -> list[dict]:
-    raise NotImplementedError
 def is_text_pdf(path: Path) -> bool:
-    raise NotImplementedError
 def render_page_to_image(path: Path, page_no: int, dpi: int = 200) -> PIL.Image.Image:
-    raise NotImplementedError

 from pathlib import Path
+import fitz
 import PIL.Image
 def extract_pages(path: Path) -> list[dict]:
+    doc = fitz.open(str(path))
+    pages = []
+    for i, page in enumerate(doc):
+        text = page.get_text("text")
+        pages.append({"page": i + 1, "text": text})
+    doc.close()
+    return pages
 def is_text_pdf(path: Path) -> bool:
+    doc = fitz.open(str(path))
+    if not doc.page_count:
+        doc.close()
+        return False
+    total_chars = sum(len(page.get_text("text")) for page in doc)
+    avg = total_chars / doc.page_count
+    doc.close()
+    return avg >= 50
 def render_page_to_image(path: Path, page_no: int, dpi: int = 200) -> PIL.Image.Image:
+    doc = fitz.open(str(path))
+    page = doc[page_no - 1]
+    mat = fitz.Matrix(dpi / 72, dpi / 72)
+    pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB)
+    img = PIL.Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
+    doc.close()
+    return img

specs/03_pdf_utils_and_chunker.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Spec 03 — PDF Utils and Chunker
+**Step:** 5 of 15
+**Time budget:** ~15 min
+---
+## Goal
+Implement `core/pdf_utils.py` (PyMuPDF text extraction and page rendering) and `core/chunker.py` (text → chunks with metadata).
+---
+## `core/pdf_utils.py`
+### `extract_pages(path: Path) -> list[dict]`
+- Opens the PDF with `fitz.open(str(path))`.
+- For each page `i`: extracts text via `page.get_text("text")`.
+- Returns `[{"page": i+1, "text": text}, ...]` (1-indexed pages).
+### `is_text_pdf(path: Path) -> bool`
+- Opens the PDF.
+- Computes average characters per page across all pages.
+- Returns `True` if average ≥ 50 characters per page (heuristic for typed PDF vs scanned blank pages).
+### `render_page_to_image(path: Path, page_no: int, dpi: int = 200) -> PIL.Image.Image`
+- Opens the PDF.
+- Gets page at index `page_no - 1` (0-indexed).
+- Creates `fitz.Matrix(dpi/72, dpi/72)` and renders via `page.get_pixmap(matrix=mat, colorspace=fitz.csRGB)`.
+- Converts pixmap to PIL Image via `Image.frombytes("RGB", [pix.width, pix.height], pix.samples)`.
+- Returns the PIL Image.
+---
+## `core/chunker.py`
+### `chunk_tender(pages: list[dict], tender_id: str) -> list[dict]`
+Input: list of `{"page": int, "text": str}` dicts.
+Strategy:
+- Join page text. Split on clause headings detected by regex `r'^\d+(\.\d+)*\s+'` (multiline).
+- Each chunk: up to ~500 tokens (~2000 chars). If a section is longer, split on `\n\n` boundaries.
+- Each chunk dict: `{"text": str, "tender_id": str, "page": int, "chunk_id": str}`.
+- `chunk_id` = `f"{tender_id}_p{page}_c{i}"`.
+Simpler implementation (sufficient for 5-page mock tender):
+- One chunk per page section: for each page, if text > 2000 chars split into ~2000-char pieces; else one chunk.
+### `chunk_bidder(pages: list[ExtractedPage], bidder_id: str, doc_name: str) -> list[dict]`
+Input: list of `ExtractedPage` objects.
+Strategy: one chunk per page.
+Each chunk dict:
+```python
+{
+    "text": page.text,
+    "bidder_id": bidder_id,
+    "doc_name": doc_name,
+    "page": page.page,
+    "source_type": page.source_type,
+    "ocr_confidence": page.confidence,
+    "chunk_id": f"{bidder_id}_{doc_name}_p{page.page}",
+}
+```
+---
+## Acceptance Criteria
+1. `extract_pages(Path("data/tender/crpf_construction_tender.pdf"))` returns a list of dicts with non-empty text on most pages.
+2. `is_text_pdf(Path("data/tender/crpf_construction_tender.pdf"))` returns `True`.
+3. `render_page_to_image(Path("data/tender/crpf_construction_tender.pdf"), 1)` returns a PIL Image with width > 0.
+4. `chunk_tender(pages, "tender_001")` returns a non-empty list of dicts each having a "text" key.
+5. Each bidder chunk has all required metadata keys.