Spaces:

gauthamnairy
/

PageIndexAPI

Running

App Files Files Community

gauthamnairy commited on Feb 16

Commit

59c1497

verified ·

1 Parent(s): ad30edb

Update app.py

Browse files

Files changed (1) hide show

app.py +81 -43

app.py CHANGED Viewed

@@ -67,18 +67,24 @@ def extract_tables_from_markdown(markdown_text, token):
             context = markdown_text[:15000]
         # 4. Generate structured JSON tables
-        extraction_prompt = """You are a Petroleum Data Extraction Expert. Your task is to extract ALL tables from the provided document context and return them as a valid JSON object.
 CRITICAL INSTRUCTIONS - READ CAREFULLY:
 1. **EXTRACT ALL ROWS**: You MUST extract EVERY SINGLE ROW from each table. Do NOT skip rows, do NOT truncate, do NOT summarize.
 2. **NO PARTIAL DATA**: If a table has 10 rows, you must return all 10 rows. If it has 100 rows, return all 100 rows.
-3. **COMPLETE EXTRACTION**: Count the rows in the source table and verify you extracted the same number.
-4. **DO NOT SUMMARIZE**: Never say "etc" or "..." or truncate with "...". Every row must be fully extracted.
-**SUGGESTION FOR COMPREHENSIVE EXTRACTION**:
-When scanning the document, look for these O&G table categories (extract ALL that you find):
 - Well Headers / Well Identification / Site Data
-- Formation Tops / Lithology / Stratigraphy
 - Directional Survey / Well Path / Azimuth/Inclination data
 - Casing Records / Casing Data / Tubing specifications
 - Cementing Data / Cement Composition / Bond logs
@@ -91,17 +97,21 @@ When scanning the document, look for these O&G table categories (extract ALL tha
 - Equipment Lists / BHA / Drill string components
 - Personnel / Company representatives / Supervisors
 - Timelines / Drilling events / Days depths
-- Cost data / AFE estimates (if present)
-- Distribution lists are usually NOT useful - skip these.
 EXTRACTION REQUIREMENTS:
-- Find ALL tables in the document - Well Headers, Formation Tops, Casing, Surveys, Drilling Data, Core Analysis, Sidewall Samples, Production Tests, etc.
 - For each table, extract:
    - "title": A descriptive title for the table
-   - "headers": Array of column names exactly as they appear
-   - "rows": Array of row objects with column names as keys - MUST INCLUDE ALL ROWS
    - "page_number": The page number where this table appears
-- **BE THOROUGH**: A typical completion report has 10-20+ separate tables. If you only found 3-5, you missed some. Scan again.
 Return VALID JSON ONLY in this exact format:
@@ -119,13 +129,10 @@ Return VALID JSON ONLY in this exact format:
 }
 VERIFICATION STEP:
-Before returning, count the rows in the source table and verify your extracted rows match exactly.
-If the source shows 6 rows, your output must have 6 rows in the "rows" array.
-SUGGESTION: If you found fewer than 8-10 tables in a completion report, re-scan the document for:
-- Smaller tables embedded in text sections
-- Equipment lists, BHA details, logging summaries
-- Data tables you may have skipped as "minor"
 Return ONLY the JSON, no markdown, no explanations, no code blocks."""
@@ -140,34 +147,65 @@ Return ONLY the JSON, no markdown, no explanations, no code blocks."""
             model=model,
             messages=messages,
             stream=False,
-            max_tokens=8192,
-            temperature=0.1
         )
         response_text = response.choices[0].message.content
         print(f"[PageIndex] LLM response received: {len(response_text)} chars")
-        # Parse JSON from response
-        # Try to extract JSON block
-        json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
-        if json_match:
-            try:
-                data = json.loads(json_match.group(0))
-                if "tables" in data:
-                    tables = data["tables"]
-                    # Ensure each table has required fields
-                    for table in tables:
-                        if "page_number" not in table:
-                            table["page_number"] = 1
-                        if "source" not in table:
-                            table["source"] = "PageIndex"
-                    print(f"[PageIndex] Successfully extracted {len(tables)} tables.")
-                    return json.dumps({"tables": tables})
-            except json.JSONDecodeError as e:
-                print(f"[PageIndex] JSON parse error: {e}")
-        # If no JSON found, return empty
-        print("[PageIndex] No valid JSON found in response, returning empty tables.")
         return json.dumps({"tables": []})
     except Exception as e:
@@ -330,7 +368,7 @@ Your goal is to extract precise technical data from the provided document contex
                 messages=messages,
                 stream=True,
                 max_tokens=8192,
-                temperature=0.3
             )
             full_response_text = ""

             context = markdown_text[:15000]
         # 4. Generate structured JSON tables
+        extraction_prompt = """You are a Petroleum Data Extraction Expert. Your task is to extract ALL tables AND convert structured paragraph data into tables from the provided document context.
 CRITICAL INSTRUCTIONS - READ CAREFULLY:
 1. **EXTRACT ALL ROWS**: You MUST extract EVERY SINGLE ROW from each table. Do NOT skip rows, do NOT truncate, do NOT summarize.
 2. **NO PARTIAL DATA**: If a table has 10 rows, you must return all 10 rows. If it has 100 rows, return all 100 rows.
+3. **CONVERT PARAGRAPHS TO TABLES**: If you find formation tops, lithology data, or any structured data in text paragraphs (e.g., "Formation X encountered at 1000m depth"), CONVERT it into a proper table with columns and rows.
+4. **COMPLETE EXTRACTION**: Count the rows in the source table and verify you extracted the same number.
+5. **DO NOT SUMMARIZE**: Never say "etc" or "..." or truncate with "...". Every row must be fully extracted.
+6. **SCRAPE PARAGRAPHS**: Look for:
+   - Formation tops mentioned in text (e.g., "Eleana Formation at 2594 feet")
+   - Lithology descriptions with depths
+   - Drilling events with dates/depths
+   - Equipment lists in bullet points
+   - Any sequential data that can be tabulated
+**O&G TABLE CATEGORIES TO EXTRACT (including from paragraphs):**
 - Well Headers / Well Identification / Site Data
+- Formation Tops / Lithology / Stratigraphy (LOOK IN TEXT PARAGRAPHS TOO!)
 - Directional Survey / Well Path / Azimuth/Inclination data
 - Casing Records / Casing Data / Tubing specifications
 - Cementing Data / Cement Composition / Bond logs
 - Equipment Lists / BHA / Drill string components
 - Personnel / Company representatives / Supervisors
 - Timelines / Drilling events / Days depths
+- Cost data / AFE estimates
+**PARAGRAPH-TO-TABLE CONVERSION EXAMPLES:**
+If text says: "The Eleana Dolomite was encountered at 2,594 ft MD (2,594 ft TVD)..."
+CREATE: {"title": "Formation Tops", "headers": ["Formation", "Depth_ft", "Depth_m"], "rows": [...]}
 EXTRACTION REQUIREMENTS:
+- Find ALL tables in the document
+- CONVERT paragraph data describing formations, depths, lithology INTO tables
 - For each table, extract:
    - "title": A descriptive title for the table
+   - "headers": Array of column names
+   - "rows": Array of row objects - MUST INCLUDE ALL ROWS
    - "page_number": The page number where this table appears
+- **BE THOROUGH**: A typical completion report has 15-25+ separate tables. If you only found 3-5, you missed some. Scan paragraphs too!
 Return VALID JSON ONLY in this exact format:
 }
 VERIFICATION STEP:
+1. Count tables found in explicit table format
+2. Count data found in paragraphs that could be tables
+3. Total should be 15-25+ for a completion report
+4. Before returning, verify you converted paragraph data to tables
 Return ONLY the JSON, no markdown, no explanations, no code blocks."""
             model=model,
             messages=messages,
             stream=False,
+            max_tokens=16384,
+            temperature=0
         )
         response_text = response.choices[0].message.content
         print(f"[PageIndex] LLM response received: {len(response_text)} chars")
+        # Parse JSON from response - handle markdown code blocks
+        response_text = response_text.strip()
+        # Try multiple extraction strategies
+        data = None
+        # Strategy 1: Try direct JSON parse
+        try:
+            data = json.loads(response_text)
+        except json.JSONDecodeError:
+            pass
+        # Strategy 2: Extract JSON from markdown code block
+        if data is None:
+            code_block_match = re.search(r'```(?:json)?\s*(\{.*\})\s*```', response_text, re.DOTALL)
+            if code_block_match:
+                try:
+                    data = json.loads(code_block_match.group(1))
+                except json.JSONDecodeError:
+                    pass
+        # Strategy 3: Extract JSON object directly
+        if data is None:
+            json_match = re.search(r'\{[\s\S]*"tables"[\s\S]*\}', response_text)
+            if json_match:
+                try:
+                    data = json.loads(json_match.group(0))
+                except json.JSONDecodeError:
+                    pass
+        # Strategy 4: Look for any JSON-like structure
+        if data is None:
+            json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
+            if json_match:
+                try:
+                    data = json.loads(json_match.group(0))
+                except json.JSONDecodeError:
+                    pass
+        if data and "tables" in data:
+            tables = data["tables"]
+            # Ensure each table has required fields
+            for table in tables:
+                if "page_number" not in table:
+                    table["page_number"] = 1
+                if "source" not in table:
+                    table["source"] = "PageIndex"
+            print(f"[PageIndex] Successfully extracted {len(tables)} tables.")
+            return json.dumps({"tables": tables})
+        # If no valid JSON found, return empty
+        print(f"[PageIndex] No valid JSON found in response. Raw preview: {response_text[:500]}")
         return json.dumps({"tables": []})
     except Exception as e:
                 messages=messages,
                 stream=True,
                 max_tokens=8192,
+                temperature=0,
             )
             full_response_text = ""