Spaces:

arjunbhargav212
/

docling-processor

Running

App Files Files Community

arjunbhargav212 commited on 6 days ago

Commit

dc23f92

verified ·

1 Parent(s): ad5d213

Upload 4 files

Browse files

Files changed (4) hide show

Dockerfile +6 -12
README.md +32 -108
app.py +374 -40
requirements.txt +7 -3

Dockerfile CHANGED Viewed

@@ -1,18 +1,12 @@
-FROM python:3.10
-# Update apt-get and install system libraries needed for Docling
-RUN apt-get update && apt-get install -y \
-    libgl1 \
-    libglib2.0-0 \
-    && rm -rf /var/lib/apt/lists/*
-WORKDIR /code
-COPY ./requirements.txt /code/requirements.txt
-RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
-COPY . .
 EXPOSE 7860
-CMD ["python", "app.py"]

+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY app.py .
 EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,126 +1,50 @@
-# 🚀 Hugging Face Document Extraction APIs
-Complete deployment package for **Docling** and **DocStrange** on Hugging Face Spaces.
 ---
-## 📁 **Folder Structure**
-```
-huggingface_deploy/
-├── README.md                      # This file
-├── docling-api/
-│   ├── app.py                     # Docling FastAPI application
-│   ├── requirements.txt           # Python dependencies
-│   └── Dockerfile                 # Docker configuration
-├── docstrange-api/
-│   ├── app.py                     # DocStrange FastAPI application
-│   ├── requirements.txt           # Python dependencies
-│   └── Dockerfile                 # Docker configuration
-└── test-scripts/
-    ├── test_docling.py            # Test Docling API
-    └── test_docstrange.py         # Test DocStrange API
-```
 ---
-## 🎯 **Quick Start**
-### **Step 1: Create Hugging Face Spaces**
-1. Go to **https://huggingface.co/spaces**
-2. Click **Create new Space**
-3. Create TWO spaces:
-   - `docling-api` (for Docling)
-   - `docstrange-api` (for DocStrange)
-### **Step 2: Upload Files**
-For each space:
-1. Upload the corresponding `app.py` and `requirements.txt`
-2. Wait for deployment (2-5 minutes)
-3. Copy your Space URL: `https://YOUR_USERNAME-docling-api.hf.space`
-### **Step 3: Connect to DataSync**
-In DataSync → Import Data → DocStrange tab:
-1. Select engine: `🔬 Docling Hugging Face` or `🧪 DocStrange Hugging Face`
-2. Paste your HF URL
-3. Click **Extract**
----
-## 📊 **API Endpoints**
-### **Docling API**
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/` | GET | Health check |
-| `/health` | GET | Health status |
-| `/convert` | POST | Full document conversion |
-| `/convert/markdown` | POST | Markdown only |
-| `/convert/tables` | POST | Tables only |
-### **DocStrange API**
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/` | GET | Health check |
-| `/health` | GET | Health status |
-| `/extract` | POST | Full document extraction |
-| `/extract/markdown` | POST | Markdown/text only |
-| `/extract/tables` | POST | Tables only |
----
-## 🧪 **Testing**
 ```bash
-# Test Docling API
-python test-scripts/test_docling.py https://your-docling-api.hf.space
-# Test DocStrange API
-python test-scripts/test_docstrange.py https://your-docstrange-api.hf.space
 ```
----
-## 💰 **Hugging Face Tiers**
-| Tier | Cost | Memory | Best For |
-|------|------|--------|----------|
-| **CPU Basic** | Free | 16GB | Testing, small PDFs |
-| **CPU Upgrade** | Free | 32GB | Medium documents |
-| **T4 GPU** | $0.60/hr | 16GB + 16GB VRAM | Large docs, fast extraction |
-**Recommendation**: Start with **Free CPU tier** for testing, upgrade to GPU for production.
----
-## 🔐 **Private Spaces**
-If you want private APIs:
-1. Go to Space Settings → **Make Private**
-2. Create token: https://huggingface.co/settings/tokens
-3. In DataSync, enter both URL and token
----
-## 📚 **Full Documentation**
-- [Docling API Details](docling-api/README.md)
-- [DocStrange API Details](docstrange-api/README.md)
----
-## 🎯 **Integration with DataSync**
-All APIs are fully integrated with:
-- ✅ DataSync Import Data module
-- ✅ Automatic fallback on failure
-- ✅ Structured data display
-- ✅ Column mapping to ERPNext
-- ✅ CSV/JSON export
-**Ready to deploy!** 🚀

 ---
+title: Unified Document Extraction API
+emoji: 📄
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_file: app.py
+pinned: false
 ---
+# 🚀 Unified Document Extraction API
+**One API, Two Engines: Docling + DocStrange**
+Extract structured data from any document using AI-powered engines.
+## Features
+- ✅ **Docling** - Advanced document parsing with structure preservation
+- ✅ **DocStrange** - GPU-accelerated intelligent document processing
+- ✅ **Multiple formats** - PDF, DOCX, XLSX, PPTX, Images, and more
+- ✅ **Structured output** - Markdown, JSON, Tables
+## API Endpoints
+- `GET /` - Health check
+- `GET /engines` - List available engines
+- `POST /convert` - Full document conversion
+- `POST /convert/markdown` - Markdown only
+- `POST /convert/tables` - Tables only
+## Usage
 ```bash
+# Convert with Docling
+curl -X POST "https://YOUR_SPACE.hf.space/convert?engine=docling" \
+  -F "file=@document.pdf"
+# Convert with DocStrange
+curl -X POST "https://YOUR_SPACE.hf.space/convert?engine=docstrange" \
+  -F "file=@document.pdf"
 ```
+## Integration
+Works with **DataSync** application for ERPNext integration.
+## License
+MIT

app.py CHANGED Viewed

@@ -1,51 +1,385 @@
-from flask import Flask, request, jsonify, render_template_string
-from docling.document_converter import DocumentConverter
 import os
-app = Flask(__name__)
-converter = DocumentConverter()
-@app.route('/', methods=['GET'])
-def index():
-    return render_template_string('''
-    <!doctype html>
-    <title>Docling Processor</title>
-    <h1>Upload PDF to Convert to Markdown</h1>
-    <form method="post" action="/process" enctype="multipart/form-data">
-      <input type="file" name="file">
-      <input type="submit" value="Convert">
-    </form>
-    ''')
-@app.route('/process', methods=['POST'])
-def process_document():
-    if 'file' not in request.files:
-        return jsonify({"status": "error", "message": "No file part"}), 400
-    file = request.files['file']
-    if file.filename == '':
-        return jsonify({"status": "error", "message": "No selected file"}), 400
-    # Save file temporarily
-    temp_path = f"/tmp/{file.filename}"
-    file.save(temp_path)
     try:
-        # Convert the document
-        result = converter.convert(temp_path)
-        # Return the exported text (or other format)
-        return jsonify({
-            "status": "success",
-            "content": result.document.export_to_markdown()
-        })
-    except Exception as e:
-        return jsonify({"status": "error", "message": str(e)}), 500
-    finally:
         # Cleanup
-        if os.path.exists(temp_path):
-            os.remove(temp_path)
 if __name__ == "__main__":
-    app.run(host="0.0.0.0", port=7860)

+"""
+Unified Document Extraction API - Docling + DocStrange
+Deploy this as a SINGLE app on Hugging Face Spaces
+Provides both Docling AND DocStrange extraction in one service
+"""
 import os
+import sys
+import tempfile
+from pathlib import Path
+from fastapi import FastAPI, File, UploadFile, HTTPException, Query
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+import uvicorn
+# ============================================================================
+# INITIALIZATION
+# ============================================================================
+# Docling setup
+HAS_DOCLING = False
+docling_converter = None
+try:
+    from docling.document_converter import DocumentConverter
+    HAS_DOCLING = True
+except ImportError:
+    pass
+# DocStrange setup
+HAS_DOCTSTRANGE = False
+docstrange_extractor = None
+try:
+    # Add docstrange to path
+    sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'docstrange'))
+    from docstrange import DocumentExtractor
+    HAS_DOCTSTRANGE = True
+except ImportError:
+    pass
+app = FastAPI(
+    title="Unified Document Extraction API",
+    description="Extract documents using Docling OR DocStrange AI engines",
+    version="2.0.0"
+)
+# Allow CORS for DataSync integration
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ============================================================================
+# LAZY INITIALIZATION
+# ============================================================================
+def get_docling_converter():
+    """Get or create Docling converter"""
+    global docling_converter
+    if docling_converter is None and HAS_DOCLING:
+        docling_converter = DocumentConverter()
+    return docling_converter
+def get_docstrange_extractor():
+    """Get or create DocStrange extractor"""
+    global docstrange_extractor
+    if docstrange_extractor is None and HAS_DOCTSTRANGE:
+        # Auto-detect GPU
+        try:
+            import torch
+            gpu = torch.cuda.is_available()
+        except:
+            gpu = False
+        docstrange_extractor = DocumentExtractor(gpu=gpu)
+    return docstrange_extractor
+# ============================================================================
+# HEALTH & INFO ENDPOINTS
+# ============================================================================
+@app.get("/")
+def root():
+    """Health check"""
+    return {
+        "status": "ok",
+        "service": "Unified Document Extraction API",
+        "version": "2.0.0",
+        "engines": {
+            "docling": HAS_DOCLING,
+            "docstrange": HAS_DOCTSTRANGE
+        }
+    }
+@app.get("/health")
+def health():
+    """Detailed health check"""
+    try:
+        import torch
+        gpu = torch.cuda.is_available()
+        vram = f"{torch.cuda.get_device_properties(0).total_mem/1024**3:.1f}GB" if gpu else "N/A"
+    except:
+        gpu = False
+        vram = "N/A"
+    return {
+        "status": "ok",
+        "gpu": gpu,
+        "vram": vram,
+        "engines": {
+            "docling": HAS_DOCLING,
+            "docstrange": HAS_DOCTSTRANGE
+        }
+    }
+@app.get("/engines")
+def list_engines():
+    """List available extraction engines"""
+    return {
+        "engines": [
+            {
+                "id": "docling",
+                "name": "Docling AI",
+                "available": HAS_DOCLING,
+                "description": "Advanced document parsing with structure preservation"
+            },
+            {
+                "id": "docstrange",
+                "name": "DocStrange",
+                "available": HAS_DOCTSTRANGE,
+                "description": "GPU-accelerated intelligent document processing"
+            }
+        ]
+    }
+# ============================================================================
+# EXTRACTION ENDPOINTS
+# ============================================================================
+@app.post("/convert")
+async def convert_document(
+    file: UploadFile = File(...),
+    engine: str = Query("docling", description="Extraction engine: docling or docstrange"),
+    output_format: str = Query("markdown", description="Output format: markdown, json, tables")
+):
+    """
+    Convert document using specified engine
+    Args:
+        file: Document file (PDF, DOCX, XLSX, Images, etc.)
+        engine: docling or docstrange
+        output_format: markdown, json, tables
+    Returns: JSON with extracted data
+    """
+    if not file.filename:
+        raise HTTPException(status_code=400, detail="No file provided")
+    # Validate engine
+    if engine not in ['docling', 'docstrange']:
+        raise HTTPException(status_code=400, detail=f"Unknown engine: {engine}. Use 'docling' or 'docstrange'")
+    # Check engine availability
+    if engine == 'docling' and not HAS_DOCLING:
+        raise HTTPException(status_code=503, detail="Docling engine not available")
+    if engine == 'docstrange' and not HAS_DOCTSTRANGE:
+        raise HTTPException(status_code=503, detail="DocStrange engine not available")
+    # Validate file extension
+    supported_extensions = ['.pdf', '.docx', '.xlsx', '.pptx', '.png', '.jpg', '.jpeg',
+                           '.bmp', '.tiff', '.webp', '.gif', '.txt', '.html', '.md', '.csv']
+    ext = Path(file.filename).suffix.lower()
+    if ext not in supported_extensions:
+        raise HTTPException(status_code=400, detail=f"Unsupported format: {ext}")
     try:
+        # Save uploaded file temporarily
+        with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        # Extract using selected engine
+        if engine == 'docling':
+            result = _extract_with_docling(tmp_path, output_format)
+        else:  # docstrange
+            result = _extract_with_docstrange(tmp_path, output_format)
         # Cleanup
+        os.unlink(tmp_path)
+        return JSONResponse(content=result)
+    except Exception as e:
+        # Cleanup on error
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=f"Extraction failed: {str(e)}")
+@app.post("/convert/markdown")
+async def convert_to_markdown(
+    file: UploadFile = File(...),
+    engine: str = Query("docling", description="docling or docstrange")
+):
+    """Extract document to markdown only (lightweight endpoint)"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        if engine == 'docling' and HAS_DOCLING:
+            converter = get_docling_converter()
+            result = converter.convert(tmp_path)
+            markdown = result.document.export_to_markdown()
+        elif engine == 'docstrange' and HAS_DOCTSTRANGE:
+            ext = get_docstrange_extractor()
+            result = ext.extract_document(tmp_path, output_format='markdown')
+            markdown = result.get('data', '')
+        else:
+            raise HTTPException(status_code=503, detail=f"{engine} engine not available")
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "markdown": markdown,
+            "engine": engine,
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/convert/tables")
+async def convert_tables(
+    file: UploadFile = File(...),
+    engine: str = Query("docling", description="docling or docstrange")
+):
+    """Extract tables only from document"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        tables_data = []
+        if engine == 'docling' and HAS_DOCLING:
+            converter = get_docling_converter()
+            result = converter.convert(tmp_path)
+            for table_idx, table in enumerate(result.document.tables):
+                try:
+                    df = table.export_to_dataframe()
+                    tables_data.append({
+                        "table_index": table_idx,
+                        "headers": list(df.columns),
+                        "rows": df.to_dict('records'),
+                        "row_count": len(df)
+                    })
+                except:
+                    pass
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "tables": tables_data,
+            "tables_count": len(tables_data),
+            "engine": engine,
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+# ============================================================================
+# ENGINE-SPECIFIC EXTRACTION FUNCTIONS
+# ============================================================================
+def _extract_with_docling(file_path, output_format):
+    """Extract using Docling"""
+    converter = get_docling_converter()
+    result = converter.convert(file_path)
+    doc = result.document
+    response = {
+        "success": True,
+        "file_name": os.path.basename(file_path),
+        "engine": "docling",
+        "format": output_format,
+        "document": {
+            "markdown": doc.export_to_markdown(),
+            "num_pages": len(doc.pages) if hasattr(doc, 'pages') else 0,
+            "tables_count": len(doc.tables)
+        },
+        "metadata": {
+            "engine": "docling",
+            "model": "docling-default"
+        }
+    }
+    # Add tables if requested
+    if output_format in ['json', 'tables']:
+        tables_data = []
+        for table_idx, table in enumerate(doc.tables):
+            try:
+                df = table.export_to_dataframe()
+                tables_data.append({
+                    "table_index": table_idx,
+                    "rows": df.to_dict('records'),
+                    "row_count": len(df)
+                })
+            except:
+                pass
+        response['document']['tables'] = tables_data
+    return response
+def _extract_with_docstrange(file_path, output_format):
+    """Extract using DocStrange"""
+    ext = get_docstrange_extractor()
+    result = ext.extract_document(file_path, output_format=output_format)
+    response = {
+        "success": True,
+        "file_name": os.path.basename(file_path),
+        "engine": "docstrange",
+        "format": result.get('format', output_format),
+        "data": result.get('data', {}),
+        "metadata": {
+            "engine": "docstrange",
+            "file_size": result.get('metadata', {}).get('file_size', 0),
+            "gpu_mode": result.get('metadata', {}).get('gpu_mode', False)
+        }
+    }
+    return response
+# ============================================================================
+# MAIN ENTRY POINT
+# ============================================================================
 if __name__ == "__main__":
+    print("\n" + "="*60)
+    print("Unified Document Extraction API")
+    print("="*60)
+    print(f"Docling: {'✅ Available' if HAS_DOCLING else '❌ Not installed'}")
+    print(f"DocStrange: {'✅ Available' if HAS_DOCTSTRANGE else '❌ Not installed'}")
+    print("="*60)
+    print("URL: http://localhost:7860")
+    print("Docs: http://localhost:7860/docs")
+    print("="*60 + "\n")
+    uvicorn.run(
+        "app:app",
+        host="0.0.0.0",
+        port=7860
+    )

requirements.txt CHANGED Viewed

@@ -1,3 +1,7 @@
-flask
-docling
-docstrange

+docling>=2.88.0
+fastapi>=0.100.0
+uvicorn>=0.23.0
+python-multipart>=0.0.6
+pandas>=2.0.0
+pillow>=10.0.0
+torch>=2.0.0