Spaces:

arjunbhargav212
/

docling-processor

Running

App Files Files Community

arjunbhargav212 commited on 7 days ago

Commit

ad5d213

verified ·

1 Parent(s): 2bcba00

Upload 12 files

Browse files

Files changed (12) hide show

QUICKSTART.md +74 -0
README.md +125 -3
docling-api/Dockerfile +12 -0
docling-api/README.md +55 -0
docling-api/app.py +232 -0
docling-api/requirements.txt +6 -0
docstrange-api/Dockerfile +12 -0
docstrange-api/README.md +55 -0
docstrange-api/app.py +237 -0
docstrange-api/requirements.txt +9 -0
test-scripts/test_docling.py +74 -0
test-scripts/test_docstrange.py +73 -0

QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# ⚡ Quick Start Guide - Hugging Face Deployment
+## 🎯 **5-Minute Setup**
+### **Step 1: Create HF Spaces (2 min)**
+1. Go to **https://huggingface.co/spaces**
+2. Create TWO spaces:
+   - `docling-api`
+   - `docstrange-api`
+3. Use **Docker SDK** for both
+4. Set to **Public** (free) or **Private**
+### **Step 2: Upload Files (1 min)**
+For EACH space:
+1. Upload `app.py` from corresponding folder
+2. Upload `requirements.txt` from corresponding folder
+3. Wait for deployment (2-3 min)
+### **Step 3: Get Your URLs**
+After deployment:
+- Docling: `https://YOUR_USERNAME-docling-api.hf.space`
+- DocStrange: `https://YOUR_USERNAME-docstrange-api.hf.space`
+### **Step 4: Connect to DataSync (1 min)**
+1. Open **http://localhost:5000**
+2. Go to **Import Data → DocStrange tab**
+3. Select engine:
+   - `🔬 Docling Hugging Face` OR
+   - `🧪 DocStrange Hugging Face`
+4. Paste your HF URL
+5. Upload PDF and extract!
+---
+## 🧪 **Test Your APIs**
+```bash
+# Test both APIs
+cd huggingface_deploy\test-scripts
+python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
+python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
+```
+---
+## ✅ **You're Done!**
+Both APIs are now integrated with DataSync and ready to extract documents!
+---
+## 🆘 **Troubleshooting**
+| Problem | Solution |
+|---------|----------|
+| Space not deploying | Check Docker logs in HF Space settings |
+| API returns 500 | Verify requirements.txt uploaded |
+| Timeout errors | PDF too large - try smaller file |
+| Not working in DataSync | Check URL format (no trailing slash) |
+---
+## 📚 **Next Steps**
+- Try different engines for comparison
+- Map extracted columns to ERPNext
+- Download CSV/JSON of extracted data
+**Happy extracting!** 🚀

README.md CHANGED Viewed

@@ -1,4 +1,126 @@
 ---
-title: Docling Processor
-sdk: docker
----

+# 🚀 Hugging Face Document Extraction APIs
+Complete deployment package for **Docling** and **DocStrange** on Hugging Face Spaces.
 ---
+## 📁 **Folder Structure**
+```
+huggingface_deploy/
+├── README.md                      # This file
+├── docling-api/
+│   ├── app.py                     # Docling FastAPI application
+│   ├── requirements.txt           # Python dependencies
+│   └── Dockerfile                 # Docker configuration
+├── docstrange-api/
+│   ├── app.py                     # DocStrange FastAPI application
+│   ├── requirements.txt           # Python dependencies
+│   └── Dockerfile                 # Docker configuration
+└── test-scripts/
+    ├── test_docling.py            # Test Docling API
+    └── test_docstrange.py         # Test DocStrange API
+```
+---
+## 🎯 **Quick Start**
+### **Step 1: Create Hugging Face Spaces**
+1. Go to **https://huggingface.co/spaces**
+2. Click **Create new Space**
+3. Create TWO spaces:
+   - `docling-api` (for Docling)
+   - `docstrange-api` (for DocStrange)
+### **Step 2: Upload Files**
+For each space:
+1. Upload the corresponding `app.py` and `requirements.txt`
+2. Wait for deployment (2-5 minutes)
+3. Copy your Space URL: `https://YOUR_USERNAME-docling-api.hf.space`
+### **Step 3: Connect to DataSync**
+In DataSync → Import Data → DocStrange tab:
+1. Select engine: `🔬 Docling Hugging Face` or `🧪 DocStrange Hugging Face`
+2. Paste your HF URL
+3. Click **Extract**
+---
+## 📊 **API Endpoints**
+### **Docling API**
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Health check |
+| `/health` | GET | Health status |
+| `/convert` | POST | Full document conversion |
+| `/convert/markdown` | POST | Markdown only |
+| `/convert/tables` | POST | Tables only |
+### **DocStrange API**
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Health check |
+| `/health` | GET | Health status |
+| `/extract` | POST | Full document extraction |
+| `/extract/markdown` | POST | Markdown/text only |
+| `/extract/tables` | POST | Tables only |
+---
+## 🧪 **Testing**
+```bash
+# Test Docling API
+python test-scripts/test_docling.py https://your-docling-api.hf.space
+# Test DocStrange API
+python test-scripts/test_docstrange.py https://your-docstrange-api.hf.space
+```
+---
+## 💰 **Hugging Face Tiers**
+| Tier | Cost | Memory | Best For |
+|------|------|--------|----------|
+| **CPU Basic** | Free | 16GB | Testing, small PDFs |
+| **CPU Upgrade** | Free | 32GB | Medium documents |
+| **T4 GPU** | $0.60/hr | 16GB + 16GB VRAM | Large docs, fast extraction |
+**Recommendation**: Start with **Free CPU tier** for testing, upgrade to GPU for production.
+---
+## 🔐 **Private Spaces**
+If you want private APIs:
+1. Go to Space Settings → **Make Private**
+2. Create token: https://huggingface.co/settings/tokens
+3. In DataSync, enter both URL and token
+---
+## 📚 **Full Documentation**
+- [Docling API Details](docling-api/README.md)
+- [DocStrange API Details](docstrange-api/README.md)
+---
+## 🎯 **Integration with DataSync**
+All APIs are fully integrated with:
+- ✅ DataSync Import Data module
+- ✅ Automatic fallback on failure
+- ✅ Structured data display
+- ✅ Column mapping to ERPNext
+- ✅ CSV/JSON export
+**Ready to deploy!** 🚀

docling-api/Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY app.py .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

docling-api/README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# 🔬 Docling API Deployment Guide
+## 📦 **Files in This Folder**
+- `app.py` - FastAPI application for document conversion
+- `requirements.txt` - Python dependencies
+- `Dockerfile` - Container configuration
+## 🚀 **Deploy to Hugging Face**
+### **Method 1: Via Web UI (Easiest)**
+1. Go to **https://huggingface.co/spaces**
+2. Click **Create new Space**
+3. **Name**: `docling-api`
+4. **SDK**: `Docker**
+5. **Visibility**: `Public` (free) or `Private` (needs token)
+6. Click **Create Space**
+7. Upload `app.py` and `requirements.txt`
+8. Wait 3-5 minutes for deployment
+### **Method 2: Via Git**
+```bash
+git clone https://huggingface.co/spaces/YOUR_USERNAME/docling-api
+cd docling-api
+cp app.py requirements.txt Dockerfile .
+git add .
+git commit -m "Deploy Docling API"
+git push
+```
+## 🧪 **Test Your Deployment**
+```bash
+cd test-scripts
+python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
+```
+## 📡 **API Documentation**
+Once deployed, visit: `https://YOUR_USERNAME-docling-api.hf.space/docs`
+## 🔧 **Endpoints**
+- `GET /` - Health check
+- `POST /convert` - Full document conversion
+- `POST /convert/markdown` - Markdown only
+- `POST /convert/tables` - Tables only
+## 💡 **Tips**
+- Start with **Free CPU tier** for testing
+- Upgrade to **T4 GPU** for production (faster, handles large PDFs)
+- Keep PDFs under 10MB for best performance

docling-api/app.py ADDED Viewed

	@@ -0,0 +1,232 @@

+"""
+Docling Hugging Face Spaces API
+Deploy this on Hugging Face Spaces to provide Docling extraction API
+"""
+import os
+import tempfile
+from pathlib import Path
+from fastapi import FastAPI, File, UploadFile, HTTPException
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from docling.document_converter import DocumentConverter
+from docling.datamodel.base_models import InputFormat
+import uvicorn
+app = FastAPI(
+    title="Docling Document Converter API",
+    description="Convert documents using Docling AI",
+    version="1.0.0"
+)
+# Allow CORS for DataSync integration
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global converter instance
+converter = None
+def get_converter():
+    """Get or create DocumentConverter instance"""
+    global converter
+    if converter is None:
+        converter = DocumentConverter()
+    return converter
+@app.get("/")
+def root():
+    """Health check"""
+    return {
+        "status": "ok",
+        "service": "Docling API",
+        "version": "1.0.0"
+    }
+@app.get("/health")
+def health():
+    """Health check"""
+    return {"status": "ok", "gpu": "available"}
+@app.post("/convert")
+async def convert_document(file: UploadFile = File(...)):
+    """
+    Convert document to structured data
+    Returns: JSON with markdown, tables, and metadata
+    """
+    if not file.filename:
+        raise HTTPException(status_code=400, detail="No file provided")
+    supported_extensions = ['.pdf', '.docx', '.xlsx', '.pptx', '.html', '.txt', '.md']
+    ext = Path(file.filename).suffix.lower()
+    if ext not in supported_extensions:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unsupported format: {ext}. Supported: {supported_extensions}"
+        )
+    try:
+        # Save uploaded file temporarily
+        with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        # Convert document
+        converter = get_converter()
+        result = converter.convert(tmp_path)
+        # Extract data
+        doc = result.document
+        # Get markdown
+        markdown_text = doc.export_to_markdown()
+        # Extract tables
+        tables_data = []
+        for table_idx, table in enumerate(doc.tables):
+            try:
+                df = table.export_to_dataframe()
+                table_dict = {
+                    "table_index": table_idx,
+                    "rows": df.to_dict('records'),
+                    "row_count": len(df)
+                }
+                tables_data.append(table_dict)
+            except Exception as e:
+                tables_data.append({
+                    "table_index": table_idx,
+                    "error": str(e)
+                })
+        # Build response
+        response = {
+            "success": True,
+            "file_name": file.filename,
+            "document": {
+                "markdown": markdown_text,
+                "text": doc.export_to_text() if hasattr(doc, 'export_to_text') else markdown_text,
+                "num_pages": len(doc.pages) if hasattr(doc, 'pages') else 0,
+                "tables": tables_data,
+                "tables_count": len(tables_data)
+            },
+            "metadata": {
+                "format": ext,
+                "engine": "docling",
+                "model": "docling-default"
+            }
+        }
+        # Cleanup
+        os.unlink(tmp_path)
+        return JSONResponse(content=response)
+    except Exception as e:
+        # Cleanup on error
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=f"Conversion failed: {str(e)}")
+@app.post("/convert/markdown")
+async def convert_to_markdown(file: UploadFile = File(...)):
+    """Convert document to markdown only (lightweight)"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        converter = get_converter()
+        result = converter.convert(tmp_path)
+        markdown = result.document.export_to_markdown()
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "markdown": markdown,
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/convert/tables")
+async def convert_tables(file: UploadFile = File(...)):
+    """Extract tables only from document"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        converter = get_converter()
+        result = converter.convert(tmp_path)
+        tables_data = []
+        for table_idx, table in enumerate(result.document.tables):
+            try:
+                df = table.export_to_dataframe()
+                tables_data.append({
+                    "table_index": table_idx,
+                    "headers": list(df.columns),
+                    "rows": df.to_dict('records'),
+                    "row_count": len(df)
+                })
+            except:
+                pass
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "tables": tables_data,
+            "tables_count": len(tables_data),
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+if __name__ == "__main__":
+    print("="*60)
+    print("Docling Document Converter API")
+    print("="*60)
+    print("URL: http://localhost:8080")
+    print("Docs: http://localhost:8080/docs")
+    print("="*60)
+    uvicorn.run(
+        "app:app",
+        host="0.0.0.0",
+        port=8080,
+        reload=True
+    )

docling-api/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+docling>=2.88.0
+fastapi>=0.100.0
+uvicorn>=0.23.0
+python-multipart>=0.0.6
+pandas>=2.0.0
+pillow>=10.0.0

docstrange-api/Dockerfile ADDED Viewed

	@@ -0,0 +1,12 @@

+FROM python:3.10-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY app.py .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

docstrange-api/README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# 🧪 DocStrange API Deployment Guide
+## 📦 **Files in This Folder**
+- `app.py` - FastAPI application for DocStrange extraction
+- `requirements.txt` - Python dependencies
+- `Dockerfile` - Container configuration
+## 🚀 **Deploy to Hugging Face**
+### **Method 1: Via Web UI (Easiest)**
+1. Go to **https://huggingface.co/spaces**
+2. Click **Create new Space**
+3. **Name**: `docstrange-api`
+4. **SDK**: `Docker**
+5. **Visibility**: `Public` (free) or `Private` (needs token)
+6. Click **Create Space**
+7. Upload `app.py` and `requirements.txt`
+8. Wait 3-5 minutes for deployment
+### **Method 2: Via Git**
+```bash
+git clone https://huggingface.co/spaces/YOUR_USERNAME/docstrange-api
+cd docstrange-api
+cp app.py requirements.txt Dockerfile .
+git add .
+git commit -m "Deploy DocStrange API"
+git push
+```
+## 🧪 **Test Your Deployment**
+```bash
+cd test-scripts
+python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
+```
+## 📡 **API Documentation**
+Once deployed, visit: `https://YOUR_USERNAME-docstrange-api.hf.space/docs`
+## 🔧 **Endpoints**
+- `GET /` - Health check
+- `POST /extract` - Full document extraction
+- `POST /extract/markdown` - Markdown/text only
+- `POST /extract/tables` - Tables only
+## 💡 **Tips**
+- DocStrange supports GPU mode if available
+- Automatic GPU detection in the API
+- Works with any document format (PDF, DOCX, Images, etc.)

docstrange-api/app.py ADDED Viewed

	@@ -0,0 +1,237 @@

+"""
+DocStrange Hugging Face Spaces API
+Deploy this on Hugging Face Spaces to provide DocStrange extraction API
+"""
+import os
+import sys
+import tempfile
+from pathlib import Path
+from fastapi import FastAPI, File, UploadFile, HTTPException
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+import uvicorn
+# Add docstrange to path
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'docstrange'))
+try:
+    from docstrange import DocumentExtractor
+    HAS_DOCTSTRANGE = True
+except ImportError:
+    HAS_DOCTSTRANGE = False
+app = FastAPI(
+    title="DocStrange Document Extractor API",
+    description="Extract structured data from documents using DocStrange AI",
+    version="1.0.0"
+)
+# Allow CORS for DataSync integration
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global extractor instance
+extractor = None
+def get_extractor():
+    """Get or create DocumentExtractor instance"""
+    global extractor
+    if extractor is None:
+        if not HAS_DOCTSTRANGE:
+            raise HTTPException(status_code=500, detail="DocStrange not installed")
+        # Use GPU if available, otherwise cloud mode
+        try:
+            import torch
+            gpu_mode = torch.cuda.is_available()
+        except:
+            gpu_mode = False
+        if gpu_mode:
+            extractor = DocumentExtractor(gpu=True)
+        else:
+            extractor = DocumentExtractor()
+    return extractor
+@app.get("/")
+def root():
+    """Health check"""
+    return {
+        "status": "ok",
+        "service": "DocStrange API",
+        "version": "1.0.0",
+        "gpu_available": HAS_DOCTSTRANGE
+    }
+@app.get("/health")
+def health():
+    """Health check"""
+    try:
+        import torch
+        gpu = torch.cuda.is_available()
+        vram = f"{torch.cuda.get_device_properties(0).total_mem/1024**3:.1f}GB" if gpu else "N/A"
+    except:
+        gpu = False
+        vram = "N/A"
+    return {
+        "status": "ok",
+        "gpu": gpu,
+        "vram": vram,
+        "docstrange": HAS_DOCTSTRANGE
+    }
+@app.post("/extract")
+async def extract_document(
+    file: UploadFile = File(...),
+    output_format: str = "markdown"
+):
+    """
+    Extract structured data from document
+    Args:
+        file: Document file (PDF, DOCX, XLSX, Images, etc.)
+        output_format: markdown, json, csv, html, text, flat-json, all
+    Returns: JSON with extracted data
+    """
+    if not file.filename:
+        raise HTTPException(status_code=400, detail="No file provided")
+    supported_formats = ['.pdf', '.docx', '.xlsx', '.pptx', '.png', '.jpg', '.jpeg',
+                        '.bmp', '.tiff', '.webp', '.gif', '.txt', '.html', '.md', '.csv']
+    ext = Path(file.filename).suffix.lower()
+    if ext not in supported_formats:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unsupported format: {ext}. Supported: {supported_formats}"
+        )
+    try:
+        # Save uploaded file temporarily
+        with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        # Extract document
+        ext = get_extractor()
+        result = ext.extract_document(tmp_path, output_format=output_format)
+        # Build response
+        response = {
+            "success": True,
+            "file_name": file.filename,
+            "data": result.get('data', {}),
+            "format": result.get('format', output_format),
+            "metadata": {
+                "file_size": result.get('metadata', {}).get('file_size', 0),
+                "engine": "docstrange",
+                "gpu_mode": result.get('metadata', {}).get('gpu_mode', False)
+            }
+        }
+        # Cleanup
+        os.unlink(tmp_path)
+        return JSONResponse(content=response)
+    except Exception as e:
+        # Cleanup on error
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=f"Extraction failed: {str(e)}")
+@app.post("/extract/markdown")
+async def extract_to_markdown(file: UploadFile = File(...)):
+    """Extract document to markdown only (lightweight)"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        ext = get_extractor()
+        result = ext.extract_document(tmp_path, output_format='markdown')
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "markdown": result.get('data', ''),
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/extract/tables")
+async def extract_tables(file: UploadFile = File(...)):
+    """Extract tables only from document"""
+    try:
+        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
+            content = await file.read()
+            tmp.write(content)
+            tmp_path = tmp.name
+        # Extract with JSON format to get structured tables
+        ext = get_extractor()
+        result = ext.extract_document(tmp_path, output_format='json')
+        data = result.get('data', {})
+        tables = data.get('tables', [])
+        os.unlink(tmp_path)
+        return {
+            "success": True,
+            "tables": tables,
+            "tables_count": len(tables),
+            "file_name": file.filename
+        }
+    except Exception as e:
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+        raise HTTPException(status_code=500, detail=str(e))
+if __name__ == "__main__":
+    print("="*60)
+    print("DocStrange Document Extractor API")
+    print("="*60)
+    print("URL: http://localhost:8080")
+    print("Docs: http://localhost:8080/docs")
+    print("="*60)
+    uvicorn.run(
+        "app:app",
+        host="0.0.0.0",
+        port=8080,
+        reload=True
+    )

docstrange-api/requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+# DocStrange API requirements
+# Note: This assumes docstrange is installed in the space
+# Add your specific dependencies here
+fastapi>=0.100.0
+uvicorn>=0.23.0
+python-multipart>=0.0.6
+pillow>=10.0.0
+torch>=2.0.0

test-scripts/test_docling.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+Test Docling Hugging Face API
+Usage: python test_docling.py <HF_API_URL>
+"""
+import sys
+import requests
+import json
+if len(sys.argv) < 2:
+    print("Usage: python test_docling.py <HF_API_URL>")
+    print("Example: python test_docling.py https://your-username-docling.hf.space")
+    sys.exit(1)
+HF_URL = sys.argv[1].rstrip('/')
+print(f"\n{'='*60}")
+print(f"Testing Docling API: {HF_URL}")
+print(f"{'='*60}\n")
+# Test 1: Health check
+print("1. Testing health check...")
+try:
+    resp = requests.get(f"{HF_URL}/")
+    print(f"   Status: {resp.status_code}")
+    print(f"   Response: {resp.json()}")
+    print(f"   ✅ Health check passed!\n")
+except Exception as e:
+    print(f"   ❌ Failed: {e}\n")
+    sys.exit(1)
+# Test 2: Check if PDF file exists
+import os
+test_pdf = "test.pdf"
+if not os.path.exists(test_pdf):
+    print(f"⚠️  No test.pdf found. Please add a test PDF to this directory.")
+    print(f"   Or create a simple test: {HF_URL}/docs")
+    sys.exit(0)
+# Test 3: Full conversion
+print(f"2. Testing full document conversion with {test_pdf}...")
+try:
+    with open(test_pdf, 'rb') as f:
+        resp = requests.post(
+            f"{HF_URL}/convert",
+            files={"file": f},
+            timeout=120
+        )
+    print(f"   Status: {resp.status_code}")
+    if resp.status_code == 200:
+        data = resp.json()
+        print(f"   ✅ Success!")
+        print(f"   File: {data.get('file_name')}")
+        print(f"   Tables: {data.get('document', {}).get('tables_count', 0)}")
+        print(f"   Pages: {data.get('document', {}).get('num_pages', 0)}")
+        # Show first few tables
+        tables = data.get('document', {}).get('tables', [])
+        if tables:
+            print(f"\n   First table preview:")
+            for table in tables[:1]:
+                rows = table.get('rows', [])[:3]
+                for row in rows:
+                    print(f"     {row}")
+    else:
+        print(f"   ❌ Failed: {resp.text}\n")
+except Exception as e:
+    print(f"   ❌ Failed: {e}\n")
+print(f"\n{'='*60}")
+print("Test complete!")
+print(f"{'='*60}\n")

test-scripts/test_docstrange.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Test DocStrange Hugging Face API
+Usage: python test_docstrange.py <HF_API_URL>
+"""
+import sys
+import requests
+import json
+import os
+if len(sys.argv) < 2:
+    print("Usage: python test_docstrange.py <HF_API_URL>")
+    print("Example: python test_docstrange.py https://your-username-docstrange.hf.space")
+    sys.exit(1)
+HF_URL = sys.argv[1].rstrip('/')
+print(f"\n{'='*60}")
+print(f"Testing DocStrange API: {HF_URL}")
+print(f"{'='*60}\n")
+# Test 1: Health check
+print("1. Testing health check...")
+try:
+    resp = requests.get(f"{HF_URL}/")
+    print(f"   Status: {resp.status_code}")
+    print(f"   Response: {resp.json()}")
+    print(f"   ✅ Health check passed!\n")
+except Exception as e:
+    print(f"   ❌ Failed: {e}\n")
+    sys.exit(1)
+# Test 2: Check for test PDF
+test_pdf = "test.pdf"
+if not os.path.exists(test_pdf):
+    print(f"⚠️  No test.pdf found. Please add a test PDF to this directory.")
+    print(f"   Or check API docs at: {HF_URL}/docs")
+    sys.exit(0)
+# Test 3: Full extraction
+print(f"2. Testing document extraction with {test_pdf}...")
+try:
+    with open(test_pdf, 'rb') as f:
+        resp = requests.post(
+            f"{HF_URL}/extract",
+            files={"file": f},
+            timeout=120
+        )
+    print(f"   Status: {resp.status_code}")
+    if resp.status_code == 200:
+        data = resp.json()
+        print(f"   ✅ Success!")
+        print(f"   File: {data.get('file_name')}")
+        print(f"   Format: {data.get('format')}")
+        print(f"   Metadata: {json.dumps(data.get('metadata', {}), indent=2)}")
+        # Preview data
+        doc_data = data.get('data', {})
+        if isinstance(doc_data, str):
+            print(f"\n   Preview (first 200 chars):")
+            print(f"   {doc_data[:200]}...")
+        elif isinstance(doc_data, dict):
+            print(f"\n   Data keys: {list(doc_data.keys())}")
+    else:
+        print(f"   ❌ Failed: {resp.text}\n")
+except Exception as e:
+    print(f"   ❌ Failed: {e}\n")
+print(f"\n{'='*60}")
+print("Test complete!")
+print(f"{'='*60}\n")