Upload 12 files
Browse files- QUICKSTART.md +74 -0
- README.md +125 -3
- docling-api/Dockerfile +12 -0
- docling-api/README.md +55 -0
- docling-api/app.py +232 -0
- docling-api/requirements.txt +6 -0
- docstrange-api/Dockerfile +12 -0
- docstrange-api/README.md +55 -0
- docstrange-api/app.py +237 -0
- docstrange-api/requirements.txt +9 -0
- test-scripts/test_docling.py +74 -0
- test-scripts/test_docstrange.py +73 -0
QUICKSTART.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# β‘ Quick Start Guide - Hugging Face Deployment
|
| 2 |
+
|
| 3 |
+
## π― **5-Minute Setup**
|
| 4 |
+
|
| 5 |
+
### **Step 1: Create HF Spaces (2 min)**
|
| 6 |
+
|
| 7 |
+
1. Go to **https://huggingface.co/spaces**
|
| 8 |
+
2. Create TWO spaces:
|
| 9 |
+
- `docling-api`
|
| 10 |
+
- `docstrange-api`
|
| 11 |
+
3. Use **Docker SDK** for both
|
| 12 |
+
4. Set to **Public** (free) or **Private**
|
| 13 |
+
|
| 14 |
+
### **Step 2: Upload Files (1 min)**
|
| 15 |
+
|
| 16 |
+
For EACH space:
|
| 17 |
+
1. Upload `app.py` from corresponding folder
|
| 18 |
+
2. Upload `requirements.txt` from corresponding folder
|
| 19 |
+
3. Wait for deployment (2-3 min)
|
| 20 |
+
|
| 21 |
+
### **Step 3: Get Your URLs**
|
| 22 |
+
|
| 23 |
+
After deployment:
|
| 24 |
+
- Docling: `https://YOUR_USERNAME-docling-api.hf.space`
|
| 25 |
+
- DocStrange: `https://YOUR_USERNAME-docstrange-api.hf.space`
|
| 26 |
+
|
| 27 |
+
### **Step 4: Connect to DataSync (1 min)**
|
| 28 |
+
|
| 29 |
+
1. Open **http://localhost:5000**
|
| 30 |
+
2. Go to **Import Data β DocStrange tab**
|
| 31 |
+
3. Select engine:
|
| 32 |
+
- `π¬ Docling Hugging Face` OR
|
| 33 |
+
- `π§ͺ DocStrange Hugging Face`
|
| 34 |
+
4. Paste your HF URL
|
| 35 |
+
5. Upload PDF and extract!
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## π§ͺ **Test Your APIs**
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
# Test both APIs
|
| 43 |
+
cd huggingface_deploy\test-scripts
|
| 44 |
+
|
| 45 |
+
python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
|
| 46 |
+
python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## β
**You're Done!**
|
| 52 |
+
|
| 53 |
+
Both APIs are now integrated with DataSync and ready to extract documents!
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## π **Troubleshooting**
|
| 58 |
+
|
| 59 |
+
| Problem | Solution |
|
| 60 |
+
|---------|----------|
|
| 61 |
+
| Space not deploying | Check Docker logs in HF Space settings |
|
| 62 |
+
| API returns 500 | Verify requirements.txt uploaded |
|
| 63 |
+
| Timeout errors | PDF too large - try smaller file |
|
| 64 |
+
| Not working in DataSync | Check URL format (no trailing slash) |
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## π **Next Steps**
|
| 69 |
+
|
| 70 |
+
- Try different engines for comparison
|
| 71 |
+
- Map extracted columns to ERPNext
|
| 72 |
+
- Download CSV/JSON of extracted data
|
| 73 |
+
|
| 74 |
+
**Happy extracting!** π
|
README.md
CHANGED
|
@@ -1,4 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π Hugging Face Document Extraction APIs
|
| 2 |
+
|
| 3 |
+
Complete deployment package for **Docling** and **DocStrange** on Hugging Face Spaces.
|
| 4 |
+
|
| 5 |
---
|
| 6 |
+
|
| 7 |
+
## π **Folder Structure**
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
huggingface_deploy/
|
| 11 |
+
βββ README.md # This file
|
| 12 |
+
βββ docling-api/
|
| 13 |
+
β βββ app.py # Docling FastAPI application
|
| 14 |
+
β βββ requirements.txt # Python dependencies
|
| 15 |
+
β βββ Dockerfile # Docker configuration
|
| 16 |
+
βββ docstrange-api/
|
| 17 |
+
β βββ app.py # DocStrange FastAPI application
|
| 18 |
+
β βββ requirements.txt # Python dependencies
|
| 19 |
+
β βββ Dockerfile # Docker configuration
|
| 20 |
+
βββ test-scripts/
|
| 21 |
+
βββ test_docling.py # Test Docling API
|
| 22 |
+
βββ test_docstrange.py # Test DocStrange API
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## π― **Quick Start**
|
| 28 |
+
|
| 29 |
+
### **Step 1: Create Hugging Face Spaces**
|
| 30 |
+
|
| 31 |
+
1. Go to **https://huggingface.co/spaces**
|
| 32 |
+
2. Click **Create new Space**
|
| 33 |
+
3. Create TWO spaces:
|
| 34 |
+
- `docling-api` (for Docling)
|
| 35 |
+
- `docstrange-api` (for DocStrange)
|
| 36 |
+
|
| 37 |
+
### **Step 2: Upload Files**
|
| 38 |
+
|
| 39 |
+
For each space:
|
| 40 |
+
1. Upload the corresponding `app.py` and `requirements.txt`
|
| 41 |
+
2. Wait for deployment (2-5 minutes)
|
| 42 |
+
3. Copy your Space URL: `https://YOUR_USERNAME-docling-api.hf.space`
|
| 43 |
+
|
| 44 |
+
### **Step 3: Connect to DataSync**
|
| 45 |
+
|
| 46 |
+
In DataSync β Import Data β DocStrange tab:
|
| 47 |
+
1. Select engine: `π¬ Docling Hugging Face` or `π§ͺ DocStrange Hugging Face`
|
| 48 |
+
2. Paste your HF URL
|
| 49 |
+
3. Click **Extract**
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## π **API Endpoints**
|
| 54 |
+
|
| 55 |
+
### **Docling API**
|
| 56 |
+
|
| 57 |
+
| Endpoint | Method | Description |
|
| 58 |
+
|----------|--------|-------------|
|
| 59 |
+
| `/` | GET | Health check |
|
| 60 |
+
| `/health` | GET | Health status |
|
| 61 |
+
| `/convert` | POST | Full document conversion |
|
| 62 |
+
| `/convert/markdown` | POST | Markdown only |
|
| 63 |
+
| `/convert/tables` | POST | Tables only |
|
| 64 |
+
|
| 65 |
+
### **DocStrange API**
|
| 66 |
+
|
| 67 |
+
| Endpoint | Method | Description |
|
| 68 |
+
|----------|--------|-------------|
|
| 69 |
+
| `/` | GET | Health check |
|
| 70 |
+
| `/health` | GET | Health status |
|
| 71 |
+
| `/extract` | POST | Full document extraction |
|
| 72 |
+
| `/extract/markdown` | POST | Markdown/text only |
|
| 73 |
+
| `/extract/tables` | POST | Tables only |
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## π§ͺ **Testing**
|
| 78 |
+
|
| 79 |
+
```bash
|
| 80 |
+
# Test Docling API
|
| 81 |
+
python test-scripts/test_docling.py https://your-docling-api.hf.space
|
| 82 |
+
|
| 83 |
+
# Test DocStrange API
|
| 84 |
+
python test-scripts/test_docstrange.py https://your-docstrange-api.hf.space
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## π° **Hugging Face Tiers**
|
| 90 |
+
|
| 91 |
+
| Tier | Cost | Memory | Best For |
|
| 92 |
+
|------|------|--------|----------|
|
| 93 |
+
| **CPU Basic** | Free | 16GB | Testing, small PDFs |
|
| 94 |
+
| **CPU Upgrade** | Free | 32GB | Medium documents |
|
| 95 |
+
| **T4 GPU** | $0.60/hr | 16GB + 16GB VRAM | Large docs, fast extraction |
|
| 96 |
+
|
| 97 |
+
**Recommendation**: Start with **Free CPU tier** for testing, upgrade to GPU for production.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## π **Private Spaces**
|
| 102 |
+
|
| 103 |
+
If you want private APIs:
|
| 104 |
+
1. Go to Space Settings β **Make Private**
|
| 105 |
+
2. Create token: https://huggingface.co/settings/tokens
|
| 106 |
+
3. In DataSync, enter both URL and token
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## π **Full Documentation**
|
| 111 |
+
|
| 112 |
+
- [Docling API Details](docling-api/README.md)
|
| 113 |
+
- [DocStrange API Details](docstrange-api/README.md)
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## π― **Integration with DataSync**
|
| 118 |
+
|
| 119 |
+
All APIs are fully integrated with:
|
| 120 |
+
- β
DataSync Import Data module
|
| 121 |
+
- β
Automatic fallback on failure
|
| 122 |
+
- β
Structured data display
|
| 123 |
+
- β
Column mapping to ERPNext
|
| 124 |
+
- β
CSV/JSON export
|
| 125 |
+
|
| 126 |
+
**Ready to deploy!** π
|
docling-api/Dockerfile
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.10-slim
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
|
| 5 |
+
COPY requirements.txt .
|
| 6 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 7 |
+
|
| 8 |
+
COPY app.py .
|
| 9 |
+
|
| 10 |
+
EXPOSE 7860
|
| 11 |
+
|
| 12 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
|
docling-api/README.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π¬ Docling API Deployment Guide
|
| 2 |
+
|
| 3 |
+
## π¦ **Files in This Folder**
|
| 4 |
+
|
| 5 |
+
- `app.py` - FastAPI application for document conversion
|
| 6 |
+
- `requirements.txt` - Python dependencies
|
| 7 |
+
- `Dockerfile` - Container configuration
|
| 8 |
+
|
| 9 |
+
## π **Deploy to Hugging Face**
|
| 10 |
+
|
| 11 |
+
### **Method 1: Via Web UI (Easiest)**
|
| 12 |
+
|
| 13 |
+
1. Go to **https://huggingface.co/spaces**
|
| 14 |
+
2. Click **Create new Space**
|
| 15 |
+
3. **Name**: `docling-api`
|
| 16 |
+
4. **SDK**: `Docker**
|
| 17 |
+
5. **Visibility**: `Public` (free) or `Private` (needs token)
|
| 18 |
+
6. Click **Create Space**
|
| 19 |
+
7. Upload `app.py` and `requirements.txt`
|
| 20 |
+
8. Wait 3-5 minutes for deployment
|
| 21 |
+
|
| 22 |
+
### **Method 2: Via Git**
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/docling-api
|
| 26 |
+
cd docling-api
|
| 27 |
+
cp app.py requirements.txt Dockerfile .
|
| 28 |
+
git add .
|
| 29 |
+
git commit -m "Deploy Docling API"
|
| 30 |
+
git push
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
## π§ͺ **Test Your Deployment**
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
cd test-scripts
|
| 37 |
+
python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## π‘ **API Documentation**
|
| 41 |
+
|
| 42 |
+
Once deployed, visit: `https://YOUR_USERNAME-docling-api.hf.space/docs`
|
| 43 |
+
|
| 44 |
+
## π§ **Endpoints**
|
| 45 |
+
|
| 46 |
+
- `GET /` - Health check
|
| 47 |
+
- `POST /convert` - Full document conversion
|
| 48 |
+
- `POST /convert/markdown` - Markdown only
|
| 49 |
+
- `POST /convert/tables` - Tables only
|
| 50 |
+
|
| 51 |
+
## π‘ **Tips**
|
| 52 |
+
|
| 53 |
+
- Start with **Free CPU tier** for testing
|
| 54 |
+
- Upgrade to **T4 GPU** for production (faster, handles large PDFs)
|
| 55 |
+
- Keep PDFs under 10MB for best performance
|
docling-api/app.py
ADDED
|
@@ -0,0 +1,232 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Docling Hugging Face Spaces API
|
| 3 |
+
Deploy this on Hugging Face Spaces to provide Docling extraction API
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
import tempfile
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
from fastapi import FastAPI, File, UploadFile, HTTPException
|
| 10 |
+
from fastapi.responses import JSONResponse
|
| 11 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 12 |
+
from docling.document_converter import DocumentConverter
|
| 13 |
+
from docling.datamodel.base_models import InputFormat
|
| 14 |
+
import uvicorn
|
| 15 |
+
|
| 16 |
+
app = FastAPI(
|
| 17 |
+
title="Docling Document Converter API",
|
| 18 |
+
description="Convert documents using Docling AI",
|
| 19 |
+
version="1.0.0"
|
| 20 |
+
)
|
| 21 |
+
|
| 22 |
+
# Allow CORS for DataSync integration
|
| 23 |
+
app.add_middleware(
|
| 24 |
+
CORSMiddleware,
|
| 25 |
+
allow_origins=["*"],
|
| 26 |
+
allow_credentials=True,
|
| 27 |
+
allow_methods=["*"],
|
| 28 |
+
allow_headers=["*"],
|
| 29 |
+
)
|
| 30 |
+
|
| 31 |
+
# Global converter instance
|
| 32 |
+
converter = None
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def get_converter():
|
| 36 |
+
"""Get or create DocumentConverter instance"""
|
| 37 |
+
global converter
|
| 38 |
+
if converter is None:
|
| 39 |
+
converter = DocumentConverter()
|
| 40 |
+
return converter
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
@app.get("/")
|
| 44 |
+
def root():
|
| 45 |
+
"""Health check"""
|
| 46 |
+
return {
|
| 47 |
+
"status": "ok",
|
| 48 |
+
"service": "Docling API",
|
| 49 |
+
"version": "1.0.0"
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
@app.get("/health")
|
| 54 |
+
def health():
|
| 55 |
+
"""Health check"""
|
| 56 |
+
return {"status": "ok", "gpu": "available"}
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
@app.post("/convert")
|
| 60 |
+
async def convert_document(file: UploadFile = File(...)):
|
| 61 |
+
"""
|
| 62 |
+
Convert document to structured data
|
| 63 |
+
|
| 64 |
+
Returns: JSON with markdown, tables, and metadata
|
| 65 |
+
"""
|
| 66 |
+
if not file.filename:
|
| 67 |
+
raise HTTPException(status_code=400, detail="No file provided")
|
| 68 |
+
|
| 69 |
+
supported_extensions = ['.pdf', '.docx', '.xlsx', '.pptx', '.html', '.txt', '.md']
|
| 70 |
+
ext = Path(file.filename).suffix.lower()
|
| 71 |
+
if ext not in supported_extensions:
|
| 72 |
+
raise HTTPException(
|
| 73 |
+
status_code=400,
|
| 74 |
+
detail=f"Unsupported format: {ext}. Supported: {supported_extensions}"
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
try:
|
| 78 |
+
# Save uploaded file temporarily
|
| 79 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
|
| 80 |
+
content = await file.read()
|
| 81 |
+
tmp.write(content)
|
| 82 |
+
tmp_path = tmp.name
|
| 83 |
+
|
| 84 |
+
# Convert document
|
| 85 |
+
converter = get_converter()
|
| 86 |
+
result = converter.convert(tmp_path)
|
| 87 |
+
|
| 88 |
+
# Extract data
|
| 89 |
+
doc = result.document
|
| 90 |
+
|
| 91 |
+
# Get markdown
|
| 92 |
+
markdown_text = doc.export_to_markdown()
|
| 93 |
+
|
| 94 |
+
# Extract tables
|
| 95 |
+
tables_data = []
|
| 96 |
+
for table_idx, table in enumerate(doc.tables):
|
| 97 |
+
try:
|
| 98 |
+
df = table.export_to_dataframe()
|
| 99 |
+
table_dict = {
|
| 100 |
+
"table_index": table_idx,
|
| 101 |
+
"rows": df.to_dict('records'),
|
| 102 |
+
"row_count": len(df)
|
| 103 |
+
}
|
| 104 |
+
tables_data.append(table_dict)
|
| 105 |
+
except Exception as e:
|
| 106 |
+
tables_data.append({
|
| 107 |
+
"table_index": table_idx,
|
| 108 |
+
"error": str(e)
|
| 109 |
+
})
|
| 110 |
+
|
| 111 |
+
# Build response
|
| 112 |
+
response = {
|
| 113 |
+
"success": True,
|
| 114 |
+
"file_name": file.filename,
|
| 115 |
+
"document": {
|
| 116 |
+
"markdown": markdown_text,
|
| 117 |
+
"text": doc.export_to_text() if hasattr(doc, 'export_to_text') else markdown_text,
|
| 118 |
+
"num_pages": len(doc.pages) if hasattr(doc, 'pages') else 0,
|
| 119 |
+
"tables": tables_data,
|
| 120 |
+
"tables_count": len(tables_data)
|
| 121 |
+
},
|
| 122 |
+
"metadata": {
|
| 123 |
+
"format": ext,
|
| 124 |
+
"engine": "docling",
|
| 125 |
+
"model": "docling-default"
|
| 126 |
+
}
|
| 127 |
+
}
|
| 128 |
+
|
| 129 |
+
# Cleanup
|
| 130 |
+
os.unlink(tmp_path)
|
| 131 |
+
|
| 132 |
+
return JSONResponse(content=response)
|
| 133 |
+
|
| 134 |
+
except Exception as e:
|
| 135 |
+
# Cleanup on error
|
| 136 |
+
if 'tmp_path' in locals():
|
| 137 |
+
try:
|
| 138 |
+
os.unlink(tmp_path)
|
| 139 |
+
except:
|
| 140 |
+
pass
|
| 141 |
+
|
| 142 |
+
raise HTTPException(status_code=500, detail=f"Conversion failed: {str(e)}")
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
@app.post("/convert/markdown")
|
| 146 |
+
async def convert_to_markdown(file: UploadFile = File(...)):
|
| 147 |
+
"""Convert document to markdown only (lightweight)"""
|
| 148 |
+
try:
|
| 149 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
|
| 150 |
+
content = await file.read()
|
| 151 |
+
tmp.write(content)
|
| 152 |
+
tmp_path = tmp.name
|
| 153 |
+
|
| 154 |
+
converter = get_converter()
|
| 155 |
+
result = converter.convert(tmp_path)
|
| 156 |
+
|
| 157 |
+
markdown = result.document.export_to_markdown()
|
| 158 |
+
|
| 159 |
+
os.unlink(tmp_path)
|
| 160 |
+
|
| 161 |
+
return {
|
| 162 |
+
"success": True,
|
| 163 |
+
"markdown": markdown,
|
| 164 |
+
"file_name": file.filename
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
except Exception as e:
|
| 168 |
+
if 'tmp_path' in locals():
|
| 169 |
+
try:
|
| 170 |
+
os.unlink(tmp_path)
|
| 171 |
+
except:
|
| 172 |
+
pass
|
| 173 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
@app.post("/convert/tables")
|
| 177 |
+
async def convert_tables(file: UploadFile = File(...)):
|
| 178 |
+
"""Extract tables only from document"""
|
| 179 |
+
try:
|
| 180 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
|
| 181 |
+
content = await file.read()
|
| 182 |
+
tmp.write(content)
|
| 183 |
+
tmp_path = tmp.name
|
| 184 |
+
|
| 185 |
+
converter = get_converter()
|
| 186 |
+
result = converter.convert(tmp_path)
|
| 187 |
+
|
| 188 |
+
tables_data = []
|
| 189 |
+
for table_idx, table in enumerate(result.document.tables):
|
| 190 |
+
try:
|
| 191 |
+
df = table.export_to_dataframe()
|
| 192 |
+
tables_data.append({
|
| 193 |
+
"table_index": table_idx,
|
| 194 |
+
"headers": list(df.columns),
|
| 195 |
+
"rows": df.to_dict('records'),
|
| 196 |
+
"row_count": len(df)
|
| 197 |
+
})
|
| 198 |
+
except:
|
| 199 |
+
pass
|
| 200 |
+
|
| 201 |
+
os.unlink(tmp_path)
|
| 202 |
+
|
| 203 |
+
return {
|
| 204 |
+
"success": True,
|
| 205 |
+
"tables": tables_data,
|
| 206 |
+
"tables_count": len(tables_data),
|
| 207 |
+
"file_name": file.filename
|
| 208 |
+
}
|
| 209 |
+
|
| 210 |
+
except Exception as e:
|
| 211 |
+
if 'tmp_path' in locals():
|
| 212 |
+
try:
|
| 213 |
+
os.unlink(tmp_path)
|
| 214 |
+
except:
|
| 215 |
+
pass
|
| 216 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
if __name__ == "__main__":
|
| 220 |
+
print("="*60)
|
| 221 |
+
print("Docling Document Converter API")
|
| 222 |
+
print("="*60)
|
| 223 |
+
print("URL: http://localhost:8080")
|
| 224 |
+
print("Docs: http://localhost:8080/docs")
|
| 225 |
+
print("="*60)
|
| 226 |
+
|
| 227 |
+
uvicorn.run(
|
| 228 |
+
"app:app",
|
| 229 |
+
host="0.0.0.0",
|
| 230 |
+
port=8080,
|
| 231 |
+
reload=True
|
| 232 |
+
)
|
docling-api/requirements.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
docling>=2.88.0
|
| 2 |
+
fastapi>=0.100.0
|
| 3 |
+
uvicorn>=0.23.0
|
| 4 |
+
python-multipart>=0.0.6
|
| 5 |
+
pandas>=2.0.0
|
| 6 |
+
pillow>=10.0.0
|
docstrange-api/Dockerfile
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.10-slim
|
| 2 |
+
|
| 3 |
+
WORKDIR /app
|
| 4 |
+
|
| 5 |
+
COPY requirements.txt .
|
| 6 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 7 |
+
|
| 8 |
+
COPY app.py .
|
| 9 |
+
|
| 10 |
+
EXPOSE 7860
|
| 11 |
+
|
| 12 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
|
docstrange-api/README.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π§ͺ DocStrange API Deployment Guide
|
| 2 |
+
|
| 3 |
+
## π¦ **Files in This Folder**
|
| 4 |
+
|
| 5 |
+
- `app.py` - FastAPI application for DocStrange extraction
|
| 6 |
+
- `requirements.txt` - Python dependencies
|
| 7 |
+
- `Dockerfile` - Container configuration
|
| 8 |
+
|
| 9 |
+
## π **Deploy to Hugging Face**
|
| 10 |
+
|
| 11 |
+
### **Method 1: Via Web UI (Easiest)**
|
| 12 |
+
|
| 13 |
+
1. Go to **https://huggingface.co/spaces**
|
| 14 |
+
2. Click **Create new Space**
|
| 15 |
+
3. **Name**: `docstrange-api`
|
| 16 |
+
4. **SDK**: `Docker**
|
| 17 |
+
5. **Visibility**: `Public` (free) or `Private` (needs token)
|
| 18 |
+
6. Click **Create Space**
|
| 19 |
+
7. Upload `app.py` and `requirements.txt`
|
| 20 |
+
8. Wait 3-5 minutes for deployment
|
| 21 |
+
|
| 22 |
+
### **Method 2: Via Git**
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
git clone https://huggingface.co/spaces/YOUR_USERNAME/docstrange-api
|
| 26 |
+
cd docstrange-api
|
| 27 |
+
cp app.py requirements.txt Dockerfile .
|
| 28 |
+
git add .
|
| 29 |
+
git commit -m "Deploy DocStrange API"
|
| 30 |
+
git push
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
## π§ͺ **Test Your Deployment**
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
cd test-scripts
|
| 37 |
+
python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## π‘ **API Documentation**
|
| 41 |
+
|
| 42 |
+
Once deployed, visit: `https://YOUR_USERNAME-docstrange-api.hf.space/docs`
|
| 43 |
+
|
| 44 |
+
## π§ **Endpoints**
|
| 45 |
+
|
| 46 |
+
- `GET /` - Health check
|
| 47 |
+
- `POST /extract` - Full document extraction
|
| 48 |
+
- `POST /extract/markdown` - Markdown/text only
|
| 49 |
+
- `POST /extract/tables` - Tables only
|
| 50 |
+
|
| 51 |
+
## π‘ **Tips**
|
| 52 |
+
|
| 53 |
+
- DocStrange supports GPU mode if available
|
| 54 |
+
- Automatic GPU detection in the API
|
| 55 |
+
- Works with any document format (PDF, DOCX, Images, etc.)
|
docstrange-api/app.py
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
DocStrange Hugging Face Spaces API
|
| 3 |
+
Deploy this on Hugging Face Spaces to provide DocStrange extraction API
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
import sys
|
| 7 |
+
import tempfile
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
|
| 10 |
+
from fastapi import FastAPI, File, UploadFile, HTTPException
|
| 11 |
+
from fastapi.responses import JSONResponse
|
| 12 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 13 |
+
import uvicorn
|
| 14 |
+
|
| 15 |
+
# Add docstrange to path
|
| 16 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'docstrange'))
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
from docstrange import DocumentExtractor
|
| 20 |
+
HAS_DOCTSTRANGE = True
|
| 21 |
+
except ImportError:
|
| 22 |
+
HAS_DOCTSTRANGE = False
|
| 23 |
+
|
| 24 |
+
app = FastAPI(
|
| 25 |
+
title="DocStrange Document Extractor API",
|
| 26 |
+
description="Extract structured data from documents using DocStrange AI",
|
| 27 |
+
version="1.0.0"
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Allow CORS for DataSync integration
|
| 31 |
+
app.add_middleware(
|
| 32 |
+
CORSMiddleware,
|
| 33 |
+
allow_origins=["*"],
|
| 34 |
+
allow_credentials=True,
|
| 35 |
+
allow_methods=["*"],
|
| 36 |
+
allow_headers=["*"],
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
# Global extractor instance
|
| 40 |
+
extractor = None
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def get_extractor():
|
| 44 |
+
"""Get or create DocumentExtractor instance"""
|
| 45 |
+
global extractor
|
| 46 |
+
if extractor is None:
|
| 47 |
+
if not HAS_DOCTSTRANGE:
|
| 48 |
+
raise HTTPException(status_code=500, detail="DocStrange not installed")
|
| 49 |
+
|
| 50 |
+
# Use GPU if available, otherwise cloud mode
|
| 51 |
+
try:
|
| 52 |
+
import torch
|
| 53 |
+
gpu_mode = torch.cuda.is_available()
|
| 54 |
+
except:
|
| 55 |
+
gpu_mode = False
|
| 56 |
+
|
| 57 |
+
if gpu_mode:
|
| 58 |
+
extractor = DocumentExtractor(gpu=True)
|
| 59 |
+
else:
|
| 60 |
+
extractor = DocumentExtractor()
|
| 61 |
+
|
| 62 |
+
return extractor
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
@app.get("/")
|
| 66 |
+
def root():
|
| 67 |
+
"""Health check"""
|
| 68 |
+
return {
|
| 69 |
+
"status": "ok",
|
| 70 |
+
"service": "DocStrange API",
|
| 71 |
+
"version": "1.0.0",
|
| 72 |
+
"gpu_available": HAS_DOCTSTRANGE
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
@app.get("/health")
|
| 77 |
+
def health():
|
| 78 |
+
"""Health check"""
|
| 79 |
+
try:
|
| 80 |
+
import torch
|
| 81 |
+
gpu = torch.cuda.is_available()
|
| 82 |
+
vram = f"{torch.cuda.get_device_properties(0).total_mem/1024**3:.1f}GB" if gpu else "N/A"
|
| 83 |
+
except:
|
| 84 |
+
gpu = False
|
| 85 |
+
vram = "N/A"
|
| 86 |
+
|
| 87 |
+
return {
|
| 88 |
+
"status": "ok",
|
| 89 |
+
"gpu": gpu,
|
| 90 |
+
"vram": vram,
|
| 91 |
+
"docstrange": HAS_DOCTSTRANGE
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
@app.post("/extract")
|
| 96 |
+
async def extract_document(
|
| 97 |
+
file: UploadFile = File(...),
|
| 98 |
+
output_format: str = "markdown"
|
| 99 |
+
):
|
| 100 |
+
"""
|
| 101 |
+
Extract structured data from document
|
| 102 |
+
|
| 103 |
+
Args:
|
| 104 |
+
file: Document file (PDF, DOCX, XLSX, Images, etc.)
|
| 105 |
+
output_format: markdown, json, csv, html, text, flat-json, all
|
| 106 |
+
|
| 107 |
+
Returns: JSON with extracted data
|
| 108 |
+
"""
|
| 109 |
+
if not file.filename:
|
| 110 |
+
raise HTTPException(status_code=400, detail="No file provided")
|
| 111 |
+
|
| 112 |
+
supported_formats = ['.pdf', '.docx', '.xlsx', '.pptx', '.png', '.jpg', '.jpeg',
|
| 113 |
+
'.bmp', '.tiff', '.webp', '.gif', '.txt', '.html', '.md', '.csv']
|
| 114 |
+
ext = Path(file.filename).suffix.lower()
|
| 115 |
+
if ext not in supported_formats:
|
| 116 |
+
raise HTTPException(
|
| 117 |
+
status_code=400,
|
| 118 |
+
detail=f"Unsupported format: {ext}. Supported: {supported_formats}"
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
try:
|
| 122 |
+
# Save uploaded file temporarily
|
| 123 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
|
| 124 |
+
content = await file.read()
|
| 125 |
+
tmp.write(content)
|
| 126 |
+
tmp_path = tmp.name
|
| 127 |
+
|
| 128 |
+
# Extract document
|
| 129 |
+
ext = get_extractor()
|
| 130 |
+
result = ext.extract_document(tmp_path, output_format=output_format)
|
| 131 |
+
|
| 132 |
+
# Build response
|
| 133 |
+
response = {
|
| 134 |
+
"success": True,
|
| 135 |
+
"file_name": file.filename,
|
| 136 |
+
"data": result.get('data', {}),
|
| 137 |
+
"format": result.get('format', output_format),
|
| 138 |
+
"metadata": {
|
| 139 |
+
"file_size": result.get('metadata', {}).get('file_size', 0),
|
| 140 |
+
"engine": "docstrange",
|
| 141 |
+
"gpu_mode": result.get('metadata', {}).get('gpu_mode', False)
|
| 142 |
+
}
|
| 143 |
+
}
|
| 144 |
+
|
| 145 |
+
# Cleanup
|
| 146 |
+
os.unlink(tmp_path)
|
| 147 |
+
|
| 148 |
+
return JSONResponse(content=response)
|
| 149 |
+
|
| 150 |
+
except Exception as e:
|
| 151 |
+
# Cleanup on error
|
| 152 |
+
if 'tmp_path' in locals():
|
| 153 |
+
try:
|
| 154 |
+
os.unlink(tmp_path)
|
| 155 |
+
except:
|
| 156 |
+
pass
|
| 157 |
+
|
| 158 |
+
raise HTTPException(status_code=500, detail=f"Extraction failed: {str(e)}")
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
@app.post("/extract/markdown")
|
| 162 |
+
async def extract_to_markdown(file: UploadFile = File(...)):
|
| 163 |
+
"""Extract document to markdown only (lightweight)"""
|
| 164 |
+
try:
|
| 165 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
|
| 166 |
+
content = await file.read()
|
| 167 |
+
tmp.write(content)
|
| 168 |
+
tmp_path = tmp.name
|
| 169 |
+
|
| 170 |
+
ext = get_extractor()
|
| 171 |
+
result = ext.extract_document(tmp_path, output_format='markdown')
|
| 172 |
+
|
| 173 |
+
os.unlink(tmp_path)
|
| 174 |
+
|
| 175 |
+
return {
|
| 176 |
+
"success": True,
|
| 177 |
+
"markdown": result.get('data', ''),
|
| 178 |
+
"file_name": file.filename
|
| 179 |
+
}
|
| 180 |
+
|
| 181 |
+
except Exception as e:
|
| 182 |
+
if 'tmp_path' in locals():
|
| 183 |
+
try:
|
| 184 |
+
os.unlink(tmp_path)
|
| 185 |
+
except:
|
| 186 |
+
pass
|
| 187 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
@app.post("/extract/tables")
|
| 191 |
+
async def extract_tables(file: UploadFile = File(...)):
|
| 192 |
+
"""Extract tables only from document"""
|
| 193 |
+
try:
|
| 194 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
|
| 195 |
+
content = await file.read()
|
| 196 |
+
tmp.write(content)
|
| 197 |
+
tmp_path = tmp.name
|
| 198 |
+
|
| 199 |
+
# Extract with JSON format to get structured tables
|
| 200 |
+
ext = get_extractor()
|
| 201 |
+
result = ext.extract_document(tmp_path, output_format='json')
|
| 202 |
+
|
| 203 |
+
data = result.get('data', {})
|
| 204 |
+
tables = data.get('tables', [])
|
| 205 |
+
|
| 206 |
+
os.unlink(tmp_path)
|
| 207 |
+
|
| 208 |
+
return {
|
| 209 |
+
"success": True,
|
| 210 |
+
"tables": tables,
|
| 211 |
+
"tables_count": len(tables),
|
| 212 |
+
"file_name": file.filename
|
| 213 |
+
}
|
| 214 |
+
|
| 215 |
+
except Exception as e:
|
| 216 |
+
if 'tmp_path' in locals():
|
| 217 |
+
try:
|
| 218 |
+
os.unlink(tmp_path)
|
| 219 |
+
except:
|
| 220 |
+
pass
|
| 221 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
if __name__ == "__main__":
|
| 225 |
+
print("="*60)
|
| 226 |
+
print("DocStrange Document Extractor API")
|
| 227 |
+
print("="*60)
|
| 228 |
+
print("URL: http://localhost:8080")
|
| 229 |
+
print("Docs: http://localhost:8080/docs")
|
| 230 |
+
print("="*60)
|
| 231 |
+
|
| 232 |
+
uvicorn.run(
|
| 233 |
+
"app:app",
|
| 234 |
+
host="0.0.0.0",
|
| 235 |
+
port=8080,
|
| 236 |
+
reload=True
|
| 237 |
+
)
|
docstrange-api/requirements.txt
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DocStrange API requirements
|
| 2 |
+
# Note: This assumes docstrange is installed in the space
|
| 3 |
+
# Add your specific dependencies here
|
| 4 |
+
|
| 5 |
+
fastapi>=0.100.0
|
| 6 |
+
uvicorn>=0.23.0
|
| 7 |
+
python-multipart>=0.0.6
|
| 8 |
+
pillow>=10.0.0
|
| 9 |
+
torch>=2.0.0
|
test-scripts/test_docling.py
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test Docling Hugging Face API
|
| 3 |
+
Usage: python test_docling.py <HF_API_URL>
|
| 4 |
+
"""
|
| 5 |
+
import sys
|
| 6 |
+
import requests
|
| 7 |
+
import json
|
| 8 |
+
|
| 9 |
+
if len(sys.argv) < 2:
|
| 10 |
+
print("Usage: python test_docling.py <HF_API_URL>")
|
| 11 |
+
print("Example: python test_docling.py https://your-username-docling.hf.space")
|
| 12 |
+
sys.exit(1)
|
| 13 |
+
|
| 14 |
+
HF_URL = sys.argv[1].rstrip('/')
|
| 15 |
+
|
| 16 |
+
print(f"\n{'='*60}")
|
| 17 |
+
print(f"Testing Docling API: {HF_URL}")
|
| 18 |
+
print(f"{'='*60}\n")
|
| 19 |
+
|
| 20 |
+
# Test 1: Health check
|
| 21 |
+
print("1. Testing health check...")
|
| 22 |
+
try:
|
| 23 |
+
resp = requests.get(f"{HF_URL}/")
|
| 24 |
+
print(f" Status: {resp.status_code}")
|
| 25 |
+
print(f" Response: {resp.json()}")
|
| 26 |
+
print(f" β
Health check passed!\n")
|
| 27 |
+
except Exception as e:
|
| 28 |
+
print(f" β Failed: {e}\n")
|
| 29 |
+
sys.exit(1)
|
| 30 |
+
|
| 31 |
+
# Test 2: Check if PDF file exists
|
| 32 |
+
import os
|
| 33 |
+
test_pdf = "test.pdf"
|
| 34 |
+
if not os.path.exists(test_pdf):
|
| 35 |
+
print(f"β οΈ No test.pdf found. Please add a test PDF to this directory.")
|
| 36 |
+
print(f" Or create a simple test: {HF_URL}/docs")
|
| 37 |
+
sys.exit(0)
|
| 38 |
+
|
| 39 |
+
# Test 3: Full conversion
|
| 40 |
+
print(f"2. Testing full document conversion with {test_pdf}...")
|
| 41 |
+
try:
|
| 42 |
+
with open(test_pdf, 'rb') as f:
|
| 43 |
+
resp = requests.post(
|
| 44 |
+
f"{HF_URL}/convert",
|
| 45 |
+
files={"file": f},
|
| 46 |
+
timeout=120
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
print(f" Status: {resp.status_code}")
|
| 50 |
+
|
| 51 |
+
if resp.status_code == 200:
|
| 52 |
+
data = resp.json()
|
| 53 |
+
print(f" β
Success!")
|
| 54 |
+
print(f" File: {data.get('file_name')}")
|
| 55 |
+
print(f" Tables: {data.get('document', {}).get('tables_count', 0)}")
|
| 56 |
+
print(f" Pages: {data.get('document', {}).get('num_pages', 0)}")
|
| 57 |
+
|
| 58 |
+
# Show first few tables
|
| 59 |
+
tables = data.get('document', {}).get('tables', [])
|
| 60 |
+
if tables:
|
| 61 |
+
print(f"\n First table preview:")
|
| 62 |
+
for table in tables[:1]:
|
| 63 |
+
rows = table.get('rows', [])[:3]
|
| 64 |
+
for row in rows:
|
| 65 |
+
print(f" {row}")
|
| 66 |
+
else:
|
| 67 |
+
print(f" β Failed: {resp.text}\n")
|
| 68 |
+
|
| 69 |
+
except Exception as e:
|
| 70 |
+
print(f" β Failed: {e}\n")
|
| 71 |
+
|
| 72 |
+
print(f"\n{'='*60}")
|
| 73 |
+
print("Test complete!")
|
| 74 |
+
print(f"{'='*60}\n")
|
test-scripts/test_docstrange.py
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test DocStrange Hugging Face API
|
| 3 |
+
Usage: python test_docstrange.py <HF_API_URL>
|
| 4 |
+
"""
|
| 5 |
+
import sys
|
| 6 |
+
import requests
|
| 7 |
+
import json
|
| 8 |
+
import os
|
| 9 |
+
|
| 10 |
+
if len(sys.argv) < 2:
|
| 11 |
+
print("Usage: python test_docstrange.py <HF_API_URL>")
|
| 12 |
+
print("Example: python test_docstrange.py https://your-username-docstrange.hf.space")
|
| 13 |
+
sys.exit(1)
|
| 14 |
+
|
| 15 |
+
HF_URL = sys.argv[1].rstrip('/')
|
| 16 |
+
|
| 17 |
+
print(f"\n{'='*60}")
|
| 18 |
+
print(f"Testing DocStrange API: {HF_URL}")
|
| 19 |
+
print(f"{'='*60}\n")
|
| 20 |
+
|
| 21 |
+
# Test 1: Health check
|
| 22 |
+
print("1. Testing health check...")
|
| 23 |
+
try:
|
| 24 |
+
resp = requests.get(f"{HF_URL}/")
|
| 25 |
+
print(f" Status: {resp.status_code}")
|
| 26 |
+
print(f" Response: {resp.json()}")
|
| 27 |
+
print(f" β
Health check passed!\n")
|
| 28 |
+
except Exception as e:
|
| 29 |
+
print(f" β Failed: {e}\n")
|
| 30 |
+
sys.exit(1)
|
| 31 |
+
|
| 32 |
+
# Test 2: Check for test PDF
|
| 33 |
+
test_pdf = "test.pdf"
|
| 34 |
+
if not os.path.exists(test_pdf):
|
| 35 |
+
print(f"β οΈ No test.pdf found. Please add a test PDF to this directory.")
|
| 36 |
+
print(f" Or check API docs at: {HF_URL}/docs")
|
| 37 |
+
sys.exit(0)
|
| 38 |
+
|
| 39 |
+
# Test 3: Full extraction
|
| 40 |
+
print(f"2. Testing document extraction with {test_pdf}...")
|
| 41 |
+
try:
|
| 42 |
+
with open(test_pdf, 'rb') as f:
|
| 43 |
+
resp = requests.post(
|
| 44 |
+
f"{HF_URL}/extract",
|
| 45 |
+
files={"file": f},
|
| 46 |
+
timeout=120
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
print(f" Status: {resp.status_code}")
|
| 50 |
+
|
| 51 |
+
if resp.status_code == 200:
|
| 52 |
+
data = resp.json()
|
| 53 |
+
print(f" β
Success!")
|
| 54 |
+
print(f" File: {data.get('file_name')}")
|
| 55 |
+
print(f" Format: {data.get('format')}")
|
| 56 |
+
print(f" Metadata: {json.dumps(data.get('metadata', {}), indent=2)}")
|
| 57 |
+
|
| 58 |
+
# Preview data
|
| 59 |
+
doc_data = data.get('data', {})
|
| 60 |
+
if isinstance(doc_data, str):
|
| 61 |
+
print(f"\n Preview (first 200 chars):")
|
| 62 |
+
print(f" {doc_data[:200]}...")
|
| 63 |
+
elif isinstance(doc_data, dict):
|
| 64 |
+
print(f"\n Data keys: {list(doc_data.keys())}")
|
| 65 |
+
else:
|
| 66 |
+
print(f" β Failed: {resp.text}\n")
|
| 67 |
+
|
| 68 |
+
except Exception as e:
|
| 69 |
+
print(f" β Failed: {e}\n")
|
| 70 |
+
|
| 71 |
+
print(f"\n{'='*60}")
|
| 72 |
+
print("Test complete!")
|
| 73 |
+
print(f"{'='*60}\n")
|