arjunbhargav212 commited on
Commit
ad5d213
Β·
verified Β·
1 Parent(s): 2bcba00

Upload 12 files

Browse files
QUICKSTART.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ⚑ Quick Start Guide - Hugging Face Deployment
2
+
3
+ ## 🎯 **5-Minute Setup**
4
+
5
+ ### **Step 1: Create HF Spaces (2 min)**
6
+
7
+ 1. Go to **https://huggingface.co/spaces**
8
+ 2. Create TWO spaces:
9
+ - `docling-api`
10
+ - `docstrange-api`
11
+ 3. Use **Docker SDK** for both
12
+ 4. Set to **Public** (free) or **Private**
13
+
14
+ ### **Step 2: Upload Files (1 min)**
15
+
16
+ For EACH space:
17
+ 1. Upload `app.py` from corresponding folder
18
+ 2. Upload `requirements.txt` from corresponding folder
19
+ 3. Wait for deployment (2-3 min)
20
+
21
+ ### **Step 3: Get Your URLs**
22
+
23
+ After deployment:
24
+ - Docling: `https://YOUR_USERNAME-docling-api.hf.space`
25
+ - DocStrange: `https://YOUR_USERNAME-docstrange-api.hf.space`
26
+
27
+ ### **Step 4: Connect to DataSync (1 min)**
28
+
29
+ 1. Open **http://localhost:5000**
30
+ 2. Go to **Import Data β†’ DocStrange tab**
31
+ 3. Select engine:
32
+ - `πŸ”¬ Docling Hugging Face` OR
33
+ - `πŸ§ͺ DocStrange Hugging Face`
34
+ 4. Paste your HF URL
35
+ 5. Upload PDF and extract!
36
+
37
+ ---
38
+
39
+ ## πŸ§ͺ **Test Your APIs**
40
+
41
+ ```bash
42
+ # Test both APIs
43
+ cd huggingface_deploy\test-scripts
44
+
45
+ python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
46
+ python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
47
+ ```
48
+
49
+ ---
50
+
51
+ ## βœ… **You're Done!**
52
+
53
+ Both APIs are now integrated with DataSync and ready to extract documents!
54
+
55
+ ---
56
+
57
+ ## πŸ†˜ **Troubleshooting**
58
+
59
+ | Problem | Solution |
60
+ |---------|----------|
61
+ | Space not deploying | Check Docker logs in HF Space settings |
62
+ | API returns 500 | Verify requirements.txt uploaded |
63
+ | Timeout errors | PDF too large - try smaller file |
64
+ | Not working in DataSync | Check URL format (no trailing slash) |
65
+
66
+ ---
67
+
68
+ ## πŸ“š **Next Steps**
69
+
70
+ - Try different engines for comparison
71
+ - Map extracted columns to ERPNext
72
+ - Download CSV/JSON of extracted data
73
+
74
+ **Happy extracting!** πŸš€
README.md CHANGED
@@ -1,4 +1,126 @@
 
 
 
 
1
  ---
2
- title: Docling Processor
3
- sdk: docker
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ Hugging Face Document Extraction APIs
2
+
3
+ Complete deployment package for **Docling** and **DocStrange** on Hugging Face Spaces.
4
+
5
  ---
6
+
7
+ ## πŸ“ **Folder Structure**
8
+
9
+ ```
10
+ huggingface_deploy/
11
+ β”œβ”€β”€ README.md # This file
12
+ β”œβ”€β”€ docling-api/
13
+ β”‚ β”œβ”€β”€ app.py # Docling FastAPI application
14
+ β”‚ β”œβ”€β”€ requirements.txt # Python dependencies
15
+ β”‚ └── Dockerfile # Docker configuration
16
+ β”œβ”€β”€ docstrange-api/
17
+ β”‚ β”œβ”€β”€ app.py # DocStrange FastAPI application
18
+ β”‚ β”œβ”€β”€ requirements.txt # Python dependencies
19
+ β”‚ └── Dockerfile # Docker configuration
20
+ └── test-scripts/
21
+ β”œβ”€β”€ test_docling.py # Test Docling API
22
+ └── test_docstrange.py # Test DocStrange API
23
+ ```
24
+
25
+ ---
26
+
27
+ ## 🎯 **Quick Start**
28
+
29
+ ### **Step 1: Create Hugging Face Spaces**
30
+
31
+ 1. Go to **https://huggingface.co/spaces**
32
+ 2. Click **Create new Space**
33
+ 3. Create TWO spaces:
34
+ - `docling-api` (for Docling)
35
+ - `docstrange-api` (for DocStrange)
36
+
37
+ ### **Step 2: Upload Files**
38
+
39
+ For each space:
40
+ 1. Upload the corresponding `app.py` and `requirements.txt`
41
+ 2. Wait for deployment (2-5 minutes)
42
+ 3. Copy your Space URL: `https://YOUR_USERNAME-docling-api.hf.space`
43
+
44
+ ### **Step 3: Connect to DataSync**
45
+
46
+ In DataSync β†’ Import Data β†’ DocStrange tab:
47
+ 1. Select engine: `πŸ”¬ Docling Hugging Face` or `πŸ§ͺ DocStrange Hugging Face`
48
+ 2. Paste your HF URL
49
+ 3. Click **Extract**
50
+
51
+ ---
52
+
53
+ ## πŸ“Š **API Endpoints**
54
+
55
+ ### **Docling API**
56
+
57
+ | Endpoint | Method | Description |
58
+ |----------|--------|-------------|
59
+ | `/` | GET | Health check |
60
+ | `/health` | GET | Health status |
61
+ | `/convert` | POST | Full document conversion |
62
+ | `/convert/markdown` | POST | Markdown only |
63
+ | `/convert/tables` | POST | Tables only |
64
+
65
+ ### **DocStrange API**
66
+
67
+ | Endpoint | Method | Description |
68
+ |----------|--------|-------------|
69
+ | `/` | GET | Health check |
70
+ | `/health` | GET | Health status |
71
+ | `/extract` | POST | Full document extraction |
72
+ | `/extract/markdown` | POST | Markdown/text only |
73
+ | `/extract/tables` | POST | Tables only |
74
+
75
+ ---
76
+
77
+ ## πŸ§ͺ **Testing**
78
+
79
+ ```bash
80
+ # Test Docling API
81
+ python test-scripts/test_docling.py https://your-docling-api.hf.space
82
+
83
+ # Test DocStrange API
84
+ python test-scripts/test_docstrange.py https://your-docstrange-api.hf.space
85
+ ```
86
+
87
+ ---
88
+
89
+ ## πŸ’° **Hugging Face Tiers**
90
+
91
+ | Tier | Cost | Memory | Best For |
92
+ |------|------|--------|----------|
93
+ | **CPU Basic** | Free | 16GB | Testing, small PDFs |
94
+ | **CPU Upgrade** | Free | 32GB | Medium documents |
95
+ | **T4 GPU** | $0.60/hr | 16GB + 16GB VRAM | Large docs, fast extraction |
96
+
97
+ **Recommendation**: Start with **Free CPU tier** for testing, upgrade to GPU for production.
98
+
99
+ ---
100
+
101
+ ## πŸ” **Private Spaces**
102
+
103
+ If you want private APIs:
104
+ 1. Go to Space Settings β†’ **Make Private**
105
+ 2. Create token: https://huggingface.co/settings/tokens
106
+ 3. In DataSync, enter both URL and token
107
+
108
+ ---
109
+
110
+ ## πŸ“š **Full Documentation**
111
+
112
+ - [Docling API Details](docling-api/README.md)
113
+ - [DocStrange API Details](docstrange-api/README.md)
114
+
115
+ ---
116
+
117
+ ## 🎯 **Integration with DataSync**
118
+
119
+ All APIs are fully integrated with:
120
+ - βœ… DataSync Import Data module
121
+ - βœ… Automatic fallback on failure
122
+ - βœ… Structured data display
123
+ - βœ… Column mapping to ERPNext
124
+ - βœ… CSV/JSON export
125
+
126
+ **Ready to deploy!** πŸš€
docling-api/Dockerfile ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ COPY requirements.txt .
6
+ RUN pip install --no-cache-dir -r requirements.txt
7
+
8
+ COPY app.py .
9
+
10
+ EXPOSE 7860
11
+
12
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
docling-api/README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ”¬ Docling API Deployment Guide
2
+
3
+ ## πŸ“¦ **Files in This Folder**
4
+
5
+ - `app.py` - FastAPI application for document conversion
6
+ - `requirements.txt` - Python dependencies
7
+ - `Dockerfile` - Container configuration
8
+
9
+ ## πŸš€ **Deploy to Hugging Face**
10
+
11
+ ### **Method 1: Via Web UI (Easiest)**
12
+
13
+ 1. Go to **https://huggingface.co/spaces**
14
+ 2. Click **Create new Space**
15
+ 3. **Name**: `docling-api`
16
+ 4. **SDK**: `Docker**
17
+ 5. **Visibility**: `Public` (free) or `Private` (needs token)
18
+ 6. Click **Create Space**
19
+ 7. Upload `app.py` and `requirements.txt`
20
+ 8. Wait 3-5 minutes for deployment
21
+
22
+ ### **Method 2: Via Git**
23
+
24
+ ```bash
25
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/docling-api
26
+ cd docling-api
27
+ cp app.py requirements.txt Dockerfile .
28
+ git add .
29
+ git commit -m "Deploy Docling API"
30
+ git push
31
+ ```
32
+
33
+ ## πŸ§ͺ **Test Your Deployment**
34
+
35
+ ```bash
36
+ cd test-scripts
37
+ python test_docling.py https://YOUR_USERNAME-docling-api.hf.space
38
+ ```
39
+
40
+ ## πŸ“‘ **API Documentation**
41
+
42
+ Once deployed, visit: `https://YOUR_USERNAME-docling-api.hf.space/docs`
43
+
44
+ ## πŸ”§ **Endpoints**
45
+
46
+ - `GET /` - Health check
47
+ - `POST /convert` - Full document conversion
48
+ - `POST /convert/markdown` - Markdown only
49
+ - `POST /convert/tables` - Tables only
50
+
51
+ ## πŸ’‘ **Tips**
52
+
53
+ - Start with **Free CPU tier** for testing
54
+ - Upgrade to **T4 GPU** for production (faster, handles large PDFs)
55
+ - Keep PDFs under 10MB for best performance
docling-api/app.py ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Docling Hugging Face Spaces API
3
+ Deploy this on Hugging Face Spaces to provide Docling extraction API
4
+ """
5
+ import os
6
+ import tempfile
7
+ from pathlib import Path
8
+
9
+ from fastapi import FastAPI, File, UploadFile, HTTPException
10
+ from fastapi.responses import JSONResponse
11
+ from fastapi.middleware.cors import CORSMiddleware
12
+ from docling.document_converter import DocumentConverter
13
+ from docling.datamodel.base_models import InputFormat
14
+ import uvicorn
15
+
16
+ app = FastAPI(
17
+ title="Docling Document Converter API",
18
+ description="Convert documents using Docling AI",
19
+ version="1.0.0"
20
+ )
21
+
22
+ # Allow CORS for DataSync integration
23
+ app.add_middleware(
24
+ CORSMiddleware,
25
+ allow_origins=["*"],
26
+ allow_credentials=True,
27
+ allow_methods=["*"],
28
+ allow_headers=["*"],
29
+ )
30
+
31
+ # Global converter instance
32
+ converter = None
33
+
34
+
35
+ def get_converter():
36
+ """Get or create DocumentConverter instance"""
37
+ global converter
38
+ if converter is None:
39
+ converter = DocumentConverter()
40
+ return converter
41
+
42
+
43
+ @app.get("/")
44
+ def root():
45
+ """Health check"""
46
+ return {
47
+ "status": "ok",
48
+ "service": "Docling API",
49
+ "version": "1.0.0"
50
+ }
51
+
52
+
53
+ @app.get("/health")
54
+ def health():
55
+ """Health check"""
56
+ return {"status": "ok", "gpu": "available"}
57
+
58
+
59
+ @app.post("/convert")
60
+ async def convert_document(file: UploadFile = File(...)):
61
+ """
62
+ Convert document to structured data
63
+
64
+ Returns: JSON with markdown, tables, and metadata
65
+ """
66
+ if not file.filename:
67
+ raise HTTPException(status_code=400, detail="No file provided")
68
+
69
+ supported_extensions = ['.pdf', '.docx', '.xlsx', '.pptx', '.html', '.txt', '.md']
70
+ ext = Path(file.filename).suffix.lower()
71
+ if ext not in supported_extensions:
72
+ raise HTTPException(
73
+ status_code=400,
74
+ detail=f"Unsupported format: {ext}. Supported: {supported_extensions}"
75
+ )
76
+
77
+ try:
78
+ # Save uploaded file temporarily
79
+ with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
80
+ content = await file.read()
81
+ tmp.write(content)
82
+ tmp_path = tmp.name
83
+
84
+ # Convert document
85
+ converter = get_converter()
86
+ result = converter.convert(tmp_path)
87
+
88
+ # Extract data
89
+ doc = result.document
90
+
91
+ # Get markdown
92
+ markdown_text = doc.export_to_markdown()
93
+
94
+ # Extract tables
95
+ tables_data = []
96
+ for table_idx, table in enumerate(doc.tables):
97
+ try:
98
+ df = table.export_to_dataframe()
99
+ table_dict = {
100
+ "table_index": table_idx,
101
+ "rows": df.to_dict('records'),
102
+ "row_count": len(df)
103
+ }
104
+ tables_data.append(table_dict)
105
+ except Exception as e:
106
+ tables_data.append({
107
+ "table_index": table_idx,
108
+ "error": str(e)
109
+ })
110
+
111
+ # Build response
112
+ response = {
113
+ "success": True,
114
+ "file_name": file.filename,
115
+ "document": {
116
+ "markdown": markdown_text,
117
+ "text": doc.export_to_text() if hasattr(doc, 'export_to_text') else markdown_text,
118
+ "num_pages": len(doc.pages) if hasattr(doc, 'pages') else 0,
119
+ "tables": tables_data,
120
+ "tables_count": len(tables_data)
121
+ },
122
+ "metadata": {
123
+ "format": ext,
124
+ "engine": "docling",
125
+ "model": "docling-default"
126
+ }
127
+ }
128
+
129
+ # Cleanup
130
+ os.unlink(tmp_path)
131
+
132
+ return JSONResponse(content=response)
133
+
134
+ except Exception as e:
135
+ # Cleanup on error
136
+ if 'tmp_path' in locals():
137
+ try:
138
+ os.unlink(tmp_path)
139
+ except:
140
+ pass
141
+
142
+ raise HTTPException(status_code=500, detail=f"Conversion failed: {str(e)}")
143
+
144
+
145
+ @app.post("/convert/markdown")
146
+ async def convert_to_markdown(file: UploadFile = File(...)):
147
+ """Convert document to markdown only (lightweight)"""
148
+ try:
149
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
150
+ content = await file.read()
151
+ tmp.write(content)
152
+ tmp_path = tmp.name
153
+
154
+ converter = get_converter()
155
+ result = converter.convert(tmp_path)
156
+
157
+ markdown = result.document.export_to_markdown()
158
+
159
+ os.unlink(tmp_path)
160
+
161
+ return {
162
+ "success": True,
163
+ "markdown": markdown,
164
+ "file_name": file.filename
165
+ }
166
+
167
+ except Exception as e:
168
+ if 'tmp_path' in locals():
169
+ try:
170
+ os.unlink(tmp_path)
171
+ except:
172
+ pass
173
+ raise HTTPException(status_code=500, detail=str(e))
174
+
175
+
176
+ @app.post("/convert/tables")
177
+ async def convert_tables(file: UploadFile = File(...)):
178
+ """Extract tables only from document"""
179
+ try:
180
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
181
+ content = await file.read()
182
+ tmp.write(content)
183
+ tmp_path = tmp.name
184
+
185
+ converter = get_converter()
186
+ result = converter.convert(tmp_path)
187
+
188
+ tables_data = []
189
+ for table_idx, table in enumerate(result.document.tables):
190
+ try:
191
+ df = table.export_to_dataframe()
192
+ tables_data.append({
193
+ "table_index": table_idx,
194
+ "headers": list(df.columns),
195
+ "rows": df.to_dict('records'),
196
+ "row_count": len(df)
197
+ })
198
+ except:
199
+ pass
200
+
201
+ os.unlink(tmp_path)
202
+
203
+ return {
204
+ "success": True,
205
+ "tables": tables_data,
206
+ "tables_count": len(tables_data),
207
+ "file_name": file.filename
208
+ }
209
+
210
+ except Exception as e:
211
+ if 'tmp_path' in locals():
212
+ try:
213
+ os.unlink(tmp_path)
214
+ except:
215
+ pass
216
+ raise HTTPException(status_code=500, detail=str(e))
217
+
218
+
219
+ if __name__ == "__main__":
220
+ print("="*60)
221
+ print("Docling Document Converter API")
222
+ print("="*60)
223
+ print("URL: http://localhost:8080")
224
+ print("Docs: http://localhost:8080/docs")
225
+ print("="*60)
226
+
227
+ uvicorn.run(
228
+ "app:app",
229
+ host="0.0.0.0",
230
+ port=8080,
231
+ reload=True
232
+ )
docling-api/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ docling>=2.88.0
2
+ fastapi>=0.100.0
3
+ uvicorn>=0.23.0
4
+ python-multipart>=0.0.6
5
+ pandas>=2.0.0
6
+ pillow>=10.0.0
docstrange-api/Dockerfile ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ COPY requirements.txt .
6
+ RUN pip install --no-cache-dir -r requirements.txt
7
+
8
+ COPY app.py .
9
+
10
+ EXPOSE 7860
11
+
12
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
docstrange-api/README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ§ͺ DocStrange API Deployment Guide
2
+
3
+ ## πŸ“¦ **Files in This Folder**
4
+
5
+ - `app.py` - FastAPI application for DocStrange extraction
6
+ - `requirements.txt` - Python dependencies
7
+ - `Dockerfile` - Container configuration
8
+
9
+ ## πŸš€ **Deploy to Hugging Face**
10
+
11
+ ### **Method 1: Via Web UI (Easiest)**
12
+
13
+ 1. Go to **https://huggingface.co/spaces**
14
+ 2. Click **Create new Space**
15
+ 3. **Name**: `docstrange-api`
16
+ 4. **SDK**: `Docker**
17
+ 5. **Visibility**: `Public` (free) or `Private` (needs token)
18
+ 6. Click **Create Space**
19
+ 7. Upload `app.py` and `requirements.txt`
20
+ 8. Wait 3-5 minutes for deployment
21
+
22
+ ### **Method 2: Via Git**
23
+
24
+ ```bash
25
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/docstrange-api
26
+ cd docstrange-api
27
+ cp app.py requirements.txt Dockerfile .
28
+ git add .
29
+ git commit -m "Deploy DocStrange API"
30
+ git push
31
+ ```
32
+
33
+ ## πŸ§ͺ **Test Your Deployment**
34
+
35
+ ```bash
36
+ cd test-scripts
37
+ python test_docstrange.py https://YOUR_USERNAME-docstrange-api.hf.space
38
+ ```
39
+
40
+ ## πŸ“‘ **API Documentation**
41
+
42
+ Once deployed, visit: `https://YOUR_USERNAME-docstrange-api.hf.space/docs`
43
+
44
+ ## πŸ”§ **Endpoints**
45
+
46
+ - `GET /` - Health check
47
+ - `POST /extract` - Full document extraction
48
+ - `POST /extract/markdown` - Markdown/text only
49
+ - `POST /extract/tables` - Tables only
50
+
51
+ ## πŸ’‘ **Tips**
52
+
53
+ - DocStrange supports GPU mode if available
54
+ - Automatic GPU detection in the API
55
+ - Works with any document format (PDF, DOCX, Images, etc.)
docstrange-api/app.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DocStrange Hugging Face Spaces API
3
+ Deploy this on Hugging Face Spaces to provide DocStrange extraction API
4
+ """
5
+ import os
6
+ import sys
7
+ import tempfile
8
+ from pathlib import Path
9
+
10
+ from fastapi import FastAPI, File, UploadFile, HTTPException
11
+ from fastapi.responses import JSONResponse
12
+ from fastapi.middleware.cors import CORSMiddleware
13
+ import uvicorn
14
+
15
+ # Add docstrange to path
16
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'docstrange'))
17
+
18
+ try:
19
+ from docstrange import DocumentExtractor
20
+ HAS_DOCTSTRANGE = True
21
+ except ImportError:
22
+ HAS_DOCTSTRANGE = False
23
+
24
+ app = FastAPI(
25
+ title="DocStrange Document Extractor API",
26
+ description="Extract structured data from documents using DocStrange AI",
27
+ version="1.0.0"
28
+ )
29
+
30
+ # Allow CORS for DataSync integration
31
+ app.add_middleware(
32
+ CORSMiddleware,
33
+ allow_origins=["*"],
34
+ allow_credentials=True,
35
+ allow_methods=["*"],
36
+ allow_headers=["*"],
37
+ )
38
+
39
+ # Global extractor instance
40
+ extractor = None
41
+
42
+
43
+ def get_extractor():
44
+ """Get or create DocumentExtractor instance"""
45
+ global extractor
46
+ if extractor is None:
47
+ if not HAS_DOCTSTRANGE:
48
+ raise HTTPException(status_code=500, detail="DocStrange not installed")
49
+
50
+ # Use GPU if available, otherwise cloud mode
51
+ try:
52
+ import torch
53
+ gpu_mode = torch.cuda.is_available()
54
+ except:
55
+ gpu_mode = False
56
+
57
+ if gpu_mode:
58
+ extractor = DocumentExtractor(gpu=True)
59
+ else:
60
+ extractor = DocumentExtractor()
61
+
62
+ return extractor
63
+
64
+
65
+ @app.get("/")
66
+ def root():
67
+ """Health check"""
68
+ return {
69
+ "status": "ok",
70
+ "service": "DocStrange API",
71
+ "version": "1.0.0",
72
+ "gpu_available": HAS_DOCTSTRANGE
73
+ }
74
+
75
+
76
+ @app.get("/health")
77
+ def health():
78
+ """Health check"""
79
+ try:
80
+ import torch
81
+ gpu = torch.cuda.is_available()
82
+ vram = f"{torch.cuda.get_device_properties(0).total_mem/1024**3:.1f}GB" if gpu else "N/A"
83
+ except:
84
+ gpu = False
85
+ vram = "N/A"
86
+
87
+ return {
88
+ "status": "ok",
89
+ "gpu": gpu,
90
+ "vram": vram,
91
+ "docstrange": HAS_DOCTSTRANGE
92
+ }
93
+
94
+
95
+ @app.post("/extract")
96
+ async def extract_document(
97
+ file: UploadFile = File(...),
98
+ output_format: str = "markdown"
99
+ ):
100
+ """
101
+ Extract structured data from document
102
+
103
+ Args:
104
+ file: Document file (PDF, DOCX, XLSX, Images, etc.)
105
+ output_format: markdown, json, csv, html, text, flat-json, all
106
+
107
+ Returns: JSON with extracted data
108
+ """
109
+ if not file.filename:
110
+ raise HTTPException(status_code=400, detail="No file provided")
111
+
112
+ supported_formats = ['.pdf', '.docx', '.xlsx', '.pptx', '.png', '.jpg', '.jpeg',
113
+ '.bmp', '.tiff', '.webp', '.gif', '.txt', '.html', '.md', '.csv']
114
+ ext = Path(file.filename).suffix.lower()
115
+ if ext not in supported_formats:
116
+ raise HTTPException(
117
+ status_code=400,
118
+ detail=f"Unsupported format: {ext}. Supported: {supported_formats}"
119
+ )
120
+
121
+ try:
122
+ # Save uploaded file temporarily
123
+ with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
124
+ content = await file.read()
125
+ tmp.write(content)
126
+ tmp_path = tmp.name
127
+
128
+ # Extract document
129
+ ext = get_extractor()
130
+ result = ext.extract_document(tmp_path, output_format=output_format)
131
+
132
+ # Build response
133
+ response = {
134
+ "success": True,
135
+ "file_name": file.filename,
136
+ "data": result.get('data', {}),
137
+ "format": result.get('format', output_format),
138
+ "metadata": {
139
+ "file_size": result.get('metadata', {}).get('file_size', 0),
140
+ "engine": "docstrange",
141
+ "gpu_mode": result.get('metadata', {}).get('gpu_mode', False)
142
+ }
143
+ }
144
+
145
+ # Cleanup
146
+ os.unlink(tmp_path)
147
+
148
+ return JSONResponse(content=response)
149
+
150
+ except Exception as e:
151
+ # Cleanup on error
152
+ if 'tmp_path' in locals():
153
+ try:
154
+ os.unlink(tmp_path)
155
+ except:
156
+ pass
157
+
158
+ raise HTTPException(status_code=500, detail=f"Extraction failed: {str(e)}")
159
+
160
+
161
+ @app.post("/extract/markdown")
162
+ async def extract_to_markdown(file: UploadFile = File(...)):
163
+ """Extract document to markdown only (lightweight)"""
164
+ try:
165
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
166
+ content = await file.read()
167
+ tmp.write(content)
168
+ tmp_path = tmp.name
169
+
170
+ ext = get_extractor()
171
+ result = ext.extract_document(tmp_path, output_format='markdown')
172
+
173
+ os.unlink(tmp_path)
174
+
175
+ return {
176
+ "success": True,
177
+ "markdown": result.get('data', ''),
178
+ "file_name": file.filename
179
+ }
180
+
181
+ except Exception as e:
182
+ if 'tmp_path' in locals():
183
+ try:
184
+ os.unlink(tmp_path)
185
+ except:
186
+ pass
187
+ raise HTTPException(status_code=500, detail=str(e))
188
+
189
+
190
+ @app.post("/extract/tables")
191
+ async def extract_tables(file: UploadFile = File(...)):
192
+ """Extract tables only from document"""
193
+ try:
194
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix.lower()) as tmp:
195
+ content = await file.read()
196
+ tmp.write(content)
197
+ tmp_path = tmp.name
198
+
199
+ # Extract with JSON format to get structured tables
200
+ ext = get_extractor()
201
+ result = ext.extract_document(tmp_path, output_format='json')
202
+
203
+ data = result.get('data', {})
204
+ tables = data.get('tables', [])
205
+
206
+ os.unlink(tmp_path)
207
+
208
+ return {
209
+ "success": True,
210
+ "tables": tables,
211
+ "tables_count": len(tables),
212
+ "file_name": file.filename
213
+ }
214
+
215
+ except Exception as e:
216
+ if 'tmp_path' in locals():
217
+ try:
218
+ os.unlink(tmp_path)
219
+ except:
220
+ pass
221
+ raise HTTPException(status_code=500, detail=str(e))
222
+
223
+
224
+ if __name__ == "__main__":
225
+ print("="*60)
226
+ print("DocStrange Document Extractor API")
227
+ print("="*60)
228
+ print("URL: http://localhost:8080")
229
+ print("Docs: http://localhost:8080/docs")
230
+ print("="*60)
231
+
232
+ uvicorn.run(
233
+ "app:app",
234
+ host="0.0.0.0",
235
+ port=8080,
236
+ reload=True
237
+ )
docstrange-api/requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # DocStrange API requirements
2
+ # Note: This assumes docstrange is installed in the space
3
+ # Add your specific dependencies here
4
+
5
+ fastapi>=0.100.0
6
+ uvicorn>=0.23.0
7
+ python-multipart>=0.0.6
8
+ pillow>=10.0.0
9
+ torch>=2.0.0
test-scripts/test_docling.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test Docling Hugging Face API
3
+ Usage: python test_docling.py <HF_API_URL>
4
+ """
5
+ import sys
6
+ import requests
7
+ import json
8
+
9
+ if len(sys.argv) < 2:
10
+ print("Usage: python test_docling.py <HF_API_URL>")
11
+ print("Example: python test_docling.py https://your-username-docling.hf.space")
12
+ sys.exit(1)
13
+
14
+ HF_URL = sys.argv[1].rstrip('/')
15
+
16
+ print(f"\n{'='*60}")
17
+ print(f"Testing Docling API: {HF_URL}")
18
+ print(f"{'='*60}\n")
19
+
20
+ # Test 1: Health check
21
+ print("1. Testing health check...")
22
+ try:
23
+ resp = requests.get(f"{HF_URL}/")
24
+ print(f" Status: {resp.status_code}")
25
+ print(f" Response: {resp.json()}")
26
+ print(f" βœ… Health check passed!\n")
27
+ except Exception as e:
28
+ print(f" ❌ Failed: {e}\n")
29
+ sys.exit(1)
30
+
31
+ # Test 2: Check if PDF file exists
32
+ import os
33
+ test_pdf = "test.pdf"
34
+ if not os.path.exists(test_pdf):
35
+ print(f"⚠️ No test.pdf found. Please add a test PDF to this directory.")
36
+ print(f" Or create a simple test: {HF_URL}/docs")
37
+ sys.exit(0)
38
+
39
+ # Test 3: Full conversion
40
+ print(f"2. Testing full document conversion with {test_pdf}...")
41
+ try:
42
+ with open(test_pdf, 'rb') as f:
43
+ resp = requests.post(
44
+ f"{HF_URL}/convert",
45
+ files={"file": f},
46
+ timeout=120
47
+ )
48
+
49
+ print(f" Status: {resp.status_code}")
50
+
51
+ if resp.status_code == 200:
52
+ data = resp.json()
53
+ print(f" βœ… Success!")
54
+ print(f" File: {data.get('file_name')}")
55
+ print(f" Tables: {data.get('document', {}).get('tables_count', 0)}")
56
+ print(f" Pages: {data.get('document', {}).get('num_pages', 0)}")
57
+
58
+ # Show first few tables
59
+ tables = data.get('document', {}).get('tables', [])
60
+ if tables:
61
+ print(f"\n First table preview:")
62
+ for table in tables[:1]:
63
+ rows = table.get('rows', [])[:3]
64
+ for row in rows:
65
+ print(f" {row}")
66
+ else:
67
+ print(f" ❌ Failed: {resp.text}\n")
68
+
69
+ except Exception as e:
70
+ print(f" ❌ Failed: {e}\n")
71
+
72
+ print(f"\n{'='*60}")
73
+ print("Test complete!")
74
+ print(f"{'='*60}\n")
test-scripts/test_docstrange.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test DocStrange Hugging Face API
3
+ Usage: python test_docstrange.py <HF_API_URL>
4
+ """
5
+ import sys
6
+ import requests
7
+ import json
8
+ import os
9
+
10
+ if len(sys.argv) < 2:
11
+ print("Usage: python test_docstrange.py <HF_API_URL>")
12
+ print("Example: python test_docstrange.py https://your-username-docstrange.hf.space")
13
+ sys.exit(1)
14
+
15
+ HF_URL = sys.argv[1].rstrip('/')
16
+
17
+ print(f"\n{'='*60}")
18
+ print(f"Testing DocStrange API: {HF_URL}")
19
+ print(f"{'='*60}\n")
20
+
21
+ # Test 1: Health check
22
+ print("1. Testing health check...")
23
+ try:
24
+ resp = requests.get(f"{HF_URL}/")
25
+ print(f" Status: {resp.status_code}")
26
+ print(f" Response: {resp.json()}")
27
+ print(f" βœ… Health check passed!\n")
28
+ except Exception as e:
29
+ print(f" ❌ Failed: {e}\n")
30
+ sys.exit(1)
31
+
32
+ # Test 2: Check for test PDF
33
+ test_pdf = "test.pdf"
34
+ if not os.path.exists(test_pdf):
35
+ print(f"⚠️ No test.pdf found. Please add a test PDF to this directory.")
36
+ print(f" Or check API docs at: {HF_URL}/docs")
37
+ sys.exit(0)
38
+
39
+ # Test 3: Full extraction
40
+ print(f"2. Testing document extraction with {test_pdf}...")
41
+ try:
42
+ with open(test_pdf, 'rb') as f:
43
+ resp = requests.post(
44
+ f"{HF_URL}/extract",
45
+ files={"file": f},
46
+ timeout=120
47
+ )
48
+
49
+ print(f" Status: {resp.status_code}")
50
+
51
+ if resp.status_code == 200:
52
+ data = resp.json()
53
+ print(f" βœ… Success!")
54
+ print(f" File: {data.get('file_name')}")
55
+ print(f" Format: {data.get('format')}")
56
+ print(f" Metadata: {json.dumps(data.get('metadata', {}), indent=2)}")
57
+
58
+ # Preview data
59
+ doc_data = data.get('data', {})
60
+ if isinstance(doc_data, str):
61
+ print(f"\n Preview (first 200 chars):")
62
+ print(f" {doc_data[:200]}...")
63
+ elif isinstance(doc_data, dict):
64
+ print(f"\n Data keys: {list(doc_data.keys())}")
65
+ else:
66
+ print(f" ❌ Failed: {resp.text}\n")
67
+
68
+ except Exception as e:
69
+ print(f" ❌ Failed: {e}\n")
70
+
71
+ print(f"\n{'='*60}")
72
+ print("Test complete!")
73
+ print(f"{'='*60}\n")