GraziePrego commited on
Commit
2d58347
ยท
verified ยท
1 Parent(s): 8203ad9

Add comprehensive README for HF Space

Browse files
Files changed (1) hide show
  1. README.md +178 -0
README.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Scrapling - Web Scraping API
3
+ emoji: ๐Ÿ•ท๏ธ
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: docker
7
+ pinned: false
8
+ license: mit
9
+ ---
10
+
11
+ # Scrapling - Advanced Web Scraping API
12
+
13
+ A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).
14
+
15
+ ## Features
16
+
17
+ - ๐Ÿš€ **REST API** - FastAPI-based endpoints for programmatic access
18
+ - ๐Ÿค– **AI-Powered Extraction** - Natural language queries for content extraction
19
+ - ๐Ÿ” **Session Management** - Persistent sessions for efficient batch processing
20
+ - ๐ŸŒ **Multiple Scraping Modes**:
21
+ - Standard HTTP (fast, low protection)
22
+ - Dynamic fetching (JavaScript support)
23
+ - Stealthy browser (anti-bot bypass)
24
+ - ๐Ÿ“Š **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
25
+ - ๐ŸŽจ **Gradio UI** - Interactive web interface for testing
26
+
27
+ ## API Endpoints
28
+
29
+ ### Base URL
30
+ ```
31
+ https://grazieprego-scrapling.hf.space
32
+ ```
33
+
34
+ ### Quick Reference
35
+
36
+ | Endpoint | Method | Description |
37
+ |----------|--------|-------------|
38
+ | `/health` | GET | Check API status |
39
+ | `/api/scrape` | POST | Stateless scrape request |
40
+ | `/api/session` | POST | Create persistent session |
41
+ | `/api/session/{id}/scrape` | POST | Scrape using session |
42
+ | `/api/session/{id}` | DELETE | Close session |
43
+ | `/docs` | GET | API documentation (HTML) |
44
+ | `/api-docs` | GET | API documentation (JSON) |
45
+
46
+ ## Usage Examples
47
+
48
+ ### 1. Stateless Scrape (One-off requests)
49
+
50
+ ```bash
51
+ curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
52
+ -H "Content-Type: application/json" \
53
+ -d '{
54
+ "url": "https://example.com",
55
+ "query": "Extract all product prices",
56
+ "model_name": "alias-fast"
57
+ }'
58
+ ```
59
+
60
+ ### 2. Session-Based Scraping (Multiple requests)
61
+
62
+ ```python
63
+ import requests
64
+
65
+ # Create session
66
+ session = requests.post(
67
+ 'https://grazieprego-scrapling.hf.space/api/session',
68
+ json={'model_name': 'alias-fast'}
69
+ )
70
+ session_id = session.json()['session_id']
71
+
72
+ try:
73
+ # Multiple scrapes using the same session
74
+ urls = [
75
+ 'https://example.com/page1',
76
+ 'https://example.com/page2',
77
+ 'https://example.com/page3'
78
+ ]
79
+
80
+ for url in urls:
81
+ result = requests.post(
82
+ f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
83
+ json={'url': url, 'query': 'Extract product data'}
84
+ )
85
+ print(f"Scraped {url}: {result.json()}")
86
+ finally:
87
+ # Always close the session
88
+ requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
89
+ ```
90
+
91
+ ### 3. Using the Gradio UI
92
+
93
+ Visit the space URL and use the interactive interface:
94
+ - **Fetch (HTTP)** tab: For standard HTTP scraping
95
+ - **Stealthy Fetch (Browser)** tab: For sites with bot protection
96
+
97
+ ## API Documentation
98
+
99
+ - **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
100
+ - **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs
101
+
102
+ ## Request Parameters
103
+
104
+ ### `/api/scrape` & `/api/session/{id}/scrape`
105
+
106
+ ```json
107
+ {
108
+ "url": "https://example.com",
109
+ "query": "Extract all headings and prices",
110
+ "model_name": "alias-fast"
111
+ }
112
+ ```
113
+
114
+ **Parameters:**
115
+ - `url` (string, required): The URL to scrape
116
+ - `query` (string, required): Natural language extraction instruction
117
+ - `model_name` (string, optional): AI model to use (default: "alias-fast")
118
+
119
+ ### `/api/session`
120
+
121
+ ```json
122
+ {
123
+ "model_name": "alias-fast"
124
+ }
125
+ ```
126
+
127
+ ## Response Format
128
+
129
+ ```json
130
+ {
131
+ "url": "https://example.com",
132
+ "query": "Extract prices",
133
+ "response": {
134
+ "status": 200,
135
+ "content": ["# Product 1: $19.99", "# Product 2: $29.99"],
136
+ "url": "https://example.com"
137
+ }
138
+ }
139
+ ```
140
+
141
+ ## Best Practices
142
+
143
+ 1. **Use stateless endpoints** for one-off requests
144
+ 2. **Use sessions** for batch processing multiple URLs
145
+ 3. **Always close sessions** when finished to free resources
146
+ 4. **Implement error handling** - 500 errors may occur on complex sites
147
+ 5. **Add retry logic** for production use
148
+ 6. **Respect rate limits** - use responsibly
149
+
150
+ ## Error Handling
151
+
152
+ - **404**: Session not found
153
+ - **500**: Internal server error (check `detail` field for specifics)
154
+ - **Common issues**:
155
+ - URL unreachable or timeout
156
+ - JavaScript-heavy sites may need `stealthy_fetch`
157
+ - Bot protection may block requests
158
+
159
+ ## Deployment
160
+
161
+ This space uses Docker with:
162
+ - Python 3.11
163
+ - FastAPI + Uvicorn
164
+ - Gradio 5.x
165
+ - Playwright for browser automation
166
+ - Scrapling for advanced scraping
167
+
168
+ ## License
169
+
170
+ MIT License - See LICENSE file for details
171
+
172
+ ## Credits
173
+
174
+ Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library
175
+
176
+ ---
177
+
178
+ **Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.