GraziePrego
/

scrapling

Model card Files Files and versions

xet

Community

GraziePrego commited on 14 days ago

Commit

2d58347

verified ·

1 Parent(s): 8203ad9

Add comprehensive README for HF Space

Browse files

Files changed (1) hide show

README.md +178 -0

README.md ADDED Viewed

	@@ -0,0 +1,178 @@

+---
+title: Scrapling - Web Scraping API
+emoji: 🕷️
+colorFrom: purple
+colorTo: blue
+sdk: docker
+pinned: false
+license: mit
+---
+# Scrapling - Advanced Web Scraping API
+A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).
+## Features
+- 🚀 **REST API** - FastAPI-based endpoints for programmatic access
+- 🤖 **AI-Powered Extraction** - Natural language queries for content extraction
+- 🔐 **Session Management** - Persistent sessions for efficient batch processing
+- 🌐 **Multiple Scraping Modes**:
+  - Standard HTTP (fast, low protection)
+  - Dynamic fetching (JavaScript support)
+  - Stealthy browser (anti-bot bypass)
+- 📊 **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
+- 🎨 **Gradio UI** - Interactive web interface for testing
+## API Endpoints
+### Base URL
+```
+https://grazieprego-scrapling.hf.space
+```
+### Quick Reference
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Check API status |
+| `/api/scrape` | POST | Stateless scrape request |
+| `/api/session` | POST | Create persistent session |
+| `/api/session/{id}/scrape` | POST | Scrape using session |
+| `/api/session/{id}` | DELETE | Close session |
+| `/docs` | GET | API documentation (HTML) |
+| `/api-docs` | GET | API documentation (JSON) |
+## Usage Examples
+### 1. Stateless Scrape (One-off requests)
+```bash
+curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://example.com",
+    "query": "Extract all product prices",
+    "model_name": "alias-fast"
+  }'
+```
+### 2. Session-Based Scraping (Multiple requests)
+```python
+import requests
+# Create session
+session = requests.post(
+    'https://grazieprego-scrapling.hf.space/api/session',
+    json={'model_name': 'alias-fast'}
+)
+session_id = session.json()['session_id']
+try:
+    # Multiple scrapes using the same session
+    urls = [
+        'https://example.com/page1',
+        'https://example.com/page2',
+        'https://example.com/page3'
+    ]
+    for url in urls:
+        result = requests.post(
+            f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
+            json={'url': url, 'query': 'Extract product data'}
+        )
+        print(f"Scraped {url}: {result.json()}")
+finally:
+    # Always close the session
+    requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
+```
+### 3. Using the Gradio UI
+Visit the space URL and use the interactive interface:
+- **Fetch (HTTP)** tab: For standard HTTP scraping
+- **Stealthy Fetch (Browser)** tab: For sites with bot protection
+## API Documentation
+- **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
+- **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs
+## Request Parameters
+### `/api/scrape` & `/api/session/{id}/scrape`
+```json
+{
+  "url": "https://example.com",
+  "query": "Extract all headings and prices",
+  "model_name": "alias-fast"
+}
+```
+**Parameters:**
+- `url` (string, required): The URL to scrape
+- `query` (string, required): Natural language extraction instruction
+- `model_name` (string, optional): AI model to use (default: "alias-fast")
+### `/api/session`
+```json
+{
+  "model_name": "alias-fast"
+}
+```
+## Response Format
+```json
+{
+  "url": "https://example.com",
+  "query": "Extract prices",
+  "response": {
+    "status": 200,
+    "content": ["# Product 1: $19.99", "# Product 2: $29.99"],
+    "url": "https://example.com"
+  }
+}
+```
+## Best Practices
+1. **Use stateless endpoints** for one-off requests
+2. **Use sessions** for batch processing multiple URLs
+3. **Always close sessions** when finished to free resources
+4. **Implement error handling** - 500 errors may occur on complex sites
+5. **Add retry logic** for production use
+6. **Respect rate limits** - use responsibly
+## Error Handling
+- **404**: Session not found
+- **500**: Internal server error (check `detail` field for specifics)
+- **Common issues**:
+  - URL unreachable or timeout
+  - JavaScript-heavy sites may need `stealthy_fetch`
+  - Bot protection may block requests
+## Deployment
+This space uses Docker with:
+- Python 3.11
+- FastAPI + Uvicorn
+- Gradio 5.x
+- Playwright for browser automation
+- Scrapling for advanced scraping
+## License
+MIT License - See LICENSE file for details
+## Credits
+Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library
+---
+**Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.