--- title: Scrapling - Web Scraping API emoji: 🕷️ colorFrom: purple colorTo: blue sdk: docker pinned: false license: mit --- # Scrapling - Advanced Web Scraping API A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation). ## Features - 🚀 **REST API** - FastAPI-based endpoints for programmatic access - 🤖 **AI-Powered Extraction** - Natural language queries for content extraction - 🔐 **Session Management** - Persistent sessions for efficient batch processing - 🌐 **Multiple Scraping Modes**: - Standard HTTP (fast, low protection) - Dynamic fetching (JavaScript support) - Stealthy browser (anti-bot bypass) - 📊 **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats - 🎨 **Gradio UI** - Interactive web interface for testing ## API Endpoints ### Base URL ``` https://grazieprego-scrapling.hf.space ``` ### Quick Reference | Endpoint | Method | Description | |----------|--------|-------------| | `/health` | GET | Check API status | | `/api/scrape` | POST | Stateless scrape request | | `/api/session` | POST | Create persistent session | | `/api/session/{id}/scrape` | POST | Scrape using session | | `/api/session/{id}` | DELETE | Close session | | `/docs` | GET | API documentation (HTML) | | `/api-docs` | GET | API documentation (JSON) | ## Usage Examples ### 1. Stateless Scrape (One-off requests) ```bash curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "query": "Extract all product prices", "model_name": "alias-fast" }' ``` ### 2. Session-Based Scraping (Multiple requests) ```python import requests # Create session session = requests.post( 'https://grazieprego-scrapling.hf.space/api/session', json={'model_name': 'alias-fast'} ) session_id = session.json()['session_id'] try: # Multiple scrapes using the same session urls = [ 'https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3' ] for url in urls: result = requests.post( f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape', json={'url': url, 'query': 'Extract product data'} ) print(f"Scraped {url}: {result.json()}") finally: # Always close the session requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}') ``` ### 3. Using the Gradio UI Visit the space URL and use the interactive interface: - **Fetch (HTTP)** tab: For standard HTTP scraping - **Stealthy Fetch (Browser)** tab: For sites with bot protection ## API Documentation - **HTML Docs**: https://grazieprego-scrapling.hf.space/docs - **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs ## Request Parameters ### `/api/scrape` & `/api/session/{id}/scrape` ```json { "url": "https://example.com", "query": "Extract all headings and prices", "model_name": "alias-fast" } ``` **Parameters:** - `url` (string, required): The URL to scrape - `query` (string, required): Natural language extraction instruction - `model_name` (string, optional): AI model to use (default: "alias-fast") ### `/api/session` ```json { "model_name": "alias-fast" } ``` ## Response Format ```json { "url": "https://example.com", "query": "Extract prices", "response": { "status": 200, "content": ["# Product 1: $19.99", "# Product 2: $29.99"], "url": "https://example.com" } } ``` ## Best Practices 1. **Use stateless endpoints** for one-off requests 2. **Use sessions** for batch processing multiple URLs 3. **Always close sessions** when finished to free resources 4. **Implement error handling** - 500 errors may occur on complex sites 5. **Add retry logic** for production use 6. **Respect rate limits** - use responsibly ## Error Handling - **404**: Session not found - **500**: Internal server error (check `detail` field for specifics) - **Common issues**: - URL unreachable or timeout - JavaScript-heavy sites may need `stealthy_fetch` - Bot protection may block requests ## Deployment This space uses Docker with: - Python 3.11 - FastAPI + Uvicorn - Gradio 5.x - Playwright for browser automation - Scrapling for advanced scraping ## License MIT License - See LICENSE file for details ## Credits Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library --- **Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.