Spaces:
Running
scraperl-documentation
Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL.
table-of-contents
- Getting Started
- Dashboard Overview
- Agents
- Plugins
- Memory System
- Models & Providers
- Settings
- API Reference
- Troubleshooting
getting-started
what-is-scraperl
ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can:
- Learn from experience - Improve scraping strategies over time
- Adapt to changes - Handle website structure changes automatically
- Multi-agent coordination - Use specialized agents for different tasks
- Memory-enhanced - Remember patterns and optimize future runs
quick-start
- Enter a Target URL - Provide the webpage you want to scrape
- Write an Instruction - Describe what data you want to extract
- Configure Options - Select model, agents, and plugins
- Start Episode - Click Start and watch the magic happen!
example-task
URL: https://example.com/products
Instruction: Extract all product names, prices, and descriptions
Task Type: Medium
dashboard-overview
The dashboard is your command center for monitoring and controlling scraping operations.
layout-structure
| Section | Description |
|---|---|
| Input Bar | Enter URL, instruction, and configure task |
| Left Sidebar | View active agents, MCPs, skills, and tools |
| Center Area | Main visualization and current observation |
| Right Sidebar | Memory stats, extracted data, recent actions |
| Bottom Logs | Real-time terminal-style log output |
stats-header
The header shows key metrics with expandable details:
- Episodes - Total scraping sessions completed
- Steps - Actions taken in current/total sessions
- Reward - Performance score (higher is better)
- Time - Current time and session duration
Click the ⋯ icon on any stat to see detailed statistics (min, max, average).
task-configuration
task-types
| Type | Description | Use Case |
|---|---|---|
| Low | Simple single-page scraping | Product page, article text |
| Medium | Multi-page with navigation | Search results, listings |
| High | Complex interactive tasks | Login-required, forms |
agents
ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping.
available-agents
| Agent | Role | Description |
|---|---|---|
| Coordinator | Orchestrator | Manages all other agents, decides strategy |
| Scraper | Extractor | Extracts data from page content |
| Navigator | Navigation | Handles page navigation, clicking, scrolling |
| Analyzer | Analysis | Analyzes extracted data for patterns |
| Validator | Validation | Validates data quality and completeness |
agent-selection
- Click the Agents button in the input bar
- Select agents you want to enable
- Active agents appear in the left sidebar accordion
- Monitor agent activity in real-time
agent-status-indicators
- Active - Currently processing
- Ready - Waiting for task
- Idle - Not currently in use
- Error - Encountered an issue
plugins
Extend ScrapeRL's capabilities with plugins organized by category.
plugin-categories
mcps-model-context-protocols
Tools that provide browser automation and page interaction:
| Plugin | Description |
|---|---|
| Browser Use | AI-powered browser automation |
| Puppeteer MCP | Headless Chrome control |
| Playwright MCP | Cross-browser automation |
skills
Specialized capabilities for specific tasks:
| Plugin | Description |
|---|---|
| Web Scraping | Core extraction algorithms |
| Data Extraction | Structured data parsing |
| Form Filling | Automated form completion |
apis
External service integrations:
| Plugin | Description |
|---|---|
| Firecrawl | High-performance web crawler |
| Jina Reader | Content reader API |
| Serper | Search engine results API |
vision
Visual understanding capabilities:
| Plugin | Description |
|---|---|
| GPT-4 Vision | OpenAI visual analysis |
| Gemini Vision | Google visual AI |
| Claude Vision | Anthropic visual models |
managing-plugins
- Go to Plugins tab
- Browse by category
- Click Install to add a plugin
- Enable plugins in Dashboard via the Plugins popup
memory-system
ScrapeRL uses a hierarchical memory system for context retention.
memory-layers
| Layer | Purpose | Retention |
|---|---|---|
| Working | Current task context | Session |
| Episodic | Experience records | Persistent |
| Semantic | Learned patterns | Persistent |
| Procedural | Action sequences | Persistent |
memory-features
- Auto-consolidation - Promotes important data between layers
- Similarity search - Find related memories quickly
- Pattern recognition - Learn from past experiences
models-and-providers
supported-providers
| Provider | Models | Best For |
|---|---|---|
| Groq | GPT-OSS 120B | Fast inference, default |
| Gemini 2.5 Flash | Balanced performance | |
| OpenAI | GPT-4 Turbo | High accuracy |
| Anthropic | Claude 3 Opus | Complex reasoning |
model-selection
- Click Model button in input bar
- Select from available models
- Models require appropriate API keys
api-keys
Configure API keys in Settings > API Keys:
- Select provider
- Enter your API key
- Click Save
- Key status shows as "Active" when configured
settings
general-settings
| Setting | Description |
|---|---|
| WebSocket Updates | Enable real-time updates |
| Memory Persistence | Save memory across sessions |
| Auto-save Episodes | Automatically save completed episodes |
| Debug Mode | Enable verbose logging |
budget-and-limits
Control API usage costs:
- Daily Limit - Maximum spend per day
- Monthly Limit - Maximum spend per month
- Max Tokens - Token limit per request
- Alert Threshold - Warning at 80% usage
Budget limits are disabled by default. Enable in Settings to control spending.
appearance
- Theme - Dark (default), Light, Auto
- Compact Mode - Reduce UI spacing
- Animations - Enable/disable transitions
api-reference
health-check
GET /api/health
Response:
{
"status": "healthy",
"version": "0.1.0",
"timestamp": "2026-03-28T00:00:00Z"
}
episode-management
# Start new episode
POST /api/episode/reset
{
"task_id": "scrape-products",
"config": { ... }
}
# Take action
POST /api/episode/step
{
"action": "navigate",
"params": { "url": "..." }
}
# Get current state
GET /api/episode/state
memory-api
# Store entry
POST /api/memory/store
{
"content": "...",
"memory_type": "working",
"metadata": { ... }
}
# Query memories
POST /api/memory/query
{
"query": "product prices",
"memory_type": "semantic",
"limit": 10
}
plugins-api
# List plugins
GET /api/plugins/
# Install plugin
POST /api/plugins/install
{ "plugin_id": "firecrawl" }
# Uninstall plugin
POST /api/plugins/uninstall
{ "plugin_id": "firecrawl" }
troubleshooting
common-issues
api-key-required-error
Solution: Configure at least one API key in Settings > API Keys
episode-not-starting
Checklist:
- Valid URL entered
- At least one agent selected
- API key configured
- System status shows "Online"
slow-performance
Tips:
- Use Groq for faster inference
- Reduce enabled plugins
- Lower task complexity if possible
memory-full
Solution: Clear memory layers in Settings > Advanced > Clear Cache
getting-help
- Check the logs panel for error details
- View episode history for past issues
- Report bugs on GitHub
keyboard-shortcuts
| Shortcut | Action |
|---|---|
Ctrl + Enter |
Start/Stop episode |
Ctrl + L |
Clear logs |
Ctrl + , |
Open settings |
Escape |
Close popups |
version-history
v0-1-0-current
- Initial release
- Multi-agent architecture
- Plugin system
- Memory layers
- Dashboard with real-time monitoring
Documentation last updated: March 2026
Built with by NeerajCodz
document-flow
flowchart TD
A[document] --> B[key-sections]
B --> C[implementation]
B --> D[operations]
B --> E[validation]
related-api-reference
| item | value |
|---|---|
| api-reference | api-reference.md |