scrapeRL / docs /user-guide.md
NeerajCodz's picture
docs: init proto
24f0bf0

scraperl-documentation

Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL.


table-of-contents

  1. Getting Started
  2. Dashboard Overview
  3. Agents
  4. Plugins
  5. Memory System
  6. Models & Providers
  7. Settings
  8. API Reference
  9. Troubleshooting

getting-started

what-is-scraperl

ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can:

  • Learn from experience - Improve scraping strategies over time
  • Adapt to changes - Handle website structure changes automatically
  • Multi-agent coordination - Use specialized agents for different tasks
  • Memory-enhanced - Remember patterns and optimize future runs

quick-start

  1. Enter a Target URL - Provide the webpage you want to scrape
  2. Write an Instruction - Describe what data you want to extract
  3. Configure Options - Select model, agents, and plugins
  4. Start Episode - Click Start and watch the magic happen!

example-task

URL: https://example.com/products
Instruction: Extract all product names, prices, and descriptions
Task Type: Medium

dashboard-overview

The dashboard is your command center for monitoring and controlling scraping operations.

layout-structure

Section Description
Input Bar Enter URL, instruction, and configure task
Left Sidebar View active agents, MCPs, skills, and tools
Center Area Main visualization and current observation
Right Sidebar Memory stats, extracted data, recent actions
Bottom Logs Real-time terminal-style log output

stats-header

The header shows key metrics with expandable details:

  • Episodes - Total scraping sessions completed
  • Steps - Actions taken in current/total sessions
  • Reward - Performance score (higher is better)
  • Time - Current time and session duration

Click the icon on any stat to see detailed statistics (min, max, average).

task-configuration

task-types

Type Description Use Case
Low Simple single-page scraping Product page, article text
Medium Multi-page with navigation Search results, listings
High Complex interactive tasks Login-required, forms

agents

ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping.

available-agents

Agent Role Description
Coordinator Orchestrator Manages all other agents, decides strategy
Scraper Extractor Extracts data from page content
Navigator Navigation Handles page navigation, clicking, scrolling
Analyzer Analysis Analyzes extracted data for patterns
Validator Validation Validates data quality and completeness

agent-selection

  1. Click the Agents button in the input bar
  2. Select agents you want to enable
  3. Active agents appear in the left sidebar accordion
  4. Monitor agent activity in real-time

agent-status-indicators

  • Active - Currently processing
  • Ready - Waiting for task
  • Idle - Not currently in use
  • Error - Encountered an issue

plugins

Extend ScrapeRL's capabilities with plugins organized by category.

plugin-categories

mcps-model-context-protocols

Tools that provide browser automation and page interaction:

Plugin Description
Browser Use AI-powered browser automation
Puppeteer MCP Headless Chrome control
Playwright MCP Cross-browser automation

skills

Specialized capabilities for specific tasks:

Plugin Description
Web Scraping Core extraction algorithms
Data Extraction Structured data parsing
Form Filling Automated form completion

apis

External service integrations:

Plugin Description
Firecrawl High-performance web crawler
Jina Reader Content reader API
Serper Search engine results API

vision

Visual understanding capabilities:

Plugin Description
GPT-4 Vision OpenAI visual analysis
Gemini Vision Google visual AI
Claude Vision Anthropic visual models

managing-plugins

  1. Go to Plugins tab
  2. Browse by category
  3. Click Install to add a plugin
  4. Enable plugins in Dashboard via the Plugins popup

memory-system

ScrapeRL uses a hierarchical memory system for context retention.

memory-layers

Layer Purpose Retention
Working Current task context Session
Episodic Experience records Persistent
Semantic Learned patterns Persistent
Procedural Action sequences Persistent

memory-features

  • Auto-consolidation - Promotes important data between layers
  • Similarity search - Find related memories quickly
  • Pattern recognition - Learn from past experiences

models-and-providers

supported-providers

Provider Models Best For
Groq GPT-OSS 120B Fast inference, default
Google Gemini 2.5 Flash Balanced performance
OpenAI GPT-4 Turbo High accuracy
Anthropic Claude 3 Opus Complex reasoning

model-selection

  1. Click Model button in input bar
  2. Select from available models
  3. Models require appropriate API keys

api-keys

Configure API keys in Settings > API Keys:

  1. Select provider
  2. Enter your API key
  3. Click Save
  4. Key status shows as "Active" when configured

settings

general-settings

Setting Description
WebSocket Updates Enable real-time updates
Memory Persistence Save memory across sessions
Auto-save Episodes Automatically save completed episodes
Debug Mode Enable verbose logging

budget-and-limits

Control API usage costs:

  • Daily Limit - Maximum spend per day
  • Monthly Limit - Maximum spend per month
  • Max Tokens - Token limit per request
  • Alert Threshold - Warning at 80% usage

Budget limits are disabled by default. Enable in Settings to control spending.

appearance

  • Theme - Dark (default), Light, Auto
  • Compact Mode - Reduce UI spacing
  • Animations - Enable/disable transitions

api-reference

health-check

GET /api/health

Response:

{
  "status": "healthy",
  "version": "0.1.0",
  "timestamp": "2026-03-28T00:00:00Z"
}

episode-management

# Start new episode
POST /api/episode/reset
{
  "task_id": "scrape-products",
  "config": { ... }
}

# Take action
POST /api/episode/step
{
  "action": "navigate",
  "params": { "url": "..." }
}

# Get current state
GET /api/episode/state

memory-api

# Store entry
POST /api/memory/store
{
  "content": "...",
  "memory_type": "working",
  "metadata": { ... }
}

# Query memories
POST /api/memory/query
{
  "query": "product prices",
  "memory_type": "semantic",
  "limit": 10
}

plugins-api

# List plugins
GET /api/plugins/

# Install plugin
POST /api/plugins/install
{ "plugin_id": "firecrawl" }

# Uninstall plugin
POST /api/plugins/uninstall
{ "plugin_id": "firecrawl" }

troubleshooting

common-issues

api-key-required-error

Solution: Configure at least one API key in Settings > API Keys

episode-not-starting

Checklist:

  • Valid URL entered
  • At least one agent selected
  • API key configured
  • System status shows "Online"

slow-performance

Tips:

  • Use Groq for faster inference
  • Reduce enabled plugins
  • Lower task complexity if possible

memory-full

Solution: Clear memory layers in Settings > Advanced > Clear Cache

getting-help

  • Check the logs panel for error details
  • View episode history for past issues
  • Report bugs on GitHub

keyboard-shortcuts

Shortcut Action
Ctrl + Enter Start/Stop episode
Ctrl + L Clear logs
Ctrl + , Open settings
Escape Close popups

version-history

v0-1-0-current

  • Initial release
  • Multi-agent architecture
  • Plugin system
  • Memory layers
  • Dashboard with real-time monitoring

Documentation last updated: March 2026

Built with by NeerajCodz

document-flow

flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]

related-api-reference

item value
api-reference api-reference.md