Spaces:

NeerajCodz
/

scrapeRL

Running

App Files Files Community

scrapeRL / docs /user-guide.md

NeerajCodz

docs: init proto

24f0bf0 9 days ago

preview code

raw

history blame contribute delete

9.04 kB

scraperl-documentation

Welcome to ScrapeRL - an advanced Reinforcement Learning-powered web scraping environment. This documentation covers all aspects of using and configuring ScrapeRL.

getting-started

what-is-scraperl

ScrapeRL is an intelligent web scraping system that uses Reinforcement Learning (RL) to learn and adapt scraping strategies. Unlike traditional scrapers, ScrapeRL can:

Learn from experience - Improve scraping strategies over time
Adapt to changes - Handle website structure changes automatically
Multi-agent coordination - Use specialized agents for different tasks
Memory-enhanced - Remember patterns and optimize future runs

quick-start

Enter a Target URL - Provide the webpage you want to scrape
Write an Instruction - Describe what data you want to extract
Configure Options - Select model, agents, and plugins
Start Episode - Click Start and watch the magic happen!

example-task

URL: https://example.com/products
Instruction: Extract all product names, prices, and descriptions
Task Type: Medium

dashboard-overview

The dashboard is your command center for monitoring and controlling scraping operations.

layout-structure

Section	Description
Input Bar	Enter URL, instruction, and configure task
Left Sidebar	View active agents, MCPs, skills, and tools
Center Area	Main visualization and current observation
Right Sidebar	Memory stats, extracted data, recent actions
Bottom Logs	Real-time terminal-style log output

stats-header

The header shows key metrics with expandable details:

Episodes - Total scraping sessions completed
Steps - Actions taken in current/total sessions
Reward - Performance score (higher is better)
Time - Current time and session duration

Click the ⋯ icon on any stat to see detailed statistics (min, max, average).

task-configuration

task-types

Type	Description	Use Case
Low	Simple single-page scraping	Product page, article text
Medium	Multi-page with navigation	Search results, listings
High	Complex interactive tasks	Login-required, forms

agents

ScrapeRL uses a multi-agent architecture where specialized agents handle different aspects of scraping.

available-agents

Agent	Role	Description
Coordinator	Orchestrator	Manages all other agents, decides strategy
Scraper	Extractor	Extracts data from page content
Navigator	Navigation	Handles page navigation, clicking, scrolling
Analyzer	Analysis	Analyzes extracted data for patterns
Validator	Validation	Validates data quality and completeness

agent-selection

Click the Agents button in the input bar
Select agents you want to enable
Active agents appear in the left sidebar accordion
Monitor agent activity in real-time

agent-status-indicators

Active - Currently processing
Ready - Waiting for task
Idle - Not currently in use
Error - Encountered an issue

plugins

Extend ScrapeRL's capabilities with plugins organized by category.

plugin-categories

mcps-model-context-protocols

Tools that provide browser automation and page interaction:

Plugin	Description
Browser Use	AI-powered browser automation
Puppeteer MCP	Headless Chrome control
Playwright MCP	Cross-browser automation

skills

Specialized capabilities for specific tasks:

Plugin	Description
Web Scraping	Core extraction algorithms
Data Extraction	Structured data parsing
Form Filling	Automated form completion

apis

External service integrations:

Plugin	Description
Firecrawl	High-performance web crawler
Jina Reader	Content reader API
Serper	Search engine results API

vision

Visual understanding capabilities:

Plugin	Description
GPT-4 Vision	OpenAI visual analysis
Gemini Vision	Google visual AI
Claude Vision	Anthropic visual models

managing-plugins

Go to Plugins tab
Browse by category
Click Install to add a plugin
Enable plugins in Dashboard via the Plugins popup

memory-system

ScrapeRL uses a hierarchical memory system for context retention.

memory-layers

Layer	Purpose	Retention
Working	Current task context	Session
Episodic	Experience records	Persistent
Semantic	Learned patterns	Persistent
Procedural	Action sequences	Persistent

memory-features

Auto-consolidation - Promotes important data between layers
Similarity search - Find related memories quickly
Pattern recognition - Learn from past experiences

models-and-providers

supported-providers

Provider	Models	Best For
Groq	GPT-OSS 120B	Fast inference, default
Google	Gemini 2.5 Flash	Balanced performance
OpenAI	GPT-4 Turbo	High accuracy
Anthropic	Claude 3 Opus	Complex reasoning

model-selection

Click Model button in input bar
Select from available models
Models require appropriate API keys

api-keys

Configure API keys in Settings > API Keys:

Select provider
Enter your API key
Click Save
Key status shows as "Active" when configured

settings

general-settings

Setting	Description
WebSocket Updates	Enable real-time updates
Memory Persistence	Save memory across sessions
Auto-save Episodes	Automatically save completed episodes
Debug Mode	Enable verbose logging

budget-and-limits

Control API usage costs:

Daily Limit - Maximum spend per day
Monthly Limit - Maximum spend per month
Max Tokens - Token limit per request
Alert Threshold - Warning at 80% usage

Budget limits are disabled by default. Enable in Settings to control spending.

appearance

Theme - Dark (default), Light, Auto
Compact Mode - Reduce UI spacing
Animations - Enable/disable transitions

api-reference

health-check

GET /api/health

Response:

{
  "status": "healthy",
  "version": "0.1.0",
  "timestamp": "2026-03-28T00:00:00Z"
}

episode-management

# Start new episode
POST /api/episode/reset
{
  "task_id": "scrape-products",
  "config": { ... }
}

# Take action
POST /api/episode/step
{
  "action": "navigate",
  "params": { "url": "..." }
}

# Get current state
GET /api/episode/state

memory-api

# Store entry
POST /api/memory/store
{
  "content": "...",
  "memory_type": "working",
  "metadata": { ... }
}

# Query memories
POST /api/memory/query
{
  "query": "product prices",
  "memory_type": "semantic",
  "limit": 10
}

plugins-api

# List plugins
GET /api/plugins/

# Install plugin
POST /api/plugins/install
{ "plugin_id": "firecrawl" }

# Uninstall plugin
POST /api/plugins/uninstall
{ "plugin_id": "firecrawl" }

troubleshooting

common-issues

api-key-required-error

Solution: Configure at least one API key in Settings > API Keys

episode-not-starting

Checklist:

Valid URL entered
At least one agent selected
API key configured
System status shows "Online"

slow-performance

Tips:

Use Groq for faster inference
Reduce enabled plugins
Lower task complexity if possible

memory-full

Solution: Clear memory layers in Settings > Advanced > Clear Cache

getting-help

Check the logs panel for error details
View episode history for past issues
Report bugs on GitHub

keyboard-shortcuts

Shortcut	Action
`Ctrl + Enter`	Start/Stop episode
`Ctrl + L`	Clear logs
`Ctrl + ,`	Open settings
`Escape`	Close popups

version-history

v0-1-0-current

Initial release
Multi-agent architecture
Plugin system
Memory layers
Dashboard with real-time monitoring

Documentation last updated: March 2026

Built with by NeerajCodz

document-flow

flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]

related-api-reference

item	value
api-reference	`api-reference.md`