scrapeRL / README.md
NeerajCodz's picture
docs: init proto
24f0bf0
metadata
title: ScrapeRL
emoji: 🌖
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false

scraperl

ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference.

what-this-project-delivers

area capability
scraping-runtime endpoint-driven scraping with json, csv, markdown, and text output modes
ai-routing provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA
agentic-tooling registry-based runtime tool planning and execution with streamed tool_call steps
memory short-term, working, long-term, and shared memory layers
interface React + Vite dashboard with live stream progress and session visibility
deployment local dev, Docker Compose, and Hugging Face Space-compatible Docker setup
evaluation root inference.py following strict [START]/[STEP]/[END] OpenEnv output contract

system-topology

flowchart TD
    A[frontend-dashboard] --> B[fastapi-control-plane]
    B --> C[episode-runtime]
    B --> D[scrape-runtime]
    B --> E[agent-runtime]
    E --> F[model-router]
    E --> G[tool-and-plugin-registry]
    E --> H[memory-manager]
    D --> G
    D --> H
    B --> I[websocket-and-sse-streams]

repository-layout

scrapeRL/
  backend/
    app/
      api/routes/        # FastAPI route modules
      agents/            # agent planning/runtime logic
      models/            # model router + provider adapters
      plugins/           # plugin registry + runtime integrations
      memory/            # memory layers and manager
      core/              # env/reward/observation/action foundations
    requirements.txt
  frontend/
    src/                 # React app
    package.json
  docs/                  # modular technical documentation
  inference.py           # OpenEnv-compliant inference runner
  docker-compose.yml
  .env.example

quick-start

docker-compose

git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL
cp .env.example .env
# set api keys in .env
docker compose up --build
service url
frontend http://localhost:3000
backend-api http://localhost:8000
swagger http://localhost:8000/swagger

local-development

Backend:

cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend:

cd frontend
npm install
npm run dev -- --host 0.0.0.0 --port 3000

configuration

Root configuration lives in .env (template: .env.example).

provider-and-model-keys

variable purpose
OPENAI_API_KEY OpenAI chat + embeddings access
ANTHROPIC_API_KEY Anthropic model access
GOOGLE_API_KEY Google provider and embeddings access
GEMINI_API_KEY alias key used by tests/compose for Gemini
GROQ_API_KEY Groq provider access
NVIDIA_API_KEY NVIDIA provider access
NVIDIA_BASE_URL NVIDIA OpenAI-compatible endpoint base URL
GEMINI_MODEL_EMBEDDING embedding model id for Google embeddings
HF_TOKEN required token for inference.py OpenAI client auth

app-runtime

variable default
DEBUG false
LOG_LEVEL INFO
HOST 0.0.0.0
PORT 8000
CORS_ORIGINS ["http://localhost:5173","http://localhost:3000"]
SESSION_TIMEOUT 3600
MEMORY_TTL 86400

inference-runtime

variable default
API_BASE_URL https://api.openai.com/v1
MODEL_NAME gpt-4.1-mini
ENV_API_BASE_URL http://localhost:8000/api
TASK_NAME task_001
BENCHMARK openenv
MAX_STEPS 12
EPISODE_SEED 42
LLM_TEMPERATURE 0.0
PROMPT_HTML_LIMIT 5000
REQUEST_TIMEOUT_SECONDS 30
USE_OPENENV_SDK true

inferencepy-openenv-contract

The root inference.py uses from openai import OpenAI for all LLM calls and emits strict structured logs:

[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>

Run:

python inference.py --task task_001 --benchmark openenv

api-quick-map

Use docs/api-reference.md for the full endpoint inventory. Core surfaces:

surface endpoints
health /api/health, /api/ready, /api/ping
episode /api/episode/reset, /api/episode/step, /api/episode/state/{episode_id}
scrape /api/scrape/stream, /api/scrape/{session_id}/status, /api/scrape/{session_id}/result
agents-tools-memory /api/agents/*, /api/tools/*, /api/plugins/*, /api/memory/*
realtime /ws/episode/{episode_id}

documentation-map

document purpose
docs/overview.md platform overview and navigation
docs/api-reference.md authoritative HTTP and WebSocket reference
docs/architecture.md system architecture and runtime planes
docs/openenv.md OpenEnv environment contract
docs/tool-calls.md streamed tool-call event patterns
docs/plugins.md plugin registry and dynamic tool model
docs/memory.md memory design and operations
docs/readme.md docs index

testing-and-validation

Backend:

cd backend
pytest

Frontend:

cd frontend
npm run test

deployment-notes

mode notes
docker-compose preferred local full-stack run
hugging-face-space root README.md front matter + Docker SDK config is compatible
direct-backend run uvicorn app.main:app with .env configured

troubleshooting

symptom likely-cause check
provider not available missing api key verify .env provider key
streaming has no step events scrape runtime failed early inspect /api/scrape/{session_id}/status
inference exits with failure missing HF_TOKEN or endpoint mismatch verify HF_TOKEN, API_BASE_URL, MODEL_NAME
no frontend data backend not reachable from frontend check VITE_API_PROXY_TARGET / backend health

license

MIT.