Spaces:
Running
Running
metadata
title: ScrapeRL
emoji: 🌖
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
scraperl
ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference.
what-this-project-delivers
| area | capability |
|---|---|
| scraping-runtime | endpoint-driven scraping with json, csv, markdown, and text output modes |
| ai-routing | provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA |
| agentic-tooling | registry-based runtime tool planning and execution with streamed tool_call steps |
| memory | short-term, working, long-term, and shared memory layers |
| interface | React + Vite dashboard with live stream progress and session visibility |
| deployment | local dev, Docker Compose, and Hugging Face Space-compatible Docker setup |
| evaluation | root inference.py following strict [START]/[STEP]/[END] OpenEnv output contract |
system-topology
flowchart TD
A[frontend-dashboard] --> B[fastapi-control-plane]
B --> C[episode-runtime]
B --> D[scrape-runtime]
B --> E[agent-runtime]
E --> F[model-router]
E --> G[tool-and-plugin-registry]
E --> H[memory-manager]
D --> G
D --> H
B --> I[websocket-and-sse-streams]
repository-layout
scrapeRL/
backend/
app/
api/routes/ # FastAPI route modules
agents/ # agent planning/runtime logic
models/ # model router + provider adapters
plugins/ # plugin registry + runtime integrations
memory/ # memory layers and manager
core/ # env/reward/observation/action foundations
requirements.txt
frontend/
src/ # React app
package.json
docs/ # modular technical documentation
inference.py # OpenEnv-compliant inference runner
docker-compose.yml
.env.example
quick-start
docker-compose
git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL
cp .env.example .env
# set api keys in .env
docker compose up --build
| service | url |
|---|---|
| frontend | http://localhost:3000 |
| backend-api | http://localhost:8000 |
| swagger | http://localhost:8000/swagger |
local-development
Backend:
cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
Frontend:
cd frontend
npm install
npm run dev -- --host 0.0.0.0 --port 3000
configuration
Root configuration lives in .env (template: .env.example).
provider-and-model-keys
| variable | purpose |
|---|---|
OPENAI_API_KEY |
OpenAI chat + embeddings access |
ANTHROPIC_API_KEY |
Anthropic model access |
GOOGLE_API_KEY |
Google provider and embeddings access |
GEMINI_API_KEY |
alias key used by tests/compose for Gemini |
GROQ_API_KEY |
Groq provider access |
NVIDIA_API_KEY |
NVIDIA provider access |
NVIDIA_BASE_URL |
NVIDIA OpenAI-compatible endpoint base URL |
GEMINI_MODEL_EMBEDDING |
embedding model id for Google embeddings |
HF_TOKEN |
required token for inference.py OpenAI client auth |
app-runtime
| variable | default |
|---|---|
DEBUG |
false |
LOG_LEVEL |
INFO |
HOST |
0.0.0.0 |
PORT |
8000 |
CORS_ORIGINS |
["http://localhost:5173","http://localhost:3000"] |
SESSION_TIMEOUT |
3600 |
MEMORY_TTL |
86400 |
inference-runtime
| variable | default |
|---|---|
API_BASE_URL |
https://api.openai.com/v1 |
MODEL_NAME |
gpt-4.1-mini |
ENV_API_BASE_URL |
http://localhost:8000/api |
TASK_NAME |
task_001 |
BENCHMARK |
openenv |
MAX_STEPS |
12 |
EPISODE_SEED |
42 |
LLM_TEMPERATURE |
0.0 |
PROMPT_HTML_LIMIT |
5000 |
REQUEST_TIMEOUT_SECONDS |
30 |
USE_OPENENV_SDK |
true |
inferencepy-openenv-contract
The root inference.py uses from openai import OpenAI for all LLM calls and emits strict structured logs:
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
Run:
python inference.py --task task_001 --benchmark openenv
api-quick-map
Use docs/api-reference.md for the full endpoint inventory. Core surfaces:
| surface | endpoints |
|---|---|
| health | /api/health, /api/ready, /api/ping |
| episode | /api/episode/reset, /api/episode/step, /api/episode/state/{episode_id} |
| scrape | /api/scrape/stream, /api/scrape/{session_id}/status, /api/scrape/{session_id}/result |
| agents-tools-memory | /api/agents/*, /api/tools/*, /api/plugins/*, /api/memory/* |
| realtime | /ws/episode/{episode_id} |
documentation-map
| document | purpose |
|---|---|
docs/overview.md |
platform overview and navigation |
docs/api-reference.md |
authoritative HTTP and WebSocket reference |
docs/architecture.md |
system architecture and runtime planes |
docs/openenv.md |
OpenEnv environment contract |
docs/tool-calls.md |
streamed tool-call event patterns |
docs/plugins.md |
plugin registry and dynamic tool model |
docs/memory.md |
memory design and operations |
docs/readme.md |
docs index |
testing-and-validation
Backend:
cd backend
pytest
Frontend:
cd frontend
npm run test
deployment-notes
| mode | notes |
|---|---|
| docker-compose | preferred local full-stack run |
| hugging-face-space | root README.md front matter + Docker SDK config is compatible |
| direct-backend | run uvicorn app.main:app with .env configured |
troubleshooting
| symptom | likely-cause | check |
|---|---|---|
| provider not available | missing api key | verify .env provider key |
| streaming has no step events | scrape runtime failed early | inspect /api/scrape/{session_id}/status |
| inference exits with failure | missing HF_TOKEN or endpoint mismatch |
verify HF_TOKEN, API_BASE_URL, MODEL_NAME |
| no frontend data | backend not reachable from frontend | check VITE_API_PROXY_TARGET / backend health |
license
MIT.