--- title: ScrapeRL emoji: 🌖 colorFrom: blue colorTo: gray sdk: docker pinned: false --- # scraperl ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference. ## what-this-project-delivers | area | capability | | --- | --- | | scraping-runtime | endpoint-driven scraping with `json`, `csv`, `markdown`, and `text` output modes | | ai-routing | provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA | | agentic-tooling | registry-based runtime tool planning and execution with streamed `tool_call` steps | | memory | short-term, working, long-term, and shared memory layers | | interface | React + Vite dashboard with live stream progress and session visibility | | deployment | local dev, Docker Compose, and Hugging Face Space-compatible Docker setup | | evaluation | root `inference.py` following strict `[START]/[STEP]/[END]` OpenEnv output contract | ## system-topology ```mermaid flowchart TD A[frontend-dashboard] --> B[fastapi-control-plane] B --> C[episode-runtime] B --> D[scrape-runtime] B --> E[agent-runtime] E --> F[model-router] E --> G[tool-and-plugin-registry] E --> H[memory-manager] D --> G D --> H B --> I[websocket-and-sse-streams] ``` ## repository-layout ```text scrapeRL/ backend/ app/ api/routes/ # FastAPI route modules agents/ # agent planning/runtime logic models/ # model router + provider adapters plugins/ # plugin registry + runtime integrations memory/ # memory layers and manager core/ # env/reward/observation/action foundations requirements.txt frontend/ src/ # React app package.json docs/ # modular technical documentation inference.py # OpenEnv-compliant inference runner docker-compose.yml .env.example ``` ## quick-start ### docker-compose ```bash git clone https://github.com/NeerajCodz/scrapeRL.git cd scrapeRL cp .env.example .env # set api keys in .env docker compose up --build ``` | service | url | | --- | --- | | frontend | `http://localhost:3000` | | backend-api | `http://localhost:8000` | | swagger | `http://localhost:8000/swagger` | ### local-development Backend: ```bash cd backend pip install -r requirements.txt uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 ``` Frontend: ```bash cd frontend npm install npm run dev -- --host 0.0.0.0 --port 3000 ``` ## configuration Root configuration lives in `.env` (template: `.env.example`). ### provider-and-model-keys | variable | purpose | | --- | --- | | `OPENAI_API_KEY` | OpenAI chat + embeddings access | | `ANTHROPIC_API_KEY` | Anthropic model access | | `GOOGLE_API_KEY` | Google provider and embeddings access | | `GEMINI_API_KEY` | alias key used by tests/compose for Gemini | | `GROQ_API_KEY` | Groq provider access | | `NVIDIA_API_KEY` | NVIDIA provider access | | `NVIDIA_BASE_URL` | NVIDIA OpenAI-compatible endpoint base URL | | `GEMINI_MODEL_EMBEDDING` | embedding model id for Google embeddings | | `HF_TOKEN` | required token for `inference.py` OpenAI client auth | ### app-runtime | variable | default | | --- | --- | | `DEBUG` | `false` | | `LOG_LEVEL` | `INFO` | | `HOST` | `0.0.0.0` | | `PORT` | `8000` | | `CORS_ORIGINS` | `["http://localhost:5173","http://localhost:3000"]` | | `SESSION_TIMEOUT` | `3600` | | `MEMORY_TTL` | `86400` | ### inference-runtime | variable | default | | --- | --- | | `API_BASE_URL` | `https://api.openai.com/v1` | | `MODEL_NAME` | `gpt-4.1-mini` | | `ENV_API_BASE_URL` | `http://localhost:8000/api` | | `TASK_NAME` | `task_001` | | `BENCHMARK` | `openenv` | | `MAX_STEPS` | `12` | | `EPISODE_SEED` | `42` | | `LLM_TEMPERATURE` | `0.0` | | `PROMPT_HTML_LIMIT` | `5000` | | `REQUEST_TIMEOUT_SECONDS` | `30` | | `USE_OPENENV_SDK` | `true` | ## inferencepy-openenv-contract The root `inference.py` uses `from openai import OpenAI` for all LLM calls and emits strict structured logs: ```text [START] task= env= model= [STEP] step= action= reward=<0.00> done= error= [END] success= steps= rewards= ``` Run: ```bash python inference.py --task task_001 --benchmark openenv ``` ## api-quick-map Use `docs/api-reference.md` for the full endpoint inventory. Core surfaces: | surface | endpoints | | --- | --- | | health | `/api/health`, `/api/ready`, `/api/ping` | | episode | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` | | scrape | `/api/scrape/stream`, `/api/scrape/{session_id}/status`, `/api/scrape/{session_id}/result` | | agents-tools-memory | `/api/agents/*`, `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` | | realtime | `/ws/episode/{episode_id}` | ## documentation-map | document | purpose | | --- | --- | | `docs/overview.md` | platform overview and navigation | | `docs/api-reference.md` | authoritative HTTP and WebSocket reference | | `docs/architecture.md` | system architecture and runtime planes | | `docs/openenv.md` | OpenEnv environment contract | | `docs/tool-calls.md` | streamed tool-call event patterns | | `docs/plugins.md` | plugin registry and dynamic tool model | | `docs/memory.md` | memory design and operations | | `docs/readme.md` | docs index | ## testing-and-validation Backend: ```bash cd backend pytest ``` Frontend: ```bash cd frontend npm run test ``` ## deployment-notes | mode | notes | | --- | --- | | docker-compose | preferred local full-stack run | | hugging-face-space | root `README.md` front matter + Docker SDK config is compatible | | direct-backend | run `uvicorn app.main:app` with `.env` configured | ## troubleshooting | symptom | likely-cause | check | | --- | --- | --- | | provider not available | missing api key | verify `.env` provider key | | streaming has no step events | scrape runtime failed early | inspect `/api/scrape/{session_id}/status` | | inference exits with failure | missing `HF_TOKEN` or endpoint mismatch | verify `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME` | | no frontend data | backend not reachable from frontend | check `VITE_API_PROXY_TARGET` / backend health | ## license MIT.