scrapeRL / docs /test /agentic-sandbox-plugin-search-report.md
NeerajCodz's picture
docs: init proto
24f0bf0

agentic-scraper-sandbox-plugin-execution-report

goal

Enable scraper as an agent that can:

  • search from non-URL prompts,
  • navigate and scrape links,
  • execute plugin-based Python analysis (numpy, pandas, bs4) safely,
  • run in a sandboxed per-request environment with cleanup.

what-was-implemented

  • Added sandbox plugin executor: backend/app/plugins/python_sandbox.py
    • AST safety validation (restricted imports and blocked dangerous calls/attributes)
    • isolated execution with python -I
    • per-request temp workspace
    • automatic cleanup after execution
  • Wired sandbox plugin execution into scrape flow (/api/scrape/stream and /api/scrape/ via shared pipeline):
    • mcp-python-sandbox
    • proc-python
    • proc-pandas
    • proc-numpy
    • proc-bs4
  • Added optional request field:
    • python_code (sandboxed code, must assign result)
  • Enhanced non-URL asset resolution:
    • MCP search attempt via DuckDuckGo provider
    • deterministic fallback resolution for scraper workflows
  • Updated plugin registry and installed plugin set for new plugins.

safety-model

  • Sandbox runs in isolated temp directory per request (scraperl-sandbox-<session>-*)
  • Dangerous operations blocked by static AST checks (open, exec, eval, subprocess, os-style operations, dunder access, etc.)
  • No persistent artifacts are kept after run (workspace removed in finally cleanup).

one-request-validation-real-curl-n-runs

All tests executed with one request to POST /api/scrape/stream each.

Test Status Errors URLs Processed Python Analysis Present Dataset Row Count
gold-csv-agentic completed 0 2 true 123
ev-data-search-json completed 0 6 true -
direct-dataset-python-analysis completed 0 1 true 123

notes

  • Gold trend request produced monthly dataset rows from 2016 onward with source links in one stream request.
  • Python plugin analysis was present in all validation scenarios.
  • Agent step stream included planner/search/navigator/extractor/verifier + sandbox analysis events.

document-flow

flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]

related-api-reference

item value
api-reference api-reference.md