Spaces:

NeerajCodz
/

scrapeRL

Running

App Files Files Community

scrapeRL / docs /architecture.md

NeerajCodz

docs: init proto

24f0bf0 10 days ago

preview code

raw

history blame contribute delete

4.92 kB

system-architecture

overview

WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.

high-level-topology

Frontend Dashboard (React/Vite)
        |
        v
FastAPI Control Plane
  - episode lifecycle
  - action dispatch
  - reward engine
  - tool registry API
  - settings + policy
        |
        +--> Agent Runtime
        |      - planner/navigator/extractor/verifier
        |      - memory manager
        |      - model router
        |
        +--> MCP Gateway
        |      - tool discovery
        |      - lazy install/load
        |      - schema + timeout + retries
        |
        +--> Search Layer
        |      - provider routing
        |      - query optimization
        |      - credibility scoring
        |
        +--> Memory Layer
        |      - short/working/long/shared
        |      - vector index + persistent storage
        |
        +--> Observability
               - traces/logs/metrics/cost dashboard

core-subsystems

1-control-plane

Responsibilities:

reset/step/state APIs
request validation
action authorization and policy checks
deterministic episode management

2-agent-runtime

Responsibilities:

policy inference
strategy execution
fallback handling
action explainability

3-tooling-plane-mcp

Responsibilities:

dynamic tool registry
server health checks
lazy installation
composition workflows

3-5-site-template-layer

Responsibilities:

maintain inbuilt domain templates (backend/app/sites/)
map instructions/assets to known site behavior
provide reusable navigation goals/fields for planner and navigator agents
expose template catalog through /api/sites* endpoints

4-data-plane

Responsibilities:

HTML ingestion and chunking
extraction and normalization
verification and reconciliation
output persistence

5-analytics-plane

Responsibilities:

reward component logging
model/token/cost accounting
tool usage telemetry
memory quality analytics

processing-pipeline

reset(task_id, seed)
observation emitted
policy selects action
action executes (native/MCP/search/memory)
reward computed and logged
done check
repeat until terminal

batch-and-parallel-design

batch

large HTML split into semantic chunks
chunk extraction batched with bounded size
merge + dedupe + confidence rank

parallel

independent chunk tasks run concurrently
search and verification can run in parallel branches
configurable worker limits and queue priorities

queue-and-scheduler

Task queue supports:

priority classes (high, normal, low)
cancellation tokens
retry policy with backoff
dead-letter queue for repeated failures

storage-architecture

Episode state: in-memory + optional persistence
Long-term memory: vector DB + metadata store
Logs/metrics: append-only time-series-friendly sink
Exports: JSON/CSV trace packs

backend-folder-notes-template-system

backend/app/sites/
  - models.py      # SiteTemplate dataclass
  - templates.py   # 50+ inbuilt site templates
  - registry.py    # list/get/match/serialize helpers

reliability

per-tool timeout and retry
per-step safety budget
circuit breaker for failing providers
deterministic fallback chains

security

API key vaulting via env/config secrets
MCP allowlist
output sanitization
redaction of sensitive tokens in logs

deployment

Single-container baseline:

frontend static build served by API backend
optional sidecars for DB/vector/MCP infra

Scale-out profile:

separate API and worker pools
managed vector DB
queue-backed distributed execution
central observability backend

compatibility-goals

local dev mode with minimal dependencies
cloud mode with managed infra
optional self-hosted LLM endpoints

future-architecture-extensions

distributed multi-agent graph execution
adaptive autoscaling by queue pressure
global memory federation across projects

api-reference-alignment

architecture-plane	primary-endpoints
control-plane	`/api/health`, `/api/ready`, `/api/settings`, `/api/tasks`
episode-runtime	`/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}`
agent-runtime	`/api/agents/`, `/api/providers/`
tooling-memory	`/api/tools/`, `/api/plugins/`, `/api/memory/*`
scraping-runtime	`/api/scrape/stream`, `/api/scrape/{session_id}/result`, `/ws/episode/{episode_id}`

Use api-reference.md as the authoritative endpoint inventory.

document-metadata

key	value
document	`architecture.md`
status	active

document-flow

flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]