Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /ARCHITECTURE.md

hjerpe

Upload folder using huggingface_hub

a001a97 verified 4 days ago

preview code

raw

history blame contribute delete

18.7 kB

	# Architecture

	> Last updated: 2026-03-29

	System map for SQLEnv, an RL environment where agents learn interactive SQL exploration via the OpenEnv framework.

	Goals:
	- Show how components connect (system map + key flows)
	- Make hidden state explicit (what lives where)
	- Define shared interfaces (Pydantic models, HTTP API)
	- Keep invariants legible (what must stay true)

	Non-goals:
	- Exhaustive API reference
	- Training hyperparameter tuning guide

	---

	## System Map

	```text
	SQLEnv System
	================================================================

	RL Training SQLEnv Server (Docker)
	───────────── ──────────────────────
	+──────────────+ +─────────────────────+
	│ TRL GRPO │ │ server/app.py │
	│ Trainer │ HTTP (JSON) │ FastAPI + OpenEnv │
	│ │<========================>│ │
	│ training/ │ SQLAction -> server +──────────┬──────────+
	│ trl_adapter │ SQLObs <- server │
	│ .py │ v
	+──────────────+ +─────────────────────+
	│ │ SQLEnvironment │
	│ OR │ (sql_environment.py)│
	v │ │
	+──────────────+ │ reset() / step() │
	│ Custom │ │ action dispatch │
	│ rollout_func │ +──┬──────┬──────┬────+
	│ (rollout.py) │ │ │ │
	+──────────────+ v v v
	+────────────────────────+
	Evaluation │ Action Handlers │
	────────── │ DESCRIBE → PRAGMA │
	+──────────────+ │ SAMPLE → SELECT N │
	│ evaluate() │──> env.reset/step │ QUERY → SQL exec │
	│ policies │ │ ANSWER → verifier │
	│ .py │ +────────┬───────────────+
	+──────────────+ │
	│ v
	+──────────────+ +────────────────────────+
	│ Policies │ │ SQLite (read-only) │
	│ RandomPolicy │ │ data/databases/ │
	│ OraclePolicy │ │ {db_id}/{db_id}.sqlite │
	+──────────────+ +────────────────────────+
	│
	┌───────┴───────┐
	v v
	+───────────+ +───────────+
	│ reward.py │ │verifier.py│
	│ 3-layer │ │ type-aware│
	│ dense │ │ comparison│
	+───────────+ +───────────+

	Data (committed) Synthetic (optional)
	──────────────── ────────────────────
	data/questions/ server/synthetic/
	questions_train.json (473 Q) generate.py
	questions_eval.json (203 Q) mutations.py
	db_list.json (10 databases) validate.py
	```

	---

	## Component Inventory

	\| Component \| Owns \| Entrypoint \| State / Output \|
	\|-----------\|------\|------------\|----------------\|
	\| SQLEnvironment \| Episode lifecycle, action dispatch, step budget \| `server/sql_environment.py` \| `EpisodeContext` (in-memory, per episode) \|
	\| FastAPI app \| HTTP endpoints, tokenizer factory \| `server/app.py` \| Stateless (delegates to environment) \|
	\| SQLEnvClient \| HTTP transport, payload serialization \| `client.py` \| Stateless (wraps server) \|
	\| Pydantic models \| Type contracts (action, observation, state) \| `models.py` \| N/A (data classes) \|
	\| Reward engine \| 3-layer dense reward computation \| `server/reward.py` \| Mutates `EpisodeContext` accumulators \|
	\| Answer verifier \| Type-aware answer comparison \| `server/verifier.py` \| Stateless (pure function) \|
	\| GRPO pipeline \| Training orchestration, rollout, reward callables \| `training/` (6 modules) \| Training artifacts in `outputs/` \|
	\| TRL adapter \| `environment_factory` for TRL GRPOTrainer \| `training/trl_adapter.py` \| Per-session environment instances \|
	\| Evaluation \| Policy protocol, evaluate() runner \| `evaluation/policies.py` \| `EvaluationResult` metrics \|
	\| Oracle policy \| Deterministic upper-bound baseline \| `evaluation/oracle_policy.py` \| Stateless per-step \|
	\| Synthetic DB gen \| Metamorphic testing via data mutations \| `server/synthetic/` \| Variant SQLite files \|
	\| Question dataset \| 676 curated Spider questions across 10 DBs \| `data/questions/` \| JSON files \|

	### External Dependencies

	\| Dependency \| Purpose \| Required \|
	\|------------\|---------\|----------\|
	\| SQLite (stdlib) \| Database execution \| Yes \|
	\| OpenEnv (`openenv-core`) \| Environment protocol, `create_app` \| Yes \|
	\| TRL (`trl`) \| GRPO training \| Only for training \|
	\| HuggingFace Transformers \| Tokenizer loading \| Only for production server \|

	---

	## Key Flows

	### Flow: Episode (Reset + Multi-Turn Steps)

	```text
	Client / Policy SQLEnvironment
	│ │
	│── reset(seed=42) ────────────────> │
	│ │── pick question (random or seeded)
	│ │── open read-only SQLite connection
	│ │── execute gold_sql → store gold_rows
	│ │── init EpisodeContext (budget=15)
	│ <── SQLObservation ────────────────│
	│ .question="How many students?" │
	│ .schema_info="Tables: student" │ (column details hidden)
	│ .budget_remaining=15 │
	│ │
	│── step(DESCRIBE student) ────────> │
	│ │── PRAGMA table_info(student)
	│ │── add to described_tables
	│ │── compute_step_reward()
	│ <── SQLObservation ────────────────│
	│ .schema_info="student: id INT" │ (columns now revealed)
	│ .result="5 columns, 20 rows" │
	│ .reward=0.02 │
	│ .budget_remaining=14 │
	│ │
	│── step(QUERY "SELECT COUNT(*)...") │
	│ │── validate (SELECT-only, single stmt)
	│ │── execute with 5s timeout
	│ │── compute_step_reward() (L1 + L2)
	│ <── SQLObservation ────────────────│
	│ .result="\| COUNT(*) \|\n\| 20 \|" │
	│ .reward=0.035 │
	│ │
	│── step(ANSWER "20") ─────────────> │
	│ │── verify_answer("20", gold, type)
	│ │── terminal reward: +1.0 or 0.0
	│ <── SQLObservation ────────────────│
	│ .done=true │
	│ .reward=1.0 │
	```

	### Flow: 3-Layer Reward Computation

	```text
	step() called with action
	│
	v
	Layer 1: Operational Shaping (every action)
	├── exec_ok? → +0.02
	├── new SQL hash? → +0.01 (per unique query, no cumulative cap)
	├── repeated SQL? → -0.01
	└── step cost → -0.005
	│
	v (only if action_type == QUERY and no error)
	Layer 2: Progress Shaping (delta-from-previous, PBRS)
	├── cardinality score (25%) — \|pred_rows - gold_rows\| / max
	├── value overlap (50%) — Jaccard of cell values
	└── numeric range (25%) — log-distance proximity
	│
	v
	bin to {0.0, 0.25, 0.5, 0.75, 1.0}
	delta = binned - previous_progress → delta * 0.15
	(positive = improvement, negative = regression)
	│
	v
	Clip per step to [-0.05, +0.15]
	No cumulative tracking
	│
	v (on ANSWER action)
	Layer 3: Terminal Correctness
	└── verify_answer() → +1.0 (correct) or 0.0 (wrong)
	```

	### Flow: TRL Training Integration

	```text
	GRPOTrainer
	│
	│── discovers tool methods via docstrings
	│ (describe, sample, query, answer)
	│
	│── per rollout:
	│ SQLEnvTRL() → SQLEnvironment (internal)
	│ .reset() → observation string
	│ .describe(table) → schema string
	│ .query(sql) → result string
	│ .answer(value) → final string
	│
	│── reward:
	│ sql_env_reward_func() → accumulated .reward
	│
	v
	Training loop (GRPO: generate N completions, rank by reward)
	```

	---

	## Shared Data Models

	Defined in `models.py`. These cross the HTTP boundary between client and server.

	### SQLAction (agent -> server)

	```python
	class SQLAction(Action):
	action_type: str # DESCRIBE \| SAMPLE \| QUERY \| ANSWER
	argument: str # table name, SQL string, or answer value
	```

	### SQLObservation (server -> agent)

	```python
	class SQLObservation(Observation):
	question: str # NL question to answer
	schema_info: str # known schema (incrementally revealed)
	result: str # last action result (truncated)
	error: str # error message if action failed
	step_count: int # current step number
	budget_remaining: int # steps left
	action_history: list[str] # summary of prior actions
	# Inherited: done (bool), reward (float \| None)
	```

	### SQLState (metadata endpoint)

	```python
	class SQLState(State):
	history_messages: list[Message]
	current_action_type: str
	```

	### Server-Only Types (never sent to agent)

	```python
	@dataclass
	class QuestionRecord:
	question_id: str
	question_text: str
	database_name: str
	gold_sql: str
	gold_answer: str
	answer_type: str # integer \| float \| string \| list
	difficulty: str # easy \| medium \| hard
	tables_involved: list[str]

	@dataclass
	class EpisodeContext:
	episode_id: str
	db_connection: sqlite3.Connection
	question_record: QuestionRecord
	step_count: int = 0
	budget: int = 15
	described_tables: set[str]
	action_log: list[str]
	done: bool = False
	gold_answer: str \| None
	gold_rows: list[tuple]
	# Reward accumulators
	query_hashes: set[str]
	best_progress: float = 0.0
	cumulative_step_reward: float = 0.0
	cumulative_new_info_reward: float = 0.0
	```

	POMDP design: The agent sees `SQLObservation`. The server holds `EpisodeContext`. The agent never sees gold answers, progress scores, or the full database. This separation forces exploration.

	---

	## API Contracts

	### HTTP (OpenEnv Protocol)

	The server exposes HTTP endpoints via `openenv.core.env_server.create_app()`.

	\| Operation \| Method \| Payload \| Response \|
	\|-----------\|--------\|---------\|----------\|
	\| Reset \| `POST /reset` \| `{seed: int}` (optional) \| `SQLObservation` (JSON) \|
	\| Step \| `POST /step` \| `{action_type, argument, metadata}` \| `{observation, reward, done, info}` \|
	\| State \| `GET /state` \| — \| `SQLState` (JSON) \|

	### Evaluation API

	```python
	# Policy protocol
	class Policy(Protocol):
	def select_action(self, observation: SQLObservation) -> SQLAction: ...

	# Built-in policies
	RandomPolicy() # random baseline
	OraclePolicy(questions) # gold-answer upper bound

	# Runner
	evaluate(env, policy, n_episodes, seed) -> EvaluationResult
	# .success_rate, .avg_reward, .avg_steps, .episodes[]
	```

	### TRL Adapter API

	```python
	SQLEnvTRL.configure(questions_path, db_dir, step_budget) # class method
	# Tool methods (auto-discovered by TRL):
	SQLEnvTRL.describe(table_name: str) -> str
	SQLEnvTRL.sample(table_name: str) -> str
	SQLEnvTRL.query(sql: str) -> str
	SQLEnvTRL.answer(value: str) -> str
	```

	---

	## Cross-Cutting Concerns

	### SQL Safety

	All database access enforces:
	- Read-only SQLite connections (`file:...?mode=ro`)
	- SELECT-only — rejects INSERT, UPDATE, DELETE, ALTER, DROP
	- Single statement — rejects `; ...` (no stacked queries)
	- 5-second timeout via SQLite progress handler
	- 20-row truncation on all result sets

	### POMDP Structure

	The partial observability is deliberate and load-bearing:
	- Agent sees table names at reset but not column details (must DESCRIBE)
	- Query results are truncated (at most 20 rows)
	- Agent never sees `gold_answer`, `best_progress`, or `gold_rows`
	- Step budget (default 15) forces strategic allocation of exploration

	### Import Compatibility

	Dual import paths throughout for local vs Docker execution:
	```python
	try:
	from sql_env.models import SQLAction # local / pip install
	except ImportError:
	from models import SQLAction # Docker (PYTHONPATH=/app/env)
	```

	### Configuration

	\| Variable \| Required \| Default \| Purpose \|
	\|----------\|----------\|---------\|---------\|
	\| `QUESTIONS_PATH` \| No \| `data/questions/student_assessment.json` \| Questions JSON \|
	\| `DB_DIR` \| No \| `data/databases/` \| SQLite database directory \|
	\| `TOKENIZER_NAME` \| No \| `mistralai/Mistral-7B-Instruct-v0.1` \| HuggingFace tokenizer \|
	\| `PORT` \| No \| `8000` \| Server port \|

	---

	## Data, State, and Storage

	### Committed Data

	\| Path \| Contents \|
	\|------\|----------\|
	\| `data/questions/questions_train.json` \| 473 training questions across 10 DBs \|
	\| `data/questions/questions_eval.json` \| 203 evaluation questions across 10 DBs \|
	\| `data/questions/db_list.json` \| 10 Spider database IDs \|
	\| `data/databases/models.py` \| Legacy SQLAlchemy ORM models \|

	### Downloaded Data (gitignored)

	Spider SQLite databases in `data/databases/{db_id}/{db_id}.sqlite`. Downloaded via `scripts/download_spider_databases.py`. The 10 databases: student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player.

	### Runtime State (in-memory, per episode)

	`EpisodeContext` holds all episode state: DB connection, gold data, reward accumulators, action history. Created on `reset()`, discarded when episode ends. Nothing persists between episodes.

	---

	<!-- ARCHITECTURE-SNAPSHOT-BEGIN -->

	Snapshot (auto-managed)

	- Repo signals: Python (pyproject.toml)
	- Roots: tests/
	- Entrypoint candidates: (none detected)

	```text
	tests/
	e2e/
	test_training_e2e.py
	integration/
	test_training_pipeline.py
	unit/
	test_error_handling.py
	test_grpo_config.py
	test_oracle_policy.py
	test_prompts.py
	test_reward.py
	test_rewards.py
	test_rollout.py
	test_sft_terminal_message.py
	test_trl_adapter.py
	test_evaluation.py
	test_smoke.py
	test_synthetic.py
	test_trl_adapter.py
	test_verifier.py
	test_verifier_integration.py
	```

	<!-- ARCHITECTURE-SNAPSHOT-END -->

	---

	## Infrastructure

	### Development

	Prerequisites: Python 3.11-3.12, `uv`, Docker (for deployment)

	```bash
	git clone <repo-url> && cd sql-env
	uv sync
	uv run python scripts/download_spider_databases.py
	uv run pytest tests/ -v
	```

	### Deployment

	Target: HuggingFace Spaces (Docker, free tier)

	```bash
	uv run openenv build # build Docker image
	uv run openenv push # push to HF Spaces
	```

	The Dockerfile uses multi-stage build with `openenv-base`, runs as non-root `appuser`, bundles Spider databases, and exposes port 8000.

	---

	## Invariants

	- Token tensors in `SQLState` grow monotonically across turns (never shrink mid-episode)
	- `EpisodeContext` is server-only — leaking gold data to the agent breaks the POMDP
	- Per-step rewards clipped to `[-0.05, 0.15]` — terminal reward (+1.0) always dominates exploration (~0.3 max)
	- `tests/` must pass without GPU, without network, without downloaded databases (mocks/fixtures)
	- SQL execution never mutates the database (read-only mode enforced at connection level)

	---

	## Glossary

	\| Term \| Definition \|
	\|------\|------------\|
	\| Episode \| One question-answering session: reset -> N steps -> terminal \|
	\| Action type \| One of: DESCRIBE, SAMPLE, QUERY, ANSWER \|
	\| POMDP \| Partially observable MDP. Agent acts under uncertainty \|
	\| Spider \| Academic text-to-SQL benchmark dataset (10 DBs used) \|
	\| OpenEnv \| Meta's RL environment framework (Environment, EnvClient) \|
	\| Green Agent \| OpenEnv's evaluation wrapper pattern \|
	\| Oracle policy \| Baseline that uses gold SQL/answer for reward ceiling validation \|
	\| TRL \| HuggingFace Transformer Reinforcement Learning library \|
	\| GRPO \| Group Relative Policy Optimization (RL algorithm used for training) \|
	\| Dense reward \| Per-step reward signal (vs sparse terminal-only reward) \|

	---

	## References

	- OpenEnv framework: https://github.com/meta-pytorch/OpenEnv
	- Spider dataset: https://huggingface.co/datasets/xlangai/spider
	- TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv