name: sql-agent-openenv version: "1.0.0" description: > A SQL generation and repair environment where an AI agent learns to write correct SQL queries through a self-debugging loop powered by a LinUCB contextual bandit and GEPA prompt evolution. Models real-world data analyst workflows — querying databases with natural language, handling errors, and iteratively improving. author: sql-agent-openenv-team tags: - openenv - sql - rl - nlp - contextual-bandit # ── Endpoints ──────────────────────────────────────────────────────────────── api: reset: /reset step: /step state: /state # ── Action Space ───────────────────────────────────────────────────────────── action_space: type: discrete n: 9 actions: - name: generate description: "Generate SQL from scratch (first attempt)" - name: rewrite_full description: "Completely rewrite the query from scratch" - name: fix_column description: "Fix wrong column name references using schema" - name: fix_table description: "Fix wrong table name references or JOIN structure" - name: add_groupby description: "Add or fix GROUP BY / aggregation clauses" - name: rewrite_cte description: "Restructure CTEs or subqueries" - name: fix_syntax description: "Fix syntax errors (brackets, commas, keywords)" - name: change_dialect description: "Convert to SQLite-compatible functions" - name: relax_filter description: "Broaden or remove overly strict WHERE conditions" # ── Observation Space ──────────────────────────────────────────────────────── observation_space: type: dict fields: - name: question type: string description: "Natural language question to answer with SQL" - name: schema_info type: string description: "Full database schema (tables, columns, types, FK relationships)" - name: current_sql type: string nullable: true description: "The SQL generated on the last attempt (null on first step)" - name: error_message type: string nullable: true description: "SQLite error message from the last attempt (null on success)" - name: error_class type: string nullable: true description: "Classified error type (e.g. no_such_column, syntax_error)" - name: attempt_number type: integer description: "Current attempt number (0 at reset, increments each step)" - name: max_attempts type: integer description: "Maximum allowed attempts per episode (5)" - name: task_id type: string description: "Active task identifier" - name: task_difficulty type: string description: "Task difficulty level: easy | medium | hard" # ── Reward ─────────────────────────────────────────────────────────────────── reward: range: [0.05, 0.95] description: > Task score is the grader output clamped strictly inside (0, 1). Graders score partial progress (column correctness, row-count match) and apply attempt penalties for multi-step repair episodes. # ── Tasks ──────────────────────────────────────────────────────────────────── tasks: - id: simple_queries name: Simple SQL Queries difficulty: easy description: > Single-table SELECT queries. Agent must retrieve correct rows by applying basic filters and projections on the marketplace database. question_count: 5 grader: > Checks that required output columns are present and row count falls within expected bounds. Attempt penalty not applied. - id: join_queries name: SQL Join Queries difficulty: medium description: > Multi-table JOIN queries with GROUP BY and aggregation. Agent must correctly join tables and compute aggregates over the marketplace data. question_count: 5 grader: > Correct columns + row count score multiplied by (1.0 - 0.1*(attempts-1)). Rewards efficient, first-try solutions. - id: complex_queries name: Complex SQL Queries difficulty: hard description: > Advanced queries using CTEs, window functions, nested aggregations, and multi-level joins. Requires precise SQLite syntax knowledge. question_count: 5 grader: > Strict correctness required. Score capped at 0.8 without first-attempt bonus. Attempt penalty of 0.1*(attempts-1) applied. Hard tasks genuinely challenge frontier models. # ── Environment Metadata ───────────────────────────────────────────────────── metadata: max_steps_per_episode: 5 database: SQLite (marketplace schema — users, products, orders, reviews, sellers) rl_algorithm: LinUCB contextual bandit (feature_dim=20, 8 repair actions) prompt_optimizer: GEPA (Generative Evolutionary Prompt Adaptation) runtime_estimate_minutes: 5 compute_requirements: vcpu: 2 memory_gb: 4