Spaces:
Running
Running
| spec_version: 1 | |
| name: medusa_env | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| tasks: | |
| - id: clean_pipeline | |
| name: Clean Pipeline | |
| difficulty: easy | |
| seed: 0 | |
| description: > | |
| Both sources are fresh. Join keys are clean and unique. The agent must | |
| verify freshness, prepare keys, join, apply SCD, and commit without | |
| triggering a row explosion. | |
| success_criteria: | |
| - COMMIT issued (episode finalized) | |
| - No Cartesian explosion detected | |
| - Silver row count <= Source A row count | |
| - match_rate > 0.80 after join | |
| scoring_rubric: | |
| committed: 0.20 | |
| no_explosion: 0.25 | |
| volume_ok: 0.20 | |
| high_match: 0.20 | |
| grader_pass: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.20 | |
| - id: no_explosion | |
| weight: 0.25 | |
| - id: volume_ok | |
| weight: 0.20 | |
| - id: high_match | |
| weight: 0.20 | |
| - id: grader_pass | |
| weight: 0.15 | |
| - id: dirty_integration | |
| name: Dirty Key Integration | |
| difficulty: medium | |
| seed: 1 | |
| description: > | |
| Source A has NULLs and whitespace in join keys. Source B has duplicate | |
| keys that can cause row explosion. The agent must PREP_KEYS and | |
| DEDUPLICATE before joining, and correctly quarantine unresolvable | |
| orphans. | |
| success_criteria: | |
| - PREP_KEYS_A issued before EXECUTE_JOIN | |
| - PREP_KEYS_B issued before EXECUTE_JOIN | |
| - DEDUPLICATE_B issued before EXECUTE_JOIN | |
| - No row explosion | |
| - Quarantine integrity check passes | |
| scoring_rubric: | |
| committed: 0.10 | |
| prepped_before_join: 0.20 | |
| deduped_before_join: 0.20 | |
| no_explosion: 0.25 | |
| integrity_ok: 0.15 | |
| grader_pass: 0.10 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.10 | |
| - id: prepped_before_join | |
| weight: 0.20 | |
| - id: deduped_before_join | |
| weight: 0.20 | |
| - id: no_explosion | |
| weight: 0.25 | |
| - id: integrity_ok | |
| weight: 0.15 | |
| - id: grader_pass | |
| weight: 0.10 | |
| - id: full_medallion | |
| name: Full Medallion Integration | |
| difficulty: hard | |
| seed: 2 | |
| description: > | |
| Source A is stale (>6h old). Source B has new schema columns not | |
| registered in Silver. The agent must check freshness, evolve the schema, | |
| clean keys, deduplicate, execute a left join, apply SCD-2 for tracked | |
| columns, and pass all grader checks. | |
| success_criteria: | |
| - SYNC_CHECK issued before any join | |
| - EVOLVE_SCHEMA issued before COMMIT | |
| - SCD-2 applied (not SCD-1) for tracked column | |
| - Silver schema contains new columns from drift | |
| - All 4 grader checks pass | |
| scoring_rubric: | |
| committed: 0.05 | |
| sync_checked: 0.15 | |
| schema_evolved: 0.15 | |
| used_scd2: 0.20 | |
| schema_ok: 0.20 | |
| grader_pass: 0.25 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.05 | |
| - id: sync_checked | |
| weight: 0.15 | |
| - id: schema_evolved | |
| weight: 0.15 | |
| - id: used_scd2 | |
| weight: 0.20 | |
| - id: schema_ok | |
| weight: 0.20 | |
| - id: grader_pass | |
| weight: 0.25 | |
| - id: schema_bootstrap | |
| name: Schema Bootstrap | |
| difficulty: easy | |
| seed: 3 | |
| description: > | |
| Fresh sources arrive with new columns in both Bronze tables. The agent | |
| must evolve the Silver schema, execute a clean join, land a non-empty | |
| Silver table, and commit without row explosion. | |
| success_criteria: | |
| - EVOLVE_SCHEMA issued before COMMIT | |
| - No row explosion | |
| - Silver contains the joined columns after drift | |
| - Silver table is non-empty | |
| scoring_rubric: | |
| committed: 0.15 | |
| no_explosion: 0.20 | |
| schema_evolved: 0.25 | |
| schema_materialized: 0.20 | |
| silver_built: 0.20 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.15 | |
| - id: no_explosion | |
| weight: 0.20 | |
| - id: schema_evolved | |
| weight: 0.25 | |
| - id: schema_materialized | |
| weight: 0.20 | |
| - id: silver_built | |
| weight: 0.20 | |
| - id: dedup_guardrail | |
| name: Dedup Guardrail | |
| difficulty: medium | |
| seed: 4 | |
| description: > | |
| Dirty join keys and duplicate Dimension rows increase the risk of row | |
| explosion. The agent must prep keys, deduplicate Source B, produce a | |
| non-empty Silver table, and commit cleanly. | |
| success_criteria: | |
| - PREP_KEYS_A and PREP_KEYS_B issued before join | |
| - DEDUPLICATE_B issued before join | |
| - No row explosion | |
| - Silver table is non-empty | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.10 | |
| prepped_before_join: 0.15 | |
| deduped_before_join: 0.25 | |
| no_explosion: 0.25 | |
| silver_built: 0.10 | |
| grader_pass: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.10 | |
| - id: prepped_before_join | |
| weight: 0.15 | |
| - id: deduped_before_join | |
| weight: 0.25 | |
| - id: no_explosion | |
| weight: 0.25 | |
| - id: silver_built | |
| weight: 0.10 | |
| - id: grader_pass | |
| weight: 0.15 | |
| - id: stale_sync_recovery | |
| name: Stale Sync Recovery | |
| difficulty: hard | |
| seed: 5 | |
| description: > | |
| Source A is stale and the pipeline must not proceed blindly. The agent | |
| must verify freshness, recover a high-match join, build Silver, and | |
| still pass the final audit. | |
| success_criteria: | |
| - SYNC_CHECK issued before any join | |
| - No row explosion | |
| - match_rate > 0.80 after join | |
| - Silver table is non-empty | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.05 | |
| sync_checked: 0.30 | |
| no_explosion: 0.20 | |
| high_match: 0.15 | |
| silver_built: 0.15 | |
| grader_pass: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.05 | |
| - id: sync_checked | |
| weight: 0.30 | |
| - id: no_explosion | |
| weight: 0.20 | |
| - id: high_match | |
| weight: 0.15 | |
| - id: silver_built | |
| weight: 0.15 | |
| - id: grader_pass | |
| weight: 0.15 | |
| - id: fresh_join_baseline | |
| name: Fresh Join Baseline | |
| difficulty: easy | |
| seed: 6 | |
| description: > | |
| A clean baseline task that rewards a simple, efficient Bronze-to-Silver | |
| run. The agent should avoid unnecessary actions while producing a | |
| high-match, non-exploding join and a usable Silver table. | |
| success_criteria: | |
| - COMMIT issued | |
| - No row explosion | |
| - match_rate > 0.80 after join | |
| - Silver table is non-empty | |
| - Episode completed efficiently | |
| scoring_rubric: | |
| committed: 0.15 | |
| no_explosion: 0.25 | |
| high_match: 0.25 | |
| silver_built: 0.20 | |
| efficient_run: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.15 | |
| - id: no_explosion | |
| weight: 0.25 | |
| - id: high_match | |
| weight: 0.25 | |
| - id: silver_built | |
| weight: 0.20 | |
| - id: efficient_run | |
| weight: 0.15 | |
| - id: stale_history_guard | |
| name: Stale History Guard | |
| difficulty: hard | |
| seed: 7 | |
| description: > | |
| A stale-source episode where the agent must both verify freshness and | |
| preserve historical correctness. The task emphasizes SCD-2 usage and | |
| proper history columns in Silver. | |
| success_criteria: | |
| - SYNC_CHECK issued before any join | |
| - SCD-2 used instead of SCD-1 | |
| - Silver table is non-empty | |
| - Silver contains history columns | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.05 | |
| sync_checked: 0.20 | |
| used_scd2: 0.25 | |
| silver_built: 0.15 | |
| history_columns: 0.15 | |
| grader_pass: 0.20 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.05 | |
| - id: sync_checked | |
| weight: 0.20 | |
| - id: used_scd2 | |
| weight: 0.25 | |
| - id: silver_built | |
| weight: 0.15 | |
| - id: history_columns | |
| weight: 0.15 | |
| - id: grader_pass | |
| weight: 0.20 | |
| - id: orphan_quarantine | |
| name: Orphan Quarantine | |
| difficulty: medium | |
| seed: 8 | |
| description: > | |
| Dirty keys create unmatched Fact rows that should not be silently | |
| dropped. The agent must prep keys, choose a left join, preserve a | |
| meaningful quarantine set, and keep audit integrity intact. | |
| success_criteria: | |
| - PREP_KEYS_A and PREP_KEYS_B issued before join | |
| - Left join used | |
| - Quarantine contains rows | |
| - No row explosion | |
| - Integrity checks pass | |
| scoring_rubric: | |
| committed: 0.10 | |
| prepped_before_join: 0.15 | |
| left_join_used: 0.20 | |
| quarantine_nonempty: 0.20 | |
| integrity_ok: 0.20 | |
| no_explosion: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.10 | |
| - id: prepped_before_join | |
| weight: 0.15 | |
| - id: left_join_used | |
| weight: 0.20 | |
| - id: quarantine_nonempty | |
| weight: 0.20 | |
| - id: integrity_ok | |
| weight: 0.20 | |
| - id: no_explosion | |
| weight: 0.15 | |
| - id: drift_alignment | |
| name: Drift Alignment | |
| difficulty: medium | |
| seed: 9 | |
| description: > | |
| Schema drift introduces new columns, but the pipeline is otherwise | |
| clean. The agent must evolve the schema, use the audited left-join | |
| path, materialize the new shape in Silver, and commit successfully. | |
| success_criteria: | |
| - EVOLVE_SCHEMA issued before COMMIT | |
| - Left join used | |
| - Silver contains the joined columns after drift | |
| - Silver table is non-empty | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.10 | |
| schema_evolved: 0.25 | |
| left_join_used: 0.15 | |
| schema_materialized: 0.25 | |
| silver_built: 0.10 | |
| grader_pass: 0.15 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.10 | |
| - id: schema_evolved | |
| weight: 0.25 | |
| - id: left_join_used | |
| weight: 0.15 | |
| - id: schema_materialized | |
| weight: 0.25 | |
| - id: silver_built | |
| weight: 0.10 | |
| - id: grader_pass | |
| weight: 0.15 | |
| - id: snapshot_upsert | |
| name: Snapshot Upsert | |
| difficulty: easy | |
| seed: 10 | |
| description: > | |
| A clean snapshot-style load where SCD-1 is sufficient. The agent should | |
| choose overwrite semantics, maintain safe volume, and land a non-empty | |
| Silver table without introducing join problems. | |
| success_criteria: | |
| - SCD-1 used instead of SCD-2 | |
| - No row explosion | |
| - Silver row count <= Source A row count | |
| - Silver table is non-empty | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.10 | |
| no_explosion: 0.20 | |
| used_scd1: 0.25 | |
| volume_ok: 0.20 | |
| silver_built: 0.15 | |
| grader_pass: 0.10 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.10 | |
| - id: no_explosion | |
| weight: 0.20 | |
| - id: used_scd1 | |
| weight: 0.25 | |
| - id: volume_ok | |
| weight: 0.20 | |
| - id: silver_built | |
| weight: 0.15 | |
| - id: grader_pass | |
| weight: 0.10 | |
| - id: schema_history_guard | |
| name: Schema History Guard | |
| difficulty: hard | |
| seed: 11 | |
| description: > | |
| Schema drift and historical tracking requirements arrive together. The | |
| agent must evolve schema, materialize the merged columns in Silver, use | |
| SCD-2, and preserve history metadata through commit. | |
| success_criteria: | |
| - EVOLVE_SCHEMA issued before COMMIT | |
| - SCD-2 used instead of SCD-1 | |
| - Silver contains the joined columns after drift | |
| - Silver contains history columns | |
| - Grader passes | |
| scoring_rubric: | |
| committed: 0.05 | |
| schema_evolved: 0.20 | |
| used_scd2: 0.20 | |
| schema_materialized: 0.20 | |
| history_columns: 0.15 | |
| grader_pass: 0.20 | |
| grader: | |
| type: weighted_rubric | |
| source: tasks.score_episode | |
| score_range: [0.0, 1.0] | |
| pass_threshold: 0.55 | |
| criteria: | |
| - id: committed | |
| weight: 0.05 | |
| - id: schema_evolved | |
| weight: 0.20 | |
| - id: used_scd2 | |
| weight: 0.20 | |
| - id: schema_materialized | |
| weight: 0.20 | |
| - id: history_columns | |
| weight: 0.15 | |
| - id: grader_pass | |
| weight: 0.20 | |