sql_env / specs /F007-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F007 Generated from: specs/F007-VERIFICATION_INPUT.json Generated: 2026-03-27


1. Unit Tests

Dockerfile Validation

Test Description Input Expected Category
test_dockerfile_exists Dockerfile exists at server/Dockerfile N/A File exists happy
test_dockerfile_has_base_image_arg BASE_IMAGE ARG is declared Parse Dockerfile ARG BASE_IMAGE present happy
test_dockerfile_port_env_variable PORT env var with fallback to 8000 Parse Dockerfile ENV PORT or CMD reads $PORT happy
test_dockerfile_cmd_uses_port_env CMD respects PORT env override Set PORT=7860 Server binds to 7860 happy
test_dockerfile_non_root_user Container runs as non-root user Parse Dockerfile USER appuser or equivalent non-root USER directive security
test_dockerfile_copies_databases Spider databases are bundled Parse Dockerfile COPY instruction includes data/databases/ happy
test_dockerfile_healthcheck Health check endpoint configured Parse Dockerfile HEALTHCHECK directive present happy
test_dockerfile_no_dev_dependencies No test/dev packages in final image Inspect final stage No pytest, ruff, etc. edge

Run: uv run pytest tests/unit/test_dockerfile.py -v

openenv.yaml Manifest

Test Description Input Expected Category
test_manifest_exists openenv.yaml exists at project root N/A File exists happy
test_manifest_spec_version spec_version field equals 1 Parse YAML spec_version: 1 happy
test_manifest_name name field is sql_env Parse YAML name: sql_env happy
test_manifest_type_space type field is 'space' Parse YAML type: space happy
test_manifest_runtime_fastapi runtime field is 'fastapi' Parse YAML runtime: fastapi happy
test_manifest_app_entrypoint app field points to valid module Parse YAML app: server.app:app happy
test_manifest_port port field is 8000 Parse YAML port: 8000 happy
test_manifest_no_extra_fields No unrecognized fields Parse YAML Only spec_version, name, type, runtime, app, port edge
test_manifest_missing_required_field Missing field produces validation error Remove name Validation error error

Run: uv run pytest tests/unit/test_manifest.py -v

Blog Outline (docs/blog-outline.md)

Test Description Input Expected Category
test_blog_outline_exists Blog outline file exists N/A File at docs/blog-outline.md happy
test_blog_has_hook_section Hook section present Parse markdown Section heading for hook/intro happy
test_blog_has_problem_section Problem section present Parse markdown Section about static benchmarks happy
test_blog_has_solution_section Solution/architecture section present Parse markdown Section about SQLEnv architecture happy
test_blog_has_results_placeholder Results placeholder for F006 Parse markdown Placeholder text for training results happy
test_blog_has_try_it_section Try-it-yourself section with links Parse markdown Links to HF Space, notebook, GitHub happy
test_blog_links_not_broken All links in blog are valid or marked placeholder Parse markdown No dead internal links edge
test_blog_minimum_length Blog outline has substantive content Parse markdown At least 200 words edge

Run: uv run pytest tests/unit/test_blog_outline.py -v

Training Notebook (notebooks/train_grpo.ipynb)

Test Description Input Expected Category
test_notebook_exists Notebook file exists N/A File at notebooks/train_grpo.ipynb happy
test_notebook_valid_json Notebook is valid JSON / ipynb format Parse file Valid nbformat structure happy
test_notebook_has_setup_cell Setup cell with pip install Inspect cells Cell containing pip install happy
test_notebook_has_connect_cell Connect cell using SQLEnvClient Inspect cells Cell importing/using SQLEnvClient happy
test_notebook_has_train_cell Training cell with GRPO loop Inspect cells Cell with training logic happy
test_notebook_has_eval_cell Evaluation cell for held-out questions Inspect cells Cell with evaluation logic happy
test_notebook_has_plot_cell Plotting cell with matplotlib Inspect cells Cell importing matplotlib and plotting happy
test_notebook_colab_compatible Colab badge or runtime metadata Inspect metadata colab in metadata or Colab badge in first cell happy
test_notebook_no_hardcoded_paths No absolute local paths Inspect all cells No /Users/, /home/, C:\\ paths edge
test_notebook_cells_ordered Setup before connect before train Inspect cell order Correct logical ordering edge
test_notebook_empty_outputs Notebook shipped with cleared outputs Inspect cells All outputs arrays empty edge

Run: uv run pytest tests/unit/test_notebook.py -v


2. Integration Tests

Flow: Local Docker Build and Run

Step Action Expected Verification
1 docker build -t sql-env:test -f server/Dockerfile . Build succeeds with exit code 0 Check exit code
2 docker run -d -p 8000:8000 --name sql-env-test sql-env:test Container starts Container running (docker ps)
3 Wait for health check (up to 30s) /health returns 200 curl -f http://localhost:8000/health
4 Connect WebSocket client, call reset Episode starts, observation returned Valid SQLObservation JSON
5 Send DESCRIBE action via WebSocket Column info returned Non-empty result field
6 Send ANSWER action via WebSocket Episode ends, reward returned done: true, reward is numeric
7 Stop container Container stops cleanly docker stop sql-env-test exits 0

Run: uv run pytest tests/integration/test_docker_local.py -v

Flow: PORT Override for HF Spaces

Step Action Expected Verification
1 docker run -d -p 7860:7860 -e PORT=7860 --name sql-env-port sql-env:test Container starts on port 7860 Container running
2 curl -f http://localhost:7860/health Health check passes HTTP 200
3 Port 8000 is NOT listening No response on 8000 curl fails on port 8000

Run: uv run pytest tests/integration/test_port_override.py -v

Flow: Database Bundling Verification

Step Action Expected Verification
1 Build Docker image Build succeeds Exit code 0
2 docker run --rm sql-env:test ls /app/env/data/databases/ Spider databases present At least one database directory listed
3 docker run --rm sql-env:test find /app/env/data/databases/ -name "*.sqlite" SQLite files present At least one .sqlite file found
4 Start container and reset episode Episode loads a bundled database No "database not found" error

Run: uv run pytest tests/integration/test_db_bundling.py -v


3. API Tests

No new API endpoints are introduced by F007. The existing /health, WebSocket, and REST endpoints from prior features are tested via integration tests above.


4. E2E Tests

Scenario: Judge Experience -- Visit HF Space and Play Episode

Setup: Docker container running (locally simulating HF Space) Actions:

  1. Open health endpoint URL -- confirm service is up
  2. Connect via WebSocket
  3. Call reset -- receive initial observation with question and schema
  4. Call step with DESCRIBE action -- receive column details
  5. Call step with QUERY action -- receive query results
  6. Call step with ANSWER action -- receive terminal observation with reward Expected: Full episode completes without errors; reward is 0.0 or 1.0

Run: uv run pytest tests/e2e/test_judge_experience.py -v

Scenario: Notebook Cell Sequence Validation

Setup: Notebook file at notebooks/train_grpo.ipynb Actions:

  1. Parse notebook JSON
  2. Validate each cell type and content markers in order:
    • Cell with pip install (setup)
    • Cell with SQLEnvClient (connect)
    • Cell with training loop keywords: grpo, train, optimizer (train)
    • Cell with eval or accuracy or held-out (evaluate)
    • Cell with matplotlib or plt. (plot)
  3. Validate no syntax errors in code cells (compile check) Expected: All five cell categories present in correct order; no syntax errors

Run: uv run pytest tests/e2e/test_notebook_validation.py -v

Scenario: README Has Competition-Ready Content

Setup: README.md at project root Actions:

  1. Verify README contains project description
  2. Verify README contains quickstart / getting started section
  3. Verify README contains link to HF Space (or placeholder)
  4. Verify README contains link to training notebook
  5. Verify README contains architecture or how-it-works section Expected: All five content sections present

Run: uv run pytest tests/e2e/test_readme_completeness.py -v


5. Edge Cases Checklist

  • Dockerfile builds on CPU-only machine (no CUDA dependencies in final image)
  • Container memory stays under HF Spaces free tier limit (~16GB)
  • PORT env variable with non-numeric value handled gracefully
  • PORT env variable with value 0 or negative handled gracefully
  • Missing data/databases/ directory causes clear error at startup, not silent failure
  • openenv.yaml with wrong spec_version is rejected by openenv validate
  • Blog outline contains no TODO/FIXME/placeholder markers except the results section
  • Notebook code cells have no import errors when dependencies are installed
  • Notebook does not require GPU (runs on Colab free tier CPU)
  • Container starts within 60 seconds (reasonable cold start)
  • Docker image size is under 2GB (reasonable for free tier)
  • .dockerignore excludes test files, .git, pycache, .env
  • Non-root user can read database files (file permissions correct)
  • Container handles SIGTERM gracefully (clean shutdown)

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output X passed
Integration pytest + docker logs Container healthy, episode complete
Dockerfile docker build output Successfully built <hash>
Port override curl output HTTP 200 on port 7860
Database bundling docker exec output ls shows .sqlite files
Blog outline File exists + content check 5 sections present
Notebook nbformat validation Valid ipynb, 5+ cells in order
README Content grep All required sections present
E2E Full episode log reset -> steps -> answer, reward=1.0
Image size docker images output < 2GB

7. External Deployment Prerequisites and Remediation

Use this checklist when deployment verification fails with external auth/access errors.

GHCR Base Image Access (403 Forbidden)

  1. Authenticate Docker to GHCR:
    • echo "$GITHUB_TOKEN" | docker login ghcr.io -u <github-username> --password-stdin
  2. Ensure GITHUB_TOKEN has package read scope for ghcr.io/meta-pytorch/openenv-base.
  3. Retry build using explicit lowercase tag:
    • uv run openenv build -t openenv-sql-env-f007-hf-submission

Hugging Face Push Readiness

  1. Authenticate Hugging Face CLI:
    • huggingface-cli login
  2. Confirm target Space repo exists and token has write access.
  3. Run push:
    • uv run openenv push

Verification Outcome Rules for External Failures

  • If local tests pass but GHCR/HF auth fails, record status as partial verification (external blocker) and include exact remediation commands above.
  • Do not mark verifier result as approved until at least one authenticated build+push attempt is documented.
  • Record authenticated evidence in specs/F007-DEMO.md under ## Live Local Proof with separate Authenticated Build Evidence and Hugging Face Push Evidence subsections containing raw command output.