--- title: Smart Contract Audit RL Environment emoji: 🔍 colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 tags: - openenv - reinforcement-learning - smart-contracts - solidity - security - evaluation - openenv license: mit short_description: OpenEnv RL environment for smart contract security auditing --- # 🔍 Smart Contract Audit RL Environment > An OpenEnv-compliant reinforcement learning environment for training and evaluating AI agents on real-world Solidity smart contract security auditing tasks. --- ## Overview Smart contract auditing is a high-stakes, expert-level task performed by professional security researchers. Mistakes cost millions — the Ethereum ecosystem has lost over **$3 billion** to exploits in audited and unaudited contracts alike. This environment simulates the core reasoning loop of a smart contract auditor, enabling RL agents to learn structured exploration strategies for vulnerability detection, property discovery, and rule checking. The dataset is derived from real audit reports published by **[Certora](https://www.certora.com/)**, covering three production-grade DeFi protocols: | Source | Protocol | |---|---| | Certora Audit | AaveVault | | Certora Audit | AaveVaultV2 | | Certora Audit | Lido Finance | Each episode exposes a fragment of a real Solidity contract. The agent must use a structured action API — mirroring how a human auditor would methodically inspect a codebase — to accomplish a defined objective within a fixed step budget. --- ## Environment Architecture ``` SmartContractEnv/ ├── agents/ │ ├── task1.py │ ├── task2.py │ └── task3.py ├── data/ │ ├── __init__.py │ ├── contracts.json │ ├── data_loader.py │ ├── properties.csv │ ├── Template.json │ ├── vulnerabilities.json │ └── vulnerabilities.md ├── env/ │ ├── __init__.py │ ├── base_env.py │ └── schemas.py ├── server/ │ ├── tasks/ │ │ ├── task1/ │ │ │ ├── __init__.py │ │ │ ├── actions.py │ │ │ ├── environment.py │ │ │ └── grader.py │ │ ├── task2/ │ │ │ ├── __init__.py │ │ │ ├── actions.py │ │ │ ├── environment.py │ │ │ └── grader.py │ │ ├── task3/ │ │ │ ├── __init__.py │ │ │ ├── actions.py │ │ │ ├── environment.py │ │ │ └── grader.py │ │ └── __init__.py │ ├── __init__.py │ └── app.py ├── utils/ │ ├── __init__.py │ ├── prompts.py │ ├── propertyretriever.py │ └── semanticmatcher.py ├── .env ├── .gitignore ├── demo.py ├── Dockerfile ├── eval.py ├── inference.py ├── LICENSE.txt ├── openenv.yaml ├── pyproject.toml ├── README.md ├── requirements.txt └── validate-submission.sh ``` --- ## Tasks ### Task 1 — Targeted Vulnerability Detection *(Medium)* **Real-world analogue:** A security auditor is handed a Solidity file and asked to pinpoint the vulnerable function and describe the class of bug. **Setup:** The agent receives a single Solidity file. The episode selects one vulnerable function at random from the dataset (7–8 available) on each `reset()`. **Objective:** Identify the vulnerable function and describe its issue in 2–3 words (e.g., `"reentrancy"`, `"integer overflow"`, `"unchecked return value"`). Submit `"NO"` if no vulnerability exists. **Action Space:** | Action | Reward | Notes | |---|---|---| | `list_functions` | −0.05 | Returns all function signatures in the file | | `get_function_code` | −0.10 (wrong fn) / +0.05 (correct fn) | Returns raw Solidity source of one function | | `get_function_summary` | −0.05 (wrong) / +0.03 (correct) | Returns NatSpec comments for a function | | `get_file_metadata` | −0.04 | Returns the file's header comment / pragma / imports | | `get_state_variables` | −0.05 | Returns all contract-level state variable declarations | | `get_call_graph` | −0.08 | Returns the inter-function call graph | | `get_task_state` | 0.00 | Returns current step count and cumulative reward | | `submit` | +5.00 (correct) / −1.50 (wrong) | One submission allowed per episode | | *(repeated query)* | −0.40 | Penalty for querying the exact same action+params twice | | *(unknown action)* | −0.20 | Any unrecognised action type | **Episode terminates** on `submit` or when the step budget is exhausted. --- ### Task 2 — Property Discovery *(Hard)* **Real-world analogue:** A formal verification engineer must derive an invariant or safety property for a contract function — the kind written as a Certora Verification Language (CVL) spec. **Setup:** The agent receives a single function extracted from a Solidity file, along with a brief description of the broader contract. The episode targets a function that has a known, labelled property in the dataset. **Objective:** Produce a natural-language description of the function's key safety property (e.g., *"The total shares minted must never exceed the total underlying assets deposited"*). **Action Space:** | Action | Reward | Notes | |---|---|---| | `get_file_natspec` | −0.03 | File-level NatSpec documentation | | `get_function_natspec` | −0.08 | Function-level NatSpec comments | | `get_function_code` | −0.06 | Raw Solidity source of the target function | | `get_related_functions` | −0.06 | Functions that call or are called by the target | | `get_input_output` | −0.04 | Parameter names/types and return values | | `get_similar_property` | −0.20 | Hard-coded reference property from a different contract | | `submit_property` | 0–5 (graded) | **One attempt per episode.** Scored by deterministic similarity checker | **Grading:** Submission reward is computed by a deterministic checker that combines keyword overlap and structural similarity against the ground-truth property. Score is normalised to `[0, 5]` and then scaled to `[0.0, 1.0]` for the episode return. --- ### Task 3 — Rule Checker *(Easy)* **Real-world analogue:** Given a known security rule (e.g., *"functions that transfer funds must emit a Transfer event"*), identify which function in the contract violates it. **Setup:** The agent receives a Solidity file and a natural-language description of a property/rule. At least one function in the file violates this rule. **Objective:** Identify the name of the rule-breaking function. **Action Space:** | Action | Reward | Notes | |---|---|---| | `get_property_specification` | −0.03 | Returns a pseudo-formal (CVL-like) version of the property | | `list_functions` | −0.05 | All function signatures in the file | | `get_function_metadata` | −0.05 | Visibility, modifiers, and signature for a function | | `get_function_code` | −0.10 | Raw Solidity source of one function | | `get_state_variables` | −0.05 | Contract-level state variable declarations | | `get_call_graph` | −0.08 | Inter-function call graph | | `submit` | +5.00 (exact) / +1.50 (sub-caller) / −1.50 (wrong) | One submission per episode | **Partial credit:** If the agent names a function that *calls* the true violating function, it receives +1.50 rather than the full +5.00. This rewards reasoning that reaches the right area of the call graph. --- ## Reward Design Rewards are shaped to encourage **efficient, targeted exploration** and discourage two failure modes: aimless browsing and brute-force guessing. ``` R_episode = Σ(step_rewards) + final_submission_reward ``` - **Exploration costs** are small and graduated by information value — cheap actions (metadata) cost less than expensive ones (full code retrieval). - **Correct-direction bonuses** on `get_function_code` in Task 1 reward navigating toward the vulnerable function before committing. - **Repetition penalty** (−0.40) discourages looping over the same queries. - **Wrong submission** (−1.50) is painful enough to deter random guessing but recoverable through efficient prior exploration. - **Episode score** is normalised to `[0.0, 1.0]` for the OpenEnv grader: `score = max(0, R_episode) / 5.0`. --- ## Observation Space Every `step()` and `reset()` returns a typed `Observation` object: ```python class Observation(BaseModel): task_id: str # "task1_vuln_detection" | "task2_property_discovery" | "task3_rule_checker" step: int # Current step index (0-indexed) max_steps: int # Episode step budget cumulative_reward: float # Running reward total done: bool # Episode terminal flag content: str # Main textual payload (code, summary, error, etc.) metadata: dict[str, Any] # Extra context (function name, contract name, etc.) initial_description: str # Persistent contract/task description shown every step ``` --- ## Action Space Actions are typed `Action` objects passed to `step()`: ```python class Action(BaseModel): action_type: str # One of the action names listed per task above params: dict[str, str] # e.g. {"function_name": "withdraw"} ``` All unknown `action_type` values return a penalty observation without terminating the episode. --- ## OpenEnv Interface The environment exposes a standard HTTP API: | Method | Path | Description | |---|---|---| | `GET` | `/health` | Liveness probe — returns `{"status": "ok"}` | | `GET` | `/tasks` | Lists all tasks with ID, difficulty, and status | | `POST` | `/reset` | Starts a new episode. Body: `{"task_id": str, "seed": int}` | | `POST` | `/step` | Takes one action. Body: `{"action_type": str, "params": {}}` | | `GET` | `/state` | Returns full internal episode state (debug) | | `GET` | `/action_space` | Returns JSON schema of valid actions | | `GET` | `/observation_space` | Returns JSON schema of observation structure | ### Quick Start ```bash SPACE_URL=http://localhost:7860 # Start a new episode for Task 1 curl -X POST $SPACE_URL/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "task1_vuln_detection", "seed": 42}' # List all functions in the contract curl -X POST $SPACE_URL/step \ -H "Content-Type: application/json" \ -d '{"action_type": "list_functions", "params": {}}' # Inspect a specific function curl -X POST $SPACE_URL/step \ -H "Content-Type: application/json" \ -d '{"action_type": "get_function_code", "params": {"function_name": "withdraw"}}' # Submit your answer curl -X POST $SPACE_URL/step \ -H "Content-Type: application/json" \ -d '{"action_type": "submit", "params": {"function_name": "withdraw", "vulnerability_type": "reentrancy"}}' ``` --- ## Setup & Installation ### Prerequisites - Docker ≥ 20.10 - Python 3.11+ (for local development) - `OPENAI_API_KEY`, `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN` environment variables ### Run with Docker ```bash # Build the image docker build -t sc-audit-env . # Run the container docker run -p 7860:7860 \ -e OPENAI_API_KEY=$OPENAI_API_KEY \ -e API_BASE_URL=$API_BASE_URL \ -e MODEL_NAME=$MODEL_NAME \ sc-audit-env # Verify it's running curl http://localhost:7860/health ``` ### Run Locally (Development) ```bash pip install -r requirements.txt uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload ``` --- ## Baseline Inference Script The `inference.py` script runs an OpenAI-compatible model against all three tasks and reports episode scores. It reads credentials from environment variables and completes in under 20 minutes on a 2 vCPU / 8 GB machine. ```bash export OPENAI_API_KEY=your_key export API_BASE_URL=your custom endpoint export MODEL_NAME=your custom model python inference.py ``` **Expected output:** ``` === Smart Contract Audit RL — Baseline Evaluation === Task 1 | Targeted Vulnerability Detection | Score: 0.41 | Steps used: 8/15 Task 2 | Property Discovery | Score: 0.28 | Steps used: 6/10 Task 3 | Rule Checker | Score: 0.72 | Steps used: 4/10 Overall average: 0.47 ``` > **Note:** Scores are stochastic due to random episode selection on `reset()`. Run with a fixed seed (`--seed 42`) for reproducible results. ### Agent System Prompt The inference script injects the following system prompt to guide output format: ``` You are a smart contract security auditor. You will be given access to a Solidity contract via a structured action API. Use the available actions to investigate the contract, then submit your answer. Always respond with a single JSON object: {"action_type": "", "params": {"": ""}} Do not include any other text outside the JSON object. ``` --- ## openenv.yaml ```yaml name: smart-contract-audit-env version: "1.2.0" description: > OpenEnv RL environment for Solidity smart contract security auditing. Agents explore real-world DeFi contracts using a structured action API to detect vulnerabilities, discover properties, and check rule compliance. tasks: - id: task1_vuln_detection name: Targeted Vulnerability Detection difficulty: medium max_steps: 40 max_score: 1.0 - id: task2_property_discovery name: Property Discovery difficulty: hard max_steps: 40 max_score: 1.0 - id: task3_rule_checker name: Rule Checker difficulty: easy max_steps: 20 max_score: 1.0 observation_schema: models/observation.py action_schema: models/action.py app_port: 7860 ``` --- ## Data The dataset (`data/dataset.json`) contains **7–8 labelled entries** per contract, each with format accoding to `data/template.json`: Ground truth is **never exposed** to the agent via any action. The `submit` action is the only path to positive reward. --- ## Design Notes & Known Limitations - **Reward calibration:** Step penalties and submission rewards may need tuning based on empirical agent performance. Current values are derived from initial design rationale, not from extensive ablation. - **Call graph granularity:** The current `get_call_graph` action returns the entire graph at once. A future revision could expose it incrementally (per-function neighbours) to make the action more informative and cost-proportional. - **Vulnerability naming:** Vulnerability types do not follow a fixed taxonomy. Grading uses keyword + semantic matching against a curated synonym list (e.g., `"re-entrancy"` ≡ `"reentrancy"`). - **Dataset size:** The current dataset covers 3 contracts with 7–8 vulnerabilities each. Expanding to more Certora audit reports would improve task diversity and reduce overfitting risk. - **`get_function_code` decomposition:** This action could be split into finer-grained sub-actions (`get_parameters`, `get_return_values`, `get_modifiers`) to give agents a more gradual information ladder. - **Property similarity scoring (Task 2):** Sentence transformer models cannot be used in the containerised environment due to memory constraints. The checker instead uses TF-IDF cosine similarity combined with keyword matching against the ground-truth property. --- ## License MIT — see `LICENSE` for details. Data sourced from public Certora audit reports. Solidity source files are reproduced for research and evaluation purposes. --- ## Citation ```bibtex @misc{sc-audit-openenv-2025, title = {Smart Contract Audit RL Environment}, year = {2025}, note = {OpenEnv-compliant RL environment for Solidity security analysis. Data sourced from Certora audit reports (AaveVault, AaveVaultV2, Lido Finance).} } ```