Spaces:

Jeromerich
/

openenv

Configuration error

App Files Files Community

jeromerichard commited on 14 days ago

Commit

74e3b5e

0 Parent(s):

Trust & Safety RL Environment - OpenEnv Hackathon

Browse files

Files changed (15) hide show

.gitignore +20 -0
Dockerfile +15 -0
abouthack +838 -0
app.py +157 -0
client.py +89 -0
inference.py +295 -0
models.py +63 -0
openenv.yaml +64 -0
pyproject.toml +33 -0
readme.md +24 -0
requirements.txt +6 -0
simpley +1716 -0
tasks.py +296 -0
train.py +222 -0
your_environment.py +440 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,20 @@

+# Environment variables (contains API keys)
+.env
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+*.egg-info/
+dist/
+build/
+.venv/
+venv/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo

Dockerfile ADDED Viewed

	@@ -0,0 +1,15 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+ENV PORT=7860
+ENV HOST=0.0.0.0
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

abouthack ADDED Viewed

	@@ -0,0 +1,838 @@

+https://www.scaler.com/school-of-technology/meta-pytorch-hackathon
+Code Review by Meta Engineers
+Get your work reviewed by engineers shaping agentic AI at Meta.
+Real Open Source Contribution
+Your code ships to a Meta-backed project, visible on your GitHub profile.
+ROUND-1
+Round 1 - Build Your Mini RL Environment:
+Wednesday, 25th March - Wednesday, 8th April
+Build a Mini-RL environment with defined tasks, graders, and reward logic. Evaluation includes programmatic checks & LLM scoring.
+OpenEnv is an open-source framework by Meta & Hugging Face for creating standardized, isolated, and reusable environments for training and deploying AI agents.
+Think of it as the universal language for AI training environments. It uses a Gymnasium-style API, containerized execution via Docker, and a central hub on Hugging Face for sharing environments.
+Teams at Meta use OpenEnv to define environments once and run them consistently across training, post-training, and evaluation. Now, you get to build on the same infrastructure.
+Read this Repository for proper codes to be used in this Hackathon:
+https://github.com/huggingface/openenv-course
+Round 1 — Problem Statement:
+Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard  step() / reset() / state()  API.
+Key Requirements at a Glance
+Must simulate a real-world task (not games or toys)
+Implement full OpenEnv spec: typed models, step()/reset()/state(), openenv.yaml
+Minimum 3 tasks with agent graders (easy → medium → hard, scores 0.0–1.0)
+Meaningful reward function with partial progress signals
+Baseline inference script with reproducible scores
+Deploy to Hugging Face Spaces + working Dockerfile
+README with environment description, action/observation spaces, setup instructions
+Functional Requirements:
+Real-world task simulation
+The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
+OpenEnv spec compliance
+Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
+Minimum 3 tasks with agent graders
+Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
+Meaningful reward function
+Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
+Baseline inference script
+Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
+Non-Functional Requirements:
+Deploys to a Hugging Face Space
+Environment must run as a containerized HF Space tagged with openenv.
+Containerized execution
+Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
+Documentation
+README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
+Evaluation Criteria:
+Parameter
+Weight
+Description
+Real-world utility
+30%
+Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
+Task & grader quality
+25%
+Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
+Environment design
+20%
+Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
+Code quality & spec compliance
+15%
+Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
+Creativity & novelty
+10%
+Novel problem domain, interesting mechanics, clever reward design, original approach.
+Scoring Breakdown
+Real-world utility (30%)
+•  0–5: Toy/artificial problem with no practical application
+•  6–15: Valid domain but shallow modeling of the real task
+•  16–25: Good domain modeling, would be useful for agent evaluation
+•  26–30: Excellent — fills a real gap, immediate value for the RL/agent community
+Task & grader quality (25%)
+•  3+ tasks with difficulty range?
+•  Graders produce scores between 0.0–1.0?
+•  Graders deterministic and reproducible?
+•  Hard task genuinely challenges frontier models?
+Environment design (20%)
+•  reset() produces clean state?
+•  Action/observation types well-designed and documented?
+•  Reward function provides useful varying signal (not just sparse)?
+•  Episode boundaries sensible?
+Code quality & spec compliance (15%)
+•  openenv validate passes?
+•  docker build && docker run works?
+•  HF Space deploys and responds?
+•  Baseline script runs and reproduces scores?
+Creativity & novelty (10%)
+•  Domain we haven’t seen in OpenEnv before?
+•  Reward design has interesting properties?
+•  Clever mechanics that make the environment engaging?
+How Judging works :
+Phase 1: Automated Validation
+Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
+Phase 2: Agentic Evaluation
+Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
+Phase 3: Human Review
+Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
+Disqualification Criteria
+Environment does not deploy or respond
+Plagiarized or trivially modified existing environments
+Graders that always return the same score
+No baseline inference script
+Pre-Submission Checklist  — all must pass or you're disqualified:
+HF Space deploys
+Automated ping to the Space URL — must return 200 and respond to reset()
+OpenEnv spec compliance
+Validate openenv.yaml, typed models, step()/reset()/state() endpoints
+Dockerfile builds
+Automated docker build on the submitted repo
+Baseline reproduces
+Run the submitted inference script — must complete without error and produce scores
+3+ tasks with graders
+Enumerate tasks, run each grader, verify scores in 0.0–1.0 range
+Additional Instructions
+Before submitting, ensure the following variables are defined in your environment configuration:
+API_BASE_URL   The API endpoint for the LLM.
+MODEL_NAME     The model identifier to use for inference.
+HF_TOKEN       Your Hugging Face / API key.
+The inference script must be named `inference.py` and placed in the root directory of the project
+Participants must use OpenAI Client for all LLM calls using above variables
+Infra Restrictions
+Runtime of inference script should be less than 20min
+Make sure your env and inference can run on a machine with vcpu=2, memory=8gb
+Validator
+Run the pre-submission validation script before submitting
+Sample Inference Script:
+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+"""
+import os
+import re
+import base64
+import textwrap
+from io import BytesIO
+from typing import List, Optional, Dict
+from openai import OpenAI
+import numpy as np
+from PIL import Image
+from browsergym_env import BrowserGymAction, BrowserGymEnv
+API_BASE_URL = os.getenv("API_BASE_URL") // "https://router.huggingface.co/v1"
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+MODEL_NAME = os.getenv("MODEL_NAME")
+MAX_STEPS = 8
+MAX_DOM_CHARS = 3500
+TEMPERATURE = 0.2
+MAX_TOKENS = 200
+FALLBACK_ACTION = "noop()"
+DEBUG = True
+ACTION_PREFIX_RE = re.compile(
+    r"^(action|next action)\s*[:\-]\s*",
+    re.IGNORECASE,
+)
+ACTION_PATTERN = re.compile(r"[A-Za-z_]+\s*\(.*\)", re.DOTALL)
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You control a web browser through BrowserGym.
+    Reply with exactly one action string.
+    The action must be a valid BrowserGym command such as:
+    - noop()
+    - click('<BID>')
+    - type('selector', 'text to enter')
+    - fill('selector', 'text to enter')
+    - send_keys('Enter')
+    - scroll('down')
+    Use single quotes around string arguments.
+    When clicking, use the BrowserGym element IDs (BIDs) listed in the user message.
+    If you are unsure, respond with noop().
+    Do not include explanations or additional text.
+    """
+).strip()
+def build_history_lines(history: List[str]) -> str:
+    if not history:
+        return "None"
+    return "\n".join(history[-4:])
+def extract_screenshot_uri(observation) -> Optional[str]:
+    if observation.screenshot is None:
+        return None
+    screen_array = np.array(observation.screenshot, dtype=np.uint8)
+    image = Image.fromarray(screen_array)
+    buffer = BytesIO()
+    image.save(buffer, format="PNG")
+    buffer.seek(0)
+    data_uri = base64.b64encode(buffer.read()).decode("utf-8")
+    return f"data:image/png;base64,{data_uri}"
+def extract_clickable_elements(observation) -> List[Dict[str, str]]:
+    """Collect BrowserGym element IDs that can be clicked."""
+    metadata = getattr(observation, "metadata", {}) or {}
+    obs_dict = metadata.get("browsergym_obs", {}) or {}
+    extra_props = obs_dict.get("extra_element_properties", {}) or {}
+    clickables: List[Dict[str, str]] = []
+    for bid, props in extra_props.items():
+        if not props.get("clickable"):
+            continue
+        bbox = props.get("bbox") or []
+        bbox_str = ", ".join(bbox) if bbox else "?"
+        clickables.append(
+            {
+                "bid": str(bid),
+                "bbox": bbox_str,
+            }
+        )
+    # Keep a stable ordering for readability
+    clickables.sort(key=lambda item: item["bid"])
+    return clickables
+def build_user_prompt(step: int, observation, history: List[str]) -> str:
+    goal = observation.goal or "(not provided)"
+    url = observation.url or "(unknown)"
+    error_note = "Yes" if observation.last_action_error else "No"
+    clickables = extract_clickable_elements(observation)
+    if clickables:
+        actions_hint = "\n".join(
+            f"    - {item['bid']} (bbox: {item['bbox']})" for item in clickables
+        )
+    else:
+        actions_hint = "    (none detected)"
+    prompt = textwrap.dedent(
+        f"""
+        Step: {step}
+        Goal: {goal}
+        Current URL: {url}
+        Previous steps:
+        {build_history_lines(history)}
+        Last action error: {error_note}
+        Available clickable element IDs: {actions_hint}
+        Reply with exactly one BrowserGym action string.
+        """
+    ).strip()
+    return prompt
+def parse_model_action(response_text: str) -> str:
+    if not response_text:
+        return FALLBACK_ACTION
+    # Prefer the first line that looks like an action string
+    lines = response_text.splitlines()
+    for raw_line in lines:
+        line = raw_line.strip()
+        if not line:
+            continue
+        line = ACTION_PREFIX_RE.sub("", line)
+        match = ACTION_PATTERN.search(line)
+        if match:
+            action = match.group(0).strip()
+            # Collapse internal whitespace
+            action = re.sub(r"\s+", " ", action)
+            # If the model tried to click by natural-language description while we
+            # only exposed numeric BrowserGym IDs, fallback to the single detected ID.
+            return action
+    # Fall back to searching the whole response
+    match = ACTION_PATTERN.search(response_text)
+    if match:
+        action = match.group(0).strip()
+        action = re.sub(r"\s+", " ", action)
+        return action
+    return FALLBACK_ACTION
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = BrowserGymEnv.from_docker_image(
+        image="browsergym-env:latest",
+        env_vars={
+            "BROWSERGYM_BENCHMARK": "miniwob",
+            "BROWSERGYM_TASK_NAME": "click-test",
+        },
+    )
+    history: List[str] = []
+    try:
+        result = env.reset()
+        observation = result.observation
+        print(f"Episode goal: {observation.goal}")
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                print("Environment signalled done. Stopping early.")
+                break
+            user_prompt = build_user_prompt(step, observation, history)
+            user_content = [{"type": "text", "text": user_prompt}]
+            screenshot_uri = extract_screenshot_uri(observation)
+            if screenshot_uri:
+                user_content.append(
+                    {
+                        "type": "image_url",
+                        "image_url": {"url": screenshot_uri},
+                    }
+                )
+            messages = [
+                {
+                    "role": "system",
+                    "content": [{"type": "text", "text": SYSTEM_PROMPT}],
+                },
+                {
+                    "role": "user",
+                    "content": user_content,
+                },
+            ]
+            try:
+                completion = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=messages,
+                    temperature=TEMPERATURE,
+                    max_tokens=MAX_TOKENS,
+                    stream=False,
+                )
+                response_text = completion.choices[0].message.content or ""
+            # pylint: disable=broad-except
+            except Exception as exc:  # noqa: BLE001
+                failure_msg = f"Model request failed ({exc}). Using fallback action."
+                print(failure_msg)
+                response_text = FALLBACK_ACTION
+            action_str = parse_model_action(response_text)
+            print(f"Step {step}: model suggested -> {action_str}")
+            result = env.step(BrowserGymAction(action_str=action_str))
+            observation = result.observation
+            reward = result.reward or 0.0
+            error_flag = " ERROR" if observation.last_action_error else ""
+            history_line = (
+                f"Step {step}: {action_str} -> reward {reward:+.2f}{error_flag}"
+            )
+            history.append(history_line)
+            print(
+                "  Reward: "
+                f"{reward:+.2f} | Done: {result.done} | Last action error: "
+                f"{observation.last_action_error}"
+            )
+            if result.done:
+                print("Episode complete.")
+                break
+        else:
+            print(f"Reached max steps ({MAX_STEPS}).")
+    finally:
+        env.close()
+if __name__ == "__main__":
+    main()
+Pre Validation Script:
+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0
+When Round 1 opens, you'll choose 1 of 4–5 problem statements and build an OpenEnv environment around it.
+Example of what a problem statement looks like
+"Build a mini-game RL environment with clearly defined tasks, automated graders, and reward logic using the OpenEnv framework."
+→ Create a mini-game an AI agent can play
+→ Define tasks with increasing difficulty
+→ Write graders that verify task completion
+→ Define reward logic for scoring
+→ Package using OpenEnv for automated evaluation
+Evaluation Criteria
+Runtime correctness
+Runs without errors
+Interface compliance
+Follows OpenEnv standard
+Task design
+ Clear, realistic, testable
+Grading logic
+ Reward system makes sense
+ Install before April 1st.
+Python 3.10+
+Install 3.10, 3.11, or 3.12.
+$
+python --version
+Copy
+Git + GitHub account
+Push your submission to GitHub or HF.
+$
+git --version
+Copy
+Hugging Face CLI
+Deploy to HF Spaces.
+$
+pip install huggingface_hub --version
+Copy
+$
+huggingface-cli login
+Copy
+OpenEnv
+The framework.
+$
+pip install openenv-core
+Copy
+Google Colab
+Prep course runs in Colab. Free tier works.
+$
+pip install openenv-core
+Copy
+OpenEnv
+The framework.
+→ colab.research.google.com
+Copy
+Docker
+Isolated container testing.
+docker --version
+Copy
+Recommended
+VS Code
+Best Python + Docker support
+When Round 1 starts on 1 April:
+Step 1
+Application Form
+Choose 1 of the 4–5 problem statements revealed on the platform.
+Step 2
+ Scaffold
+$
+openenv init my_env
+Copy
+Generate project structure.
+Step 3
+Build
+Define your environment in the generated files.
+Step 4
+Test locally
+$
+uv run server
+Copy
+Step 5
+Deploy
+$
+openenv push --repo-id your-username/my-env
+Copy
+Step 6
+ Submit
+Paste your HF Spaces URL here before the deadline.
+Deadline: 8 April 2026, 11:59 PM IST

app.py ADDED Viewed

	@@ -0,0 +1,157 @@

+from __future__ import annotations
+import json
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from models import TrustAction, TrustObservation, TrustState, ContentSignals
+from your_environment import TrustSafetyEnvironment
+# ── Force manual FastAPI (openenv_core create_app causes 422 on /step) ────────
+print("[app] Using manual FastAPI ✅")
+_env = TrustSafetyEnvironment(seed=42)
+app = FastAPI(
+    title="Trust & Safety RL Environment",
+    description="Risk-aware content moderation environment for agent training.",
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ── Serializers ───────────────────────────────────────────────────────────────
+def _obs_to_dict(obs: TrustObservation) -> Dict[str, Any]:
+    return {
+        "ticket_id":           obs.ticket_id,
+        "post_text":           obs.post_text,
+        "image_description":   obs.image_description,
+        "comments_found":      obs.comments_found,
+        "user_history_found":  obs.user_history_found,
+        "entity_status_found": obs.entity_status_found,
+        "policy_found":        obs.policy_found,
+        "extracted_signals":   obs.extracted_signals,
+        "validation_result":   obs.validation_result,
+        "step_number":         obs.step_number,
+        "info":                obs.info,
+        "done":                obs.done,
+        "reward":              obs.reward,
+    }
+def _state_to_dict(s: TrustState) -> Dict[str, Any]:
+    return {
+        "episode_id":        s.episode_id,
+        "step_count":        s.step_count,
+        "current_task_id":   s.current_task_id,
+        "difficulty":        s.difficulty,
+        "ambiguity_level":   s.ambiguity_level,
+        "risk_level":        s.risk_level,
+        "tools_used":        s.tools_used,
+        "signals_extracted": s.signals_extracted,
+        "is_done":           s.is_done,
+    }
+# ── Request bodies ─────────────────────────────────────────────────────────────
+class ResetRequest(BaseModel):
+    seed:       Any = None
+    episode_id: Any = None
+    model_config = {"extra": "ignore"}
+class ActionRequest(BaseModel):
+    action_type:    str                      = ""
+    tool_name:      Optional[str]            = None
+    signals:        Optional[Dict[str, Any]] = None   # raw dict — validated below
+    final_decision: Optional[str]            = None
+    model_config = {"extra": "ignore"}   # ← ignore unknown keys from LLM
+# ── Helpers ────────────────────────────────────────────────────────────────────
+def _parse_signals(raw: Dict[str, Any]) -> ContentSignals:
+    """Defensively normalise LLM signal output before Pydantic validation."""
+    # Clamp floats
+    raw["toxicity_level"] = float(raw.get("toxicity_level", 0.5))
+    raw["confidence"]     = float(raw.get("confidence",     0.5))
+    # content_flags must be a list of strings
+    flags = raw.get("content_flags", [])
+    if not isinstance(flags, list):
+        flags = [flags] if isinstance(flags, str) else []
+    raw["content_flags"] = [str(f) for f in flags]
+    # boolean coercion
+    raw["is_protected_class"]       = bool(raw.get("is_protected_class",       False))
+    raw["is_direct_attack"]         = bool(raw.get("is_direct_attack",         False))
+    raw["abusive_language_present"] = bool(raw.get("abusive_language_present", False))
+    # string fields — fallback to sensible defaults
+    raw.setdefault("target",       "none")
+    raw.setdefault("intent",       "ambiguous")
+    raw.setdefault("context_type", "statement")
+    return ContentSignals(**raw)
+# ── Routes ─────────────────────────────────────────────────────────────────────
+@app.get("/health")
+async def health():
+    return {"status": "ok", "environment": "trust-safety-env", "version": "1.0.0"}
+@app.get("/")
+async def root():
+    return {"status": "ok", "docs": "/docs"}
+@app.post("/reset")
+async def reset(body: ResetRequest = ResetRequest()):
+    obs = _env.reset(seed=body.seed, episode_id=body.episode_id)
+    return JSONResponse(_obs_to_dict(obs))
+@app.post("/step")
+async def step(body: ActionRequest):
+    # Parse + validate signals defensively
+    signals: Optional[ContentSignals] = None
+    if body.signals:
+        try:
+            signals = _parse_signals(dict(body.signals))   # copy so we don't mutate
+        except Exception as e:
+            raise HTTPException(status_code=400, detail=f"Invalid signals payload: {e}")
+    action = TrustAction(
+        action_type    = body.action_type,
+        tool_name      = body.tool_name,
+        signals        = signals,
+        final_decision = body.final_decision,
+    )
+    try:
+        obs = _env.step(action)
+    except (RuntimeError, ValueError) as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return JSONResponse(_obs_to_dict(obs))
+@app.get("/state")
+async def state():
+    return JSONResponse(_state_to_dict(_env.state))

client.py ADDED Viewed

	@@ -0,0 +1,89 @@

+from __future__ import annotations
+from typing import Any
+from openenv.core.env_client import EnvClient        # ✅ correct import
+from openenv.core.client_types import StepResult     # ✅ correct import
+from models import TrustAction, TrustObservation, TrustState, ContentSignals
+class TrustSafetyEnv(EnvClient[TrustAction, TrustObservation, TrustState]):  # ✅ EnvClient, 3 generics
+    """
+    Typed WebSocket/HTTP client for the Trust & Safety RL Environment.
+    Usage (sync — for scripts, GRPOTrainer):
+        env = TrustSafetyEnv(base_url="http://localhost:8000").sync()
+        result = env.reset()
+        result = env.reset(episode_id="T-001")
+        result = env.step(TrustAction(action_type="use_tool", tool_name="view_policy"))
+        result = env.step(TrustAction(action_type="final_decision", final_decision="REMOVE"))
+        state  = env.state()
+        env.close()
+    Usage (async):
+        async with TrustSafetyEnv(base_url="http://localhost:8000") as env:
+            result = await env.reset()
+    """
+    def step_payload(self, action: TrustAction) -> dict:          # ✅ NO underscore
+        payload: dict[str, Any] = {"action_type": action.action_type}
+        if action.tool_name is not None:
+            payload["tool_name"] = action.tool_name
+        if action.signals is not None:
+            s = action.signals
+            payload["signals"] = {
+                "target":                   s.target,
+                "is_protected_class":       s.is_protected_class,
+                "toxicity_level":           float(s.toxicity_level),
+                "is_direct_attack":         s.is_direct_attack,
+                "context_type":             s.context_type,
+                "intent":                   s.intent,
+                "confidence":               float(s.confidence),
+                "abusive_language_present": s.abusive_language_present,
+                "content_flags":            list(s.content_flags),
+            }
+        if action.final_decision is not None:
+            payload["final_decision"] = action.final_decision
+        return payload
+    def parse_result(self, payload: dict) -> StepResult[TrustObservation]:   # ✅ NO underscore
+        obs_data = payload.get("observation", payload)
+        obs = TrustObservation(
+            ticket_id           = obs_data.get("ticket_id", ""),
+            post_text           = obs_data.get("post_text", ""),
+            image_description   = obs_data.get("image_description", ""),
+            comments_found      = obs_data.get("comments_found"),
+            user_history_found  = obs_data.get("user_history_found"),
+            entity_status_found = obs_data.get("entity_status_found"),
+            policy_found        = obs_data.get("policy_found"),
+            extracted_signals   = obs_data.get("extracted_signals"),
+            validation_result   = obs_data.get("validation_result"),
+            step_number         = obs_data.get("step_number", 0),
+            info                = obs_data.get("info"),
+            done                = payload.get("done", obs_data.get("done", False)),
+            reward              = payload.get("reward", obs_data.get("reward")),
+        )
+        return StepResult(
+            observation = obs,
+            reward      = payload.get("reward", obs_data.get("reward")),
+            done        = payload.get("done", obs_data.get("done", False)),
+        )
+    def parse_state(self, payload: dict) -> TrustState:           # ✅ NO underscore
+        return TrustState(
+            episode_id        = payload.get("episode_id"),
+            step_count        = payload.get("step_count", 0),
+            current_task_id   = payload.get("current_task_id"),
+            difficulty        = payload.get("difficulty"),
+            ambiguity_level   = payload.get("ambiguity_level"),
+            risk_level        = payload.get("risk_level"),
+            tools_used        = payload.get("tools_used", []),
+            signals_extracted = payload.get("signals_extracted", False),
+            is_done           = payload.get("is_done", False),
+        )

inference.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""
+inference.py — Trust & Safety RL Environment Evaluation
+========================================================
+MANDATORY env vars:
+  API_BASE_URL   LLM endpoint  (e.g. https://router.huggingface.co/v1)
+  MODEL_NAME     Model ID      (e.g. meta-llama/Llama-3.1-8B-Instruct)
+  HF_TOKEN       API key
+  ENV_BASE_URL   Environment server URL (default: http://localhost:8000)
+"""
+import os, json, time, requests
+from openai import OpenAI
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
+API_KEY      = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY", "")
+MODEL_NAME   = os.environ.get("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:8000")
+if not API_KEY:
+    raise EnvironmentError("Set HF_TOKEN (your Hugging Face / API key).")
+if not MODEL_NAME:
+    raise EnvironmentError("Set MODEL_NAME.")
+SYSTEM_PROMPT = """You are a Trust & Safety analyst. Follow this EXACT 3-stage pipeline.
+Skipping any stage results in a score penalty.
+════════════════════════════════════════════
+STAGE 1 — Investigate (optional)
+════════════════════════════════════════════
+Use tools only when context is needed. For OBVIOUS violations (spam, phishing), skip.
+{"action_type": "use_tool", "tool_name": "read_comments"}
+{"action_type": "use_tool", "tool_name": "check_user_history"}
+{"action_type": "use_tool", "tool_name": "check_entity_status"}
+{"action_type": "use_tool", "tool_name": "view_policy"}
+════════════════════════════════════════════
+STAGE 2 — Extract signals (MANDATORY)
+════════════════════════════════════════════
+You MUST call extract_signals before final_decision. No exceptions.
+You are a FEATURE EXTRACTOR — output what the content shows, do NOT decide.
+{"action_type": "extract_signals", "signals": {
+  "target": "individual|group|religion|race|gender|political|none",
+  "is_protected_class": true|false,
+  "toxicity_level": 0.0-1.0,
+  "is_direct_attack": true|false,
+  "context_type": "statement|satire|news|meme|personal_attack|protest|education|spam",
+  "intent": "harmful|descriptive|ambiguous|satirical|educational",
+  "confidence": 0.0-1.0,
+  "abusive_language_present": true|false,
+  "content_flags": ["spam","phishing","slur","threat","misinformation",
+                    "political_content","public_figure","competitive_language"]
+}}
+After submitting signals you receive VALIDATION. If needs_escalation_hint is true → prefer ESCALATE.
+════════════════════════════════════════════
+STAGE 3 — Final decision (MANDATORY)
+════════════════════════════════════════════
+{"action_type": "final_decision", "final_decision": "ALLOW|REMOVE|ALLOW_WITH_WARNING|ESCALATE"}
+DECISION RULES (apply in order):
+  1. Phishing / spam / impersonation           → REMOVE
+  2. Explicit hate speech, direct threats      → REMOVE
+  3. Political satire of a PUBLIC FIGURE       → ALLOW_WITH_WARNING
+  4. News / journalism / education             → ALLOW_WITH_WARNING
+  5. Gaming / sports banter (confirmed)        → ALLOW
+  6. Adjusted confidence after validation < 0.6 → ESCALATE
+  7. Genuinely ambiguous, no clear rule        → ESCALATE
+  8. Safe, no signals triggered                → ALLOW
+NEVER jump from Stage 1 directly to final_decision.
+ALWAYS: [optional tools] → extract_signals → final_decision
+Respond in strict JSON only. No markdown."""
+SIGNAL_DEFAULTS = {
+    "target": "none", "is_protected_class": False,
+    "toxicity_level": 0.5, "is_direct_attack": False,
+    "context_type": "statement", "intent": "ambiguous",
+    "confidence": 0.5, "abusive_language_present": False,
+    "content_flags": [],
+}
+VALID_TOOLS     = {"read_comments","check_user_history","check_entity_status","view_policy"}
+VALID_DECISIONS = {"ALLOW","REMOVE","ALLOW_WITH_WARNING","ESCALATE"}
+VALID_TYPES     = {"use_tool","extract_signals","final_decision"}
+def _clamp_signals(raw: dict) -> dict:
+    result = SIGNAL_DEFAULTS.copy()
+    for k in SIGNAL_DEFAULTS:
+        if k in raw:
+            result[k] = raw[k]
+    try:
+        result["toxicity_level"] = max(0.0, min(1.0, float(result["toxicity_level"])))
+        result["confidence"]     = max(0.0, min(1.0, float(result["confidence"])))
+    except (TypeError, ValueError):
+        result["toxicity_level"] = 0.5
+        result["confidence"]     = 0.5
+    if not isinstance(result["content_flags"], list):
+        result["content_flags"] = []
+    return result
+def _parse(text: str) -> dict:
+    text = text.strip()
+    s, e = text.find("{"), text.rfind("}") + 1
+    if s == -1 or e == 0:
+        raise ValueError(f"No JSON in: {text}")
+    return json.loads(text[s:e])
+def _normalize(raw: dict) -> dict:
+    t = raw.get("action_type", "")
+    if t not in VALID_TYPES:
+        return {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "use_tool":
+        tool = raw.get("tool_name", "")
+        return {"action_type": "use_tool", "tool_name": tool} if tool in VALID_TOOLS \
+               else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "extract_signals":
+        sigs = raw.get("signals")
+        return {"action_type": "extract_signals", "signals": _clamp_signals(sigs)} \
+               if sigs else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    dec = raw.get("final_decision", "ESCALATE")
+    return {"action_type": "final_decision",
+            "final_decision": dec if dec in VALID_DECISIONS else "ESCALATE"}
+def _obs_to_prompt(obs: dict) -> str:
+    lines = [
+        f"=== TICKET {obs.get('ticket_id','')} (Step {obs.get('step_number',0)}) ===",
+        f"\nPOST TEXT:\n{obs.get('post_text','')}",
+        f"\nIMAGE:\n{obs.get('image_description','')}",
+    ]
+    for key, label in [
+        ("comments_found","COMMENTS"),("user_history_found","USER HISTORY"),
+        ("entity_status_found","ENTITY STATUS"),("policy_found","POLICY"),
+    ]:
+        if obs.get(key):
+            lines.append(f"\n{label}:\n{obs[key]}")
+    if obs.get("extracted_signals"):
+        lines.append(f"\nYOUR EXTRACTED SIGNALS:\n{json.dumps(obs['extracted_signals'],indent=2)}")
+    if obs.get("validation_result"):
+        v    = obs["validation_result"]
+        hint = "⚠️ YES — prefer ESCALATE" if v.get("needs_escalation_hint") else "No"
+        lines.append(
+            f"\n📋 VALIDATION:\n"
+            f"  Adj. Confidence : {v.get('adjusted_confidence')}\n"
+            f"  Issues          : {v.get('consistency_issues')}\n"
+            f"  Escalation Hint : {hint}"
+        )
+    if not obs.get("extracted_signals"):
+        lines.append("\n⚠️  REMINDER: Call extract_signals before final_decision.")
+    lines.append("\nYour next action (strict JSON only):")
+    return "\n".join(lines)
+def run_task(client: OpenAI, task_id: str) -> float:
+    for _ in range(30):
+        # CORRECT ✅ — pass task ID directly
+        r = requests.post(
+        f"{ENV_BASE_URL}/reset",
+        json={"episode_id": task_id},   # ← this is the only change
+        timeout=10
+        )
+        r.raise_for_status()
+        obs = r.json()
+        # Handle both flat (TrustObservation) and wrapped response
+        if isinstance(obs, dict) and "observation" in obs:
+            obs = obs["observation"]
+        if obs.get("ticket_id") == task_id:
+            break
+    else:
+        raise RuntimeError(f"Could not get task {task_id} after 30 resets.")
+    print(f"\n{'='*62}\nTask: {task_id} | Starting...\n{'='*62}")
+    messages     = [{"role": "system", "content": SYSTEM_PROMPT}]
+    final_reward = 0.0
+    for step_num in range(14):
+        messages.append({"role": "user", "content": _obs_to_prompt(obs)})
+        time.sleep(0.5)
+        resp = client.chat.completions.create(
+            model=MODEL_NAME, messages=messages, temperature=0.0,
+            response_format={"type": "json_object"},
+        )
+        llm_text = resp.choices[0].message.content or ""
+        messages.append({"role": "assistant", "content": llm_text})
+        try:
+            action = _normalize(_parse(llm_text))
+        except Exception as ex:
+            print(f"  [Step {step_num+1}] Parse error: {ex}"); break
+        atype = action["action_type"]
+        if atype == "use_tool":
+            print(f"  [Step {step_num+1}] 🔧 use_tool        → {action.get('tool_name')}")
+        elif atype == "extract_signals":
+            s = action.get("signals", {})
+            print(f"  [Step {step_num+1}] 🔍 extract_signals → "
+                  f"intent={s.get('intent')} | ctx={s.get('context_type')} | "
+                  f"tox={s.get('toxicity_level')} | conf={s.get('confidence')}")
+        else:
+            print(f"  [Step {step_num+1}] ⚖️  final_decision  → {action.get('final_decision')}")
+        r2 = requests.post(f"{ENV_BASE_URL}/step", json=action, timeout=30)
+        r2.raise_for_status()
+        result = r2.json()
+        # Handle flat (TrustObservation) and wrapped response
+        if "observation" in result:
+            obs  = result["observation"]
+            done = result.get("done", obs.get("done", False))
+            final_reward = float(result.get("reward") or obs.get("reward") or 0.0)
+        else:
+            obs  = result
+            done = result.get("done", False)
+            final_reward = float(result.get("reward") or 0.0)
+        if done:
+            info = obs.get("info") or {}
+            bd   = info.get("reward_breakdown", {})
+            pol  = info.get("policy_recommendation", {})
+            vr   = obs.get("validation_result") or {}
+            print(f"\n  ── EPISODE COMPLETE {'─'*42}")
+            print(f"  Decision:           {info.get('final_decision','N/A')}")
+            print(f"  Ground Truth:       {info.get('ground_truth','N/A')}")
+            print(f"  Policy Engine:      {pol.get('recommended','N/A')} "
+                  f"[{pol.get('rule_strength','?')} rule] ({pol.get('reason','?')})")
+            print(f"  Signals Extracted:  {'✅' if info.get('signals_extracted') else '❌ SKIPPED'}")
+            print(f"  Tools Used:         {info.get('tools_used', [])}")
+            print(f"  Required Tools:     {info.get('required_tools', [])}")
+            print(f"  Adj. Confidence:    {vr.get('adjusted_confidence','N/A')}")
+            print(f"  Issues:             {vr.get('consistency_issues',[])}")
+            print(f"  Ambiguity / Risk:   {info.get('ambiguity_level','?')} / {info.get('risk_level','?')}")
+            if bd:
+                print(f"\n  ── Reward Breakdown {'─'*42}")
+                print(f"  1. Base Decision Score:      {bd.get('base_score',0):+.4f}")
+                print(f"  2. Policy Alignment:         {bd.get('policy_alignment',0):+.4f}")
+                print(f"  3. Signal Accuracy Bonus:    {bd.get('signal_accuracy_bonus',0):+.4f}")
+                print(f"  4. Escalation Adjustment:    {bd.get('escalation_adj',0):+.4f}")
+                print(f"  5. Signal Process Bonus:     {bd.get('signal_bonus',0):+.4f}")
+                print(f"     Tool Cost:               -{bd.get('tool_cost',0):.4f}")
+                print(f"     Tool Miss Penalty:       -{bd.get('tool_miss_penalty',0):.4f}")
+                print(f"     Validation Penalty:      -{bd.get('validation_penalty',0):.4f}")
+                print(f"     Risk Penalty:            -{bd.get('risk_penalty',0):.4f}")
+                print(f"     Confidence Discipline:   -{bd.get('confidence_penalty',0):.4f}")
+                print(f"  {'─'*60}")
+                print(f"     FINAL REWARD:             {bd.get('final_reward',0):.4f}")
+            print(f"\n  SCORE: {final_reward:.4f}")
+            break
+    return final_reward
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print("=" * 62)
+    print("Trust & Safety RL Environment — Baseline Evaluation")
+    print("=" * 62)
+    print(f"Model      : {MODEL_NAME}")
+    print(f"LLM API    : {API_BASE_URL}")
+    print(f"Env Server : {ENV_BASE_URL}")
+    print(f"Reward     : Accuracy · Policy · Signals · Escalation")
+    print(f"             Tools · Consistency · Risk · Confidence")
+    tasks  = [
+        ("T-001", "Easy   — Phishing Spam",   "low"),
+        ("T-002", "Medium — Gaming Banter",    "low"),
+        ("T-003", "Hard   — Political Satire", "high"),
+    ]
+    scores = []
+    for tid, desc, risk in tasks:
+        print(f"\n\n>>> {tid} | {desc} | Risk: {risk}")
+        scores.append((tid, desc, run_task(client, tid)))
+    print("\n" + "=" * 62)
+    print("FINAL BASELINE RESULTS")
+    print("=" * 62)
+    total = 0.0
+    for tid, desc, s in scores:
+        print(f"  {tid} | {desc:<32} | {s:.4f}  {'✅ PASS' if s >= 0.6 else '❌ FAIL'}")
+        total += s
+    vals = [s for _, _, s in scores]
+    print(f"\n  Average : {total/len(scores):.4f}")
+    print(f"  Min     : {min(vals):.4f}  |  Max : {max(vals):.4f}")
+    print("=" * 62)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from __future__ import annotations
+from typing import Optional, List, Dict, Any
+from pydantic import BaseModel, Field, field_validator
+class ContentSignals(BaseModel):
+    target: str = "none"
+    is_protected_class: bool = False
+    toxicity_level: float = 0.5
+    is_direct_attack: bool = False
+    context_type: str = "statement"
+    intent: str = "ambiguous"
+    confidence: float = 0.5
+    abusive_language_present: bool = False
+    content_flags: List[str] = Field(default_factory=list)
+    @field_validator("toxicity_level", "confidence")
+    @classmethod
+    def clamp_0_1(cls, v: float) -> float:
+        return max(0.0, min(1.0, float(v)))
+    model_config = {"extra": "ignore"}
+class TrustAction(BaseModel):
+    action_type: str = ""
+    tool_name: Optional[str] = None
+    signals: Optional[ContentSignals] = None
+    final_decision: Optional[str] = None
+    model_config = {"extra": "ignore"}
+class TrustObservation(BaseModel):
+    ticket_id: str = ""
+    post_text: str = ""
+    image_description: str = ""
+    comments_found: Optional[str] = None
+    user_history_found: Optional[str] = None
+    entity_status_found: Optional[str] = None
+    policy_found: Optional[str] = None
+    extracted_signals: Optional[Dict[str, Any]] = None
+    validation_result: Optional[Dict[str, Any]] = None
+    step_number: int = 0
+    info: Optional[Dict[str, Any]] = None
+    done: bool = False
+    reward: Optional[float] = None
+    model_config = {"extra": "ignore"}
+class TrustState(BaseModel):
+    episode_id: Optional[str] = None
+    step_count: int = 0
+    current_task_id: Optional[str] = None
+    difficulty: Optional[str] = None
+    ambiguity_level: Optional[str] = None
+    risk_level: Optional[str] = None
+    tools_used: List[str] = Field(default_factory=list)
+    signals_extracted: bool = False
+    is_done: bool = False
+    model_config = {"extra": "ignore"}

openenv.yaml ADDED Viewed

	@@ -0,0 +1,64 @@

+spec_version: 1
+name: trust-safety-env
+type: environment
+runtime: python
+app: app:app
+port: 8000
+description: >
+  Risk-aware content moderation RL environment for Trust & Safety decision-making.
+  Agents investigate content, extract structured signals, and make policy-aligned
+  decisions under uncertainty across hate speech, political sensitivity, and
+  cultural nuance. Models real-world moderation at scale (Meta-style).
+author: Jerome Richard D
+version: "1.0.0"
+license: MIT
+action_space:
+  type: TrustAction
+  description: "use_tool | extract_signals | final_decision"
+observation_space:
+  type: TrustObservation
+  description: "Content ticket with progressive context revelation"
+tasks:
+  - id: T-001
+    name: Phishing Spam Detection
+    difficulty: easy
+    description: Identify and remove clear phishing / impersonation content
+  - id: T-002
+    name: Gaming Banter Classification
+    difficulty: medium
+    description: Distinguish competitive gaming banter from genuine harassment
+  - id: T-003
+    name: Political Satire Review
+    difficulty: hard
+    description: Handle editorial satire of public figures with high-risk sensitivity
+  - id: T-004
+    name: Hate Speech Disguised as Education
+    difficulty: medium
+    description: Detect hate speech hidden behind pseudoscientific or educational framing
+  - id: T-005
+    name: Political News with Protest Violence
+    difficulty: hard
+    description: Protect legitimate journalism on sensitive political events without over-censorship
+  - id: T-006
+    name: Religious Expression False Flag
+    difficulty: hard
+    description: Distinguish protected religious expression from automated false-positive flag
+tags:
+  - content-moderation
+  - trust-safety
+  - hate-speech
+  - political-sensitivity
+  - cultural-nuance
+  - real-world
+  - openenv

pyproject.toml ADDED Viewed

	@@ -0,0 +1,33 @@

+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.backends.legacy:build"
+[project]
+name = "trust-safety-env"
+version = "1.0.0"
+description = "Risk-aware Trust & Safety content moderation RL environment — OpenEnv compatible"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.6.0",
+    "openai>=1.30.0",
+    "requests>=2.31.0",
+    "python-dotenv>=1.0.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=8.0"]
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["*"]
+[tool.openenv]
+name             = "trust-safety-env"
+environment_class = "your_environment.TrustSafetyEnvironment"
+action_model     = "models.TrustAction"
+observation_model = "models.TrustObservation"
+state_model      = "models.TrustState"

readme.md ADDED Viewed

	@@ -0,0 +1,24 @@

+---
+title: Trust Safety RL Environment
+emoji: 🛡️
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - reinforcement-learning
+  - content-moderation
+  - trust-safety
+---
+# Trust & Safety RL Environment
+3-layer risk-aware content moderation RL environment built on OpenEnv.
+## Endpoints
+- `POST /reset` — start a new episode
+- `POST /step` — take an action
+- `GET /state` — current episode state
+- `GET /health` — health check
+- `GET /docs` — interactive API docs

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+pydantic>=2.0.0
+requests>=2.31.0
+openenv-core>=0.2.2
+python-dotenv>=1.0.0

simpley ADDED Viewed

	@@ -0,0 +1,1716 @@

+app.py:
+from __future__ import annotations
+import json
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel
+from models import TrustAction, TrustObservation, TrustState, ContentSignals
+from your_environment import TrustSafetyEnvironment
+# ── Force manual FastAPI (openenv_core create_app causes 422 on /step) ────────
+print("[app] Using manual FastAPI ✅")
+_env = TrustSafetyEnvironment(seed=42)
+app = FastAPI(
+    title="Trust & Safety RL Environment",
+    description="Risk-aware content moderation environment for agent training.",
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ── Serializers ───────────────────────────────────────────────────────────────
+def _obs_to_dict(obs: TrustObservation) -> Dict[str, Any]:
+    return {
+        "ticket_id":           obs.ticket_id,
+        "post_text":           obs.post_text,
+        "image_description":   obs.image_description,
+        "comments_found":      obs.comments_found,
+        "user_history_found":  obs.user_history_found,
+        "entity_status_found": obs.entity_status_found,
+        "policy_found":        obs.policy_found,
+        "extracted_signals":   obs.extracted_signals,
+        "validation_result":   obs.validation_result,
+        "step_number":         obs.step_number,
+        "info":                obs.info,
+        "done":                obs.done,
+        "reward":              obs.reward,
+    }
+def _state_to_dict(s: TrustState) -> Dict[str, Any]:
+    return {
+        "episode_id":        s.episode_id,
+        "step_count":        s.step_count,
+        "current_task_id":   s.current_task_id,
+        "difficulty":        s.difficulty,
+        "ambiguity_level":   s.ambiguity_level,
+        "risk_level":        s.risk_level,
+        "tools_used":        s.tools_used,
+        "signals_extracted": s.signals_extracted,
+        "is_done":           s.is_done,
+    }
+# ── Request bodies ─────────────────────────────────────────────────────────────
+class ResetRequest(BaseModel):
+    seed:       Any = None
+    episode_id: Any = None
+    model_config = {"extra": "ignore"}
+class ActionRequest(BaseModel):
+    action_type:    str                      = ""
+    tool_name:      Optional[str]            = None
+    signals:        Optional[Dict[str, Any]] = None   # raw dict — validated below
+    final_decision: Optional[str]            = None
+    model_config = {"extra": "ignore"}   # ← ignore unknown keys from LLM
+# ── Helpers ────────────────────────────────────────────────────────────────────
+def _parse_signals(raw: Dict[str, Any]) -> ContentSignals:
+    """Defensively normalise LLM signal output before Pydantic validation."""
+    # Clamp floats
+    raw["toxicity_level"] = float(raw.get("toxicity_level", 0.5))
+    raw["confidence"]     = float(raw.get("confidence",     0.5))
+    # content_flags must be a list of strings
+    flags = raw.get("content_flags", [])
+    if not isinstance(flags, list):
+        flags = [flags] if isinstance(flags, str) else []
+    raw["content_flags"] = [str(f) for f in flags]
+    # boolean coercion
+    raw["is_protected_class"]       = bool(raw.get("is_protected_class",       False))
+    raw["is_direct_attack"]         = bool(raw.get("is_direct_attack",         False))
+    raw["abusive_language_present"] = bool(raw.get("abusive_language_present", False))
+    # string fields — fallback to sensible defaults
+    raw.setdefault("target",       "none")
+    raw.setdefault("intent",       "ambiguous")
+    raw.setdefault("context_type", "statement")
+    return ContentSignals(**raw)
+# ── Routes ─────────────────────────────────────────────────────────────────────
+@app.get("/health")
+async def health():
+    return {"status": "ok", "environment": "trust-safety-env", "version": "1.0.0"}
+@app.get("/")
+async def root():
+    return {"status": "ok", "docs": "/docs"}
+@app.post("/reset")
+async def reset(body: ResetRequest = ResetRequest()):
+    obs = _env.reset(seed=body.seed, episode_id=body.episode_id)
+    return JSONResponse(_obs_to_dict(obs))
+@app.post("/step")
+async def step(body: ActionRequest):
+    # Parse + validate signals defensively
+    signals: Optional[ContentSignals] = None
+    if body.signals:
+        try:
+            signals = _parse_signals(dict(body.signals))   # copy so we don't mutate
+        except Exception as e:
+            raise HTTPException(status_code=400, detail=f"Invalid signals payload: {e}")
+    action = TrustAction(
+        action_type    = body.action_type,
+        tool_name      = body.tool_name,
+        signals        = signals,
+        final_decision = body.final_decision,
+    )
+    try:
+        obs = _env.step(action)
+    except (RuntimeError, ValueError) as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    return JSONResponse(_obs_to_dict(obs))
+@app.get("/state")
+async def state():
+    return JSONResponse(_state_to_dict(_env.state))
+client.py:
+from __future__ import annotations
+from typing import Any
+from openenv.core.http_env_client import HTTPEnvClient
+from openenv.core.types import StepResult
+from models import TrustAction, TrustObservation, TrustState, ContentSignals
+class TrustSafetyEnv(HTTPEnvClient[TrustAction, TrustObservation]):
+    """
+    Typed HTTP client for the Trust & Safety RL Environment.
+    Usage:
+        client = TrustSafetyEnv(base_url="http://localhost:8000")
+        result = client.reset()
+        result = client.step(TrustAction(action_type="final_decision",
+                                          final_decision="ALLOW"))
+        state  = client.state()
+        client.close()
+    """
+    def _step_payload(self, action: TrustAction) -> dict:
+        payload: dict = {"action_type": action.action_type}
+        if action.tool_name is not None:
+            payload["tool_name"] = action.tool_name
+        if action.signals is not None:
+            s = action.signals
+            payload["signals"] = {
+                "target":                   s.target,
+                "is_protected_class":       s.is_protected_class,
+                "toxicity_level":           s.toxicity_level,
+                "is_direct_attack":         s.is_direct_attack,
+                "context_type":             s.context_type,
+                "intent":                   s.intent,
+                "confidence":               s.confidence,
+                "abusive_language_present": s.abusive_language_present,
+                "content_flags":            s.content_flags,
+            }
+        if action.final_decision is not None:
+            payload["final_decision"] = action.final_decision
+        return payload
+    def _parse_result(self, payload: dict) -> StepResult[TrustObservation]:
+        obs_data = payload.get("observation", payload)  # handle flat or nested
+        signals_raw = obs_data.get("extracted_signals")
+        signals = None
+        if isinstance(signals_raw, dict):
+            try:
+                signals = ContentSignals(**signals_raw)
+            except Exception:
+                signals = None
+        obs = TrustObservation(
+            ticket_id=obs_data.get("ticket_id", ""),
+            post_text=obs_data.get("post_text", ""),
+            image_description=obs_data.get("image_description", ""),
+            comments_found=obs_data.get("comments_found"),
+            user_history_found=obs_data.get("user_history_found"),
+            entity_status_found=obs_data.get("entity_status_found"),
+            policy_found=obs_data.get("policy_found"),
+            extracted_signals=obs_data.get("extracted_signals"),
+            validation_result=obs_data.get("validation_result"),
+            step_number=obs_data.get("step_number", 0),
+            info=obs_data.get("info"),
+            done=payload.get("done", obs_data.get("done", False)),
+            reward=payload.get("reward", obs_data.get("reward")),
+        )
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward", obs_data.get("reward")),
+            done=payload.get("done", obs_data.get("done", False)),
+        )
+    def _parse_state(self, payload: dict) -> TrustState:
+        return TrustState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            current_task_id=payload.get("current_task_id"),
+            difficulty=payload.get("difficulty"),
+            ambiguity_level=payload.get("ambiguity_level"),
+            risk_level=payload.get("risk_level"),
+            tools_used=payload.get("tools_used", []),
+            signals_extracted=payload.get("signals_extracted", False),
+            is_done=payload.get("is_done", False),
+        )
+DockerFile:
+FROM python:3.11-slim
+WORKDIR /app
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY pyproject.toml .
+RUN pip install --no-cache-dir -e "."
+COPY . .
+ENV PYTHONPATH="/app:$PYTHONPATH"
+HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+EXPOSE 8000
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
+inference.py:
+"""
+inference.py — Trust & Safety RL Environment Evaluation
+========================================================
+MANDATORY env vars:
+  API_BASE_URL   LLM endpoint  (e.g. https://router.huggingface.co/v1)
+  MODEL_NAME     Model ID      (e.g. meta-llama/Llama-3.1-8B-Instruct)
+  HF_TOKEN       API key
+  ENV_BASE_URL   Environment server URL (default: http://localhost:8000)
+"""
+import os, json, time, requests
+from openai import OpenAI
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
+API_KEY      = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY", "")
+MODEL_NAME   = os.environ.get("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:8000")
+if not API_KEY:
+    raise EnvironmentError("Set HF_TOKEN (your Hugging Face / API key).")
+if not MODEL_NAME:
+    raise EnvironmentError("Set MODEL_NAME.")
+SYSTEM_PROMPT = """You are a Trust & Safety analyst. Follow this EXACT 3-stage pipeline.
+Skipping any stage results in a score penalty.
+════════════════════════════════════════════
+STAGE 1 — Investigate (optional)
+════════════════════════════════════════════
+Use tools only when context is needed. For OBVIOUS violations (spam, phishing), skip.
+{"action_type": "use_tool", "tool_name": "read_comments"}
+{"action_type": "use_tool", "tool_name": "check_user_history"}
+{"action_type": "use_tool", "tool_name": "check_entity_status"}
+{"action_type": "use_tool", "tool_name": "view_policy"}
+════════════════════════════════════════════
+STAGE 2 — Extract signals (MANDATORY)
+════════════════════════════════════════════
+You MUST call extract_signals before final_decision. No exceptions.
+You are a FEATURE EXTRACTOR — output what the content shows, do NOT decide.
+{"action_type": "extract_signals", "signals": {
+  "target": "individual|group|religion|race|gender|political|none",
+  "is_protected_class": true|false,
+  "toxicity_level": 0.0-1.0,
+  "is_direct_attack": true|false,
+  "context_type": "statement|satire|news|meme|personal_attack|protest|education|spam",
+  "intent": "harmful|descriptive|ambiguous|satirical|educational",
+  "confidence": 0.0-1.0,
+  "abusive_language_present": true|false,
+  "content_flags": ["spam","phishing","slur","threat","misinformation",
+                    "political_content","public_figure","competitive_language"]
+}}
+After submitting signals you receive VALIDATION. If needs_escalation_hint is true → prefer ESCALATE.
+════════════════════════════════════════════
+STAGE 3 — Final decision (MANDATORY)
+════════════════════════════════════════════
+{"action_type": "final_decision", "final_decision": "ALLOW|REMOVE|ALLOW_WITH_WARNING|ESCALATE"}
+DECISION RULES (apply in order):
+  1. Phishing / spam / impersonation           → REMOVE
+  2. Explicit hate speech, direct threats      → REMOVE
+  3. Political satire of a PUBLIC FIGURE       → ALLOW_WITH_WARNING
+  4. News / journalism / education             → ALLOW_WITH_WARNING
+  5. Gaming / sports banter (confirmed)        → ALLOW
+  6. Adjusted confidence after validation < 0.6 → ESCALATE
+  7. Genuinely ambiguous, no clear rule        → ESCALATE
+  8. Safe, no signals triggered                → ALLOW
+NEVER jump from Stage 1 directly to final_decision.
+ALWAYS: [optional tools] → extract_signals → final_decision
+Respond in strict JSON only. No markdown."""
+SIGNAL_DEFAULTS = {
+    "target": "none", "is_protected_class": False,
+    "toxicity_level": 0.5, "is_direct_attack": False,
+    "context_type": "statement", "intent": "ambiguous",
+    "confidence": 0.5, "abusive_language_present": False,
+    "content_flags": [],
+}
+VALID_TOOLS     = {"read_comments","check_user_history","check_entity_status","view_policy"}
+VALID_DECISIONS = {"ALLOW","REMOVE","ALLOW_WITH_WARNING","ESCALATE"}
+VALID_TYPES     = {"use_tool","extract_signals","final_decision"}
+def _clamp_signals(raw: dict) -> dict:
+    result = SIGNAL_DEFAULTS.copy()
+    for k in SIGNAL_DEFAULTS:
+        if k in raw:
+            result[k] = raw[k]
+    try:
+        result["toxicity_level"] = max(0.0, min(1.0, float(result["toxicity_level"])))
+        result["confidence"]     = max(0.0, min(1.0, float(result["confidence"])))
+    except (TypeError, ValueError):
+        result["toxicity_level"] = 0.5
+        result["confidence"]     = 0.5
+    if not isinstance(result["content_flags"], list):
+        result["content_flags"] = []
+    return result
+def _parse(text: str) -> dict:
+    text = text.strip()
+    s, e = text.find("{"), text.rfind("}") + 1
+    if s == -1 or e == 0:
+        raise ValueError(f"No JSON in: {text}")
+    return json.loads(text[s:e])
+def _normalize(raw: dict) -> dict:
+    t = raw.get("action_type", "")
+    if t not in VALID_TYPES:
+        return {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "use_tool":
+        tool = raw.get("tool_name", "")
+        return {"action_type": "use_tool", "tool_name": tool} if tool in VALID_TOOLS \
+               else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "extract_signals":
+        sigs = raw.get("signals")
+        return {"action_type": "extract_signals", "signals": _clamp_signals(sigs)} \
+               if sigs else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    dec = raw.get("final_decision", "ESCALATE")
+    return {"action_type": "final_decision",
+            "final_decision": dec if dec in VALID_DECISIONS else "ESCALATE"}
+def _obs_to_prompt(obs: dict) -> str:
+    lines = [
+        f"=== TICKET {obs.get('ticket_id','')} (Step {obs.get('step_number',0)}) ===",
+        f"\nPOST TEXT:\n{obs.get('post_text','')}",
+        f"\nIMAGE:\n{obs.get('image_description','')}",
+    ]
+    for key, label in [
+        ("comments_found","COMMENTS"),("user_history_found","USER HISTORY"),
+        ("entity_status_found","ENTITY STATUS"),("policy_found","POLICY"),
+    ]:
+        if obs.get(key):
+            lines.append(f"\n{label}:\n{obs[key]}")
+    if obs.get("extracted_signals"):
+        lines.append(f"\nYOUR EXTRACTED SIGNALS:\n{json.dumps(obs['extracted_signals'],indent=2)}")
+    if obs.get("validation_result"):
+        v    = obs["validation_result"]
+        hint = "⚠️ YES — prefer ESCALATE" if v.get("needs_escalation_hint") else "No"
+        lines.append(
+            f"\n📋 VALIDATION:\n"
+            f"  Adj. Confidence : {v.get('adjusted_confidence')}\n"
+            f"  Issues          : {v.get('consistency_issues')}\n"
+            f"  Escalation Hint : {hint}"
+        )
+    if not obs.get("extracted_signals"):
+        lines.append("\n⚠️  REMINDER: Call extract_signals before final_decision.")
+    lines.append("\nYour next action (strict JSON only):")
+    return "\n".join(lines)
+def run_task(client: OpenAI, task_id: str) -> float:
+    for _ in range(30):
+        # CORRECT ✅ — pass task ID directly
+        r = requests.post(
+        f"{ENV_BASE_URL}/reset",
+        json={"episode_id": task_id},   # ← this is the only change
+        timeout=10
+        )
+        r.raise_for_status()
+        obs = r.json()
+        # Handle both flat (TrustObservation) and wrapped response
+        if isinstance(obs, dict) and "observation" in obs:
+            obs = obs["observation"]
+        if obs.get("ticket_id") == task_id:
+            break
+    else:
+        raise RuntimeError(f"Could not get task {task_id} after 30 resets.")
+    print(f"\n{'='*62}\nTask: {task_id} | Starting...\n{'='*62}")
+    messages     = [{"role": "system", "content": SYSTEM_PROMPT}]
+    final_reward = 0.0
+    for step_num in range(14):
+        messages.append({"role": "user", "content": _obs_to_prompt(obs)})
+        time.sleep(0.5)
+        resp = client.chat.completions.create(
+            model=MODEL_NAME, messages=messages, temperature=0.0,
+            response_format={"type": "json_object"},
+        )
+        llm_text = resp.choices[0].message.content or ""
+        messages.append({"role": "assistant", "content": llm_text})
+        try:
+            action = _normalize(_parse(llm_text))
+        except Exception as ex:
+            print(f"  [Step {step_num+1}] Parse error: {ex}"); break
+        atype = action["action_type"]
+        if atype == "use_tool":
+            print(f"  [Step {step_num+1}] 🔧 use_tool        → {action.get('tool_name')}")
+        elif atype == "extract_signals":
+            s = action.get("signals", {})
+            print(f"  [Step {step_num+1}] 🔍 extract_signals → "
+                  f"intent={s.get('intent')} | ctx={s.get('context_type')} | "
+                  f"tox={s.get('toxicity_level')} | conf={s.get('confidence')}")
+        else:
+            print(f"  [Step {step_num+1}] ⚖️  final_decision  → {action.get('final_decision')}")
+        r2 = requests.post(f"{ENV_BASE_URL}/step", json=action, timeout=30)
+        r2.raise_for_status()
+        result = r2.json()
+        # Handle flat (TrustObservation) and wrapped response
+        if "observation" in result:
+            obs  = result["observation"]
+            done = result.get("done", obs.get("done", False))
+            final_reward = float(result.get("reward") or obs.get("reward") or 0.0)
+        else:
+            obs  = result
+            done = result.get("done", False)
+            final_reward = float(result.get("reward") or 0.0)
+        if done:
+            info = obs.get("info") or {}
+            bd   = info.get("reward_breakdown", {})
+            pol  = info.get("policy_recommendation", {})
+            vr   = obs.get("validation_result") or {}
+            print(f"\n  ── EPISODE COMPLETE {'─'*42}")
+            print(f"  Decision:           {info.get('final_decision','N/A')}")
+            print(f"  Ground Truth:       {info.get('ground_truth','N/A')}")
+            print(f"  Policy Engine:      {pol.get('recommended','N/A')} "
+                  f"[{pol.get('rule_strength','?')} rule] ({pol.get('reason','?')})")
+            print(f"  Signals Extracted:  {'✅' if info.get('signals_extracted') else '❌ SKIPPED'}")
+            print(f"  Tools Used:         {info.get('tools_used', [])}")
+            print(f"  Required Tools:     {info.get('required_tools', [])}")
+            print(f"  Adj. Confidence:    {vr.get('adjusted_confidence','N/A')}")
+            print(f"  Issues:             {vr.get('consistency_issues',[])}")
+            print(f"  Ambiguity / Risk:   {info.get('ambiguity_level','?')} / {info.get('risk_level','?')}")
+            if bd:
+                print(f"\n  ── Reward Breakdown {'─'*42}")
+                print(f"  1. Base Decision Score:      {bd.get('base_score',0):+.4f}")
+                print(f"  2. Policy Alignment:         {bd.get('policy_alignment',0):+.4f}")
+                print(f"  3. Signal Accuracy Bonus:    {bd.get('signal_accuracy_bonus',0):+.4f}")
+                print(f"  4. Escalation Adjustment:    {bd.get('escalation_adj',0):+.4f}")
+                print(f"  5. Signal Process Bonus:     {bd.get('signal_bonus',0):+.4f}")
+                print(f"     Tool Cost:               -{bd.get('tool_cost',0):.4f}")
+                print(f"     Tool Miss Penalty:       -{bd.get('tool_miss_penalty',0):.4f}")
+                print(f"     Validation Penalty:      -{bd.get('validation_penalty',0):.4f}")
+                print(f"     Risk Penalty:            -{bd.get('risk_penalty',0):.4f}")
+                print(f"     Confidence Discipline:   -{bd.get('confidence_penalty',0):.4f}")
+                print(f"  {'─'*60}")
+                print(f"     FINAL REWARD:             {bd.get('final_reward',0):.4f}")
+            print(f"\n  SCORE: {final_reward:.4f}")
+            break
+    return final_reward
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print("=" * 62)
+    print("Trust & Safety RL Environment — Baseline Evaluation")
+    print("=" * 62)
+    print(f"Model      : {MODEL_NAME}")
+    print(f"LLM API    : {API_BASE_URL}")
+    print(f"Env Server : {ENV_BASE_URL}")
+    print(f"Reward     : Accuracy · Policy · Signals · Escalation")
+    print(f"             Tools · Consistency · Risk · Confidence")
+    tasks  = [
+        ("T-001", "Easy   — Phishing Spam",   "low"),
+        ("T-002", "Medium — Gaming Banter",    "low"),
+        ("T-003", "Hard   — Political Satire", "high"),
+    ]
+    scores = []
+    for tid, desc, risk in tasks:
+        print(f"\n\n>>> {tid} | {desc} | Risk: {risk}")
+        scores.append((tid, desc, run_task(client, tid)))
+    print("\n" + "=" * 62)
+    print("FINAL BASELINE RESULTS")
+    print("=" * 62)
+    total = 0.0
+    for tid, desc, s in scores:
+        print(f"  {tid} | {desc:<32} | {s:.4f}  {'✅ PASS' if s >= 0.6 else '❌ FAIL'}")
+        total += s
+    vals = [s for _, _, s in scores]
+    print(f"\n  Average : {total/len(scores):.4f}")
+    print(f"  Min     : {min(vals):.4f}  |  Max : {max(vals):.4f}")
+    print("=" * 62)
+if __name__ == "__main__":
+    main()
+models.py:
+"""
+inference.py — Trust & Safety RL Environment Evaluation
+========================================================
+MANDATORY env vars:
+  API_BASE_URL   LLM endpoint  (e.g. https://router.huggingface.co/v1)
+  MODEL_NAME     Model ID      (e.g. meta-llama/Llama-3.1-8B-Instruct)
+  HF_TOKEN       API key
+  ENV_BASE_URL   Environment server URL (default: http://localhost:8000)
+"""
+import os, json, time, requests
+from openai import OpenAI
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
+API_KEY      = os.environ.get("HF_TOKEN") or os.environ.get("API_KEY", "")
+MODEL_NAME   = os.environ.get("MODEL_NAME", "meta-llama/Llama-3.1-8B-Instruct")
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:8000")
+if not API_KEY:
+    raise EnvironmentError("Set HF_TOKEN (your Hugging Face / API key).")
+if not MODEL_NAME:
+    raise EnvironmentError("Set MODEL_NAME.")
+SYSTEM_PROMPT = """You are a Trust & Safety analyst. Follow this EXACT 3-stage pipeline.
+Skipping any stage results in a score penalty.
+════════════════════════════════════════════
+STAGE 1 — Investigate (optional)
+════════════════════════════════════════════
+Use tools only when context is needed. For OBVIOUS violations (spam, phishing), skip.
+{"action_type": "use_tool", "tool_name": "read_comments"}
+{"action_type": "use_tool", "tool_name": "check_user_history"}
+{"action_type": "use_tool", "tool_name": "check_entity_status"}
+{"action_type": "use_tool", "tool_name": "view_policy"}
+════════════════════════════════════════════
+STAGE 2 — Extract signals (MANDATORY)
+════════════════════════════════════════════
+You MUST call extract_signals before final_decision. No exceptions.
+You are a FEATURE EXTRACTOR — output what the content shows, do NOT decide.
+{"action_type": "extract_signals", "signals": {
+  "target": "individual|group|religion|race|gender|political|none",
+  "is_protected_class": true|false,
+  "toxicity_level": 0.0-1.0,
+  "is_direct_attack": true|false,
+  "context_type": "statement|satire|news|meme|personal_attack|protest|education|spam",
+  "intent": "harmful|descriptive|ambiguous|satirical|educational",
+  "confidence": 0.0-1.0,
+  "abusive_language_present": true|false,
+  "content_flags": ["spam","phishing","slur","threat","misinformation",
+                    "political_content","public_figure","competitive_language"]
+}}
+After submitting signals you receive VALIDATION. If needs_escalation_hint is true → prefer ESCALATE.
+════════════════════════════════════════════
+STAGE 3 — Final decision (MANDATORY)
+════════════════════════════════════════════
+{"action_type": "final_decision", "final_decision": "ALLOW|REMOVE|ALLOW_WITH_WARNING|ESCALATE"}
+DECISION RULES (apply in order):
+  1. Phishing / spam / impersonation           → REMOVE
+  2. Explicit hate speech, direct threats      → REMOVE
+  3. Political satire of a PUBLIC FIGURE       → ALLOW_WITH_WARNING
+  4. News / journalism / education             → ALLOW_WITH_WARNING
+  5. Gaming / sports banter (confirmed)        → ALLOW
+  6. Adjusted confidence after validation < 0.6 → ESCALATE
+  7. Genuinely ambiguous, no clear rule        → ESCALATE
+  8. Safe, no signals triggered                → ALLOW
+NEVER jump from Stage 1 directly to final_decision.
+ALWAYS: [optional tools] → extract_signals → final_decision
+Respond in strict JSON only. No markdown."""
+SIGNAL_DEFAULTS = {
+    "target": "none", "is_protected_class": False,
+    "toxicity_level": 0.5, "is_direct_attack": False,
+    "context_type": "statement", "intent": "ambiguous",
+    "confidence": 0.5, "abusive_language_present": False,
+    "content_flags": [],
+}
+VALID_TOOLS     = {"read_comments","check_user_history","check_entity_status","view_policy"}
+VALID_DECISIONS = {"ALLOW","REMOVE","ALLOW_WITH_WARNING","ESCALATE"}
+VALID_TYPES     = {"use_tool","extract_signals","final_decision"}
+def _clamp_signals(raw: dict) -> dict:
+    result = SIGNAL_DEFAULTS.copy()
+    for k in SIGNAL_DEFAULTS:
+        if k in raw:
+            result[k] = raw[k]
+    try:
+        result["toxicity_level"] = max(0.0, min(1.0, float(result["toxicity_level"])))
+        result["confidence"]     = max(0.0, min(1.0, float(result["confidence"])))
+    except (TypeError, ValueError):
+        result["toxicity_level"] = 0.5
+        result["confidence"]     = 0.5
+    if not isinstance(result["content_flags"], list):
+        result["content_flags"] = []
+    return result
+def _parse(text: str) -> dict:
+    text = text.strip()
+    s, e = text.find("{"), text.rfind("}") + 1
+    if s == -1 or e == 0:
+        raise ValueError(f"No JSON in: {text}")
+    return json.loads(text[s:e])
+def _normalize(raw: dict) -> dict:
+    t = raw.get("action_type", "")
+    if t not in VALID_TYPES:
+        return {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "use_tool":
+        tool = raw.get("tool_name", "")
+        return {"action_type": "use_tool", "tool_name": tool} if tool in VALID_TOOLS \
+               else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    if t == "extract_signals":
+        sigs = raw.get("signals")
+        return {"action_type": "extract_signals", "signals": _clamp_signals(sigs)} \
+               if sigs else {"action_type": "final_decision", "final_decision": "ESCALATE"}
+    dec = raw.get("final_decision", "ESCALATE")
+    return {"action_type": "final_decision",
+            "final_decision": dec if dec in VALID_DECISIONS else "ESCALATE"}
+def _obs_to_prompt(obs: dict) -> str:
+    lines = [
+        f"=== TICKET {obs.get('ticket_id','')} (Step {obs.get('step_number',0)}) ===",
+        f"\nPOST TEXT:\n{obs.get('post_text','')}",
+        f"\nIMAGE:\n{obs.get('image_description','')}",
+    ]
+    for key, label in [
+        ("comments_found","COMMENTS"),("user_history_found","USER HISTORY"),
+        ("entity_status_found","ENTITY STATUS"),("policy_found","POLICY"),
+    ]:
+        if obs.get(key):
+            lines.append(f"\n{label}:\n{obs[key]}")
+    if obs.get("extracted_signals"):
+        lines.append(f"\nYOUR EXTRACTED SIGNALS:\n{json.dumps(obs['extracted_signals'],indent=2)}")
+    if obs.get("validation_result"):
+        v    = obs["validation_result"]
+        hint = "⚠️ YES — prefer ESCALATE" if v.get("needs_escalation_hint") else "No"
+        lines.append(
+            f"\n📋 VALIDATION:\n"
+            f"  Adj. Confidence : {v.get('adjusted_confidence')}\n"
+            f"  Issues          : {v.get('consistency_issues')}\n"
+            f"  Escalation Hint : {hint}"
+        )
+    if not obs.get("extracted_signals"):
+        lines.append("\n⚠️  REMINDER: Call extract_signals before final_decision.")
+    lines.append("\nYour next action (strict JSON only):")
+    return "\n".join(lines)
+def run_task(client: OpenAI, task_id: str) -> float:
+    for _ in range(30):
+        # CORRECT ✅ — pass task ID directly
+        r = requests.post(
+        f"{ENV_BASE_URL}/reset",
+        json={"episode_id": task_id},   # ← this is the only change
+        timeout=10
+        )
+        r.raise_for_status()
+        obs = r.json()
+        # Handle both flat (TrustObservation) and wrapped response
+        if isinstance(obs, dict) and "observation" in obs:
+            obs = obs["observation"]
+        if obs.get("ticket_id") == task_id:
+            break
+    else:
+        raise RuntimeError(f"Could not get task {task_id} after 30 resets.")
+    print(f"\n{'='*62}\nTask: {task_id} | Starting...\n{'='*62}")
+    messages     = [{"role": "system", "content": SYSTEM_PROMPT}]
+    final_reward = 0.0
+    for step_num in range(14):
+        messages.append({"role": "user", "content": _obs_to_prompt(obs)})
+        time.sleep(0.5)
+        resp = client.chat.completions.create(
+            model=MODEL_NAME, messages=messages, temperature=0.0,
+            response_format={"type": "json_object"},
+        )
+        llm_text = resp.choices[0].message.content or ""
+        messages.append({"role": "assistant", "content": llm_text})
+        try:
+            action = _normalize(_parse(llm_text))
+        except Exception as ex:
+            print(f"  [Step {step_num+1}] Parse error: {ex}"); break
+        atype = action["action_type"]
+        if atype == "use_tool":
+            print(f"  [Step {step_num+1}] 🔧 use_tool        → {action.get('tool_name')}")
+        elif atype == "extract_signals":
+            s = action.get("signals", {})
+            print(f"  [Step {step_num+1}] 🔍 extract_signals → "
+                  f"intent={s.get('intent')} | ctx={s.get('context_type')} | "
+                  f"tox={s.get('toxicity_level')} | conf={s.get('confidence')}")
+        else:
+            print(f"  [Step {step_num+1}] ⚖️  final_decision  → {action.get('final_decision')}")
+        r2 = requests.post(f"{ENV_BASE_URL}/step", json=action, timeout=30)
+        r2.raise_for_status()
+        result = r2.json()
+        # Handle flat (TrustObservation) and wrapped response
+        if "observation" in result:
+            obs  = result["observation"]
+            done = result.get("done", obs.get("done", False))
+            final_reward = float(result.get("reward") or obs.get("reward") or 0.0)
+        else:
+            obs  = result
+            done = result.get("done", False)
+            final_reward = float(result.get("reward") or 0.0)
+        if done:
+            info = obs.get("info") or {}
+            bd   = info.get("reward_breakdown", {})
+            pol  = info.get("policy_recommendation", {})
+            vr   = obs.get("validation_result") or {}
+            print(f"\n  ── EPISODE COMPLETE {'─'*42}")
+            print(f"  Decision:           {info.get('final_decision','N/A')}")
+            print(f"  Ground Truth:       {info.get('ground_truth','N/A')}")
+            print(f"  Policy Engine:      {pol.get('recommended','N/A')} "
+                  f"[{pol.get('rule_strength','?')} rule] ({pol.get('reason','?')})")
+            print(f"  Signals Extracted:  {'✅' if info.get('signals_extracted') else '❌ SKIPPED'}")
+            print(f"  Tools Used:         {info.get('tools_used', [])}")
+            print(f"  Required Tools:     {info.get('required_tools', [])}")
+            print(f"  Adj. Confidence:    {vr.get('adjusted_confidence','N/A')}")
+            print(f"  Issues:             {vr.get('consistency_issues',[])}")
+            print(f"  Ambiguity / Risk:   {info.get('ambiguity_level','?')} / {info.get('risk_level','?')}")
+            if bd:
+                print(f"\n  ── Reward Breakdown {'─'*42}")
+                print(f"  1. Base Decision Score:      {bd.get('base_score',0):+.4f}")
+                print(f"  2. Policy Alignment:         {bd.get('policy_alignment',0):+.4f}")
+                print(f"  3. Signal Accuracy Bonus:    {bd.get('signal_accuracy_bonus',0):+.4f}")
+                print(f"  4. Escalation Adjustment:    {bd.get('escalation_adj',0):+.4f}")
+                print(f"  5. Signal Process Bonus:     {bd.get('signal_bonus',0):+.4f}")
+                print(f"     Tool Cost:               -{bd.get('tool_cost',0):.4f}")
+                print(f"     Tool Miss Penalty:       -{bd.get('tool_miss_penalty',0):.4f}")
+                print(f"     Validation Penalty:      -{bd.get('validation_penalty',0):.4f}")
+                print(f"     Risk Penalty:            -{bd.get('risk_penalty',0):.4f}")
+                print(f"     Confidence Discipline:   -{bd.get('confidence_penalty',0):.4f}")
+                print(f"  {'─'*60}")
+                print(f"     FINAL REWARD:             {bd.get('final_reward',0):.4f}")
+            print(f"\n  SCORE: {final_reward:.4f}")
+            break
+    return final_reward
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print("=" * 62)
+    print("Trust & Safety RL Environment — Baseline Evaluation")
+    print("=" * 62)
+    print(f"Model      : {MODEL_NAME}")
+    print(f"LLM API    : {API_BASE_URL}")
+    print(f"Env Server : {ENV_BASE_URL}")
+    print(f"Reward     : Accuracy · Policy · Signals · Escalation")
+    print(f"             Tools · Consistency · Risk · Confidence")
+    tasks  = [
+        ("T-001", "Easy   — Phishing Spam",   "low"),
+        ("T-002", "Medium — Gaming Banter",    "low"),
+        ("T-003", "Hard   — Political Satire", "high"),
+    ]
+    scores = []
+    for tid, desc, risk in tasks:
+        print(f"\n\n>>> {tid} | {desc} | Risk: {risk}")
+        scores.append((tid, desc, run_task(client, tid)))
+    print("\n" + "=" * 62)
+    print("FINAL BASELINE RESULTS")
+    print("=" * 62)
+    total = 0.0
+    for tid, desc, s in scores:
+        print(f"  {tid} | {desc:<32} | {s:.4f}  {'✅ PASS' if s >= 0.6 else '❌ FAIL'}")
+        total += s
+    vals = [s for _, _, s in scores]
+    print(f"\n  Average : {total/len(scores):.4f}")
+    print(f"  Min     : {min(vals):.4f}  |  Max : {max(vals):.4f}")
+    print("=" * 62)
+if __name__ == "__main__":
+    main()
+openenv.yaml:
+spec_version: 1
+name: trust-safety-env
+type: environment
+runtime: python
+app: app:app
+port: 8000
+description: >
+  Risk-aware content moderation RL environment for Trust & Safety decision-making.
+  Agents investigate content, extract structured signals, and make policy-aligned
+  decisions under uncertainty across hate speech, political sensitivity, and
+  cultural nuance. Models real-world moderation at scale (Meta-style).
+author: Jerome Richard D
+version: "1.0.0"
+license: MIT
+action_space:
+  type: TrustAction
+  description: "use_tool | extract_signals | final_decision"
+observation_space:
+  type: TrustObservation
+  description: "Content ticket with progressive context revelation"
+tasks:
+  - id: T-001
+    name: Phishing Spam Detection
+    difficulty: easy
+    description: Identify and remove clear phishing / impersonation content
+  - id: T-002
+    name: Gaming Banter Classification
+    difficulty: medium
+    description: Distinguish competitive gaming banter from genuine harassment
+  - id: T-003
+    name: Political Satire Review
+    difficulty: hard
+    description: Handle editorial satire of public figures with high-risk sensitivity
+  - id: T-004
+    name: Hate Speech Disguised as Education
+    difficulty: medium
+    description: Detect hate speech hidden behind pseudoscientific or educational framing
+  - id: T-005
+    name: Political News with Protest Violence
+    difficulty: hard
+    description: Protect legitimate journalism on sensitive political events without over-censorship
+  - id: T-006
+    name: Religious Expression False Flag
+    difficulty: hard
+    description: Distinguish protected religious expression from automated false-positive flag
+tags:
+  - content-moderation
+  - trust-safety
+  - hate-speech
+  - political-sensitivity
+  - cultural-nuance
+  - real-world
+  - openenv
+pyproject.toml:
+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.backends.legacy:build"
+[project]
+name = "trust-safety-env"
+version = "1.0.0"
+description = "Risk-aware Trust & Safety content moderation RL environment — OpenEnv compatible"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.6.0",
+    "openai>=1.30.0",
+    "requests>=2.31.0",
+    "python-dotenv>=1.0.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=8.0"]
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["*"]
+[tool.openenv]
+name             = "trust-safety-env"
+environment_class = "your_environment.TrustSafetyEnvironment"
+action_model     = "models.TrustAction"
+observation_model = "models.TrustObservation"
+state_model      = "models.TrustState"
+requirements.txt:
+fastapi>=0.110.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.6.0
+openai>=1.30.0
+requests>=2.31.0
+python-dotenv>=1.0.0
+tasks.py :
+TASKS = [
+    {
+        "task_id": "T-001",
+        "difficulty": "easy",
+        "ambiguity_level": "low",
+        "risk_level": "low",
+        "post_text": (
+            "🚨 URGENT: Your account has been COMPROMISED! "
+            "Click here immediately: http://meta-security-verify-now.xyz/login?token=abc123 "
+            "Act within 24 hours or your account will be PERMANENTLY BANNED. — Meta Security Team"
+        ),
+        "image_description": (
+            "Poorly formatted alert page with mismatched Meta logos, "
+            "red urgent warning banners, and a suspicious external URL."
+        ),
+        "tool_responses": {
+            "read_comments":       "Comments disabled. Zero engagement.",
+            "check_user_history":  "Account created 2 days ago. Zero posts. Flagged by spam filter 3×.",
+            "check_entity_status": "Domain 'meta-security-verify-now.xyz' registered 1 day ago — confirmed phishing domain.",
+            "view_policy":         "Policy 3.2 — Phishing: Impersonating Meta to harvest credentials = immediate REMOVE.",
+        },
+        "required_tools": [],
+        "ground_truth": "REMOVE",
+        "ground_truth_signals": {
+            "target": "individual",
+            "is_protected_class": False,
+            "toxicity_level": 0.90,
+            "is_direct_attack": True,
+            "context_type": "spam",
+            "intent": "harmful",
+            "abusive_language_present": False,
+            "content_flags": ["phishing", "spam", "impersonation"],
+        },
+    },
+    {
+        "task_id": "T-002",
+        "difficulty": "medium",
+        "ambiguity_level": "medium",
+        "risk_level": "low",
+        "post_text": (
+            "LMAO I will literally SNIPE you the second the game starts, "
+            "no mercy, you're getting absolutely destroyed 💀🎮 "
+            "1v1 me and I'll end your whole career bro. Your team is DONE."
+        ),
+        "image_description": (
+            "Call of Duty: Warzone — squad wipe in progress. "
+            "Kill notifications, sniper rifle, enemy team eliminated. Classic gaming screenshot."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "47 comments: 'GG bro!', 'You always trash talk before CoD 😂', "
+                "'rematch tonight??', 'your aim is mid bro'. "
+                "All mutual gaming friends. Clear competitive banter context."
+            ),
+            "check_user_history": (
+                "Account 3 years active. 500+ posts, 90% gaming. "
+                "Zero harassment reports. Regular CoD tournament player."
+            ),
+            "check_entity_status": "Private individual. Standard gaming community member.",
+            "view_policy": (
+                "Policy 7.1 — Harassment: Hyperbolic language in established gaming communities "
+                "is NOT harassment. 'Snipe', 'destroy', 'end your career' are non-literal expressions."
+            ),
+        },
+        "required_tools": ["read_comments"],
+        "ground_truth": "ALLOW",
+        "ground_truth_signals": {
+            "target": "individual",
+            "is_protected_class": False,
+            "toxicity_level": 0.30,
+            "is_direct_attack": False,
+            "context_type": "meme",
+            "intent": "ambiguous",
+            "abusive_language_present": False,
+            "content_flags": ["competitive_language"],
+        },
+    },
+    {
+        "task_id": "T-003",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "Sharing this political cartoon I drew. "
+            "Mayor Davidson cutting the homeless shelter budget while sitting on a throne of gold coins. "
+            "Caption: 'Priorities.' #LocalPolitics #HoldThemAccountable #SatiricalArt #MayorDavidson"
+        ),
+        "image_description": (
+            "Hand-drawn political cartoon. Exaggerated caricature of a public official on a coin throne, "
+            "homeless families pushed away. Deliberately satirical editorial style. Artist signature visible."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "23 comments: 'Powerful commentary', 'Free speech!', 'Great art', 'Defamation!'. "
+                "No threats or coordinated harassment detected."
+            ),
+            "check_user_history": (
+                "Account 5 years. Verified local journalist and artist. "
+                "Regular political commentary. No violations. Posts cited in local news."
+            ),
+            "check_entity_status": (
+                "Mayor Davidson: CONFIRMED PUBLIC FIGURE. "
+                "Elected Mayor of Springfield. Active public interest in recent social services budget cuts. "
+                "Multiple credible news articles confirm the cuts are real and newsworthy."
+            ),
+            "view_policy": (
+                "Policy 12.4 — Political Satire: Editorial cartoons targeting PUBLIC OFFICIALS "
+                "on PUBLIC INTEREST matters are PROTECTED. Caricature is a recognized technique. "
+                "Not realistic impersonation. Correct action: ALLOW_WITH_WARNING."
+            ),
+        },
+        "required_tools": ["check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW_WITH_WARNING",
+        "ground_truth_signals": {
+            "target": "political",
+            "is_protected_class": False,
+            "toxicity_level": 0.20,
+            "is_direct_attack": False,
+            "context_type": "satire",
+            "intent": "satirical",
+            "abusive_language_present": False,
+            "content_flags": ["political_content", "public_figure"],
+        },
+    },
+    # ── Append these to the TASKS list ──────────────────────────────────────────
+    {
+        "task_id": "T-004",
+        "difficulty": "medium",
+        "ambiguity_level": "medium",
+        "risk_level": "high",
+        "post_text": (
+            "Sharing this 'educational' video about why [ethnic group] people "
+            "are genetically predisposed to criminal behaviour. "
+            "The science doesn't lie. Share before it gets deleted. 👀 "
+            "#HiddenTruths #CensoredFacts #JustAsking"
+        ),
+        "image_description": (
+            "Screenshot of a YouTube-style video thumbnail with a white lab coat figure "
+            "and a graph. Title: 'The Uncomfortable Biology of Crime Rates'. "
+            "Uploader has 0 verified credentials. Comments disabled."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "Comments disabled by uploader. "
+                "Zero engagement from credible academic or scientific accounts."
+            ),
+            "check_user_history": (
+                "Account 6 months old. 48 posts — all promoting pseudoscientific race content. "
+                "Previously received 2 policy warnings for hate content. "
+                "No verified credentials. Frequently uses 'just asking questions' framing."
+            ),
+            "check_entity_status": (
+                "Ethnic group mentioned: confirmed PROTECTED CLASS under platform policy. "
+                "Claim about genetic criminality: DEBUNKED by mainstream biology and criminology. "
+                "This framing is a known pseudoscience hate vector (scientific racism)."
+            ),
+            "view_policy": (
+                "Policy 4.3 — Hate Speech: Content attributing criminal or negative traits "
+                "to a protected class based on race or ethnicity = REMOVE. "
+                "Disguising hate speech as 'educational content' or 'science' does not "
+                "change the violation. Action: REMOVE."
+            ),
+        },
+        "required_tools": ["check_entity_status", "view_policy"],
+        "ground_truth": "REMOVE",
+        "ground_truth_signals": {
+            "target": "group",
+            "is_protected_class": True,
+            "toxicity_level": 0.85,
+            "is_direct_attack": True,
+            "context_type": "statement",
+            "intent": "harmful",
+            "abusive_language_present": False,
+            "content_flags": ["slur", "misinformation", "hate_speech_disguised_as_education"],
+        },
+    },
+    {
+        "task_id": "T-005",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "The Prime Minister announced new economic reforms today. "
+            "Opposition leaders called it 'economic warfare on the poor.' "
+            "Protests erupted in 3 cities. Security forces used tear gas. "
+            "Government says the crackdown was 'necessary to restore order.' "
+            "At least 12 protesters arrested. #Politics #Protest #Reform"
+        ),
+        "image_description": (
+            "News-style post with a photo of a large protest crowd holding signs. "
+            "Tear gas clouds visible in background. "
+            "Caption: 'Thousands take to streets against new austerity measures.' "
+            "Shared by a verified news outlet handle. No graphic violence."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "340 comments split sharply: "
+                "'State violence against citizens!', 'Finally someone fixing the economy', "
+                "'This is propaganda!', 'Biased reporting!', 'Necessary action'. "
+                "High engagement, deep political polarization. No incitement to violence detected."
+            ),
+            "check_user_history": (
+                "Posted by @NationalNewsNetwork — verified news organization. "
+                "12 years active. Consistently neutral coverage. "
+                "No policy violations. Registered press entity."
+            ),
+            "check_entity_status": (
+                "Prime Minister: CONFIRMED PUBLIC FIGURE, head of government. "
+                "Economic reforms: CONFIRMED real policy, actively debated in parliament. "
+                "Protest: CONFIRMED real event, covered by 6+ independent news agencies. "
+                "Political sensitivity: VERY HIGH — active governance dispute in a democracy."
+            ),
+            "view_policy": (
+                "Policy 12.1 — News & Political Content: Factual reporting on real government "
+                "actions, protests, and police responses by VERIFIED NEWS entities is PROTECTED. "
+                "Even politically contentious factual reporting is protected expression. "
+                "Suppressing legitimate news = censorship risk. "
+                "Action: ALLOW_WITH_WARNING (politically sensitive content label)."
+            ),
+        },
+        "required_tools": ["check_user_history", "check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW_WITH_WARNING",
+        "ground_truth_signals": {
+            "target": "political",
+            "is_protected_class": False,
+            "toxicity_level": 0.20,
+            "is_direct_attack": False,
+            "context_type": "news",
+            "intent": "descriptive",
+            "abusive_language_present": False,
+            "content_flags": ["political_content", "public_figure", "protest"],
+        },
+    },
+    {
+        "task_id": "T-006",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "Sending Eid Mubarak wishes to my community 🌙✨ "
+            "During this holy month let us remember those suffering in conflict zones. "
+            "Our brothers and sisters deserve peace and dignity. "
+            "May Allah guide us all. #EidMubarak #Ramadan #Peace #Solidarity"
+        ),
+        "image_description": (
+            "Warm celebratory post with a crescent moon and stars graphic. "
+            "Soft gold and green colors. "
+            "Comment section shows replies in Arabic, Urdu, and English. "
+            "Emojis: praying hands, crescent moon, dove."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "89 comments: 'Eid Mubarak!', 'JazakAllah Khair', "
+                "'prayers for peace', 'beautiful message', "
+                "1 comment flagged by automated system: 'why is this allowed, "
+                "these people support terrorism' — that comment is from a third party, "
+                "not the original poster. "
+                "No incitement, no threats, no extremist content in original post."
+            ),
+            "check_user_history": (
+                "Account 4 years active. 200+ posts — community announcements, "
+                "religious holiday greetings, charity drives. "
+                "Zero prior violations. Community leader in local Muslim cultural centre."
+            ),
+            "check_entity_status": (
+                "Islam / Muslim community: CONFIRMED PROTECTED CLASS (religion). "
+                "Eid al-Fitr: globally recognized Islamic religious holiday. "
+                "Content: religious greeting + humanitarian solidarity message. "
+                "No extremist symbols, no prohibited organisations mentioned."
+            ),
+            "view_policy": (
+                "Policy 8.1 — Religious Expression: Greetings and solidarity messages "
+                "for recognized religious holidays are PROTECTED. "
+                "Vague 'conflict zone' solidarity without naming a prohibited group "
+                "is NOT a violation. "
+                "Automated flag was triggered by a COMMENT, not the original post. "
+                "Action: ALLOW — original content is clearly safe religious expression."
+            ),
+        },
+        "required_tools": ["read_comments", "check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW",
+        "ground_truth_signals": {
+            "target": "religion",
+            "is_protected_class": True,
+            "toxicity_level": 0.05,
+            "is_direct_attack": False,
+            "context_type": "statement",
+            "intent": "educational",
+            "abusive_language_present": False,
+            "content_flags": ["political_content"],
+        },
+    },
+]
+your_environment.py:
+from __future__ import annotations
+import random
+import uuid
+from typing import Optional, Dict, Any, Set
+try:
+    from openenv_core.env_server import Environment
+    print("[env] Inheriting from openenv_core.env_server.Environment ✅")
+except ImportError:
+    try:
+        from openenv.core.env_server import Environment
+        print("[env] Inheriting from openenv.core.env_server.Environment ✅")
+    except ImportError:
+        Environment = object
+        print("[env] openenv_core not found — using plain object base ⚠️")
+from models import TrustObservation, TrustAction, TrustState, ContentSignals
+from tasks import TASKS
+TOOL_COSTS: Dict[str, float] = {
+    "read_comments":       0.05,
+    "check_user_history":  0.05,
+    "check_entity_status": 0.10,
+    "view_policy":         0.10,
+}
+MAX_STEPS = 7
+DECISION_MATRIX: Dict[tuple, float] = {
+    ("REMOVE",             "REMOVE"):             1.00,
+    ("ALLOW",              "ALLOW"):              1.00,
+    ("ALLOW_WITH_WARNING", "ALLOW_WITH_WARNING"): 1.00,
+    ("ESCALATE",           "ESCALATE"):           1.00,
+    ("ALLOW_WITH_WARNING", "ALLOW"):              0.75,
+    ("ALLOW",              "ALLOW_WITH_WARNING"): 0.55,
+    ("ESCALATE",           "ALLOW_WITH_WARNING"): 0.65,
+    ("ESCALATE",           "ALLOW"):              0.45,
+    ("ESCALATE",           "REMOVE"):             0.45,
+    ("REMOVE",             "ALLOW"):              0.10,
+    ("REMOVE",             "ALLOW_WITH_WARNING"): 0.20,
+    ("ALLOW",              "REMOVE"):             0.00,
+    ("ALLOW_WITH_WARNING", "REMOVE"):             0.15,
+}
+class TrustSafetyEnvironment(Environment):
+    """
+    3-Layer Risk-Aware Trust & Safety RL Environment.
+    Layer 1 — Evidence gathering  : agent uses investigation tools (optional)
+    Layer 2 — Signal extraction   : agent outputs ContentSignals as feature extractor
+    Layer 3 — Policy engine       : validates signals, applies rules, computes reward
+    8-Component Reward: Accuracy · Policy Alignment · Signal Quality · Escalation
+                        Tool Usage · Consistency · Risk Sensitivity · Confidence
+    """
+    def __init__(self, seed: int = 42) -> None:
+        super().__init__()
+        self._rng                = random.Random(seed)
+        self._current_task:      Optional[Dict[str, Any]]  = None
+        self._tools_used:        Set[str]                  = set()
+        self._step_count:        int                       = 0
+        self._extracted_signals: Optional[ContentSignals]  = None
+        self._validation_result: Optional[Dict[str, Any]]  = None
+        self._signals_extracted: bool                      = False
+        self._obs:               Optional[TrustObservation]= None
+        self._state              = TrustState()
+        # ✅ FIX 3 — build a dict keyed by task_id for O(1) lookup
+        self._tasks: Dict[str, Dict[str, Any]] = {
+            t["task_id"]: t for t in TASKS
+        }
+    # -----------------------------------------------------------------------
+    # OpenEnv interface
+    # -----------------------------------------------------------------------
+    def reset(self, seed=None, episode_id=None, **kwargs) -> TrustObservation:
+        # ✅ FIX 1 — reset() is now correctly INSIDE the class
+        if seed is not None:
+            self._rng.seed(seed)
+        # Pick task by episode_id if provided, else random from all 6
+        if episode_id and episode_id in self._tasks:
+            task = self._tasks[episode_id]
+        else:
+            task = self._rng.choice(list(self._tasks.values()))
+        self._current_task       = task
+        self._tools_used         = set()
+        self._step_count         = 0
+        self._extracted_signals  = None
+        self._validation_result  = None
+        self._signals_extracted  = False
+        self._state = TrustState(
+            episode_id=task["task_id"],
+            step_count=0,
+            current_task_id=task["task_id"],
+            difficulty=task.get("difficulty", "medium"),
+            risk_level=task.get("risk_level", "medium"),
+            is_done=False,
+            tools_used=[],
+            signals_extracted=False,
+        )
+        self._obs = TrustObservation(
+            ticket_id=task["task_id"],
+            post_text=task["post_text"],
+            image_description=task.get("image_description", ""),
+            step_number=0,
+            done=False,
+        )
+        return self._obs   # ✅ FIX 2 — single clean return, stray return removed
+    def step(self, action: TrustAction, timeouts: Optional[Any] = None,
+             **kwargs) -> TrustObservation:
+        if self._current_task is None or self._obs is None:
+            raise RuntimeError("Call reset() before step().")
+        if self._step_count >= MAX_STEPS:
+            self._obs = TrustObservation(
+                ticket_id=self._current_task["task_id"],
+                post_text=self._obs.post_text,
+                image_description=self._obs.image_description,
+                step_number=self._step_count,
+                done=True,
+                reward=0.0,
+                info={"reason": "timeout", "tools_used": list(self._tools_used)},
+            )
+            return self._obs
+        atype = action.action_type
+        if atype == "use_tool":
+            return self._handle_tool(action)
+        if atype == "extract_signals":
+            return self._handle_signal_extraction(action)
+        if atype == "final_decision":
+            return self._handle_final_decision(action)
+        raise ValueError(f"Unknown action_type: {atype!r}")
+    @property
+    def state(self) -> TrustState:
+        return self._state
+    # -----------------------------------------------------------------------
+    # Layer 1 — Tool handling
+    # -----------------------------------------------------------------------
+    def _handle_tool(self, action: TrustAction) -> TrustObservation:
+        tool = action.tool_name
+        if tool not in TOOL_COSTS:
+            raise ValueError(f"Unknown tool: {tool!r}")
+        self._tools_used.add(tool)
+        response = self._current_task["tool_responses"].get(tool, "No data found.")
+        field_map = {
+            "read_comments":       "comments_found",
+            "check_user_history":  "user_history_found",
+            "check_entity_status": "entity_status_found",
+            "view_policy":         "policy_found",
+        }
+        self._step_count       += 1
+        self._state.step_count  = self._step_count
+        self._state.tools_used  = list(self._tools_used)
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found",
+                      "extracted_signals", "validation_result")
+        }
+        obs_kwargs[field_map[tool]] = response
+        obs_kwargs["step_number"]   = self._step_count
+        obs_kwargs["done"]          = False
+        obs_kwargs["reward"]        = None
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    # -----------------------------------------------------------------------
+    # Layer 2 — Signal extraction + validation
+    # -----------------------------------------------------------------------
+    def _handle_signal_extraction(self, action: TrustAction) -> TrustObservation:
+        raw = action.signals
+        raw.toxicity_level = max(0.0, min(1.0, float(raw.toxicity_level)))
+        raw.confidence     = max(0.0, min(1.0, float(raw.confidence)))
+        if not isinstance(raw.content_flags, list):
+            raw.content_flags = []
+        self._extracted_signals       = raw
+        self._signals_extracted       = True
+        self._validation_result       = self._validate_signals(raw)
+        self._step_count             += 1
+        self._state.step_count        = self._step_count
+        self._state.signals_extracted = True
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found")
+        }
+        obs_kwargs["extracted_signals"] = {
+            "target":                   raw.target,
+            "is_protected_class":       raw.is_protected_class,
+            "toxicity_level":           raw.toxicity_level,
+            "is_direct_attack":         raw.is_direct_attack,
+            "context_type":             raw.context_type,
+            "intent":                   raw.intent,
+            "confidence":               raw.confidence,
+            "abusive_language_present": raw.abusive_language_present,
+            "content_flags":            raw.content_flags,
+        }
+        obs_kwargs["validation_result"] = self._validation_result
+        obs_kwargs["step_number"]       = self._step_count
+        obs_kwargs["done"]              = False
+        obs_kwargs["reward"]            = None
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    def _validate_signals(self, s: ContentSignals) -> Dict[str, Any]:
+        issues = []
+        conf   = s.confidence
+        if not s.abusive_language_present and s.toxicity_level > 0.75:
+            issues.append("high_toxicity_without_abusive_language"); conf -= 0.15
+        if s.context_type in ("satire", "education") and s.intent == "harmful":
+            issues.append("harmful_intent_contradicts_satire_context"); conf -= 0.12
+        if s.is_protected_class and s.target == "none":
+            issues.append("protected_class_flagged_without_target"); conf -= 0.10
+        if s.is_direct_attack and s.toxicity_level < 0.25:
+            issues.append("direct_attack_with_low_toxicity"); conf -= 0.12
+        if s.context_type == "spam" and "spam" not in s.content_flags \
+                and "phishing" not in s.content_flags:
+            issues.append("spam_context_missing_spam_flag"); conf -= 0.08
+        adj = round(max(0.0, min(1.0, conf)), 3)
+        return {
+            "original_confidence":   s.confidence,
+            "adjusted_confidence":   adj,
+            "consistency_issues":    issues,
+            "issue_count":           len(issues),
+            "is_consistent":         len(issues) == 0,
+            "needs_escalation_hint": adj < 0.55 or len(issues) >= 2,
+        }
+    # -----------------------------------------------------------------------
+    # Layer 3 — Policy engine + final decision
+    # -----------------------------------------------------------------------
+    def _policy_engine(self) -> Dict[str, Any]:
+        if not self._extracted_signals or not self._validation_result:
+            return {"recommended": "ESCALATE", "reason": "no_signals_extracted",
+                    "rule_strength": "weak"}
+        s   = self._extracted_signals
+        adj = self._validation_result["adjusted_confidence"]
+        if adj < 0.50:
+            return {"recommended": "ESCALATE", "reason": "low_adjusted_confidence",
+                    "rule_strength": "medium"}
+        if "spam" in s.content_flags or "phishing" in s.content_flags:
+            return {"recommended": "REMOVE", "reason": "spam_or_phishing_detected",
+                    "rule_strength": "strong"}
+        if s.is_protected_class and s.is_direct_attack \
+                and s.toxicity_level > 0.70 and adj > 0.70:
+            return {"recommended": "REMOVE",
+                    "reason": "protected_class_direct_attack_high_toxicity",
+                    "rule_strength": "strong"}
+        if s.toxicity_level > 0.85 and s.intent == "harmful":
+            return {"recommended": "REMOVE", "reason": "high_toxicity_harmful_intent",
+                    "rule_strength": "strong"}
+        if s.context_type in ("satire", "news", "education", "protest") \
+                and s.intent in ("satirical", "descriptive", "educational"):
+            return {"recommended": "ALLOW_WITH_WARNING",
+                    "reason": "protected_expression_context",
+                    "rule_strength": "medium"}
+        if 0.40 < s.toxicity_level < 0.75 and s.intent == "ambiguous":
+            return {"recommended": "ESCALATE", "reason": "ambiguous_moderate_toxicity",
+                    "rule_strength": "medium"}
+        return {"recommended": "ALLOW", "reason": "no_policy_violation_detected",
+                "rule_strength": "medium"}
+    def _handle_final_decision(self, action: TrustAction) -> TrustObservation:
+        decision   = action.final_decision
+        components = self._compute_components(decision)
+        policy_rec = components.pop("_policy_rec")
+        reward     = self._finalize_reward(components)
+        self._step_count       += 1
+        self._state.step_count  = self._step_count
+        self._state.is_done     = True
+        components["final_reward"] = reward
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found",
+                      "extracted_signals", "validation_result")
+        }
+        obs_kwargs["step_number"] = self._step_count
+        obs_kwargs["done"]        = True
+        obs_kwargs["reward"]      = reward
+        obs_kwargs["info"]        = {
+            "final_decision":        decision,
+            "ground_truth":          self._current_task["ground_truth"],
+            "policy_recommendation": policy_rec,
+            "signals_extracted":     self._signals_extracted,
+            "tools_used":            list(self._tools_used),
+            "required_tools":        self._current_task["required_tools"],
+            "ambiguity_level":       self._current_task["ambiguity_level"],
+            "risk_level":            self._current_task["risk_level"],
+            "task_id":               self._current_task["task_id"],
+            "reward_breakdown":      components,
+        }
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    # -----------------------------------------------------------------------
+    # 8-Component Reward Engine
+    # -----------------------------------------------------------------------
+    def _compute_components(self, final_decision: str) -> Dict[str, Any]:
+        gt             = self._current_task["ground_truth"]
+        required_tools = self._current_task["required_tools"]
+        ambiguity      = self._current_task["ambiguity_level"]
+        risk_level     = self._current_task["risk_level"]
+        policy_rec     = self._policy_engine()
+        base_score = DECISION_MATRIX.get((final_decision, gt), 0.20)
+        if final_decision == "ESCALATE" and ambiguity == "high":
+            base_score = max(base_score, 0.70)
+        is_correct = base_score >= 0.90
+        rule_weight = {"strong": 1.0, "medium": 0.70, "weak": 0.40}.get(
+            policy_rec.get("rule_strength", "medium"), 0.70)
+        policy_alignment = round(
+            (+0.12 if final_decision == policy_rec["recommended"] else -0.18) * rule_weight, 4)
+        signal_accuracy_bonus = self._compute_signal_accuracy()
+        adj_conf = (self._validation_result["adjusted_confidence"]
+                    if self._validation_result else 0.50)
+        should_escalate = adj_conf < 0.50
+        if should_escalate and final_decision == "ESCALATE":
+            escalation_adj = +0.15
+        elif should_escalate and final_decision != "ESCALATE":
+            escalation_adj = -0.18
+        elif not should_escalate and final_decision == "ESCALATE" and ambiguity == "low":
+            escalation_adj = -0.20
+        elif not should_escalate and final_decision == "ESCALATE":
+            escalation_adj = -0.10
+        else:
+            escalation_adj = 0.0
+        signal_bonus     = +0.05 if self._signals_extracted else -0.10
+        tool_cost        = round(sum(TOOL_COSTS.get(t, 0.0) for t in self._tools_used), 4)
+        missing_required = set(required_tools) - self._tools_used
+        tool_miss_penalty = round(len(missing_required) * 0.25, 4)
+        if self._validation_result:
+            n = self._validation_result["issue_count"]
+            validation_penalty = {0: 0.00, 1: 0.05, 2: 0.12}.get(n, 0.20)
+        else:
+            validation_penalty = 0.12
+        risk_penalty = 0.0
+        if not is_correct:
+            risk_penalty = {"high": 0.20, "medium": 0.10, "low": 0.0}.get(risk_level, 0.0)
+        if base_score < 0.50 and adj_conf > 0.80:
+            confidence_penalty = 0.22
+        elif base_score < 0.50 and adj_conf > 0.65:
+            confidence_penalty = 0.12
+        elif self._signals_extracted and final_decision == "ESCALATE" and adj_conf < 0.55:
+            confidence_penalty = -0.10
+        else:
+            confidence_penalty = 0.0
+        return {
+            "base_score":            base_score,
+            "policy_alignment":      policy_alignment,
+            "signal_accuracy_bonus": signal_accuracy_bonus,
+            "escalation_adj":        escalation_adj,
+            "signal_bonus":          signal_bonus,
+            "tool_cost":             tool_cost,
+            "tool_miss_penalty":     tool_miss_penalty,
+            "validation_penalty":    validation_penalty,
+            "risk_penalty":          risk_penalty,
+            "confidence_penalty":    confidence_penalty,
+            "_policy_rec":           policy_rec,
+        }
+    def _finalize_reward(self, components: Dict[str, Any]) -> float:
+        raw = (
+            components["base_score"]
+            + components["policy_alignment"]
+            + components["signal_accuracy_bonus"]
+            + components["escalation_adj"]
+            + components["signal_bonus"]
+            - components["tool_cost"]
+            - components["tool_miss_penalty"]
+            - components["validation_penalty"]
+            - components["risk_penalty"]
+            - components["confidence_penalty"]
+        )
+        return round(max(0.0, min(1.0, raw)), 4)
+    def _compute_signal_accuracy(self) -> float:
+        if not self._extracted_signals:
+            return 0.0
+        gt = self._current_task.get("ground_truth_signals", {})
+        if not gt:
+            return 0.05
+        s     = self._extracted_signals
+        score = 0.0
+        if s.target       == gt.get("target"):        score += 0.20
+        if s.intent       == gt.get("intent"):        score += 0.20
+        if s.context_type == gt.get("context_type"):  score += 0.20
+        tox_diff = abs(s.toxicity_level - gt.get("toxicity_level", 0.5))
+        score += 0.20 if tox_diff <= 0.20 else (0.10 if tox_diff <= 0.35 else 0.0)
+        gt_flags = set(gt.get("content_flags", []))
+        s_flags  = set(s.content_flags)
+        if gt_flags:
+            score += 0.20 * min(1.0, len(gt_flags & s_flags) / len(gt_flags))
+        else:
+            score += 0.20 if not s_flags else 0.10
+        return round(score * 0.15, 4)

tasks.py ADDED Viewed

	@@ -0,0 +1,296 @@

+TASKS = [
+    {
+        "task_id": "T-001",
+        "difficulty": "easy",
+        "ambiguity_level": "low",
+        "risk_level": "low",
+        "post_text": (
+            "🚨 URGENT: Your account has been COMPROMISED! "
+            "Click here immediately: http://meta-security-verify-now.xyz/login?token=abc123 "
+            "Act within 24 hours or your account will be PERMANENTLY BANNED. — Meta Security Team"
+        ),
+        "image_description": (
+            "Poorly formatted alert page with mismatched Meta logos, "
+            "red urgent warning banners, and a suspicious external URL."
+        ),
+        "tool_responses": {
+            "read_comments":       "Comments disabled. Zero engagement.",
+            "check_user_history":  "Account created 2 days ago. Zero posts. Flagged by spam filter 3×.",
+            "check_entity_status": "Domain 'meta-security-verify-now.xyz' registered 1 day ago — confirmed phishing domain.",
+            "view_policy":         "Policy 3.2 — Phishing: Impersonating Meta to harvest credentials = immediate REMOVE.",
+        },
+        "required_tools": [],
+        "ground_truth": "REMOVE",
+        "ground_truth_signals": {
+            "target": "individual",
+            "is_protected_class": False,
+            "toxicity_level": 0.90,
+            "is_direct_attack": True,
+            "context_type": "spam",
+            "intent": "harmful",
+            "abusive_language_present": False,
+            "content_flags": ["phishing", "spam", "impersonation"],
+        },
+    },
+    {
+        "task_id": "T-002",
+        "difficulty": "medium",
+        "ambiguity_level": "medium",
+        "risk_level": "low",
+        "post_text": (
+            "LMAO I will literally SNIPE you the second the game starts, "
+            "no mercy, you're getting absolutely destroyed 💀🎮 "
+            "1v1 me and I'll end your whole career bro. Your team is DONE."
+        ),
+        "image_description": (
+            "Call of Duty: Warzone — squad wipe in progress. "
+            "Kill notifications, sniper rifle, enemy team eliminated. Classic gaming screenshot."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "47 comments: 'GG bro!', 'You always trash talk before CoD 😂', "
+                "'rematch tonight??', 'your aim is mid bro'. "
+                "All mutual gaming friends. Clear competitive banter context."
+            ),
+            "check_user_history": (
+                "Account 3 years active. 500+ posts, 90% gaming. "
+                "Zero harassment reports. Regular CoD tournament player."
+            ),
+            "check_entity_status": "Private individual. Standard gaming community member.",
+            "view_policy": (
+                "Policy 7.1 — Harassment: Hyperbolic language in established gaming communities "
+                "is NOT harassment. 'Snipe', 'destroy', 'end your career' are non-literal expressions."
+            ),
+        },
+        "required_tools": ["read_comments"],
+        "ground_truth": "ALLOW",
+        "ground_truth_signals": {
+            "target": "individual",
+            "is_protected_class": False,
+            "toxicity_level": 0.30,
+            "is_direct_attack": False,
+            "context_type": "meme",
+            "intent": "ambiguous",
+            "abusive_language_present": False,
+            "content_flags": ["competitive_language"],
+        },
+    },
+    {
+        "task_id": "T-003",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "Sharing this political cartoon I drew. "
+            "Mayor Davidson cutting the homeless shelter budget while sitting on a throne of gold coins. "
+            "Caption: 'Priorities.' #LocalPolitics #HoldThemAccountable #SatiricalArt #MayorDavidson"
+        ),
+        "image_description": (
+            "Hand-drawn political cartoon. Exaggerated caricature of a public official on a coin throne, "
+            "homeless families pushed away. Deliberately satirical editorial style. Artist signature visible."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "23 comments: 'Powerful commentary', 'Free speech!', 'Great art', 'Defamation!'. "
+                "No threats or coordinated harassment detected."
+            ),
+            "check_user_history": (
+                "Account 5 years. Verified local journalist and artist. "
+                "Regular political commentary. No violations. Posts cited in local news."
+            ),
+            "check_entity_status": (
+                "Mayor Davidson: CONFIRMED PUBLIC FIGURE. "
+                "Elected Mayor of Springfield. Active public interest in recent social services budget cuts. "
+                "Multiple credible news articles confirm the cuts are real and newsworthy."
+            ),
+            "view_policy": (
+                "Policy 12.4 — Political Satire: Editorial cartoons targeting PUBLIC OFFICIALS "
+                "on PUBLIC INTEREST matters are PROTECTED. Caricature is a recognized technique. "
+                "Not realistic impersonation. Correct action: ALLOW_WITH_WARNING."
+            ),
+        },
+        "required_tools": ["check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW_WITH_WARNING",
+        "ground_truth_signals": {
+            "target": "political",
+            "is_protected_class": False,
+            "toxicity_level": 0.20,
+            "is_direct_attack": False,
+            "context_type": "satire",
+            "intent": "satirical",
+            "abusive_language_present": False,
+            "content_flags": ["political_content", "public_figure"],
+        },
+    },
+    # ── Append these to the TASKS list ──────────────────────────────────────────
+    {
+        "task_id": "T-004",
+        "difficulty": "medium",
+        "ambiguity_level": "medium",
+        "risk_level": "high",
+        "post_text": (
+            "Sharing this 'educational' video about why [ethnic group] people "
+            "are genetically predisposed to criminal behaviour. "
+            "The science doesn't lie. Share before it gets deleted. 👀 "
+            "#HiddenTruths #CensoredFacts #JustAsking"
+        ),
+        "image_description": (
+            "Screenshot of a YouTube-style video thumbnail with a white lab coat figure "
+            "and a graph. Title: 'The Uncomfortable Biology of Crime Rates'. "
+            "Uploader has 0 verified credentials. Comments disabled."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "Comments disabled by uploader. "
+                "Zero engagement from credible academic or scientific accounts."
+            ),
+            "check_user_history": (
+                "Account 6 months old. 48 posts — all promoting pseudoscientific race content. "
+                "Previously received 2 policy warnings for hate content. "
+                "No verified credentials. Frequently uses 'just asking questions' framing."
+            ),
+            "check_entity_status": (
+                "Ethnic group mentioned: confirmed PROTECTED CLASS under platform policy. "
+                "Claim about genetic criminality: DEBUNKED by mainstream biology and criminology. "
+                "This framing is a known pseudoscience hate vector (scientific racism)."
+            ),
+            "view_policy": (
+                "Policy 4.3 — Hate Speech: Content attributing criminal or negative traits "
+                "to a protected class based on race or ethnicity = REMOVE. "
+                "Disguising hate speech as 'educational content' or 'science' does not "
+                "change the violation. Action: REMOVE."
+            ),
+        },
+        "required_tools": ["check_entity_status", "view_policy"],
+        "ground_truth": "REMOVE",
+        "ground_truth_signals": {
+            "target": "group",
+            "is_protected_class": True,
+            "toxicity_level": 0.85,
+            "is_direct_attack": True,
+            "context_type": "statement",
+            "intent": "harmful",
+            "abusive_language_present": False,
+            "content_flags": ["slur", "misinformation", "hate_speech_disguised_as_education"],
+        },
+    },
+    {
+        "task_id": "T-005",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "The Prime Minister announced new economic reforms today. "
+            "Opposition leaders called it 'economic warfare on the poor.' "
+            "Protests erupted in 3 cities. Security forces used tear gas. "
+            "Government says the crackdown was 'necessary to restore order.' "
+            "At least 12 protesters arrested. #Politics #Protest #Reform"
+        ),
+        "image_description": (
+            "News-style post with a photo of a large protest crowd holding signs. "
+            "Tear gas clouds visible in background. "
+            "Caption: 'Thousands take to streets against new austerity measures.' "
+            "Shared by a verified news outlet handle. No graphic violence."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "340 comments split sharply: "
+                "'State violence against citizens!', 'Finally someone fixing the economy', "
+                "'This is propaganda!', 'Biased reporting!', 'Necessary action'. "
+                "High engagement, deep political polarization. No incitement to violence detected."
+            ),
+            "check_user_history": (
+                "Posted by @NationalNewsNetwork — verified news organization. "
+                "12 years active. Consistently neutral coverage. "
+                "No policy violations. Registered press entity."
+            ),
+            "check_entity_status": (
+                "Prime Minister: CONFIRMED PUBLIC FIGURE, head of government. "
+                "Economic reforms: CONFIRMED real policy, actively debated in parliament. "
+                "Protest: CONFIRMED real event, covered by 6+ independent news agencies. "
+                "Political sensitivity: VERY HIGH — active governance dispute in a democracy."
+            ),
+            "view_policy": (
+                "Policy 12.1 — News & Political Content: Factual reporting on real government "
+                "actions, protests, and police responses by VERIFIED NEWS entities is PROTECTED. "
+                "Even politically contentious factual reporting is protected expression. "
+                "Suppressing legitimate news = censorship risk. "
+                "Action: ALLOW_WITH_WARNING (politically sensitive content label)."
+            ),
+        },
+        "required_tools": ["check_user_history", "check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW_WITH_WARNING",
+        "ground_truth_signals": {
+            "target": "political",
+            "is_protected_class": False,
+            "toxicity_level": 0.20,
+            "is_direct_attack": False,
+            "context_type": "news",
+            "intent": "descriptive",
+            "abusive_language_present": False,
+            "content_flags": ["political_content", "public_figure", "protest"],
+        },
+    },
+    {
+        "task_id": "T-006",
+        "difficulty": "hard",
+        "ambiguity_level": "high",
+        "risk_level": "high",
+        "post_text": (
+            "Sending Eid Mubarak wishes to my community 🌙✨ "
+            "During this holy month let us remember those suffering in conflict zones. "
+            "Our brothers and sisters deserve peace and dignity. "
+            "May Allah guide us all. #EidMubarak #Ramadan #Peace #Solidarity"
+        ),
+        "image_description": (
+            "Warm celebratory post with a crescent moon and stars graphic. "
+            "Soft gold and green colors. "
+            "Comment section shows replies in Arabic, Urdu, and English. "
+            "Emojis: praying hands, crescent moon, dove."
+        ),
+        "tool_responses": {
+            "read_comments": (
+                "89 comments: 'Eid Mubarak!', 'JazakAllah Khair', "
+                "'prayers for peace', 'beautiful message', "
+                "1 comment flagged by automated system: 'why is this allowed, "
+                "these people support terrorism' — that comment is from a third party, "
+                "not the original poster. "
+                "No incitement, no threats, no extremist content in original post."
+            ),
+            "check_user_history": (
+                "Account 4 years active. 200+ posts — community announcements, "
+                "religious holiday greetings, charity drives. "
+                "Zero prior violations. Community leader in local Muslim cultural centre."
+            ),
+            "check_entity_status": (
+                "Islam / Muslim community: CONFIRMED PROTECTED CLASS (religion). "
+                "Eid al-Fitr: globally recognized Islamic religious holiday. "
+                "Content: religious greeting + humanitarian solidarity message. "
+                "No extremist symbols, no prohibited organisations mentioned."
+            ),
+            "view_policy": (
+                "Policy 8.1 — Religious Expression: Greetings and solidarity messages "
+                "for recognized religious holidays are PROTECTED. "
+                "Vague 'conflict zone' solidarity without naming a prohibited group "
+                "is NOT a violation. "
+                "Automated flag was triggered by a COMMENT, not the original post. "
+                "Action: ALLOW — original content is clearly safe religious expression."
+            ),
+        },
+        "required_tools": ["read_comments", "check_entity_status", "view_policy"],
+        "ground_truth": "ALLOW",
+        "ground_truth_signals": {
+            "target": "religion",
+            "is_protected_class": True,
+            "toxicity_level": 0.05,
+            "is_direct_attack": False,
+            "context_type": "statement",
+            "intent": "educational",
+            "abusive_language_present": False,
+            "content_flags": ["political_content"],
+        },
+    },
+]

train.py ADDED Viewed

	@@ -0,0 +1,222 @@

+from __future__ import annotations
+import re, random
+from typing import Any
+import torch
+from datasets import Dataset
+from transformers import AutoTokenizer
+from trl import GRPOConfig, GRPOTrainer
+from trl.trainer.grpo_trainer import generate_rollout_completions
+from client import TrustSafetyEnv
+from models import TrustAction, ContentSignals
+# ── Config ────────────────────────────────────────────────────────────────────
+MODEL_NAME   = "Qwen/Qwen3-1.7B"
+ENV_BASE_URL = "https://YOUR-USERNAME-trust-safety-env.hf.space"  # ← your HF Space URL
+MAX_STEPS    = 7
+TASKS        = ["T-001","T-002","T-003","T-004","T-005","T-006"]
+SYSTEM_PROMPT = """You are a Trust & Safety content moderation agent.
+MANDATORY 3-STEP PROCESS — follow this EXACT order every time:
+STEP 1 — USE TOOLS (always before anything else)
+  Call investigation tools to gather evidence before deciding:
+  - read_comments       → check comment context
+  - check_user_history  → check prior violations
+  - check_entity_status → verify public figure / org status
+  - view_policy         → retrieve relevant platform policy
+  Respond with JSON: {"action_type": "use_tool", "tool_name": "<tool>"}
+STEP 2 — EXTRACT SIGNALS
+  After gathering evidence, respond with JSON:
+  {"action_type": "extract_signals", "signals": {
+    "target": "<none|person|group>",
+    "is_protected_class": <true|false>,
+    "toxicity_level": <0.0-1.0>,
+    "is_direct_attack": <true|false>,
+    "context_type": "<spam|gaming|satire|news|education|statement|protest>",
+    "intent": "<harmful|satirical|descriptive|educational|ambiguous>",
+    "confidence": <0.0-1.0>,
+    "abusive_language_present": <true|false>,
+    "content_flags": ["<spam|phishing|hate_speech|harassment|...>"]
+  }}
+STEP 3 — FINAL DECISION
+  Respond with JSON: {"action_type": "final_decision", "final_decision": "<REMOVE|ALLOW|ALLOW_WITH_WARNING|ESCALATE>"}
+⚠️ Skipping tools costs -0.25 reward per missed required tool.
+⚠️ Always output valid JSON. Nothing else."""
+# ── Dataset — one row per task, shuffled every epoch ─────────────────────────
+def make_dataset() -> Dataset:
+    rows = [{"task_id": tid, "prompt": f"Process moderation ticket {tid}"} for tid in TASKS]
+    random.shuffle(rows)
+    return Dataset.from_list(rows * 50)   # 300 rows = ~50 passes over all 6 tasks
+# ── Action parser — extract JSON action from model output ────────────────────
+def parse_action(text: str) -> TrustAction | None:
+    try:
+        match = re.search(r'\{.*\}', text, re.DOTALL)
+        if not match:
+            return None
+        import json
+        d = json.loads(match.group())
+        atype = d.get("action_type", "")
+        if atype == "use_tool":
+            return TrustAction(action_type="use_tool", tool_name=d.get("tool_name"))
+        if atype == "extract_signals":
+            raw = d.get("signals", {})
+            flags = raw.get("content_flags", [])
+            if not isinstance(flags, list):
+                flags = []
+            signals = ContentSignals(
+                target                  = raw.get("target", "none"),
+                is_protected_class      = bool(raw.get("is_protected_class", False)),
+                toxicity_level          = float(raw.get("toxicity_level", 0.5)),
+                is_direct_attack        = bool(raw.get("is_direct_attack", False)),
+                context_type            = raw.get("context_type", "statement"),
+                intent                  = raw.get("intent", "ambiguous"),
+                confidence              = float(raw.get("confidence", 0.5)),
+                abusive_language_present= bool(raw.get("abusive_language_present", False)),
+                content_flags           = flags,
+            )
+            return TrustAction(action_type="extract_signals", signals=signals)
+        if atype == "final_decision":
+            return TrustAction(action_type="final_decision",
+                               final_decision=d.get("final_decision", "ESCALATE"))
+    except Exception:
+        pass
+    return None
+# ── Rollout — one full episode per prompt ────────────────────────────────────
+def rollout_once(trainer, env, tokenizer, task_id: str):
+    result    = env.reset(episode_id=task_id)
+    obs       = result.observation
+    ep_reward = 0.0
+    all_prompt_ids     = []
+    all_completion_ids = []
+    all_logprobs       = []
+    reward_components  = {"accuracy": 0.0, "tools": 0.0, "signals": 0.0}
+    for step in range(MAX_STEPS):
+        if result.done:
+            break
+        # Build context from current observation
+        context_parts = [f"TICKET: {obs.ticket_id}\n{obs.post_text}"]
+        if obs.comments_found:      context_parts.append(f"Comments: {obs.comments_found}")
+        if obs.user_history_found:  context_parts.append(f"User History: {obs.user_history_found}")
+        if obs.entity_status_found: context_parts.append(f"Entity: {obs.entity_status_found}")
+        if obs.policy_found:        context_parts.append(f"Policy: {obs.policy_found}")
+        if obs.extracted_signals:   context_parts.append(f"Signals: {obs.extracted_signals}")
+        messages = [
+            {"role": "system",  "content": SYSTEM_PROMPT},
+            {"role": "user",    "content": "\n\n".join(context_parts)},
+        ]
+        rollout = generate_rollout_completions(trainer, [messages])
+        text    = rollout["text"][0] if rollout["text"] else ""
+        action = parse_action(text)
+        if action is None:
+            action = TrustAction(action_type="final_decision", final_decision="ESCALATE")
+        result = env.step(action)
+        obs    = result.observation
+        if result.reward is not None:
+            ep_reward = result.reward
+        all_prompt_ids.extend(rollout.get("prompt_ids", []))
+        all_completion_ids.extend(rollout.get("completion_ids", []))
+        all_logprobs.extend(rollout.get("logprobs", []))
+    # Reward components for GRPO shaping signal
+    info = obs.info or {}
+    breakdown = info.get("reward_breakdown", {})
+    reward_components["accuracy"] = breakdown.get("base_score", ep_reward)
+    reward_components["tools"]    = -breakdown.get("tool_miss_penalty", 0.0)
+    reward_components["signals"]  = breakdown.get("signal_accuracy_bonus", 0.0)
+    return {
+        "prompt_ids":       all_prompt_ids,
+        "completion_ids":   all_completion_ids,
+        "logprobs":         all_logprobs,
+        "accuracy_reward":  reward_components["accuracy"],
+        "tool_reward":      reward_components["tools"],
+        "signal_reward":    reward_components["signals"],
+    }
+# ── Rollout wrapper — called by GRPOTrainer ───────────────────────────────────
+def rollout_func(trainer, prompts, **kwargs):
+    env = TrustSafetyEnv(base_url=ENV_BASE_URL).sync()
+    results = []
+    for p in prompts:
+        task_id = p.get("task_id", random.choice(TASKS))
+        results.append(rollout_once(trainer, env, None, task_id))
+    env.close()
+    return results
+# ── Reward functions (called by GRPOTrainer separately) ──────────────────────
+def reward_accuracy(completions, **kwargs):
+    """Base decision correctness from environment."""
+    return [r.get("accuracy_reward", 0.0) for r in completions]
+def reward_tools(completions, **kwargs):
+    """Penalise skipping required tools."""
+    return [r.get("tool_reward", 0.0) for r in completions]
+def reward_signals(completions, **kwargs):
+    """Bonus for accurate signal extraction."""
+    return [r.get("signal_reward", 0.0) for r in completions]
+# ── GRPOConfig ────────────────────────────────────────────────────────────────
+grpo_config = GRPOConfig(
+    output_dir                  = "./trust-safety-grpo",
+    num_train_epochs            = 1,
+    learning_rate               = 5e-6,
+    gradient_accumulation_steps = 64,
+    per_device_train_batch_size = 1,
+    num_generations             = 2,
+    max_completion_length       = 256,   # JSON actions are longer than Wordle guesses
+    max_prompt_length           = 1024,
+    use_vllm                    = True,
+    vllm_mode                   = "colocate",
+    vllm_gpu_memory_utilization = 0.15,
+    gradient_checkpointing      = True,
+    report_to                   = "trackio",
+    logging_steps               = 5,
+    save_steps                  = 50,
+)
+# ── Main ──────────────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    dataset = make_dataset()
+    trainer = GRPOTrainer(
+        model         = MODEL_NAME,
+        reward_funcs  = [reward_accuracy, reward_tools, reward_signals],
+        rollout_func  = rollout_func,
+        train_dataset = dataset,
+        args          = grpo_config,
+    )
+    print("=" * 60)
+    print(f"Training {MODEL_NAME} on Trust & Safety environment")
+    print(f"Tasks : {TASKS}")
+    print(f"Epochs: {grpo_config.num_train_epochs}")
+    print("=" * 60)
+    trainer.train()
+    trainer.save_model("./trust-safety-grpo-final")
+    print("✅ Model saved to ./trust-safety-grpo-final")

your_environment.py ADDED Viewed

	@@ -0,0 +1,440 @@

+from __future__ import annotations
+import random
+import uuid
+from typing import Optional, Dict, Any, Set
+try:
+    from openenv_core.env_server import Environment
+    print("[env] Inheriting from openenv_core.env_server.Environment ✅")
+except ImportError:
+    try:
+        from openenv.core.env_server import Environment
+        print("[env] Inheriting from openenv.core.env_server.Environment ✅")
+    except ImportError:
+        Environment = object
+        print("[env] openenv_core not found — using plain object base ⚠️")
+from models import TrustObservation, TrustAction, TrustState, ContentSignals
+from tasks import TASKS
+TOOL_COSTS: Dict[str, float] = {
+    "read_comments":       0.05,
+    "check_user_history":  0.05,
+    "check_entity_status": 0.10,
+    "view_policy":         0.10,
+}
+MAX_STEPS = 7
+DECISION_MATRIX: Dict[tuple, float] = {
+    ("REMOVE",             "REMOVE"):             1.00,
+    ("ALLOW",              "ALLOW"):              1.00,
+    ("ALLOW_WITH_WARNING", "ALLOW_WITH_WARNING"): 1.00,
+    ("ESCALATE",           "ESCALATE"):           1.00,
+    ("ALLOW_WITH_WARNING", "ALLOW"):              0.75,
+    ("ALLOW",              "ALLOW_WITH_WARNING"): 0.55,
+    ("ESCALATE",           "ALLOW_WITH_WARNING"): 0.65,
+    ("ESCALATE",           "ALLOW"):              0.45,
+    ("ESCALATE",           "REMOVE"):             0.45,
+    ("REMOVE",             "ALLOW"):              0.10,
+    ("REMOVE",             "ALLOW_WITH_WARNING"): 0.20,
+    ("ALLOW",              "REMOVE"):             0.00,
+    ("ALLOW_WITH_WARNING", "REMOVE"):             0.15,
+}
+class TrustSafetyEnvironment(Environment):
+    """
+    3-Layer Risk-Aware Trust & Safety RL Environment.
+    Layer 1 — Evidence gathering  : agent uses investigation tools (optional)
+    Layer 2 — Signal extraction   : agent outputs ContentSignals as feature extractor
+    Layer 3 — Policy engine       : validates signals, applies rules, computes reward
+    8-Component Reward: Accuracy · Policy Alignment · Signal Quality · Escalation
+                        Tool Usage · Consistency · Risk Sensitivity · Confidence
+    """
+    def __init__(self, seed: int = 42) -> None:
+        super().__init__()
+        self._rng                = random.Random(seed)
+        self._current_task:      Optional[Dict[str, Any]]  = None
+        self._tools_used:        Set[str]                  = set()
+        self._step_count:        int                       = 0
+        self._extracted_signals: Optional[ContentSignals]  = None
+        self._validation_result: Optional[Dict[str, Any]]  = None
+        self._signals_extracted: bool                      = False
+        self._obs:               Optional[TrustObservation]= None
+        self._state              = TrustState()
+        # ✅ FIX 3 — build a dict keyed by task_id for O(1) lookup
+        self._tasks: Dict[str, Dict[str, Any]] = {
+            t["task_id"]: t for t in TASKS
+        }
+    # -----------------------------------------------------------------------
+    # OpenEnv interface
+    # -----------------------------------------------------------------------
+    def reset(self, seed=None, episode_id=None, **kwargs) -> TrustObservation:
+        # ✅ FIX 1 — reset() is now correctly INSIDE the class
+        if seed is not None:
+            self._rng.seed(seed)
+        # Pick task by episode_id if provided, else random from all 6
+        if episode_id and episode_id in self._tasks:
+            task = self._tasks[episode_id]
+        else:
+            task = self._rng.choice(list(self._tasks.values()))
+        self._current_task       = task
+        self._tools_used         = set()
+        self._step_count         = 0
+        self._extracted_signals  = None
+        self._validation_result  = None
+        self._signals_extracted  = False
+        self._state = TrustState(
+            episode_id=task["task_id"],
+            step_count=0,
+            current_task_id=task["task_id"],
+            difficulty=task.get("difficulty", "medium"),
+            risk_level=task.get("risk_level", "medium"),
+            is_done=False,
+            tools_used=[],
+            signals_extracted=False,
+        )
+        self._obs = TrustObservation(
+            ticket_id=task["task_id"],
+            post_text=task["post_text"],
+            image_description=task.get("image_description", ""),
+            step_number=0,
+            done=False,
+        )
+        return self._obs   # ✅ FIX 2 — single clean return, stray return removed
+    def step(self, action: TrustAction, timeouts: Optional[Any] = None,
+             **kwargs) -> TrustObservation:
+        if self._current_task is None or self._obs is None:
+            raise RuntimeError("Call reset() before step().")
+        if self._step_count >= MAX_STEPS:
+            self._obs = TrustObservation(
+                ticket_id=self._current_task["task_id"],
+                post_text=self._obs.post_text,
+                image_description=self._obs.image_description,
+                step_number=self._step_count,
+                done=True,
+                reward=0.0,
+                info={"reason": "timeout", "tools_used": list(self._tools_used)},
+            )
+            return self._obs
+        atype = action.action_type
+        if atype == "use_tool":
+            return self._handle_tool(action)
+        if atype == "extract_signals":
+            return self._handle_signal_extraction(action)
+        if atype == "final_decision":
+            return self._handle_final_decision(action)
+        raise ValueError(f"Unknown action_type: {atype!r}")
+    @property
+    def state(self) -> TrustState:
+        return self._state
+    # -----------------------------------------------------------------------
+    # Layer 1 — Tool handling
+    # -----------------------------------------------------------------------
+    def _handle_tool(self, action: TrustAction) -> TrustObservation:
+        tool = action.tool_name
+        if tool not in TOOL_COSTS:
+            raise ValueError(f"Unknown tool: {tool!r}")
+        self._tools_used.add(tool)
+        response = self._current_task["tool_responses"].get(tool, "No data found.")
+        field_map = {
+            "read_comments":       "comments_found",
+            "check_user_history":  "user_history_found",
+            "check_entity_status": "entity_status_found",
+            "view_policy":         "policy_found",
+        }
+        self._step_count       += 1
+        self._state.step_count  = self._step_count
+        self._state.tools_used  = list(self._tools_used)
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found",
+                      "extracted_signals", "validation_result")
+        }
+        obs_kwargs[field_map[tool]] = response
+        obs_kwargs["step_number"]   = self._step_count
+        obs_kwargs["done"]          = False
+        obs_kwargs["reward"]        = None
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    # -----------------------------------------------------------------------
+    # Layer 2 — Signal extraction + validation
+    # -----------------------------------------------------------------------
+    def _handle_signal_extraction(self, action: TrustAction) -> TrustObservation:
+        raw = action.signals
+        raw.toxicity_level = max(0.0, min(1.0, float(raw.toxicity_level)))
+        raw.confidence     = max(0.0, min(1.0, float(raw.confidence)))
+        if not isinstance(raw.content_flags, list):
+            raw.content_flags = []
+        self._extracted_signals       = raw
+        self._signals_extracted       = True
+        self._validation_result       = self._validate_signals(raw)
+        self._step_count             += 1
+        self._state.step_count        = self._step_count
+        self._state.signals_extracted = True
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found")
+        }
+        obs_kwargs["extracted_signals"] = {
+            "target":                   raw.target,
+            "is_protected_class":       raw.is_protected_class,
+            "toxicity_level":           raw.toxicity_level,
+            "is_direct_attack":         raw.is_direct_attack,
+            "context_type":             raw.context_type,
+            "intent":                   raw.intent,
+            "confidence":               raw.confidence,
+            "abusive_language_present": raw.abusive_language_present,
+            "content_flags":            raw.content_flags,
+        }
+        obs_kwargs["validation_result"] = self._validation_result
+        obs_kwargs["step_number"]       = self._step_count
+        obs_kwargs["done"]              = False
+        obs_kwargs["reward"]            = None
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    def _validate_signals(self, s: ContentSignals) -> Dict[str, Any]:
+        issues = []
+        conf   = s.confidence
+        if not s.abusive_language_present and s.toxicity_level > 0.75:
+            issues.append("high_toxicity_without_abusive_language"); conf -= 0.15
+        if s.context_type in ("satire", "education") and s.intent == "harmful":
+            issues.append("harmful_intent_contradicts_satire_context"); conf -= 0.12
+        if s.is_protected_class and s.target == "none":
+            issues.append("protected_class_flagged_without_target"); conf -= 0.10
+        if s.is_direct_attack and s.toxicity_level < 0.25:
+            issues.append("direct_attack_with_low_toxicity"); conf -= 0.12
+        if s.context_type == "spam" and "spam" not in s.content_flags \
+                and "phishing" not in s.content_flags:
+            issues.append("spam_context_missing_spam_flag"); conf -= 0.08
+        adj = round(max(0.0, min(1.0, conf)), 3)
+        return {
+            "original_confidence":   s.confidence,
+            "adjusted_confidence":   adj,
+            "consistency_issues":    issues,
+            "issue_count":           len(issues),
+            "is_consistent":         len(issues) == 0,
+            "needs_escalation_hint": adj < 0.55 or len(issues) >= 2,
+        }
+    # -----------------------------------------------------------------------
+    # Layer 3 — Policy engine + final decision
+    # -----------------------------------------------------------------------
+    def _policy_engine(self) -> Dict[str, Any]:
+        if not self._extracted_signals or not self._validation_result:
+            return {"recommended": "ESCALATE", "reason": "no_signals_extracted",
+                    "rule_strength": "weak"}
+        s   = self._extracted_signals
+        adj = self._validation_result["adjusted_confidence"]
+        if adj < 0.50:
+            return {"recommended": "ESCALATE", "reason": "low_adjusted_confidence",
+                    "rule_strength": "medium"}
+        if "spam" in s.content_flags or "phishing" in s.content_flags:
+            return {"recommended": "REMOVE", "reason": "spam_or_phishing_detected",
+                    "rule_strength": "strong"}
+        if s.is_protected_class and s.is_direct_attack \
+                and s.toxicity_level > 0.70 and adj > 0.70:
+            return {"recommended": "REMOVE",
+                    "reason": "protected_class_direct_attack_high_toxicity",
+                    "rule_strength": "strong"}
+        if s.toxicity_level > 0.85 and s.intent == "harmful":
+            return {"recommended": "REMOVE", "reason": "high_toxicity_harmful_intent",
+                    "rule_strength": "strong"}
+        if s.context_type in ("satire", "news", "education", "protest") \
+                and s.intent in ("satirical", "descriptive", "educational"):
+            return {"recommended": "ALLOW_WITH_WARNING",
+                    "reason": "protected_expression_context",
+                    "rule_strength": "medium"}
+        if 0.40 < s.toxicity_level < 0.75 and s.intent == "ambiguous":
+            return {"recommended": "ESCALATE", "reason": "ambiguous_moderate_toxicity",
+                    "rule_strength": "medium"}
+        return {"recommended": "ALLOW", "reason": "no_policy_violation_detected",
+                "rule_strength": "medium"}
+    def _handle_final_decision(self, action: TrustAction) -> TrustObservation:
+        decision   = action.final_decision
+        components = self._compute_components(decision)
+        policy_rec = components.pop("_policy_rec")
+        reward     = self._finalize_reward(components)
+        self._step_count       += 1
+        self._state.step_count  = self._step_count
+        self._state.is_done     = True
+        components["final_reward"] = reward
+        obs_kwargs = {
+            k: getattr(self._obs, k)
+            for k in ("ticket_id", "post_text", "image_description",
+                      "comments_found", "user_history_found",
+                      "entity_status_found", "policy_found",
+                      "extracted_signals", "validation_result")
+        }
+        obs_kwargs["step_number"] = self._step_count
+        obs_kwargs["done"]        = True
+        obs_kwargs["reward"]      = reward
+        obs_kwargs["info"]        = {
+            "final_decision":        decision,
+            "ground_truth":          self._current_task["ground_truth"],
+            "policy_recommendation": policy_rec,
+            "signals_extracted":     self._signals_extracted,
+            "tools_used":            list(self._tools_used),
+            "required_tools":        self._current_task["required_tools"],
+            "ambiguity_level":       self._current_task["ambiguity_level"],
+            "risk_level":            self._current_task["risk_level"],
+            "task_id":               self._current_task["task_id"],
+            "reward_breakdown":      components,
+        }
+        self._obs = TrustObservation(**obs_kwargs)
+        return self._obs
+    # -----------------------------------------------------------------------
+    # 8-Component Reward Engine
+    # -----------------------------------------------------------------------
+    def _compute_components(self, final_decision: str) -> Dict[str, Any]:
+        gt             = self._current_task["ground_truth"]
+        required_tools = self._current_task["required_tools"]
+        ambiguity      = self._current_task["ambiguity_level"]
+        risk_level     = self._current_task["risk_level"]
+        policy_rec     = self._policy_engine()
+        base_score = DECISION_MATRIX.get((final_decision, gt), 0.20)
+        if final_decision == "ESCALATE" and ambiguity == "high":
+            base_score = max(base_score, 0.70)
+        is_correct = base_score >= 0.90
+        rule_weight = {"strong": 1.0, "medium": 0.70, "weak": 0.40}.get(
+            policy_rec.get("rule_strength", "medium"), 0.70)
+        policy_alignment = round(
+            (+0.12 if final_decision == policy_rec["recommended"] else -0.18) * rule_weight, 4)
+        signal_accuracy_bonus = self._compute_signal_accuracy()
+        adj_conf = (self._validation_result["adjusted_confidence"]
+                    if self._validation_result else 0.50)
+        should_escalate = adj_conf < 0.50
+        if should_escalate and final_decision == "ESCALATE":
+            escalation_adj = +0.15
+        elif should_escalate and final_decision != "ESCALATE":
+            escalation_adj = -0.18
+        elif not should_escalate and final_decision == "ESCALATE" and ambiguity == "low":
+            escalation_adj = -0.20
+        elif not should_escalate and final_decision == "ESCALATE":
+            escalation_adj = -0.10
+        else:
+            escalation_adj = 0.0
+        signal_bonus     = +0.05 if self._signals_extracted else -0.10
+        tool_cost        = round(sum(TOOL_COSTS.get(t, 0.0) for t in self._tools_used), 4)
+        missing_required = set(required_tools) - self._tools_used
+        tool_miss_penalty = round(len(missing_required) * 0.25, 4)
+        if self._validation_result:
+            n = self._validation_result["issue_count"]
+            validation_penalty = {0: 0.00, 1: 0.05, 2: 0.12}.get(n, 0.20)
+        else:
+            validation_penalty = 0.12
+        risk_penalty = 0.0
+        if not is_correct:
+            risk_penalty = {"high": 0.20, "medium": 0.10, "low": 0.0}.get(risk_level, 0.0)
+        if base_score < 0.50 and adj_conf > 0.80:
+            confidence_penalty = 0.22
+        elif base_score < 0.50 and adj_conf > 0.65:
+            confidence_penalty = 0.12
+        elif self._signals_extracted and final_decision == "ESCALATE" and adj_conf < 0.55:
+            confidence_penalty = -0.10
+        else:
+            confidence_penalty = 0.0
+        return {
+            "base_score":            base_score,
+            "policy_alignment":      policy_alignment,
+            "signal_accuracy_bonus": signal_accuracy_bonus,
+            "escalation_adj":        escalation_adj,
+            "signal_bonus":          signal_bonus,
+            "tool_cost":             tool_cost,
+            "tool_miss_penalty":     tool_miss_penalty,
+            "validation_penalty":    validation_penalty,
+            "risk_penalty":          risk_penalty,
+            "confidence_penalty":    confidence_penalty,
+            "_policy_rec":           policy_rec,
+        }
+    def _finalize_reward(self, components: Dict[str, Any]) -> float:
+        raw = (
+            components["base_score"]
+            + components["policy_alignment"]
+            + components["signal_accuracy_bonus"]
+            + components["escalation_adj"]
+            + components["signal_bonus"]
+            - components["tool_cost"]
+            - components["tool_miss_penalty"]
+            - components["validation_penalty"]
+            - components["risk_penalty"]
+            - components["confidence_penalty"]
+        )
+        return round(max(0.0, min(1.0, raw)), 4)
+    def _compute_signal_accuracy(self) -> float:
+        if not self._extracted_signals:
+            return 0.0
+        gt = self._current_task.get("ground_truth_signals", {})
+        if not gt:
+            return 0.05
+        s     = self._extracted_signals
+        score = 0.0
+        if s.target       == gt.get("target"):        score += 0.20
+        if s.intent       == gt.get("intent"):        score += 0.20
+        if s.context_type == gt.get("context_type"):  score += 0.20
+        tox_diff = abs(s.toxicity_level - gt.get("toxicity_level", 0.5))
+        score += 0.20 if tox_diff <= 0.20 else (0.10 if tox_diff <= 0.35 else 0.0)
+        gt_flags = set(gt.get("content_flags", []))
+        s_flags  = set(s.content_flags)
+        if gt_flags:
+            score += 0.20 * min(1.0, len(gt_flags & s_flags) / len(gt_flags))
+        else:
+            score += 0.20 if not s_flags else 0.10
+        return round(score * 0.15, 4)