Spaces:

uvpatel7271
/

final-python-env

Running

App Files Files Community

uvpatel7271 commited on 6 days ago

Commit

019e7db

verified ·

1 Parent(s): a79a797

Upload folder using huggingface_hub

Browse files

Files changed (42) hide show

DEMO_SCRIPT.md +21 -12
Dockerfile +5 -4
README.md +282 -181
__init__.py +32 -16
analyzers/ds_analyzer.py +16 -14
analyzers/dsa_analyzer.py +8 -7
analyzers/ml_analyzer.py +16 -14
analyzers/web_analyzer.py +8 -7
api/main.py +6 -6
app/env/__init__.py +3 -3
app/env/runner.py +55 -102
app/examples.py +31 -31
app/models/inference.py +21 -8
app/services/openai_service.py +6 -2
app/streamlit_app.py +135 -74
app/utils/runtime.py +31 -20
client.py +17 -10
graders/bug_fix.py +37 -9
graders/optimization.py +38 -13
graders/shared.py +104 -23
graders/syntax.py +37 -9
models/pytorch_model.py +227 -149
openenv_python_code_review_env.egg-info/PKG-INFO +34 -25
openenv_python_code_review_env.egg-info/SOURCES.txt +12 -0
openenv_python_code_review_env.egg-info/requires.txt +0 -1
openenv_python_code_review_env.egg-info/top_level.txt +1 -14
pyproject.toml +22 -6
schemas/request.py +51 -19
schemas/response.py +109 -73
server/Dockerfile +3 -2
server/app.py +18 -24
server/demo.py +1 -1
server/env.py +45 -36
server/requirements.txt +6 -8
services/analysis_service.py +258 -139
services/reward_service.py +56 -38
services/suggestion_service.py +113 -28
tests/test_inference_runner.py +53 -6
tests/test_scoring.py +11 -0
utils/ast_parser.py +248 -144
utils/complexity.py +70 -37
uv.lock +0 -2

DEMO_SCRIPT.md CHANGED Viewed

@@ -1,12 +1,21 @@
-# TorchReview Copilot Demo Script
-## 60-90 Second Walkthrough
-1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered code review and improvement system built with PyTorch.
-2. Point to the problem statement: manual code review is slow, inconsistent, and hard to scale.
-3. Select the `Fix the invoice total syntax regression` example to show the app loading a broken code sample together with the context window.
-4. Highlight the **Live Triage Radar**, the ML quality score, and the RL-ready reward score.
-5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known code-quality patterns from the OpenEnv task catalog.
-6. Scroll to the three-step improvement plan and call out the progression: syntax and bug fixes, edge cases, then scalability.
-7. Switch to the performance example to show the confidence profile and reward changing for a different class of issue.
-8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo remains grounded in measurable task outcomes.

+# TorchReview Copilot Demo Script
+## 60-90 Second Walkthrough
+1. Introduce TorchReview Copilot as an AI-powered code review system that helps developers find bugs, reduce complexity, and improve maintainability faster.
+2. Frame the problem clearly: manual code reviews are slow, inconsistent, and hard to scale across growing teams and codebases.
+3. Open the Streamlit app and load the `Boundary Bug` example to show a realistic Python regression with failing behavior.
+4. Point out the pipeline on-screen:
+   input code, static analysis, PyTorch scoring, suggestions, and RL-ready reward output.
+5. Highlight the PyTorch story:
+   the app uses CodeBERTa embeddings through PyTorch to score code quality, maintainability, and domain fit.
+6. Show the headline metrics:
+   detected domain, ML score, lint score, and final reward.
+7. Scroll to the reward breakdown and explain that the reward is not arbitrary; it combines ML quality, maintainability, security, lint signals, and complexity penalties.
+8. Open the Suggestions tab and show the prioritized fixes plus the three-step improvement plan.
+9. Switch to the `Performance Hotspot` example to demonstrate that the system adapts to a different issue profile and pushes optimization hints instead of only syntax guidance.
+10. Close by emphasizing that the same repo also works as an OpenEnv environment, so the project is both a usable developer product and an RL-ready benchmark component.
+## 20-Second Closing Line
+TorchReview Copilot turns code review into a measurable AI workflow: PyTorch handles semantic scoring, deterministic analyzers keep it grounded, and OpenEnv makes it trainable and benchmarkable.

Dockerfile CHANGED Viewed

@@ -6,14 +6,16 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONIOENCODING=utf-8 \
     PIP_NO_CACHE_DIR=1 \
     PIP_DISABLE_PIP_VERSION_CHECK=1 \
-    ENABLE_GRADIO_DEMO=false
 WORKDIR /app
 COPY server/requirements.txt /tmp/requirements.txt
 RUN python -m pip install --upgrade pip && \
-    pip install -r /tmp/requirements.txt
 COPY . /app
@@ -24,5 +26,4 @@ EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
     CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
-ENV ENABLE_WEB_INTERFACE=true
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--no-access-log"]

     PYTHONIOENCODING=utf-8 \
     PIP_NO_CACHE_DIR=1 \
     PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    PIP_DEFAULT_TIMEOUT=120 \
+    ENABLE_GRADIO_DEMO=false \
+    ENABLE_WEB_INTERFACE=false
 WORKDIR /app
 COPY server/requirements.txt /tmp/requirements.txt
 RUN python -m pip install --upgrade pip && \
+    pip install --prefer-binary -r /tmp/requirements.txt
 COPY . /app
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
     CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,181 +1,282 @@
----
-title: Python Code Review Environment Server
-sdk: docker
-app_port: 8000
-base_path: /web
-pinned: false
-tags:
-  - openenv
----
-# OpenEnv Python Code Review Environment
-Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
-## Architecture
-```text
-root
-├── inference.py                # Root validator entrypoint
-├── openenv.yaml                # OpenEnv manifest
-├── app/
-│   ├── agents/                # Action policy and fallback strategy
-│   ├── env/                   # RL loop runner and stdout contract
-│   ├── models/                # Inference dataclasses/config
-│   ├── services/              # OpenAI client wrapper with retries
-│   └── utils/                 # Formatting, task loading, log suppression
-├── server/
-│   ├── env.py                 # OpenEnv environment and reward shaping
-│   ├── app.py                 # FastAPI/OpenEnv app, optional Gradio mount
-│   └── Dockerfile             # Hugging Face Docker image
-├── graders/                   # Syntax, bug-fix, optimization graders
-├── tasks/                     # Deterministic benchmark tasks and references
-├── services/                  # Multi-domain analysis services
-├── analyzers/                 # Domain-specific analyzers
-├── models/                    # Lazy-loaded PyTorch scoring model
-├── schemas/                   # API request/response contracts
-└── tests/                     # Local validation coverage
-```
-Runtime flow:
-```text
-inference.py
-  -> app.env.runner.InferenceRunner
-  -> env.reset(task_id=...)
-  -> ReviewAgent(action planning)
-  -> env.step_result(action)
-  -> strict [START]/[STEP]/[END] output
-```
-## What Was Fixed
-- `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
-- OpenAI usage is limited to the official Python client:
-  `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
-- Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
-- Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
-- The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
-- Step errors now surface through `last_action_error` and are printed in `[STEP]`.
-- Reward shaping is now dynamic in the OpenEnv environment:
-  code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward.
-- The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals.
-- The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`.
-- Server startup is lighter:
-  the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default.
-## Local Setup
-Install dev dependencies:
-```bash
-pip install -e .[dev]
-```
-Run the test suite:
-```bash
-pytest -q
-```
-Run the OpenEnv server locally:
-```bash
-python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
-```
-Optional demo UI:
-```bash
-set ENABLE_GRADIO_DEMO=true
-set ENABLE_WEB_INTERFACE=true
-python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
-```
-## Inference Contract
-Required environment variables:
-- `API_BASE_URL`
-  Default: `https://router.huggingface.co/v1`
-- `MODEL_NAME`
-  Default: `Qwen/Qwen2.5-3B-Instruct`
-- `HF_TOKEN`
-  Mandatory, no default is injected
-Example:
-```bash
-set API_BASE_URL=https://router.huggingface.co/v1
-set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
-set HF_TOKEN=hf_xxx
-python inference.py
-```
-Expected stdout shape:
-```text
-[START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
-[STEP]  step=1 action=run_tests reward=0.12 done=false error=null
-[STEP]  step=2 action=edit_code reward=0.96 done=false error=null
-[STEP]  step=3 action=run_tests reward=0.99 done=false error=null
-[STEP]  step=4 action=submit_solution reward=0.99 done=true error=null
-[END]   success=true steps=4 rewards=0.12,0.96,0.99,0.99
-```
-## Docker
-Build from the project root:
-```bash
-docker build -f server/Dockerfile .
-```
-Run locally:
-```bash
-docker run --rm -p 8000:8000 ^
-  -e API_BASE_URL=https://router.huggingface.co/v1 ^
-  -e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^
-  -e HF_TOKEN=hf_xxx ^
-  openenv-python-code-review-env
-```
-Container behavior:
-- Base image: `python:3.11-slim`
-- Build context: project root
-- Healthcheck: `GET /health`
-- Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000`
-## Hugging Face Spaces
-Recommended deployment steps:
-1. Create a Docker Space.
-2. Push this repository as-is.
-3. Let Spaces build with `server/Dockerfile`.
-4. Set Space secrets:
-   `HF_TOKEN`
-5. Set Space variables as needed:
-   `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
-   `ENABLE_WEB_INTERFACE=false` is also supported for OpenEnv-managed deploys.
-6. Confirm the app listens on port `8000`.
-7. Smoke-test:
-   `/health`
-   `/reset`
-   `/step`
-## Performance Notes
-- Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target.
-- The analyzer model is lazy-loaded instead of being created at startup.
-- The inference runner relies on short prompts, low token budgets, and limited retries.
-- The policy uses deterministic reference-code fallback instead of expensive iterative code generation.
-- Public validation is preferred before final submission to avoid wasted hidden-eval steps.
-## Known Limitations
-- If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped.
-- The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark.
-- Gradio remains optional and is disabled by default to keep deployment lighter.

+---
+title: TorchReview Copilot
+sdk: docker
+app_port: 8000
+base_path: /web
+pinned: false
+tags:
+  - openenv
+  - pytorch
+  - code-review
+---
+# TorchReview Copilot
+TorchReview Copilot is an AI-powered code review and improvement system built for the Meta PyTorch OpenEnv Hackathon. It combines deterministic static analysis, a real PyTorch code encoder, domain-aware review logic, and RL-ready reward shaping to help developers catch bugs, reduce complexity, and improve maintainability faster.
+## Problem Statement
+Manual code review is slow, inconsistent, and difficult to scale. Small logic bugs slip through, performance hotspots hide in otherwise correct code, and review quality changes from reviewer to reviewer.
+## Solution
+TorchReview Copilot accepts Python code, analyzes it with AST and complexity heuristics, scores it with a PyTorch model, and returns:
+- A code quality score
+- Domain-aware review feedback
+- Actionable improvement suggestions
+- An RL-ready reward signal for OpenEnv environments
+## Why This Is Hackathon-Worthy
+- Solves a real developer productivity problem
+- Uses PyTorch meaningfully for model inference, not as a placeholder
+- Produces a measurable reward signal for RL workflows
+- Ships as a usable product with API, UI, docs, tests, and OpenEnv compatibility
+## Tech Stack
+- `PyTorch` for model execution and similarity scoring
+- `transformers` with `huggingface/CodeBERTa-small-v1` for pretrained code embeddings
+- `FastAPI` for the analysis API
+- `Streamlit` for the interactive review UI
+- `Pydantic` for request and response validation
+- `OpenAI` Python client for hackathon-compliant LLM action planning in `inference.py`
+- `OpenEnv` for environment, reward, and validator integration
+## Pipeline
+```text
+Input Python Code
+  -> AST Parsing + Structural Signals
+  -> Complexity + Lint Heuristics
+  -> PyTorch Model Inference (CodeBERTa / torch fallback)
+  -> Domain Analysis + Suggestion Engine
+  -> RL Reward Shaping
+  -> UI + API + OpenEnv Environment Output
+```
+## PyTorch Integration
+PyTorch is used in the core scoring path:
+- The app loads `huggingface/CodeBERTa-small-v1` through `transformers`
+- Input code, repository context, traceback text, and static-analysis hints are embedded with the encoder
+- The resulting embedding is compared against quality, maintainability, domain, and issue prototypes
+- The model produces:
+  - `ml_quality_score`
+  - `maintainability_score`
+  - domain confidences
+  - issue probabilities
+If pretrained weights are unavailable, the project falls back to a torch-native hashed embedding backend so local demos and CI still work offline.
+## Reward System
+The system is RL-ready by design. Reward shaping blends model confidence, code quality, security, maintainability, and complexity into a bounded signal.
+Core reward:
+```text
+reward = 0.50*ml_score
+       + 0.18*lint_score
+       + 0.12*maintainability_score
+       + 0.10*domain_score
+       + 0.10*security_score
+       - 0.20*complexity_penalty
+```
+The OpenEnv environment adds step-level shaping for:
+- public test progress
+- syntax recovery
+- runtime improvements
+- error reduction
+- final submission success
+- regressions and invalid actions
+All task and step rewards are normalized into a strict safe interval for OpenEnv validation and printed in a validator-safe two-decimal band.
+## Features
+- Real PyTorch-backed code quality inference
+- Static analysis with syntax, lint, AST, and complexity signals
+- Domain-aware review for DSA, data science, ML/DL, and web code
+- Prioritized suggestions and a compact 3-step improvement plan
+- Auto-fix preview hints for quick wins
+- Real-time Streamlit scoring mode
+- OpenEnv-compatible environment and `inference.py`
+- Deterministic benchmark tasks for syntax fixes, bug fixes, and optimization
+## WOW Features
+- Real-time scoring in the Streamlit interface
+- Auto-fix preview panel
+- Reward visualization and score breakdown
+- OpenEnv environment with transparent reward decomposition
+## Project Structure
+```text
+root
+|- inference.py
+|- api/
+|- app/
+|  |- agents/
+|  |- env/
+|  |- models/
+|  |- services/
+|  `- utils/
+|- analyzers/
+|- graders/
+|- models/
+|- schemas/
+|- services/
+|- tasks/
+|- tests/
+`- utils/
+```
+Key modules:
+- `models/pytorch_model.py`: PyTorch + transformer inference
+- `services/analysis_service.py`: end-to-end review pipeline
+- `services/reward_service.py`: RL-friendly reward shaping
+- `services/suggestion_service.py`: actionable recommendations
+- `app/streamlit_app.py`: interactive UI
+- `server/env.py`: OpenEnv environment implementation
+- `app/env/runner.py`: strict `inference.py` runner
+## API
+Run the analysis API:
+```bash
+python -m uvicorn api.main:app --host 0.0.0.0 --port 7860
+```
+Main endpoint:
+- `POST /analyze`
+The API returns:
+- detected domain
+- static-analysis summary
+- model prediction
+- score breakdown
+- suggestions
+- improvement plan
+## Streamlit UI
+Run the product UI locally:
+```bash
+streamlit run app/streamlit_app.py
+```
+The UI includes:
+- code input editor
+- example snippets
+- real-time scoring toggle
+- ML score, lint score, and reward display
+- domain confidence chart
+- reward-signal visualization
+- suggestion list and auto-fix preview
+## OpenEnv Compatibility
+This repository is also a valid OpenEnv submission:
+- `inference.py` is in the repo root
+- `API_BASE_URL` and `MODEL_NAME` have defaults
+- `HF_TOKEN` is read from the environment
+- The runner uses the official `OpenAI` Python client
+- Output follows the required `[START]`, `[STEP]`, `[END]` contract
+Example:
+```text
+[START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
+[STEP]  step=1 action=run_tests reward=0.34 done=false error=null
+[STEP]  step=2 action=edit_code reward=0.42 done=false error=null
+[STEP]  step=3 action=submit_solution reward=0.99 done=true error=null
+[END]   success=true steps=3 rewards=0.34,0.42,0.99
+```
+## Setup
+Install dependencies:
+```bash
+pip install -e .[dev]
+```
+Run tests:
+```bash
+pytest -q
+```
+Run the OpenEnv server:
+```bash
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+Run the demo UI mounted into the server:
+```bash
+set ENABLE_GRADIO_DEMO=true
+set ENABLE_WEB_INTERFACE=true
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+## Hugging Face Spaces
+This repo is designed to run on a Docker-based Hugging Face Space under a `2 vCPU / 8 GB RAM` budget.
+Recommended Space settings:
+- SDK: `Docker`
+- Port: `8000`
+- Secret: `HF_TOKEN`
+- Optional vars:
+  - `API_BASE_URL`
+  - `MODEL_NAME`
+  - `ENABLE_GRADIO_DEMO=false`
+  - `ENABLE_WEB_INTERFACE=false`
+## Screenshots
+Add these before final submission:
+- Main review UI with code editor and reward metrics
+- Suggestions tab with improvement plan
+- OpenEnv task loop or validator output snippet
+## Demo Link
+Add your live Hugging Face Space URL here before final submission.
+## Demo Script
+See [DEMO_SCRIPT.md](DEMO_SCRIPT.md) for a concise hackathon walkthrough.
+## Testing
+The repo includes coverage for:
+- score normalization into the strict OpenEnv-safe interval
+- inference output formatting
+- API response structure
+- multi-domain analysis behavior
+- triage and embedding behavior
+## Notes for Judges
+- This is not a toy wrapper around an LLM. The review pipeline includes deterministic analysis, PyTorch-based code scoring, and explicit reward shaping.
+- The system is useful both as a developer-facing application and as a benchmark-friendly RL environment.
+- The design intentionally balances product polish with validator reliability.

__init__.py CHANGED Viewed

@@ -1,19 +1,35 @@
-"""Public package exports for python_code_review_env."""
-from .client import PythonCodeReviewEnv, PythonEnv
-from .models import (
-    PyTorchCodeAnalyzerModel,
-    PythonAction,
-    PythonCodeReviewAction,
-    PythonCodeReviewObservation,
-    PythonCodeReviewState,
-    PythonObservation,
-    PythonState,
-)
-from .schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
-from .services import AnalysisService
-from .triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
-from .triage_models import TriageResult
 __all__ = [
     "PythonAction",

+"""Public package exports for python_code_review_env."""
+try:
+    from .client import PythonCodeReviewEnv, PythonEnv
+    from .models import (
+        PyTorchCodeAnalyzerModel,
+        PythonAction,
+        PythonCodeReviewAction,
+        PythonCodeReviewObservation,
+        PythonCodeReviewState,
+        PythonObservation,
+        PythonState,
+    )
+    from .schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
+    from .services import AnalysisService
+    from .triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
+    from .triage_models import TriageResult
+except ImportError:  # pragma: no cover
+    from client import PythonCodeReviewEnv, PythonEnv
+    from models import (
+        PyTorchCodeAnalyzerModel,
+        PythonAction,
+        PythonCodeReviewAction,
+        PythonCodeReviewObservation,
+        PythonCodeReviewState,
+        PythonObservation,
+        PythonState,
+    )
+    from schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
+    from services import AnalysisService
+    from triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
+    from triage_models import TriageResult
 __all__ = [
     "PythonAction",

analyzers/ds_analyzer.py CHANGED Viewed

@@ -15,13 +15,14 @@ def analyze_data_science_code(code: str, parsed: Dict[str, Any], complexity: Dic
     score = 0.72
     if "iterrows(" in code or "itertuples(" in code:
-        issues.append(
-            AnalysisIssue(
-                title="Row-wise dataframe iteration detected",
-                severity="medium",
-                description="Looping through dataframe rows is usually slower and less scalable than vectorized operations.",
-            )
-        )
         suggestions.append("Use vectorized pandas or numpy expressions instead of row-wise iteration.")
         score -= 0.18
@@ -30,13 +31,14 @@ def analyze_data_science_code(code: str, parsed: Dict[str, Any], complexity: Dic
         score -= 0.05
     if "fit_transform(" in code and "train_test_split" not in code:
-        issues.append(
-            AnalysisIssue(
-                title="Potential data leakage risk",
-                severity="high",
-                description="Feature transforms appear before an explicit train/test split.",
-            )
-        )
         suggestions.append("Split train and validation data before fitting stateful preprocessing steps.")
         score -= 0.2

     score = 0.72
     if "iterrows(" in code or "itertuples(" in code:
+        issues.append(
+            AnalysisIssue(
+                title="Row-wise dataframe iteration detected",
+                category="performance",
+                severity="medium",
+                description="Looping through dataframe rows is usually slower and less scalable than vectorized operations.",
+            )
+        )
         suggestions.append("Use vectorized pandas or numpy expressions instead of row-wise iteration.")
         score -= 0.18
         score -= 0.05
     if "fit_transform(" in code and "train_test_split" not in code:
+        issues.append(
+            AnalysisIssue(
+                title="Potential data leakage risk",
+                category="correctness",
+                severity="high",
+                description="Feature transforms appear before an explicit train/test split.",
+            )
+        )
         suggestions.append("Split train and validation data before fitting stateful preprocessing steps.")
         score -= 0.2

analyzers/dsa_analyzer.py CHANGED Viewed

@@ -15,13 +15,14 @@ def analyze_dsa_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, An
     score = 0.7
     if parsed.get("max_loop_depth", 0) >= 2:
-        issues.append(
-            AnalysisIssue(
-                title="Nested loops suggest brute-force behavior",
-                severity="medium",
-                description="The implementation scans the input multiple times, which is often avoidable in DSA problems.",
-            )
-        )
         suggestions.append("Consider replacing nested scans with a hashmap, prefix table, or sorted search strategy.")
         score -= 0.15

     score = 0.7
     if parsed.get("max_loop_depth", 0) >= 2:
+        issues.append(
+            AnalysisIssue(
+                title="Nested loops suggest brute-force behavior",
+                category="performance",
+                severity="medium",
+                description="The implementation scans the input multiple times, which is often avoidable in DSA problems.",
+            )
+        )
         suggestions.append("Consider replacing nested scans with a hashmap, prefix table, or sorted search strategy.")
         score -= 0.15

analyzers/ml_analyzer.py CHANGED Viewed

@@ -15,13 +15,14 @@ def analyze_ml_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any
     score = 0.74
     if "torch" in code and "model.eval()" not in code and "predict" in code.lower():
-        issues.append(
-            AnalysisIssue(
-                title="Inference path may be missing eval mode",
-                severity="high",
-                description="Inference code should place the model in eval mode before prediction.",
-            )
-        )
         suggestions.append("Call model.eval() before inference to disable training-time behavior such as dropout.")
         score -= 0.18
@@ -30,13 +31,14 @@ def analyze_ml_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any
         score -= 0.12
     if parsed.get("calls_backward") and not parsed.get("calls_optimizer_step"):
-        issues.append(
-            AnalysisIssue(
-                title="Backward pass without optimizer step",
-                severity="medium",
-                description="Gradients are computed, but the optimizer step is not obvious in the snippet.",
-            )
-        )
         suggestions.append("Ensure optimizer.step() and optimizer.zero_grad() are placed correctly in the training loop.")
         score -= 0.12

     score = 0.74
     if "torch" in code and "model.eval()" not in code and "predict" in code.lower():
+        issues.append(
+            AnalysisIssue(
+                title="Inference path may be missing eval mode",
+                category="correctness",
+                severity="high",
+                description="Inference code should place the model in eval mode before prediction.",
+            )
+        )
         suggestions.append("Call model.eval() before inference to disable training-time behavior such as dropout.")
         score -= 0.18
         score -= 0.12
     if parsed.get("calls_backward") and not parsed.get("calls_optimizer_step"):
+        issues.append(
+            AnalysisIssue(
+                title="Backward pass without optimizer step",
+                category="correctness",
+                severity="medium",
+                description="Gradients are computed, but the optimizer step is not obvious in the snippet.",
+            )
+        )
         suggestions.append("Ensure optimizer.step() and optimizer.zero_grad() are placed correctly in the training loop.")
         score -= 0.12

analyzers/web_analyzer.py CHANGED Viewed

@@ -16,13 +16,14 @@ def analyze_web_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, An
     route_decorators = set(parsed.get("route_decorators", []))
     if route_decorators and not parsed.get("uses_pydantic"):
-        issues.append(
-            AnalysisIssue(
-                title="Request validation model is missing",
-                severity="high",
-                description="Route handlers appear present, but no obvious Pydantic validation layer was detected.",
-            )
-        )
         suggestions.append("Add Pydantic request and response models for strict validation and type-safe contracts.")
         score -= 0.2

     route_decorators = set(parsed.get("route_decorators", []))
     if route_decorators and not parsed.get("uses_pydantic"):
+        issues.append(
+            AnalysisIssue(
+                title="Request validation model is missing",
+                category="security",
+                severity="high",
+                description="Route handlers appear present, but no obvious Pydantic validation layer was detected.",
+            )
+        )
         suggestions.append("Add Pydantic request and response models for strict validation and type-safe contracts.")
         score -= 0.2

api/main.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""FastAPI backend for the multi-domain AI code analyzer."""
 from __future__ import annotations
@@ -9,7 +9,7 @@ from schemas.response import AnalyzeCodeResponse
 from services.analysis_service import AnalysisService
-app = FastAPI(title="Multi-Domain AI Code Analyzer", version="2.0.0")
 analysis_service = AnalysisService()
@@ -21,7 +21,7 @@ def health() -> dict[str, str]:
 @app.post("/analyze", response_model=AnalyzeCodeResponse)
-def analyze_code(payload: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
-    """Analyze code across supported domains and return structured results."""
-    return analysis_service.analyze(payload)

+"""FastAPI backend for the AI-powered Python code review platform."""
 from __future__ import annotations
 from services.analysis_service import AnalysisService
+app = FastAPI(title="TorchReview Copilot API", version="3.0.0")
 analysis_service = AnalysisService()
 @app.post("/analyze", response_model=AnalyzeCodeResponse)
+def analyze_code(payload: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
+    """Analyze Python code and return review scores, suggestions, and reward signals."""
+    return analysis_service.analyze(payload)

app/env/__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
-"""Inference runtime helpers for the OpenEnv environment."""
-from .runner import main
-__all__ = ["main"]

+"""OpenEnv inference runtime package."""
+from .runner import InferenceRunner, main
+__all__ = ["InferenceRunner", "main"]

app/env/runner.py CHANGED Viewed

@@ -1,25 +1,14 @@
-"""Strict-output inference runtime for OpenEnv validators."""
 from __future__ import annotations
 from typing import Any
-from compat import install_openenv_fastmcp_compat
 from app.agents.review_agent import ReviewAgent
-from app.models.inference import AgentDecision, InferenceConfig
 from app.services.openai_service import OpenAIActionPlanner
-from app.utils.runtime import (
-    compact_text,
-    format_bool,
-    format_error,
-    format_reward,
-    observation_attr,
-    parse_task_ids,
-    suppress_output,
-)
-install_openenv_fastmcp_compat()
 try:
     from models import PythonCodeReviewAction
@@ -30,107 +19,71 @@ except ImportError:  # pragma: no cover
 class InferenceRunner:
-    """Run benchmark tasks with strict single-line progress output."""
     def __init__(self, config: InferenceConfig) -> None:
         self.config = config
         self.agent = ReviewAgent(OpenAIActionPlanner(config))
-    def run(self) -> int:
-        for task_name in parse_task_ids():
-            self.run_task(task_name)
-        return 0
-    def run_task(self, task_name: str) -> None:
         rewards: list[str] = []
-        step_count = 0
         success = False
-        fatal_error: str | None = None
-        self._emit_start(task_name)
         try:
-            env = self._create_env()
-            observation = self._reset_env(env, task_name)
-            done = bool(observation_attr(observation, "done", False))
-            max_steps = max(
-                1,
-                min(
-                    self.config.max_episode_steps,
-                    int(observation_attr(observation, "attempts_remaining", self.config.max_episode_steps) or self.config.max_episode_steps),
-                ),
-            )
-            while not done and step_count < max_steps:
                 decision = self.agent.act(observation)
-                observation, reward, done, info = self._step_env(env, decision)
-                step_count += 1
                 rewards.append(format_reward(reward))
-                step_error = self._resolve_step_error(info, observation, decision)
-                self._emit_step(step_count, decision.action_type, reward, done, step_error)
-            if not done and step_count >= max_steps:
-                fatal_error = "step budget exhausted"
-            success = bool(done) and fatal_error is None
         except Exception as exc:
-            fatal_error = compact_text(f"{type(exc).__name__}: {exc}", default="runtime failure")
         finally:
-            self._emit_end(success=success, step_count=step_count, rewards=rewards)
-    def _create_env(self) -> PythonCodeReviewEnvironment:
-        with suppress_output():
-            return PythonCodeReviewEnvironment(verbose=False)
-    def _reset_env(self, env: PythonCodeReviewEnvironment, task_name: str) -> Any:
-        with suppress_output():
-            return env.reset(task_id=task_name)
-    def _step_env(
-        self,
-        env: PythonCodeReviewEnvironment,
-        decision: AgentDecision,
-    ) -> tuple[Any, float, bool, dict[str, Any]]:
-        action = PythonCodeReviewAction(action_type=decision.action_type, code=decision.code)
-        with suppress_output():
-            observation, reward, done, info = env.step_result(action)
-        return observation, float(reward), bool(done), dict(info or {})
-    def _resolve_step_error(
-        self,
-        info: dict[str, Any],
-        observation: Any,
-        decision: AgentDecision,
-    ) -> str | None:
-        env_error = compact_text(
-            info.get("last_action_error") or observation_attr(observation, "last_action_error", None),
-            default="",
-        )
-        if env_error:
-            return env_error
-        if decision.error:
-            return compact_text(decision.error, default="")
-        return None
-    def _emit_start(self, task_name: str) -> None:
-        print(
-            f"[START] task={task_name} env={self.config.benchmark_name} model={self.config.model_name}",
-            flush=True,
-        )
-    def _emit_step(self, step_count: int, action: str, reward: float, done: bool, error: str | None) -> None:
-        print(
-            f"[STEP]  step={step_count} action={compact_text(action, default='analyze_code')} "
-            f"reward={format_reward(reward)} done={format_bool(done)} error={format_error(error)}",
-            flush=True,
-        )
-    def _emit_end(self, *, success: bool, step_count: int, rewards: list[str]) -> None:
-        print(
-            f"[END]   success={format_bool(success)} steps={step_count} rewards={','.join(rewards)}",
-            flush=True,
-        )
 def main() -> int:
-    """Entrypoint used by the root-level inference wrapper."""
-    return InferenceRunner(InferenceConfig.from_env()).run()

+"""Strict OpenEnv inference runner for TorchReview Copilot."""
 from __future__ import annotations
+import os
 from typing import Any
 from app.agents.review_agent import ReviewAgent
+from app.models.inference import InferenceConfig
 from app.services.openai_service import OpenAIActionPlanner
+from app.utils.runtime import format_bool, format_error, format_reward, parse_task_ids
 try:
     from models import PythonCodeReviewAction
 class InferenceRunner:
+    """Execute one OpenEnv episode and emit the required stdout contract."""
     def __init__(self, config: InferenceConfig) -> None:
         self.config = config
         self.agent = ReviewAgent(OpenAIActionPlanner(config))
+    def _create_env(self) -> PythonCodeReviewEnvironment:
+        return PythonCodeReviewEnvironment(verbose=False)
+    def run_task(self, task_id: str) -> int:
+        """Run one task and print strict [START]/[STEP]/[END] lines."""
+        env = self._create_env()
         rewards: list[str] = []
+        steps = 0
         success = False
+        print(f"[START] task={task_id} env={self.config.benchmark_name} model={self.config.model_name}")
         try:
+            observation = env.reset(task_id=task_id)
+            done = bool(getattr(observation, "done", False))
+            while not done and steps < self.config.max_episode_steps:
                 decision = self.agent.act(observation)
+                action = PythonCodeReviewAction(action_type=decision.action_type, code=decision.code)
+                observation, reward, done, info = env.step_result(action)
+                steps += 1
                 rewards.append(format_reward(reward))
+                error_value = info.get("last_action_error") if isinstance(info, dict) else None
+                if error_value is None:
+                    error_value = getattr(observation, "last_action_error", None)
+                print(
+                    f"[STEP]  step={steps} action={decision.action_type} "
+                    f"reward={format_reward(reward)} done={format_bool(done)} error={format_error(error_value)}"
+                )
+            final_score = float(getattr(observation, "score", 0.0))
+            success = bool(done and final_score >= self.config.success_threshold)
+            return 0 if success else 1
         except Exception as exc:
+            if steps == 0:
+                print(
+                    f"[STEP]  step=1 action=bootstrap reward=0.00 done=true "
+                    f"error={format_error(f'{type(exc).__name__}: {exc}')}"
+                )
+                rewards.append("0.00")
+                steps = 1
+            return 1
         finally:
+            try:
+                close_method = getattr(env, "close", None)
+                if callable(close_method):
+                    close_method()
+            except Exception:
+                pass
+            print(f"[END]   success={format_bool(success)} steps={steps} rewards={','.join(rewards)}")
 def main() -> int:
+    """Run a single validator episode using environment defaults."""
+    config = InferenceConfig.from_env()
+    task_id = (
+        str(os.getenv("OPENENV_TASK_ID") or os.getenv("TASK_ID") or "").strip()
+        or parse_task_ids()[0]
+    )
+    runner = InferenceRunner(config)
+    return runner.run_task(task_id)

app/examples.py CHANGED Viewed

@@ -1,31 +1,31 @@
-"""Example snippets for each supported analysis domain."""
-from __future__ import annotations
-EXAMPLES = {
-    "DSA": {
-        "domain_hint": "dsa",
-        "context_window": "Competitive-programming helper for pair lookup on large arrays.",
-        "traceback_text": "",
-        "code": """def two_sum(nums, target):\n    for i in range(len(nums)):\n        for j in range(i + 1, len(nums)):\n            if nums[i] + nums[j] == target:\n                return [i, j]\n    return []\n""",
-    },
-    "Data Science": {
-        "domain_hint": "data_science",
-        "context_window": "Feature engineering step in a churn-prediction notebook.",
-        "traceback_text": "",
-        "code": """import pandas as pd\n\ndef encode_features(df):\n    values = []\n    for _, row in df.iterrows():\n        values.append(row['age'] * row['sessions'])\n    df['score'] = values\n    return df\n""",
-    },
-    "ML / DL": {
-        "domain_hint": "ml_dl",
-        "context_window": "Inference utility for a PyTorch classifier used in a batch review job.",
-        "traceback_text": "",
-        "code": """import torch\n\nclass Predictor:\n    def __init__(self, model):\n        self.model = model\n\n    def predict(self, batch):\n        outputs = self.model(batch)\n        return outputs.argmax(dim=1)\n""",
-    },
-    "Web / FastAPI": {
-        "domain_hint": "web",
-        "context_window": "Backend endpoint for creating review tasks from user-submitted payloads.",
-        "traceback_text": "",
-        "code": """from fastapi import FastAPI, Request\n\napp = FastAPI()\n\n@app.post('/tasks')\ndef create_task(request: Request):\n    payload = request.json()\n    return {'task': payload}\n""",
-    },
-}

+"""Example snippets for the code review UI."""
+from __future__ import annotations
+EXAMPLES = {
+    "Boundary Bug": {
+        "domain_hint": "dsa",
+        "context_window": "Analytics helper that groups sorted events into session windows.",
+        "traceback_text": "AssertionError: expected [(1, 3), (8, 8)] but got [(1, 8)] on the boundary case.",
+        "code": """def collapse_sessions(events, idle_timeout_minutes):\n    if not events:\n        return []\n\n    sessions = []\n    current_start = events[0]['minute']\n    current_end = current_start\n\n    for event in events[1:]:\n        minute = event['minute']\n        if minute - current_end > idle_timeout_minutes:\n            sessions.append((current_start, current_end))\n            current_start = minute\n        current_end = minute\n\n    return sessions\n""",
+    },
+    "Performance Hotspot": {
+        "domain_hint": "dsa",
+        "context_window": "Nightly export job running on a small CPU box with rising traffic volume.",
+        "traceback_text": "BenchmarkWarning: function exceeded latency budget due to repeated full-list scans.",
+        "code": """def rank_active_users(events):\n    users = []\n    for event in events:\n        if event['status'] == 'active':\n            found = False\n            for existing in users:\n                if existing == event['user_id']:\n                    found = True\n            if not found:\n                users.append(event['user_id'])\n\n    totals = []\n    for user in users:\n        count = 0\n        for event in events:\n            if event['status'] == 'active' and event['user_id'] == user:\n                count += 1\n        totals.append((user, count))\n\n    totals.sort(key=lambda item: (-item[1], item[0]))\n    return totals\n""",
+    },
+    "ML Inference": {
+        "domain_hint": "ml_dl",
+        "context_window": "Batch inference helper for a PyTorch image classifier.",
+        "traceback_text": "",
+        "code": """import torch\n\nclass Predictor:\n    def __init__(self, model):\n        self.model = model\n\n    def predict(self, batch):\n        outputs = self.model(batch)\n        return outputs.argmax(dim=1)\n""",
+    },
+    "FastAPI Endpoint": {
+        "domain_hint": "web",
+        "context_window": "Backend endpoint for creating review tasks from user-submitted payloads.",
+        "traceback_text": "",
+        "code": """from fastapi import FastAPI, Request\n\napp = FastAPI()\n\n@app.post('/tasks')\ndef create_task(request: Request):\n    payload = request.json()\n    return {'task': payload}\n""",
+    },
+}

app/models/inference.py CHANGED Viewed

@@ -11,25 +11,38 @@ DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
 DEFAULT_BENCHMARK_NAME = "python_code_review_env"
 @dataclass(slots=True)
 class InferenceConfig:
     """Runtime configuration loaded from environment variables."""
     api_base_url: str
     model_name: str
-    hf_token: str
-    benchmark_name: str = DEFAULT_BENCHMARK_NAME
-    request_timeout_s: float = 12.0
-    max_retries: int = 2
-    max_episode_steps: int = 12
-    success_threshold: float = 0.94
     @classmethod
     def from_env(cls) -> "InferenceConfig":
         return cls(
-            api_base_url=str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL),
             model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
-            hf_token=str(os.getenv("HF_TOKEN") or ""),
             benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
         )

 DEFAULT_BENCHMARK_NAME = "python_code_review_env"
+def _resolve_api_key(api_base_url: str) -> str:
+    """Choose the correct provider token for the configured endpoint."""
+    normalized = api_base_url.strip().lower()
+    hf_token = str(os.getenv("HF_TOKEN") or "").strip()
+    openai_api_key = str(os.getenv("OPENAI_API_KEY") or "").strip()
+    if "api.openai.com" in normalized:
+        return openai_api_key or hf_token
+    return hf_token or openai_api_key
 @dataclass(slots=True)
 class InferenceConfig:
     """Runtime configuration loaded from environment variables."""
     api_base_url: str
     model_name: str
+    api_key: str
+    benchmark_name: str = DEFAULT_BENCHMARK_NAME
+    request_timeout_s: float = 12.0
+    max_retries: int = 2
+    max_episode_steps: int = 12
+    success_threshold: float = 0.88
     @classmethod
     def from_env(cls) -> "InferenceConfig":
+        api_base_url = str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL)
         return cls(
+            api_base_url=api_base_url,
             model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
+            api_key=_resolve_api_key(api_base_url),
             benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
         )

app/services/openai_service.py CHANGED Viewed

@@ -20,11 +20,15 @@ class OpenAIActionPlanner:
     def __init__(self, config: InferenceConfig) -> None:
         self.config = config
-        self.client = OpenAI(base_url=config.api_base_url, api_key=config.hf_token) if config.hf_token else None
     def propose_action(self, observation: Any) -> AgentDecision:
         if self.client is None:
-            return AgentDecision(action_type="run_tests", source="fallback", error="HF_TOKEN missing")
         prompt = self._build_prompt(observation)
         for attempt in range(self.config.max_retries + 1):

     def __init__(self, config: InferenceConfig) -> None:
         self.config = config
+        self.client = (
+            OpenAI(base_url=config.api_base_url, api_key=config.api_key, timeout=config.request_timeout_s)
+            if config.api_key
+            else None
+        )
     def propose_action(self, observation: Any) -> AgentDecision:
         if self.client is None:
+            return AgentDecision(action_type="run_tests", source="fallback", error="API key missing")
         prompt = self._build_prompt(observation)
         for attempt in range(self.config.max_retries + 1):

app/streamlit_app.py CHANGED Viewed

@@ -1,44 +1,75 @@
-"""Streamlit frontend for the multi-domain analyzer platform."""
-from __future__ import annotations
-import streamlit as st
 from app.examples import EXAMPLES
 from schemas.request import AnalyzeCodeRequest
 from services.analysis_service import AnalysisService
-analysis_service = AnalysisService()
-def _analyze(code: str, context_window: str, traceback_text: str, domain_hint: str):
-    """Run the analysis service with validated request payloads."""
     request = AnalyzeCodeRequest(
         code=code,
         context_window=context_window,
         traceback_text=traceback_text,
         domain_hint=domain_hint,  # type: ignore[arg-type]
-    )
-    return analysis_service.analyze(request)
-def main() -> None:
-    """Render the Streamlit UI."""
-    st.set_page_config(page_title="Multi-Domain AI Code Analyzer", layout="wide")
-    st.title("Multi-Domain AI Code Analyzer & Improvement System")
-    st.caption("PyTorch-powered code review across DSA, Data Science, ML/DL, and Web backend code.")
-    example_name = st.selectbox("Example input", list(EXAMPLES.keys()))
-    example = EXAMPLES[example_name]
-    auto_analyze = st.toggle("Real-time scoring", value=True)
-    left, right = st.columns([1.2, 1.0])
-    with left:
-        code = st.text_area("Code input", value=example["code"], height=420)
-        context_window = st.text_area("Context window", value=example["context_window"], height=100)
         traceback_text = st.text_area("Optional traceback / runtime hint", value=example["traceback_text"], height=100)
         domain_hint = st.selectbox("Domain hint", ["auto", "dsa", "data_science", "ml_dl", "web"], index=["auto", "dsa", "data_science", "ml_dl", "web"].index(example["domain_hint"]))
         analyze_clicked = st.button("Analyze Code", type="primary")
@@ -47,53 +78,83 @@ def main() -> None:
     if code and (analyze_clicked or auto_analyze):
         result = _analyze(code, context_window, traceback_text, domain_hint)
-    with right:
-        if result is None:
-            st.info("Paste code or load an example to start analysis.")
-        else:
-            metric_cols = st.columns(4)
-            metric_cols[0].metric("Detected domain", result.detected_domain)
-            metric_cols[1].metric("ML score", f"{result.score_breakdown.ml_score:.0%}")
-            metric_cols[2].metric("Domain score", f"{result.score_breakdown.domain_score:.0%}")
-            metric_cols[3].metric("Reward", f"{result.score_breakdown.reward:.0%}")
-            st.bar_chart(result.domain_confidences)
-            st.caption(result.summary)
-    if result is not None:
-        overview_tab, suggestions_tab, domain_tab, static_tab = st.tabs(
-            ["Overview", "Suggestions", "Domain Detail", "Static Analysis"]
-        )
-        with overview_tab:
-            st.subheader("Improvement Plan")
-            for step in result.improvement_plan:
-                st.write(f"- {step}")
-            st.subheader("Complexity")
-            st.write(
-                {
-                    "time_complexity": result.static_analysis.time_complexity,
-                    "space_complexity": result.static_analysis.space_complexity,
-                    "cyclomatic_complexity": result.static_analysis.cyclomatic_complexity,
-                }
-            )
-        with suggestions_tab:
-            st.subheader("Suggestions")
-            for suggestion in result.domain_analysis.suggestions:
-                st.write(f"- {suggestion}")
-            if result.domain_analysis.issues:
-                st.subheader("Issues")
-                for issue in result.domain_analysis.issues:
-                    st.write(f"- [{issue.severity}] {issue.title}: {issue.description}")
-        with domain_tab:
-            st.subheader("Domain Highlights")
-            st.json(result.domain_analysis.highlights)
-            st.write(f"Domain score: {result.domain_analysis.domain_score:.0%}")
-        with static_tab:
-            st.subheader("Static Analysis")
-            st.json(result.static_analysis.model_dump())
 if __name__ == "__main__":

+"""Streamlit frontend for the AI-powered Python code review platform."""
+from __future__ import annotations
+import streamlit as st
 from app.examples import EXAMPLES
 from schemas.request import AnalyzeCodeRequest
 from services.analysis_service import AnalysisService
+analysis_service = AnalysisService()
+def _analyze(code: str, context_window: str, traceback_text: str, domain_hint: str):
+    """Run the analysis service with validated request payloads."""
     request = AnalyzeCodeRequest(
         code=code,
         context_window=context_window,
         traceback_text=traceback_text,
         domain_hint=domain_hint,  # type: ignore[arg-type]
+    )
+    return analysis_service.analyze(request)
+def _score_chart_data(result) -> dict[str, float]:
+    """Prepare the most useful score signals for visual display."""
+    return {
+        "reward": result.score_breakdown.reward,
+        "ml_quality": result.score_breakdown.ml_score,
+        "lint": result.score_breakdown.lint_score,
+        "maintainability": result.score_breakdown.maintainability_score,
+        "readability": result.score_breakdown.readability_score,
+        "security": result.score_breakdown.security_score,
+    }
+def main() -> None:
+    """Render the Streamlit UI."""
+    st.set_page_config(page_title="TorchReview Copilot", layout="wide")
+    st.title("TorchReview Copilot")
+    st.caption(
+        "AI-powered Python code review with static analysis, PyTorch scoring, "
+        "RL-ready rewards, and actionable code-improvement guidance."
+    )
+    with st.sidebar:
+        st.subheader("Review Pipeline")
+        st.markdown(
+            "\n".join(
+                [
+                    "1. Input Python code",
+                    "2. Parse AST + estimate complexity",
+                    "3. Score with a PyTorch encoder",
+                    "4. Generate suggestions and auto-fix hints",
+                    "5. Compute an RL-ready reward",
+                ]
+            )
+        )
+        example_name = st.selectbox("Example input", list(EXAMPLES.keys()))
+        auto_analyze = st.toggle("Real-time scoring", value=True)
+        st.info("The PyTorch layer uses CodeBERTa embeddings when weights are available, with a torch-native fallback for offline demos.")
+    example = EXAMPLES[example_name]
+    left, right = st.columns([1.2, 1.0])
+    with left:
+        code = st.text_area("Code input", value=example["code"], height=420)
+        context_window = st.text_area("Context window", value=example["context_window"], height=100)
         traceback_text = st.text_area("Optional traceback / runtime hint", value=example["traceback_text"], height=100)
         domain_hint = st.selectbox("Domain hint", ["auto", "dsa", "data_science", "ml_dl", "web"], index=["auto", "dsa", "data_science", "ml_dl", "web"].index(example["domain_hint"]))
         analyze_clicked = st.button("Analyze Code", type="primary")
     if code and (analyze_clicked or auto_analyze):
         result = _analyze(code, context_window, traceback_text, domain_hint)
+    with right:
+        if result is None:
+            st.info("Paste code or load an example to start analysis.")
+        else:
+            metric_cols = st.columns(4)
+            metric_cols[0].metric("Detected domain", result.detected_domain)
+            metric_cols[1].metric("ML score", f"{result.score_breakdown.ml_score:.0%}")
+            metric_cols[2].metric("Lint score", f"{result.score_breakdown.lint_score:.0%}")
+            metric_cols[3].metric("Reward", f"{result.score_breakdown.reward:.0%}")
+            st.subheader("Domain Confidence")
+            st.bar_chart(result.domain_confidences)
+            st.subheader("Review Signal Radar")
+            st.bar_chart(_score_chart_data(result))
+            st.code(
+                "reward = 0.50*ml_score + 0.18*lint + 0.12*maintainability "
+                "+ 0.10*domain + 0.10*security - 0.20*complexity",
+                language="text",
+            )
+            st.caption(result.summary)
+    if result is not None:
+        overview_tab, suggestions_tab, domain_tab, static_tab = st.tabs(
+            ["Overview", "Suggestions", "Domain Detail", "Static Analysis"]
+        )
+        with overview_tab:
+            st.subheader("Reward Breakdown")
+            st.json(result.score_visualization)
+            st.subheader("Top Signals")
+            signal_cols = st.columns(3)
+            signal_cols[0].progress(result.score_breakdown.quality_signal, text="Quality signal")
+            signal_cols[1].progress(result.score_breakdown.error_reduction_signal, text="Error reduction")
+            signal_cols[2].progress(result.score_breakdown.completion_signal, text="Completion")
+            st.subheader("Improvement Plan")
+            for step in result.improvement_plan:
+                st.write(f"- {step}")
+            if result.auto_fix_preview:
+                st.subheader("Auto-Fix Preview")
+                for hint in result.auto_fix_preview:
+                    st.write(f"- {hint}")
+            st.subheader("Complexity")
+            st.write(
+                {
+                    "time_complexity": result.static_analysis.time_complexity,
+                    "space_complexity": result.static_analysis.space_complexity,
+                    "cyclomatic_complexity": result.static_analysis.cyclomatic_complexity,
+                    "max_nesting_depth": result.static_analysis.max_nesting_depth,
+                }
+            )
+        with suggestions_tab:
+            st.subheader("Suggestions")
+            for suggestion in result.suggestions:
+                st.write(f"- [{suggestion.priority}] {suggestion.title}: {suggestion.action}")
+            if result.domain_analysis.suggestions:
+                st.subheader("Domain Hints")
+                for item in result.domain_analysis.suggestions:
+                    st.write(f"- {item}")
+            if result.domain_analysis.issues or result.static_analysis.issues:
+                st.subheader("Issues")
+                for issue in result.domain_analysis.issues + result.static_analysis.issues:
+                    st.write(f"- [{issue.severity}] {issue.title}: {issue.description}")
+        with domain_tab:
+            st.subheader("Domain Highlights")
+            st.json(result.domain_analysis.highlights)
+            st.write(f"Domain score: {result.domain_analysis.domain_score:.0%}")
+            st.write(f"Model label: {result.model_prediction.quality_label}")
+            st.write(f"Model backend: `{result.model_backend}`")
+            if result.model_prediction.notes:
+                st.subheader("Model Notes")
+                for note in result.model_prediction.notes:
+                    st.write(f"- {note}")
+        with static_tab:
+            st.subheader("Static Analysis")
+            st.json(result.static_analysis.model_dump())
 if __name__ == "__main__":

app/utils/runtime.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Formatting, parsing, and IO-suppression helpers for inference."""
 from __future__ import annotations
@@ -7,10 +7,14 @@ from collections.abc import Iterable
 from contextlib import contextmanager, redirect_stderr, redirect_stdout
 from typing import Any, Iterator
-try:
-    from tasks import task_ids
-except ImportError:  # pragma: no cover
-    from python_env.tasks import task_ids  # type: ignore[no-redef]
 def compact_text(
@@ -51,21 +55,28 @@ def observation_attr(observation: Any, name: str, default: Any = None, *, preser
     return value
-def format_bool(value: Any) -> str:
-    return "true" if bool(value) else "false"
-def format_reward(value: Any) -> str:
-    try:
-        reward = float(value)
-    except Exception:
-        reward = 0.0
-    return f"{reward:.2f}"
-def format_error(value: Any) -> str:
-    text = compact_text(value, default="")
-    return text if text else "null"
 def parse_task_ids() -> list[str]:

+"""Formatting, parsing, and IO-suppression helpers for inference."""
 from __future__ import annotations
 from contextlib import contextmanager, redirect_stderr, redirect_stdout
 from typing import Any, Iterator
+try:
+    from tasks import task_ids
+except ImportError:  # pragma: no cover
+    from python_env.tasks import task_ids  # type: ignore[no-redef]
+MIN_DISPLAY_REWARD = 0.01
+MAX_DISPLAY_REWARD = 0.99
 def compact_text(
     return value
+def format_bool(value: Any) -> str:
+    """Render booleans in the lowercase form required by OpenEnv."""
+    return "true" if bool(value) else "false"
+def format_reward(value: Any) -> str:
+    """Render rewards in a validator-safe two-decimal open interval."""
+    try:
+        reward = float(value)
+    except Exception:
+        reward = MIN_DISPLAY_REWARD
+    reward = max(MIN_DISPLAY_REWARD, min(MAX_DISPLAY_REWARD, reward))
+    return f"{reward:.2f}"
+def format_error(value: Any) -> str:
+    """Render nullable error strings in the stdout contract format."""
+    text = compact_text(value, default="")
+    return text if text else "null"
 def parse_task_ids() -> list[str]:

client.py CHANGED Viewed

@@ -2,16 +2,23 @@
 from __future__ import annotations
-from typing import Dict
-from openenv.core import EnvClient
-from openenv.core.client_types import StepResult
-from .models import (
-    PythonCodeReviewAction,
-    PythonCodeReviewObservation,
-    PythonCodeReviewState,
-)
 class PythonCodeReviewEnv(

 from __future__ import annotations
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+try:
+    from .models import (
+        PythonCodeReviewAction,
+        PythonCodeReviewObservation,
+        PythonCodeReviewState,
+    )
+except ImportError:  # pragma: no cover
+    from models import (
+        PythonCodeReviewAction,
+        PythonCodeReviewObservation,
+        PythonCodeReviewState,
+    )
 class PythonCodeReviewEnv(

graders/bug_fix.py CHANGED Viewed

@@ -12,10 +12,10 @@ except ImportError:
 from .shared import (
     base_grade,
     compile_code,
     component_score,
     execute_cases,
     quality_metrics,
-    shaped_score,
     similarity_score,
     summarize_results,
 )
@@ -32,6 +32,7 @@ def grade_bug_fix_task(
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
@@ -40,11 +41,18 @@ def grade_bug_fix_task(
     }
     if not compiled:
-        progress = 0.02 + 0.12 * similarity_score(code, task.reference_code)
         details["test_results"] = []
         details["test_summary"] = "Code does not compile."
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
@@ -59,9 +67,16 @@ def grade_bug_fix_task(
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.12 + 0.18 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -73,9 +88,16 @@ def grade_bug_fix_task(
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.1 + 0.2 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -89,9 +111,15 @@ def grade_bug_fix_task(
     pass_rate = data["passed"] / max(data["total"], 1)
     details["test_results"] = data["results"]
     details["test_summary"] = summarize_results("Test results", data["results"])
-    progress = min(1.0, 0.05 + 0.8 * pass_rate + 0.15 * quality["score"])
     return base_grade(
-        score=shaped_score(progress),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

 from .shared import (
     base_grade,
     compile_code,
+    composite_grade_score,
     component_score,
     execute_cases,
     quality_metrics,
     similarity_score,
     summarize_results,
 )
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
+    similarity = similarity_score(code, task.reference_code)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
     }
     if not compiled:
         details["test_results"] = []
         details["test_summary"] = "Code does not compile."
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.0,
+                quality=0.05,
+                runtime=0.05,
+                syntax=0.0,
+                similarity=similarity,
+                baseline=0.04,
+                penalty=0.05,
+            ),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.10,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.06,
+                penalty=0.12,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.12,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.06,
+                penalty=0.08,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     pass_rate = data["passed"] / max(data["total"], 1)
     details["test_results"] = data["results"]
     details["test_summary"] = summarize_results("Test results", data["results"])
     return base_grade(
+        score=composite_grade_score(
+            correctness=pass_rate,
+            quality=quality["score"],
+            runtime=0.05,
+            syntax=0.95,
+            similarity=similarity,
+            baseline=0.08,
+        ),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

graders/optimization.py CHANGED Viewed

@@ -13,10 +13,10 @@ from .shared import (
     base_grade,
     benchmark_candidate,
     compile_code,
     component_score,
     execute_cases,
     quality_metrics,
-    shaped_score,
     similarity_score,
     summarize_results,
 )
@@ -33,6 +33,7 @@ def grade_optimization_task(
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
@@ -41,11 +42,18 @@ def grade_optimization_task(
     }
     if not compiled:
-        progress = 0.02 + 0.1 * similarity_score(code, task.reference_code)
         details["test_results"] = []
         details["test_summary"] = "Code does not compile."
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
@@ -60,9 +68,16 @@ def grade_optimization_task(
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.1 + 0.18 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -74,9 +89,16 @@ def grade_optimization_task(
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.1 + 0.2 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -105,13 +127,16 @@ def grade_optimization_task(
     details["benchmark"] = benchmark_summary
     runtime_progress = 0.0 if benchmark_summary == "Benchmark deferred until hidden evaluation." else runtime_score
-    if include_hidden:
-        progress = min(1.0, 0.05 + 0.6 * pass_rate + 0.2 * quality["score"] + 0.15 * runtime_progress)
-    else:
-        progress = min(1.0, 0.05 + 0.7 * pass_rate + 0.25 * quality["score"])
     return base_grade(
-        score=shaped_score(progress),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

     base_grade,
     benchmark_candidate,
     compile_code,
+    composite_grade_score,
     component_score,
     execute_cases,
     quality_metrics,
     similarity_score,
     summarize_results,
 )
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
+    similarity = similarity_score(code, task.reference_code)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
     }
     if not compiled:
         details["test_results"] = []
         details["test_summary"] = "Code does not compile."
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.0,
+                quality=0.05,
+                runtime=0.0,
+                syntax=0.0,
+                similarity=similarity,
+                baseline=0.04,
+                penalty=0.06,
+            ),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.08,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.05,
+                penalty=0.14,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.10,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.05,
+                penalty=0.08,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     details["benchmark"] = benchmark_summary
     runtime_progress = 0.0 if benchmark_summary == "Benchmark deferred until hidden evaluation." else runtime_score
     return base_grade(
+        score=composite_grade_score(
+            correctness=pass_rate,
+            quality=quality["score"],
+            runtime=runtime_progress if include_hidden else 0.10,
+            syntax=0.95,
+            similarity=similarity,
+            baseline=0.08 if include_hidden else 0.07,
+            penalty=0.10 if timed_out else 0.0,
+        ),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

graders/shared.py CHANGED Viewed

@@ -23,6 +23,7 @@ STRICT_SCORE_MIN = 0.01
 STRICT_SCORE_MAX = 0.99
 POOR_SCORE = 0.1
 NEAR_PERFECT_SCORE = 0.95
 def finite_float(value: Any, fallback: float = STRICT_SCORE_MIN) -> float:
@@ -44,22 +45,45 @@ def clamp(value: float, lower: float = 0.0, upper: float = 1.0) -> float:
     return max(lower, min(upper, numeric))
 def strict_score(value: Any, lower: float = STRICT_SCORE_MIN, upper: float = STRICT_SCORE_MAX) -> float:
     """Clamp a score to the OpenEnv-safe open interval (0, 1)."""
     score = max(lower, min(upper, finite_float(value, fallback=lower)))
-    score = round(score, 3)
     assert 0 < score < 1, f"Invalid score: {score}"
     return score
 def shaped_score(progress: Any, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
-    """Map progress in [0, 1] to a shaped score band within (0, 1)."""
     bounded_progress = clamp(finite_float(progress, fallback=0.0))
-    score = floor + (ceiling - floor) * bounded_progress
-    score = max(STRICT_SCORE_MIN, min(score, STRICT_SCORE_MAX))
-    score = round(score, 3)
     assert 0 < score < 1, f"Invalid score: {score}"
     return score
@@ -83,7 +107,56 @@ def safe_ratio(numerator: Any, denominator: Any) -> float:
 def component_score(value: Any) -> float:
     """Normalize component scores such as syntax, quality, and runtime."""
-    return strict_score(value)
 def compile_code(code: str) -> tuple[bool, str]:
@@ -121,23 +194,31 @@ def _queue_worker(
         )
-def run_with_timeout(
-    worker: Callable[[Dict[str, Any]], Dict[str, Any]],
-    payload: Dict[str, Any],
-    timeout_s: float,
-) -> Dict[str, Any]:
-    """Execute a worker in a subprocess and terminate on timeout."""
-    ctx = mp.get_context("spawn")
-    queue = ctx.Queue()
-    process = ctx.Process(target=_queue_worker, args=(worker, payload, queue))
-    process.start()
-    process.join(timeout_s)
-    if process.is_alive():
-        process.terminate()
-        process.join()
-        return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
     if queue.empty():
         return {"timed_out": False, "error": "Worker exited without returning a result."}

 STRICT_SCORE_MAX = 0.99
 POOR_SCORE = 0.1
 NEAR_PERFECT_SCORE = 0.95
+EPS = 1e-6
 def finite_float(value: Any, fallback: float = STRICT_SCORE_MIN) -> float:
     return max(lower, min(upper, numeric))
+def safe_score(score: Any) -> float:
+    """Clamp any score to the strict OpenEnv-safe open interval (0, 1)."""
+    bounded = max(EPS, min(1.0 - EPS, finite_float(score, fallback=EPS)))
+    assert 0 < bounded < 1, f"Score must be strictly between 0 and 1: {bounded}"
+    return bounded
+def normalize_score(x: Any) -> float:
+    """Sigmoid-normalize a raw score and clamp it safely into (0, 1)."""
+    numeric = finite_float(x, fallback=0.0)
+    bounded = max(-20.0, min(20.0, numeric))
+    return safe_score(1.0 / (1.0 + math.exp(-bounded)))
+def final_score_pipeline(raw_score: Any) -> float:
+    """Normalize arbitrary raw scoring signals into a strict OpenEnv-safe score."""
+    return normalize_score(raw_score)
 def strict_score(value: Any, lower: float = STRICT_SCORE_MIN, upper: float = STRICT_SCORE_MAX) -> float:
     """Clamp a score to the OpenEnv-safe open interval (0, 1)."""
     score = max(lower, min(upper, finite_float(value, fallback=lower)))
+    score = safe_score(score)
     assert 0 < score < 1, f"Invalid score: {score}"
     return score
 def shaped_score(progress: Any, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
+    """Map progress in [0, 1] to a smooth score band within (0, 1)."""
     bounded_progress = clamp(finite_float(progress, fallback=0.0))
+    centered_progress = (bounded_progress - 0.5) * 6.0
+    smoothed_progress = final_score_pipeline(centered_progress)
+    score = floor + (ceiling - floor) * smoothed_progress
+    score = safe_score(score)
     assert 0 < score < 1, f"Invalid score: {score}"
     return score
 def component_score(value: Any) -> float:
     """Normalize component scores such as syntax, quality, and runtime."""
+    bounded_value = clamp(finite_float(value, fallback=0.0))
+    return shaped_score(bounded_value, floor=0.02, ceiling=0.98)
+def composite_progress(
+    *,
+    correctness: Any = 0.0,
+    quality: Any = 0.0,
+    runtime: Any = 0.0,
+    syntax: Any = 0.0,
+    similarity: Any = 0.0,
+    baseline: float = 0.05,
+    penalty: Any = 0.0,
+) -> float:
+    """Blend multiple progress signals into a stable scalar progress estimate."""
+    progress = (
+        finite_float(baseline, fallback=0.05)
+        + 0.45 * clamp(correctness)
+        + 0.20 * clamp(quality)
+        + 0.15 * clamp(runtime)
+        + 0.15 * clamp(syntax)
+        + 0.05 * clamp(similarity)
+        - 0.20 * clamp(penalty)
+    )
+    return clamp(progress)
+def composite_grade_score(
+    *,
+    correctness: Any = 0.0,
+    quality: Any = 0.0,
+    runtime: Any = 0.0,
+    syntax: Any = 0.0,
+    similarity: Any = 0.0,
+    baseline: float = 0.05,
+    penalty: Any = 0.0,
+) -> float:
+    """Create a smooth task score from multiple bounded signals."""
+    progress = composite_progress(
+        correctness=correctness,
+        quality=quality,
+        runtime=runtime,
+        syntax=syntax,
+        similarity=similarity,
+        baseline=baseline,
+        penalty=penalty,
+    )
+    return shaped_score(progress)
 def compile_code(code: str) -> tuple[bool, str]:
         )
+def run_with_timeout(
+    worker: Callable[[Dict[str, Any]], Dict[str, Any]],
+    payload: Dict[str, Any],
+    timeout_s: float,
+) -> Dict[str, Any]:
+    """Execute a worker in a subprocess and terminate on timeout.
+    Some constrained Windows environments disallow spawned pipes or child
+    processes. In those cases, fall back to the inline timeout path so local
+    demos and tests still work deterministically.
+    """
+    try:
+        ctx = mp.get_context("spawn")
+        queue = ctx.Queue()
+        process = ctx.Process(target=_queue_worker, args=(worker, payload, queue))
+        process.start()
+        process.join(timeout_s)
+    except (PermissionError, OSError):
+        return run_inline_with_timeout(worker, payload, timeout_s)
+    if process.is_alive():
+        process.terminate()
+        process.join()
+        return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
     if queue.empty():
         return {"timed_out": False, "error": "Worker exited without returning a result."}

graders/syntax.py CHANGED Viewed

@@ -12,10 +12,10 @@ except ImportError:
 from .shared import (
     base_grade,
     compile_code,
     component_score,
     execute_cases,
     quality_metrics,
-    shaped_score,
     similarity_score,
     summarize_results,
 )
@@ -26,6 +26,7 @@ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> Ta
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
@@ -33,11 +34,18 @@ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> Ta
     }
     if not compiled:
-        progress = 0.05 + 0.2 * similarity_score(code, task.reference_code)
         details["test_results"] = []
         details["test_summary"] = "Code does not compile yet."
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + len(task.hidden_cases),
@@ -52,9 +60,16 @@ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> Ta
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.2 + 0.25 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -66,9 +81,16 @@ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> Ta
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
-        progress = 0.18 + 0.2 * quality["score"]
         return base_grade(
-            score=shaped_score(progress),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
@@ -82,9 +104,15 @@ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> Ta
     details["test_results"] = data["results"]
     details["test_summary"] = summarize_results("Validation checks", data["results"])
     pass_rate = data["passed"] / max(data["total"], 1)
-    progress = min(1.0, 0.15 + 0.75 * pass_rate + 0.1 * quality["score"])
     return base_grade(
-        score=shaped_score(progress),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

 from .shared import (
     base_grade,
     compile_code,
+    composite_grade_score,
     component_score,
     execute_cases,
     quality_metrics,
     similarity_score,
     summarize_results,
 )
     compiled, compile_error = compile_code(code)
     quality = quality_metrics(code, task.function_name)
+    similarity = similarity_score(code, task.reference_code)
     details = {
         "compile_error": compile_error,
         "quality_notes": quality["quality_notes"],
     }
     if not compiled:
         details["test_results"] = []
         details["test_summary"] = "Code does not compile yet."
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.0,
+                quality=0.05,
+                runtime=0.05,
+                syntax=0.0,
+                similarity=similarity,
+                baseline=0.05,
+                penalty=0.05,
+            ),
             syntax_score=component_score(0.01),
             tests_passed=0,
             tests_total=len(task.public_cases) + len(task.hidden_cases),
     if result.get("timed_out"):
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.15,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.08,
+                penalty=0.12,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     if "error" in result:
         details["test_results"] = []
         details["test_summary"] = result["error"]
         return base_grade(
+            score=composite_grade_score(
+                correctness=0.18,
+                quality=quality["score"],
+                runtime=0.0,
+                syntax=0.95,
+                similarity=similarity,
+                baseline=0.08,
+                penalty=0.08,
+            ),
             syntax_score=component_score(0.95),
             tests_passed=0,
             tests_total=len(cases),
     details["test_results"] = data["results"]
     details["test_summary"] = summarize_results("Validation checks", data["results"])
     pass_rate = data["passed"] / max(data["total"], 1)
     return base_grade(
+        score=composite_grade_score(
+            correctness=pass_rate,
+            quality=quality["score"],
+            runtime=0.05,
+            syntax=0.95,
+            similarity=similarity,
+            baseline=0.10,
+        ),
         syntax_score=component_score(0.95),
         tests_passed=data["passed"],
         tests_total=data["total"],

models/pytorch_model.py CHANGED Viewed

@@ -1,149 +1,227 @@
-"""PyTorch + transformers model wrapper for multi-domain code scoring."""
-from __future__ import annotations
-import hashlib
-from typing import Dict, List, Sequence
-import torch
-import torch.nn.functional as F
-try:
-    from transformers import AutoModel, AutoTokenizer
-except Exception:
-    AutoModel = None  # type: ignore[assignment]
-    AutoTokenizer = None  # type: ignore[assignment]
-DOMAIN_PROTOTYPES: Dict[str, List[str]] = {
-    "dsa": [
-        "Binary search, hashmap optimization, recursion, dynamic programming, arrays, trees, graphs, stack, queue, complexity.",
-        "Competitive programming algorithm with loops, memoization, prefix sums, and asymptotic analysis.",
-    ],
-    "data_science": [
-        "Pandas dataframe transformation, numpy vectorization, feature leakage, train test split, iterrows misuse.",
-        "Data cleaning pipeline using pandas, numpy, aggregation, joins, and vectorized operations.",
-    ],
-    "ml_dl": [
-        "PyTorch model, training loop, optimizer, backward pass, eval mode, no_grad, loss function, dataloader.",
-        "Machine learning inference and training code with torch, sklearn, tensors, gradients, and model checkpoints.",
-    ],
-    "web": [
-        "FastAPI endpoint, request validation, Pydantic models, async routes, API security, backend service design.",
-        "REST API backend with routers, dependency injection, input validation, serialization, and error handling.",
-    ],
-    "general": [
-        "General Python utility code with readable structure, typing, tests, and maintainable abstractions.",
-    ],
-}
-QUALITY_ANCHORS: Dict[str, List[str]] = {
-    "high": [
-        "Readable typed Python code with validation, efficient algorithms, vectorized operations, safe inference, and clean API boundaries.",
-        "Production-ready code with small functions, docstrings, low complexity, and clear error handling.",
-    ],
-    "low": [
-        "Brute-force nested loops, missing validation, unsafe input handling, missing eval mode, missing no_grad, and code smells.",
-        "Hard to maintain code with high complexity, repeated scans, mutable side effects, and unclear structure.",
-    ],
-}
-class _HashEmbeddingBackend:
-    """Torch-native fallback when pretrained weights cannot be loaded."""
-    def __init__(self, dimensions: int = 128) -> None:
-        self.dimensions = dimensions
-        self.model_id = "hashed-token-fallback"
-        self.backend_name = "hashed-token-fallback"
-        self.notes = ["Using hashed embeddings because pretrained transformer weights are unavailable."]
-    def embed_texts(self, texts: Sequence[str]) -> torch.Tensor:
-        matrix = torch.zeros((len(texts), self.dimensions), dtype=torch.float32)
-        for row_index, text in enumerate(texts):
-            tokens = text.lower().split()[:512]
-            if not tokens:
-                matrix[row_index, 0] = 1.0
-                continue
-            for token in tokens:
-                digest = hashlib.md5(token.encode("utf-8")).hexdigest()
-                bucket = int(digest[:8], 16) % self.dimensions
-                sign = -1.0 if int(digest[8:10], 16) % 2 else 1.0
-                matrix[row_index, bucket] += sign
-        return F.normalize(matrix + 1e-6, dim=1)
-class PyTorchCodeAnalyzerModel:
-    """Score code using pretrained transformer embeddings plus prototype similarity."""
-    def __init__(self, model_id: str = "huggingface/CodeBERTa-small-v1") -> None:
-        self.model_id = model_id
-        self.backend_name = model_id
-        self.notes: List[str] = []
-        self._tokenizer = None
-        self._model = None
-        self._fallback = _HashEmbeddingBackend()
-        self._prototype_cache: Dict[str, torch.Tensor] = {}
-    def _ensure_loaded(self) -> None:
-        if self._model is not None or self.notes:
-            return
-        if AutoTokenizer is None or AutoModel is None:
-            self.backend_name = self._fallback.backend_name
-            self.notes = list(self._fallback.notes)
-            return
-        try:
-            self._tokenizer = AutoTokenizer.from_pretrained(self.model_id)
-            self._model = AutoModel.from_pretrained(self.model_id)
-            self._model.eval()
-            self.notes.append(f"Loaded pretrained encoder `{self.model_id}`.")
-        except Exception as exc:
-            self.backend_name = self._fallback.backend_name
-            self.notes = list(self._fallback.notes) + [f"Pretrained load failed: {type(exc).__name__}: {exc}"]
-    def _embed_texts(self, texts: Sequence[str]) -> torch.Tensor:
-        self._ensure_loaded()
-        if self._model is None or self._tokenizer is None:
-            return self._fallback.embed_texts(texts)
-        encoded = self._tokenizer(list(texts), padding=True, truncation=True, max_length=256, return_tensors="pt")
-        with torch.no_grad():
-            outputs = self._model(**encoded)
-            hidden = outputs.last_hidden_state
-            mask = encoded["attention_mask"].unsqueeze(-1)
-            pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
-        return F.normalize(pooled, dim=1)
-    def _prototype_matrix(self, bucket: str, texts: Sequence[str]) -> torch.Tensor:
-        if bucket not in self._prototype_cache:
-            self._prototype_cache[bucket] = self._embed_texts(texts)
-        return self._prototype_cache[bucket]
-    def predict(self, code: str, context_window: str, static_summary: Dict[str, object]) -> Dict[str, object]:
-        """Predict domain probabilities and a model quality score."""
-        document = (
-            f"Code:\n{code.strip()[:4000]}\n\n"
-            f"Context:\n{context_window.strip()[:1000]}\n\n"
-            f"Static hints:\n{static_summary}\n"
-        )
-        candidate = self._embed_texts([document])
-        domain_scores: Dict[str, float] = {}
-        for domain, texts in DOMAIN_PROTOTYPES.items():
-            matrix = self._prototype_matrix(f"domain:{domain}", texts)
-            similarity = torch.matmul(candidate, matrix.T).max().item()
-            domain_scores[domain] = round((similarity + 1.0) / 2.0, 4)
-        high_matrix = self._prototype_matrix("quality:high", QUALITY_ANCHORS["high"])
-        low_matrix = self._prototype_matrix("quality:low", QUALITY_ANCHORS["low"])
-        high_similarity = torch.matmul(candidate, high_matrix.T).max().item()
-        low_similarity = torch.matmul(candidate, low_matrix.T).max().item()
-        ml_quality_score = torch.sigmoid(torch.tensor((high_similarity - low_similarity) * 4.0)).item()
-        return {
-            "domain_scores": domain_scores,
-            "ml_quality_score": round(float(ml_quality_score), 4),
-            "backend_name": self.backend_name,
-            "model_id": self.model_id,
-            "notes": list(self.notes),
-        }

+"""PyTorch + transformers model wrapper for code-quality scoring."""
+from __future__ import annotations
+import hashlib
+from typing import Dict, List, Sequence
+import torch
+import torch.nn.functional as F
+try:
+    from transformers import AutoModel, AutoTokenizer
+except Exception:
+    AutoModel = None  # type: ignore[assignment]
+    AutoTokenizer = None  # type: ignore[assignment]
+DOMAIN_PROTOTYPES: Dict[str, List[str]] = {
+    "dsa": [
+        "Algorithmic Python with nested loops, recursion, dynamic programming, maps, and asymptotic analysis.",
+        "Competitive programming utility focused on arrays, graphs, search, and runtime complexity.",
+    ],
+    "data_science": [
+        "Pandas dataframe transformation, numpy vectorization, feature engineering, data cleaning, and leakage prevention.",
+        "Notebook-style data pipeline using joins, aggregations, and columnar operations.",
+    ],
+    "ml_dl": [
+        "PyTorch model inference or training loop with eval mode, no_grad, tensors, optimizer, and loss functions.",
+        "Machine learning code with torch, sklearn, batches, checkpoints, and metrics.",
+    ],
+    "web": [
+        "FastAPI backend endpoint with pydantic validation, dependency injection, request parsing, and API safety.",
+        "Python web-service route handling, serialization, authentication, and response contracts.",
+    ],
+    "general": [
+        "General Python utility code with readability, typing, small functions, tests, and maintainable abstractions.",
+    ],
+}
+QUALITY_ANCHORS: Dict[str, List[str]] = {
+    "high": [
+        "Production-ready Python code with clear naming, docstrings, validation, efficient loops, and low complexity.",
+        "Clean code with explicit error handling, typing, modular design, and testable functions.",
+    ],
+    "low": [
+        "Bug-prone Python with nested loops, missing validation, weak naming, duplicated logic, and hard-to-review structure.",
+        "Risky code with syntax drift, unclear behavior, mutable side effects, and repeated scans over data.",
+    ],
+}
+MAINTAINABILITY_ANCHORS: Dict[str, List[str]] = {
+    "high": [
+        "Readable functions, small logical units, strong typing, comments only where needed, and simple control flow.",
+        "Maintainable Python service with clean architecture, cohesive modules, and explicit contracts.",
+    ],
+    "low": [
+        "Large unstructured function, missing docstrings, weak names, deeply nested branches, and difficult debugging.",
+        "Hard-to-maintain script with inconsistent style, brittle branching, and hidden side effects.",
+    ],
+}
+ISSUE_ANCHORS: Dict[str, List[str]] = {
+    "correctness": [
+        "Off-by-one bug, missing final append, incorrect boundary handling, failing assertions, wrong return value.",
+        "Logic regression caused by a missing branch, incorrect state update, or unhandled edge case.",
+    ],
+    "performance": [
+        "Repeated full-list scans, brute-force nested loops, iterrows misuse, avoidable O(n^2) behavior, slow pipeline.",
+        "Performance regression from redundant iteration, poor data structures, or missing vectorization.",
+    ],
+    "security": [
+        "Unsafe input handling, unchecked request payload, eval usage, missing validation, insecure backend pattern.",
+        "Security risk caused by trusting raw user input or bypassing schema validation.",
+    ],
+    "style": [
+        "Readability issues from long lines, missing docstrings, inconsistent spacing, tabs, and trailing whitespace.",
+        "Style drift that makes code review harder and maintenance slower.",
+    ],
+}
+class _HashEmbeddingBackend:
+    """Torch-native fallback when pretrained weights cannot be loaded."""
+    def __init__(self, dimensions: int = 128) -> None:
+        self.dimensions = dimensions
+        self.model_id = "hashed-token-fallback"
+        self.backend_name = "hashed-token-fallback"
+        self.notes = ["Using hashed embeddings because pretrained transformer weights are unavailable."]
+    def embed_texts(self, texts: Sequence[str]) -> torch.Tensor:
+        matrix = torch.zeros((len(texts), self.dimensions), dtype=torch.float32)
+        for row_index, text in enumerate(texts):
+            tokens = text.lower().split()[:512]
+            if not tokens:
+                matrix[row_index, 0] = 1.0
+                continue
+            for token in tokens:
+                digest = hashlib.md5(token.encode("utf-8")).hexdigest()
+                bucket = int(digest[:8], 16) % self.dimensions
+                sign = -1.0 if int(digest[8:10], 16) % 2 else 1.0
+                matrix[row_index, bucket] += sign
+        return F.normalize(matrix + 1e-6, dim=1)
+class PyTorchCodeAnalyzerModel:
+    """Score code using pretrained transformer embeddings plus prototype similarity."""
+    def __init__(self, model_id: str = "huggingface/CodeBERTa-small-v1") -> None:
+        self.model_id = model_id
+        self.backend_name = model_id
+        self.notes: List[str] = []
+        self._tokenizer = None
+        self._model = None
+        self._fallback = _HashEmbeddingBackend()
+        self._prototype_cache: Dict[str, torch.Tensor] = {}
+    def _ensure_loaded(self) -> None:
+        if self._model is not None or self.notes:
+            return
+        if AutoTokenizer is None or AutoModel is None:
+            self.backend_name = self._fallback.backend_name
+            self.notes = list(self._fallback.notes)
+            return
+        try:
+            self._tokenizer = AutoTokenizer.from_pretrained(self.model_id)
+            self._model = AutoModel.from_pretrained(self.model_id)
+            self._model.eval()
+            self.notes.append(f"Loaded pretrained encoder `{self.model_id}`.")
+        except Exception as exc:
+            self.backend_name = self._fallback.backend_name
+            self.notes = list(self._fallback.notes) + [f"Pretrained load failed: {type(exc).__name__}: {exc}"]
+    def _embed_texts(self, texts: Sequence[str]) -> torch.Tensor:
+        self._ensure_loaded()
+        if self._model is None or self._tokenizer is None:
+            return self._fallback.embed_texts(texts)
+        encoded = self._tokenizer(list(texts), padding=True, truncation=True, max_length=256, return_tensors="pt")
+        with torch.no_grad():
+            outputs = self._model(**encoded)
+            hidden = outputs.last_hidden_state
+            mask = encoded["attention_mask"].unsqueeze(-1)
+            pooled = (hidden * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
+        return F.normalize(pooled, dim=1)
+    def _prototype_matrix(self, bucket: str, texts: Sequence[str]) -> torch.Tensor:
+        if bucket not in self._prototype_cache:
+            self._prototype_cache[bucket] = self._embed_texts(texts)
+        return self._prototype_cache[bucket]
+    @staticmethod
+    def _unit_similarity(candidate: torch.Tensor, matrix: torch.Tensor) -> float:
+        similarity = torch.matmul(candidate, matrix.T).max().item()
+        return round((similarity + 1.0) / 2.0, 4)
+    @staticmethod
+    def _quality_label(score: float) -> str:
+        if score >= 0.82:
+            return "excellent"
+        if score >= 0.66:
+            return "good"
+        if score >= 0.45:
+            return "needs_work"
+        return "risky"
+    def predict(
+        self,
+        code: str,
+        context_window: str,
+        traceback_text: str,
+        static_summary: Dict[str, object],
+    ) -> Dict[str, object]:
+        """Predict domain probabilities, quality, and issue risks for Python code."""
+        document = (
+            f"Code:\n{code.strip()[:4000]}\n\n"
+            f"Context:\n{context_window.strip()[:1000]}\n\n"
+            f"Traceback:\n{traceback_text.strip()[:1000]}\n\n"
+            f"Static hints:\n{static_summary}\n"
+        )
+        candidate = self._embed_texts([document])
+        domain_scores: Dict[str, float] = {}
+        for domain, texts in DOMAIN_PROTOTYPES.items():
+            domain_scores[domain] = self._unit_similarity(candidate, self._prototype_matrix(f"domain:{domain}", texts))
+        high_matrix = self._prototype_matrix("quality:high", QUALITY_ANCHORS["high"])
+        low_matrix = self._prototype_matrix("quality:low", QUALITY_ANCHORS["low"])
+        high_similarity = torch.matmul(candidate, high_matrix.T).max().item()
+        low_similarity = torch.matmul(candidate, low_matrix.T).max().item()
+        ml_quality_score = round(float(torch.sigmoid(torch.tensor((high_similarity - low_similarity) * 4.0)).item()), 4)
+        high_maintainability = torch.matmul(
+            candidate,
+            self._prototype_matrix("maintainability:high", MAINTAINABILITY_ANCHORS["high"]).T,
+        ).max().item()
+        low_maintainability = torch.matmul(
+            candidate,
+            self._prototype_matrix("maintainability:low", MAINTAINABILITY_ANCHORS["low"]).T,
+        ).max().item()
+        maintainability_score = round(
+            float(torch.sigmoid(torch.tensor((high_maintainability - low_maintainability) * 4.0)).item()),
+            4,
+        )
+        issue_logits = []
+        issue_labels = list(ISSUE_ANCHORS.keys())
+        for label in issue_labels:
+            similarity = torch.matmul(candidate, self._prototype_matrix(f"issue:{label}", ISSUE_ANCHORS[label]).T).max().item()
+            issue_logits.append(similarity)
+        probabilities = torch.softmax(torch.tensor(issue_logits) * 3.0, dim=0)
+        issue_probabilities = {
+            label: round(float(probabilities[index].item()), 4)
+            for index, label in enumerate(issue_labels)
+        }
+        return {
+            "domain_scores": domain_scores,
+            "ml_quality_score": ml_quality_score,
+            "quality_score": ml_quality_score,
+            "quality_label": self._quality_label(ml_quality_score),
+            "maintainability_score": maintainability_score,
+            "issue_probabilities": issue_probabilities,
+            "backend_name": self.backend_name,
+            "model_id": self.model_id,
+            "notes": list(self.notes),
+        }

openenv_python_code_review_env.egg-info/PKG-INFO CHANGED Viewed

@@ -6,7 +6,6 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 Requires-Dist: fastapi>=0.111.0
 Requires-Dist: gradio>=5.26.0
-Requires-Dist: hf-xet>=1.4.3
 Requires-Dist: openai>=1.76.0
 Requires-Dist: openenv-core[core]>=0.2.2
 Requires-Dist: streamlit>=1.44.0
@@ -35,25 +34,26 @@ Production-ready hackathon submission for OpenEnv evaluation, deterministic vali
 ```text
 root
-├── inference.py                # Root validator entrypoint
-├── openenv.yaml                # OpenEnv manifest
-├── app/
-│   ├── agents/                # Action policy and fallback strategy
-│   ├── env/                   # RL loop runner and stdout contract
-│   ├── models/                # Inference dataclasses/config
-│   ├── services/              # OpenAI client wrapper with retries
-│   └── utils/                 # Formatting, task loading, log suppression
-├── server/
-│   ├── env.py                 # OpenEnv environment and reward shaping
-│   ├── app.py                 # FastAPI/OpenEnv app, optional Gradio mount
-│   └── Dockerfile             # Hugging Face Docker image
-├── graders/                   # Syntax, bug-fix, optimization graders
-├── tasks/                     # Deterministic benchmark tasks and references
-├── services/                  # Multi-domain analysis services
-├── analyzers/                 # Domain-specific analyzers
-├── models/                    # Lazy-loaded PyTorch scoring model
-├── schemas/                   # API request/response contracts
-└── tests/                     # Local validation coverage
 ```
 Runtime flow:
@@ -71,8 +71,8 @@ inference.py
 - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
 - OpenAI usage is limited to the official Python client:
-  `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
-- Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
 - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
 - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
 - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
@@ -120,7 +120,9 @@ Required environment variables:
 - `MODEL_NAME`
   Default: `Qwen/Qwen2.5-3B-Instruct`
 - `HF_TOKEN`
-  Mandatory, no default is injected
 Example:
@@ -131,6 +133,13 @@ set HF_TOKEN=hf_xxx
 python inference.py
 ```
 Expected stdout shape:
 ```text
@@ -147,7 +156,7 @@ Expected stdout shape:
 Build from the project root:
 ```bash
-docker build -f server/Dockerfile .
 ```
 Run locally:
@@ -173,7 +182,7 @@ Recommended deployment steps:
 1. Create a Docker Space.
 2. Push this repository as-is.
-3. Let Spaces build with `server/Dockerfile`.
 4. Set Space secrets:
    `HF_TOKEN`
 5. Set Space variables as needed:

 Description-Content-Type: text/markdown
 Requires-Dist: fastapi>=0.111.0
 Requires-Dist: gradio>=5.26.0
 Requires-Dist: openai>=1.76.0
 Requires-Dist: openenv-core[core]>=0.2.2
 Requires-Dist: streamlit>=1.44.0
 ```text
 root
+|- inference.py                # Root validator entrypoint
+|- openenv.yaml                # OpenEnv manifest
+|- app/
+|  |- agents/                  # Action policy and fallback strategy
+|  |- env/                     # RL loop runner and stdout contract
+|  |- models/                  # Inference dataclasses/config
+|  |- services/                # OpenAI client wrapper with retries
+|  `- utils/                   # Formatting, task loading, log suppression
+|- server/
+|  |- env.py                   # OpenEnv environment and reward shaping
+|  |- app.py                   # FastAPI/OpenEnv app, optional Gradio mount
+|  `- Dockerfile               # Alternate Docker build path
+|- Dockerfile                  # Root deployment Docker image
+|- graders/                    # Syntax, bug-fix, optimization graders
+|- tasks/                      # Deterministic benchmark tasks and references
+|- services/                   # Multi-domain analysis services
+|- analyzers/                  # Domain-specific analyzers
+|- models/                     # Lazy-loaded PyTorch scoring model
+|- schemas/                    # API request/response contracts
+`- tests/                      # Local validation coverage
 ```
 Runtime flow:
 - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
 - OpenAI usage is limited to the official Python client:
+  `client = OpenAI(base_url=API_BASE_URL, api_key=provider_token)`.
+- Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; the runtime now selects `HF_TOKEN` for the Hugging Face router and `OPENAI_API_KEY` for direct OpenAI usage.
 - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
 - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
 - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
 - `MODEL_NAME`
   Default: `Qwen/Qwen2.5-3B-Instruct`
 - `HF_TOKEN`
+  Required for `https://router.huggingface.co/v1`
+- `OPENAI_API_KEY`
+  Required for `https://api.openai.com/v1`
 Example:
 python inference.py
 ```
+```bash
+set API_BASE_URL=https://api.openai.com/v1
+set MODEL_NAME=gpt-4.1-mini
+set OPENAI_API_KEY=sk-xxx
+python inference.py
+```
 Expected stdout shape:
 ```text
 Build from the project root:
 ```bash
+docker build -t openenv-python-code-review-env .
 ```
 Run locally:
 1. Create a Docker Space.
 2. Push this repository as-is.
+3. Let Spaces build from the root `Dockerfile`.
 4. Set Space secrets:
    `HF_TOKEN`
 5. Set Space variables as needed:

openenv_python_code_review_env.egg-info/SOURCES.txt CHANGED Viewed

@@ -1,5 +1,15 @@
 README.md
 pyproject.toml
 analyzers/__init__.py
 analyzers/ds_analyzer.py
 analyzers/dsa_analyzer.py
@@ -12,6 +22,8 @@ app/examples.py
 app/streamlit_app.py
 app/agents/__init__.py
 app/agents/review_agent.py
 app/models/__init__.py
 app/models/inference.py
 app/services/__init__.py

 README.md
 pyproject.toml
+./__init__.py
+./client.py
+./compat.py
+./inference.py
+./launch.py
+./models.py
+./sitecustomize.py
+./triage.py
+./triage_catalog.py
+./triage_models.py
 analyzers/__init__.py
 analyzers/ds_analyzer.py
 analyzers/dsa_analyzer.py
 app/streamlit_app.py
 app/agents/__init__.py
 app/agents/review_agent.py
+app/env/__init__.py
+app/env/runner.py
 app/models/__init__.py
 app/models/inference.py
 app/services/__init__.py

openenv_python_code_review_env.egg-info/requires.txt CHANGED Viewed

@@ -1,6 +1,5 @@
 fastapi>=0.111.0
 gradio>=5.26.0
-hf-xet>=1.4.3
 openai>=1.76.0
 openenv-core[core]>=0.2.2
 streamlit>=1.44.0

 fastapi>=0.111.0
 gradio>=5.26.0
 openai>=1.76.0
 openenv-core[core]>=0.2.2
 streamlit>=1.44.0

openenv_python_code_review_env.egg-info/top_level.txt CHANGED Viewed

@@ -1,14 +1 @@
-analyzers
-api
-app
-build
-graders
-models
-outputs
-schemas
-server
-services
-tasks
-tests
-utils
-venv


1	+ python_env

pyproject.toml CHANGED Viewed

@@ -8,11 +8,9 @@ version = "1.0.0"
 description = "TorchReview Copilot: AI-powered Python code triage with PyTorch and OpenEnv validation."
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
     "fastapi>=0.111.0",
     "gradio>=5.26.0",
-    "hf-xet>=1.4.3",
     "openai>=1.76.0",
     "openenv-core[core]>=0.2.2",
     "streamlit>=1.44.0",
@@ -30,9 +28,27 @@ dev = [
 [project.scripts]
 server = "python_env.server.app:main"
 [tool.setuptools]
 include-package-data = true
-[tool.setuptools.packages.find]
-where = ["."]
-include = ["*"]

 description = "TorchReview Copilot: AI-powered Python code triage with PyTorch and OpenEnv validation."
 readme = "README.md"
 requires-python = ">=3.10"
 dependencies = [
     "fastapi>=0.111.0",
     "gradio>=5.26.0",
     "openai>=1.76.0",
     "openenv-core[core]>=0.2.2",
     "streamlit>=1.44.0",
 [project.scripts]
 server = "python_env.server.app:main"
+[tool.pytest.ini_options]
+pythonpath = ["."]
 [tool.setuptools]
 include-package-data = true
+packages = [
+    "python_env",
+    "python_env.server",
+    "python_env.tasks",
+    "python_env.graders",
+    "python_env.api",
+    "python_env.app",
+    "python_env.app.agents",
+    "python_env.app.env",
+    "python_env.app.models",
+    "python_env.app.services",
+    "python_env.app.utils",
+    "python_env.analyzers",
+    "python_env.models",
+    "python_env.schemas",
+    "python_env.services",
+    "python_env.utils",
+]
+package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.app.agents" = "app/agents", "python_env.app.env" = "app/env", "python_env.app.models" = "app/models", "python_env.app.services" = "app/services", "python_env.app.utils" = "app/utils", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }

schemas/request.py CHANGED Viewed

@@ -1,19 +1,51 @@
-"""Request schemas for code analysis endpoints and UI."""
-from __future__ import annotations
-from typing import Literal
-from pydantic import BaseModel, Field
-DomainHint = Literal["auto", "dsa", "data_science", "ml_dl", "web"]
-class AnalyzeCodeRequest(BaseModel):
-    """Validated input payload for multi-domain code analysis."""
-    code: str = Field(..., min_length=1, description="Source code to analyze.")
-    context_window: str = Field(default="", max_length=2000, description="Optional repository or task context.")
-    traceback_text: str = Field(default="", max_length=2000, description="Optional runtime or test failure output.")
-    domain_hint: DomainHint = Field(default="auto", description="Optional domain override when auto detection is not desired.")

+"""Request schemas for the AI-powered code review workflow."""
+from __future__ import annotations
+from typing import Literal
+from pydantic import BaseModel, ConfigDict, Field, field_validator
+DomainHint = Literal["auto", "general", "dsa", "data_science", "ml_dl", "web"]
+class AnalyzeCodeRequest(BaseModel):
+    """Validated input payload for Python code review requests."""
+    model_config = ConfigDict(str_strip_whitespace=True)
+    code: str = Field(..., min_length=1, description="Python source code to analyze.")
+    context_window: str = Field(
+        default="",
+        max_length=4000,
+        description="Optional repository, pull request, or runtime context.",
+    )
+    traceback_text: str = Field(
+        default="",
+        max_length=4000,
+        description="Optional traceback or failing test output.",
+    )
+    domain_hint: DomainHint = Field(
+        default="auto",
+        description="Optional analysis lens for domain-aware suggestions.",
+    )
+    filename: str = Field(default="snippet.py", max_length=255, description="Virtual filename for display.")
+    enable_suggestions: bool = Field(
+        default=True,
+        description="Whether the service should return a prioritized improvement plan.",
+    )
+    @field_validator("code")
+    @classmethod
+    def _reject_empty_code(cls, value: str) -> str:
+        stripped = value.strip()
+        if not stripped:
+            raise ValueError("code must not be empty")
+        return stripped
+    @field_validator("filename")
+    @classmethod
+    def _normalize_filename(cls, value: str) -> str:
+        candidate = value.strip() or "snippet.py"
+        return candidate[:255]

schemas/response.py CHANGED Viewed

@@ -1,73 +1,109 @@
-"""Response schemas for the multi-domain analysis platform."""
-from __future__ import annotations
-from typing import Dict, List, Literal
-from pydantic import BaseModel, Field
-DomainType = Literal["dsa", "data_science", "ml_dl", "web", "general"]
-Severity = Literal["low", "medium", "high"]
-class AnalysisIssue(BaseModel):
-    """One detected issue or risk in the code snippet."""
-    title: str
-    severity: Severity
-    description: str
-    line_hint: int | None = None
-class StaticAnalysisSummary(BaseModel):
-    """Language-agnostic static-analysis signals."""
-    syntax_valid: bool
-    syntax_error: str = ""
-    cyclomatic_complexity: int = Field(..., ge=1)
-    line_count: int = Field(..., ge=0)
-    max_loop_depth: int = Field(..., ge=0)
-    time_complexity: str = "Unknown"
-    space_complexity: str = "Unknown"
-    detected_imports: List[str] = Field(default_factory=list)
-    code_smells: List[str] = Field(default_factory=list)
-class DomainAnalysis(BaseModel):
-    """Domain-specific analysis payload returned by an analyzer."""
-    domain: DomainType
-    domain_score: float = Field(..., ge=0.0, le=1.0)
-    issues: List[AnalysisIssue] = Field(default_factory=list)
-    suggestions: List[str] = Field(default_factory=list)
-    highlights: Dict[str, float | str] = Field(default_factory=dict)
-class ScoreBreakdown(BaseModel):
-    """Reward inputs and final normalized score."""
-    ml_score: float = Field(..., ge=0.0, le=1.0)
-    domain_score: float = Field(..., ge=0.0, le=1.0)
-    lint_score: float = Field(..., ge=0.0, le=1.0)
-    complexity_penalty: float = Field(..., ge=0.0, le=1.0)
-    quality_signal: float = Field(..., ge=0.0, le=1.0)
-    error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
-    completion_signal: float = Field(..., ge=0.0, le=1.0)
-    reward: float = Field(..., ge=0.0, le=1.0)
-class AnalyzeCodeResponse(BaseModel):
-    """Top-level structured output for API and UI consumers."""
-    detected_domain: DomainType
-    domain_confidences: Dict[str, float]
-    score_breakdown: ScoreBreakdown
-    static_analysis: StaticAnalysisSummary
-    domain_analysis: DomainAnalysis
-    improvement_plan: List[str] = Field(default_factory=list)
-    model_backend: str
-    model_id: str
-    summary: str
-    context_window: str = ""
-    analysis_time_ms: float = Field(..., ge=0.0)

+"""Response schemas for the AI-powered code review platform."""
+from __future__ import annotations
+from typing import Dict, List, Literal
+from pydantic import BaseModel, Field
+Severity = Literal["low", "medium", "high"]
+IssueCategory = Literal["correctness", "maintainability", "performance", "security", "style"]
+QualityLabel = Literal["excellent", "good", "needs_work", "risky"]
+DetectedDomain = Literal["general", "dsa", "data_science", "ml_dl", "web"]
+class AnalysisIssue(BaseModel):
+    """One detected issue or risk in the code snippet."""
+    title: str
+    category: IssueCategory = "maintainability"
+    severity: Severity
+    description: str
+    line_hint: int | None = None
+class StaticAnalysisSummary(BaseModel):
+    """Python-specific static-analysis signals."""
+    syntax_valid: bool
+    syntax_error: str = ""
+    cyclomatic_complexity: int = Field(..., ge=1)
+    line_count: int = Field(..., ge=0)
+    max_nesting_depth: int = Field(..., ge=0)
+    max_loop_depth: int = Field(..., ge=0)
+    time_complexity: str = "Unknown"
+    space_complexity: str = "Unknown"
+    lint_score: float = Field(..., ge=0.0, le=1.0)
+    docstring_coverage: float = Field(..., ge=0.0, le=1.0)
+    detected_imports: List[str] = Field(default_factory=list)
+    code_smells: List[str] = Field(default_factory=list)
+    issues: List[AnalysisIssue] = Field(default_factory=list)
+class DomainAnalysis(BaseModel):
+    """Domain-aware review signals used for context-specific suggestions."""
+    domain: DetectedDomain
+    domain_score: float = Field(..., ge=0.0, le=1.0)
+    issues: List[AnalysisIssue] = Field(default_factory=list)
+    suggestions: List[str] = Field(default_factory=list)
+    highlights: Dict[str, float | str] = Field(default_factory=dict)
+class ModelPrediction(BaseModel):
+    """PyTorch model output derived from pretrained code embeddings."""
+    quality_label: QualityLabel
+    quality_score: float = Field(..., ge=0.0, le=1.0)
+    maintainability_score: float = Field(..., ge=0.0, le=1.0)
+    issue_probabilities: Dict[str, float] = Field(default_factory=dict)
+    notes: List[str] = Field(default_factory=list)
+class ScoreBreakdown(BaseModel):
+    """Reward inputs and the final RL-ready scalar reward."""
+    ml_score: float = Field(..., ge=0.0, le=1.0)
+    domain_score: float = Field(..., ge=0.0, le=1.0)
+    lint_score: float = Field(..., ge=0.0, le=1.0)
+    complexity_penalty: float = Field(..., ge=0.0, le=1.0)
+    maintainability_score: float = Field(..., ge=0.0, le=1.0)
+    security_score: float = Field(..., ge=0.0, le=1.0)
+    readability_score: float = Field(..., ge=0.0, le=1.0)
+    quality_signal: float = Field(..., ge=0.0, le=1.0)
+    error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
+    completion_signal: float = Field(..., ge=0.0, le=1.0)
+    reward: float = Field(..., ge=0.0, le=1.0)
+class SuggestionItem(BaseModel):
+    """One prioritized improvement suggestion."""
+    priority: Literal["P0", "P1", "P2"]
+    title: str
+    rationale: str
+    action: str
+    category: IssueCategory
+class AnalyzeCodeResponse(BaseModel):
+    """Top-level structured output for API and UI consumers."""
+    language: Literal["python"] = "python"
+    detected_domain: DetectedDomain
+    domain_confidences: Dict[str, float] = Field(default_factory=dict)
+    score_breakdown: ScoreBreakdown
+    static_analysis: StaticAnalysisSummary
+    model_prediction: ModelPrediction
+    domain_analysis: DomainAnalysis
+    suggestions: List[SuggestionItem] = Field(default_factory=list)
+    improvement_plan: List[str] = Field(default_factory=list)
+    auto_fix_preview: List[str] = Field(default_factory=list)
+    score_visualization: Dict[str, float] = Field(default_factory=dict)
+    model_backend: str
+    model_id: str
+    summary: str
+    context_window: str = ""
+    filename: str = "snippet.py"
+    analysis_time_ms: float = Field(..., ge=0.0)

server/Dockerfile CHANGED Viewed

@@ -6,7 +6,8 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONIOENCODING=utf-8 \
     PIP_NO_CACHE_DIR=1 \
     PIP_DISABLE_PIP_VERSION_CHECK=1 \
-    ENABLE_GRADIO_DEMO=false
 WORKDIR /app
@@ -24,4 +25,4 @@ EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
     CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--no-access-log"]

     PYTHONIOENCODING=utf-8 \
     PIP_NO_CACHE_DIR=1 \
     PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    ENABLE_GRADIO_DEMO=false \
+    ENABLE_WEB_INTERFACE=false
 WORKDIR /app
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
     CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

server/app.py CHANGED Viewed

@@ -18,12 +18,12 @@ try:
 except Exception:
     gr = None  # type: ignore[assignment]
-try:
-    from ..models import PythonCodeReviewAction, PythonCodeReviewObservation
-    from .env import PythonCodeReviewEnvironment
-except ImportError:
-    from models import PythonCodeReviewAction, PythonCodeReviewObservation
-    from server.env import PythonCodeReviewEnvironment
 def _gradio_enabled() -> bool:
@@ -40,7 +40,7 @@ def _max_concurrent_envs() -> int:
         return 2
-def build_application():
     """Compose the OpenEnv API with the Gradio demo frontend."""
     api_app = create_app(
@@ -50,19 +50,13 @@ def build_application():
         env_name="python_code_review_env",
         max_concurrent_envs=_max_concurrent_envs(),
     )
-    served_app = api_app
-    if gr is not None and _gradio_enabled():
-        try:
-            from .demo import CSS, build_demo
-        except ImportError:
-            from server.demo import CSS, build_demo
-        served_app = gr.mount_gradio_app(
-            api_app,
-            build_demo(),
-            path="/",
-            theme=gr.themes.Soft(primary_hue="orange", secondary_hue="amber"),
-            css=CSS,
-        )
     wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
@@ -77,10 +71,10 @@ def build_application():
 app = build_application()
-def main(host: str = "0.0.0.0", port: int = 8000) -> None:
-    import uvicorn
-    uvicorn.run(app, host=host, port=port, access_log=False)
 if __name__ == "__main__":

 except Exception:
     gr = None  # type: ignore[assignment]
+try:
+    from ..models import PythonCodeReviewAction, PythonCodeReviewObservation
+    from .env import PythonCodeReviewEnvironment
+except ImportError:
+    from models import PythonCodeReviewAction, PythonCodeReviewObservation
+    from server.env import PythonCodeReviewEnvironment
 def _gradio_enabled() -> bool:
         return 2
+def build_application():
     """Compose the OpenEnv API with the Gradio demo frontend."""
     api_app = create_app(
         env_name="python_code_review_env",
         max_concurrent_envs=_max_concurrent_envs(),
     )
+    served_app = api_app
+    if gr is not None and _gradio_enabled():
+        try:
+            from .demo import build_demo
+        except ImportError:
+            from server.demo import build_demo
+        served_app = gr.mount_gradio_app(api_app, build_demo(), path="/")
     wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
 app = build_application()
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
 if __name__ == "__main__":

server/demo.py CHANGED Viewed

@@ -347,7 +347,7 @@ def build_demo() -> gr.Blocks:
     examples = get_default_engine().example_map()
     first_example = next(iter(examples.values()))
-    with gr.Blocks(title="TorchReview Copilot") as demo:
         gr.HTML(
             """
             <div class="hero-card">

     examples = get_default_engine().example_map()
     first_example = next(iter(examples.values()))
+    with gr.Blocks(theme=gr.themes.Soft(primary_hue="orange", secondary_hue="amber"), css=CSS, title="TorchReview Copilot") as demo:
         gr.HTML(
             """
             <div class="hero-card">

server/env.py CHANGED Viewed

@@ -10,7 +10,7 @@ from openenv.core.env_server.types import EnvironmentMetadata
 try:
     from ..graders import grade_task
-    from ..graders.shared import component_score, safe_ratio, strict_score
     from ..models import (
         HistoryEntry,
         PythonCodeReviewAction,
@@ -22,7 +22,7 @@ try:
     from ..tasks import ReviewTask, list_tasks, select_task
 except ImportError:
     from graders import grade_task
-    from graders.shared import component_score, safe_ratio, strict_score
     from models import (
         HistoryEntry,
         PythonCodeReviewAction,
@@ -46,7 +46,7 @@ def _empty_grade() -> TaskGrade:
 def _reward_value(value: float) -> float:
-    return strict_score(value)
 class PythonCodeReviewEnvironment(
@@ -300,36 +300,45 @@ class PythonCodeReviewEnvironment(
     ) -> RewardDetails:
         prev_score = previous_grade.score
         curr_score = current_grade.score
         prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
         curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
         prev_runtime = previous_grade.runtime_score
         curr_runtime = current_grade.runtime_score
-        prev_compile_error = bool(str(previous_grade.details.get("compile_error", "")).strip())
-        curr_compile_error = bool(str(current_grade.details.get("compile_error", "")).strip())
-        syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
-        test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.28, 3)
-        progress_delta = round(max(curr_score - prev_score, 0.0) * 0.3, 3)
-        quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.12, 3)
-        runtime_bonus = round(max(curr_runtime - prev_runtime, 0.0) * 0.08, 3)
-        error_reduction_bonus = 0.1 if prev_compile_error and not curr_compile_error else 0.0
-        completion_bonus = 0.14 if final_submission and curr_rate >= 0.999 and curr_score >= 0.94 else 0.0
-        correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
-        invalid_action_penalty = round((0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0, 3)
-        timeout_penalty = round((0.06 + (0.08 * max(curr_runtime, prev_runtime))) if timed_out else 0.0, 3)
-        regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.25, 3)
-        stagnation_penalty = round((0.02 + (0.05 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0, 3)
         raw_value = (
-            0.32 * curr_score
             + syntax_reward
             + test_reward
             + progress_delta
             + quality_bonus
             + error_reduction_bonus
             + completion_bonus
-            + runtime_bonus
             + correctness_bonus
             - invalid_action_penalty
             - timeout_penalty
@@ -367,22 +376,22 @@ class PythonCodeReviewEnvironment(
             reason_parts.append("no meaningful state change")
         return RewardDetails(
-            value=value,
-            syntax_reward=syntax_reward,
-            test_reward=test_reward,
-            correctness_bonus=correctness_bonus,
-            quality_bonus=quality_bonus,
-            error_reduction_bonus=error_reduction_bonus,
-            completion_bonus=completion_bonus,
-            runtime_bonus=runtime_bonus,
-            progress_delta=progress_delta,
-            invalid_action_penalty=invalid_action_penalty,
-            timeout_penalty=timeout_penalty,
-            regression_penalty=regression_penalty,
-            stagnation_penalty=stagnation_penalty,
             reason=", ".join(reason_parts),
-            prev_score=prev_score,
-            curr_score=curr_score,
             code_changed=code_changed,
         )

 try:
     from ..graders import grade_task
+    from ..graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
     from ..models import (
         HistoryEntry,
         PythonCodeReviewAction,
     from ..tasks import ReviewTask, list_tasks, select_task
 except ImportError:
     from graders import grade_task
+    from graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
     from models import (
         HistoryEntry,
         PythonCodeReviewAction,
 def _reward_value(value: float) -> float:
+    return final_score_pipeline(value)
 class PythonCodeReviewEnvironment(
     ) -> RewardDetails:
         prev_score = previous_grade.score
         curr_score = current_grade.score
+        prev_syntax = previous_grade.syntax_score
+        curr_syntax = current_grade.syntax_score
+        prev_quality = previous_grade.quality_score
+        curr_quality = current_grade.quality_score
         prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
         curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
         prev_runtime = previous_grade.runtime_score
         curr_runtime = current_grade.runtime_score
+        prev_compile_health = 0.1 if str(previous_grade.details.get("compile_error", "")).strip() else 0.95
+        curr_compile_health = 0.1 if str(current_grade.details.get("compile_error", "")).strip() else 0.95
+        syntax_reward = max(curr_syntax - prev_syntax, 0.0) * 0.18
+        test_reward = max(curr_rate - prev_rate, 0.0) * 0.22
+        progress_delta = max(curr_score - prev_score, 0.0) * 0.24
+        quality_bonus = max(curr_quality - prev_quality, 0.0) * 0.12
+        runtime_bonus = max(curr_runtime - prev_runtime, 0.0) * 0.10
+        error_reduction_bonus = max(curr_compile_health - prev_compile_health, 0.0) * 0.14
+        completion_bonus = (0.04 + 0.10 * curr_rate) * float(final_submission)
+        correctness_bonus = max(curr_score - 0.5, 0.0) * 0.12 * float(final_submission)
+        invalid_action_penalty = (0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0
+        timeout_penalty = (0.05 + (0.06 * max(curr_runtime, prev_runtime))) if timed_out else 0.0
+        regression_penalty = max(prev_score - curr_score, 0.0) * 0.24
+        stagnation_penalty = (0.02 + (0.04 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0
         raw_value = (
+            2.0 * (curr_score - 0.5)
+            + 1.2 * (curr_rate - prev_rate)
+            + 0.8 * (curr_quality - prev_quality)
+            + 0.7 * (curr_runtime - prev_runtime)
+            + 0.9 * (curr_syntax - prev_syntax)
+            + 0.6 * (curr_compile_health - prev_compile_health)
             + syntax_reward
             + test_reward
             + progress_delta
             + quality_bonus
+            + runtime_bonus
             + error_reduction_bonus
             + completion_bonus
             + correctness_bonus
             - invalid_action_penalty
             - timeout_penalty
             reason_parts.append("no meaningful state change")
         return RewardDetails(
+            value=safe_score(value),
+            syntax_reward=round(syntax_reward, 6),
+            test_reward=round(test_reward, 6),
+            correctness_bonus=round(correctness_bonus, 6),
+            quality_bonus=round(quality_bonus, 6),
+            error_reduction_bonus=round(error_reduction_bonus, 6),
+            completion_bonus=round(completion_bonus, 6),
+            runtime_bonus=round(runtime_bonus, 6),
+            progress_delta=round(progress_delta, 6),
+            invalid_action_penalty=round(invalid_action_penalty, 6),
+            timeout_penalty=round(timeout_penalty, 6),
+            regression_penalty=round(regression_penalty, 6),
+            stagnation_penalty=round(stagnation_penalty, 6),
             reason=", ".join(reason_parts),
+            prev_score=safe_score(prev_score),
+            curr_score=safe_score(curr_score),
             code_changed=code_changed,
         )

server/requirements.txt CHANGED Viewed

@@ -1,8 +1,6 @@
-openenv-core[core]>=0.2.2
-fastapi>=0.111.0
-gradio>=5.26.0
-uvicorn>=0.30.0
-openai>=1.76.0
-streamlit>=1.44.0
-torch>=2.2.0
-transformers>=4.45.0

+openenv-core[core]>=0.2.2
+fastapi>=0.111.0
+uvicorn>=0.30.0
+openai>=1.76.0
+torch>=2.2.0
+transformers>=4.45.0

services/analysis_service.py CHANGED Viewed

@@ -1,139 +1,258 @@
-"""Orchestration layer for multi-domain code analysis."""
-from __future__ import annotations
-import time
-from typing import Any, Callable, Dict
-from analyzers import analyze_data_science_code, analyze_dsa_code, analyze_ml_code, analyze_web_code
-from models import PyTorchCodeAnalyzerModel
-from schemas.request import AnalyzeCodeRequest
-from schemas.response import AnalyzeCodeResponse, DomainAnalysis, StaticAnalysisSummary
-from services.reward_service import RewardService
-from services.suggestion_service import SuggestionService
-from utils import estimate_complexity, parse_code_structure
-def _lint_score(parsed: Dict[str, Any]) -> float:
-    """Convert structural smells into a normalized lint-style score."""
-    score = 1.0
-    if not parsed.get("syntax_valid", True):
-        score -= 0.45
-    score -= min(parsed.get("long_lines", 0), 5) * 0.03
-    if parsed.get("tabs_used"):
-        score -= 0.1
-    if parsed.get("trailing_whitespace_lines"):
-        score -= 0.05
-    if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
-        score -= 0.08
-    return round(max(0.0, min(1.0, score)), 4)
-class AnalysisService:
-    """End-to-end analysis pipeline shared by API and UI."""
-    def __init__(self) -> None:
-        self._model: PyTorchCodeAnalyzerModel | None = None
-        self.reward_service = RewardService()
-        self.suggestion_service = SuggestionService()
-        self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
-            "dsa": analyze_dsa_code,
-            "data_science": analyze_data_science_code,
-            "ml_dl": analyze_ml_code,
-            "web": analyze_web_code,
-        }
-    @property
-    def model(self) -> PyTorchCodeAnalyzerModel:
-        if self._model is None:
-            self._model = PyTorchCodeAnalyzerModel()
-        return self._model
-    def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
-        """Derive domain priors from imports and syntax-level hints."""
-        scores = {
-            "dsa": 0.2 + (0.15 if parsed.get("uses_recursion") else 0.0) + (0.15 if parsed.get("max_loop_depth", 0) >= 1 else 0.0),
-            "data_science": 0.2 + (0.35 if parsed.get("uses_pandas") or parsed.get("uses_numpy") else 0.0),
-            "ml_dl": 0.2 + (0.35 if parsed.get("uses_torch") or parsed.get("uses_sklearn") else 0.0),
-            "web": 0.2 + (0.35 if parsed.get("uses_fastapi") or parsed.get("uses_flask") else 0.0) + (0.1 if parsed.get("route_decorators") else 0.0),
-            "general": 0.2,
-        }
-        if "fastapi" in code.lower():
-            scores["web"] += 0.1
-        if "pandas" in code.lower() or "numpy" in code.lower():
-            scores["data_science"] += 0.1
-        if "torch" in code.lower():
-            scores["ml_dl"] += 0.1
-        if "while" in code or "for" in code:
-            scores["dsa"] += 0.05
-        return {key: round(min(value, 0.99), 4) for key, value in scores.items()}
-    def analyze(self, request: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
-        """Run the complete multi-domain analysis pipeline."""
-        started = time.perf_counter()
-        parsed = parse_code_structure(request.code)
-        complexity = estimate_complexity(parsed, request.code)
-        model_prediction = self.model.predict(request.code, request.context_window, parsed)
-        heuristic_scores = self._heuristic_domain_scores(parsed, request.code)
-        combined_scores = {}
-        for domain, heuristic_score in heuristic_scores.items():
-            model_score = float(model_prediction["domain_scores"].get(domain, 0.2))
-            combined_scores[domain] = round((0.6 * model_score) + (0.4 * heuristic_score), 4)
-        detected_domain = request.domain_hint if request.domain_hint != "auto" else max(combined_scores, key=combined_scores.get)
-        analyzer = self._analyzers.get(detected_domain)
-        domain_analysis = (
-            analyzer(request.code, parsed, complexity)
-            if analyzer is not None
-            else DomainAnalysis(
-                domain="general",
-                domain_score=0.6,
-                issues=[],
-                suggestions=["Add stronger domain-specific context for deeper analysis."],
-                highlights={},
-            )
-        )
-        lint_score = _lint_score(parsed)
-        score_breakdown = self.reward_service.compute(
-            ml_score=float(model_prediction["ml_quality_score"]),
-            domain_score=domain_analysis.domain_score,
-            lint_score=lint_score,
-            complexity_penalty=float(complexity["complexity_penalty"]),
-        )
-        static_analysis = StaticAnalysisSummary(
-            syntax_valid=bool(parsed["syntax_valid"]),
-            syntax_error=str(parsed["syntax_error"]),
-            cyclomatic_complexity=int(complexity["cyclomatic_complexity"]),
-            line_count=int(parsed["line_count"]),
-            max_loop_depth=int(parsed["max_loop_depth"]),
-            time_complexity=str(complexity["time_complexity"]),
-            space_complexity=str(complexity["space_complexity"]),
-            detected_imports=list(parsed["imports"]),
-            code_smells=list(parsed["code_smells"]),
-        )
-        improvement_plan = self.suggestion_service.build_improvement_plan(
-            domain_analysis=domain_analysis,
-            static_analysis=static_analysis,
-        )
-        summary = (
-            f"Detected `{detected_domain}` code with a model score of {score_breakdown.ml_score:.0%}, "
-            f"domain score {score_breakdown.domain_score:.0%}, and final reward {score_breakdown.reward:.0%}."
-        )
-        return AnalyzeCodeResponse(
-            detected_domain=detected_domain,  # type: ignore[arg-type]
-            domain_confidences=combined_scores,
-            score_breakdown=score_breakdown,
-            static_analysis=static_analysis,
-            domain_analysis=domain_analysis,
-            improvement_plan=improvement_plan,
-            model_backend=str(model_prediction["backend_name"]),
-            model_id=str(model_prediction["model_id"]),
-            summary=summary,
-            context_window=request.context_window,
-            analysis_time_ms=round((time.perf_counter() - started) * 1000.0, 2),
-        )

+"""Orchestration layer for AI-powered Python code review."""
+from __future__ import annotations
+import time
+from typing import Any, Callable
+from analyzers import analyze_data_science_code, analyze_dsa_code, analyze_ml_code, analyze_web_code
+from models import PyTorchCodeAnalyzerModel
+from schemas.request import AnalyzeCodeRequest
+from schemas.response import (
+    AnalysisIssue,
+    AnalyzeCodeResponse,
+    DomainAnalysis,
+    ModelPrediction,
+    StaticAnalysisSummary,
+)
+from services.reward_service import RewardService
+from services.suggestion_service import SuggestionService
+from utils import estimate_complexity, parse_code_structure
+def _clamp_unit(value: float) -> float:
+    return max(0.0, min(1.0, float(value)))
+def _lint_score(parsed: dict[str, Any]) -> float:
+    """Convert structural smells into a normalized lint-style score."""
+    score = 1.0
+    if not parsed.get("syntax_valid", True):
+        score -= 0.45
+    score -= min(int(parsed.get("long_lines", 0) or 0), 5) * 0.03
+    if parsed.get("tabs_used"):
+        score -= 0.1
+    if parsed.get("trailing_whitespace_lines"):
+        score -= 0.05
+    if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
+        score -= 0.08
+    return round(_clamp_unit(score), 4)
+def _static_issues(parsed: dict[str, Any], complexity: dict[str, Any]) -> list[AnalysisIssue]:
+    """Turn parser and complexity heuristics into review issues."""
+    issues: list[AnalysisIssue] = []
+    if not parsed.get("syntax_valid", True):
+        issues.append(
+            AnalysisIssue(
+                title="Syntax error blocks execution",
+                category="correctness",
+                severity="high",
+                description=str(parsed.get("syntax_error", "Python failed to parse the snippet.")),
+            )
+        )
+    if int(parsed.get("max_loop_depth", 0) or 0) >= 2:
+        issues.append(
+            AnalysisIssue(
+                title="Nested loops increase runtime risk",
+                category="performance",
+                severity="medium",
+                description="The current control flow suggests a brute-force path that may not scale on larger inputs.",
+            )
+        )
+    if int(complexity.get("cyclomatic_complexity", 1) or 1) >= 7:
+        issues.append(
+            AnalysisIssue(
+                title="Cyclomatic complexity is elevated",
+                category="maintainability",
+                severity="medium",
+                description="Branch-heavy code is harder to review, test, and optimize confidently.",
+            )
+        )
+    if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
+        issues.append(
+            AnalysisIssue(
+                title="Missing public-function documentation",
+                category="style",
+                severity="low",
+                description="Short docstrings would make the expected contract and edge cases easier to review.",
+            )
+        )
+    return issues
+class AnalysisService:
+    """End-to-end analysis pipeline shared by API and UI."""
+    def __init__(self) -> None:
+        self._model: PyTorchCodeAnalyzerModel | None = None
+        self.reward_service = RewardService()
+        self.suggestion_service = SuggestionService()
+        self._analyzers: dict[str, Callable[[str, dict[str, Any], dict[str, Any]], DomainAnalysis]] = {
+            "dsa": analyze_dsa_code,
+            "data_science": analyze_data_science_code,
+            "ml_dl": analyze_ml_code,
+            "web": analyze_web_code,
+        }
+    @property
+    def model(self) -> PyTorchCodeAnalyzerModel:
+        if self._model is None:
+            self._model = PyTorchCodeAnalyzerModel()
+        return self._model
+    def _heuristic_domain_scores(self, parsed: dict[str, Any], code: str) -> dict[str, float]:
+        """Derive domain priors from imports and syntax-level hints."""
+        scores = {
+            "dsa": 0.22
+            + (0.18 if parsed.get("uses_recursion") else 0.0)
+            + (0.18 if int(parsed.get("max_loop_depth", 0) or 0) >= 1 else 0.0),
+            "data_science": 0.22 + (0.38 if parsed.get("uses_pandas") or parsed.get("uses_numpy") else 0.0),
+            "ml_dl": 0.22 + (0.38 if parsed.get("uses_torch") or parsed.get("uses_sklearn") else 0.0),
+            "web": 0.22
+            + (0.38 if parsed.get("uses_fastapi") or parsed.get("uses_flask") else 0.0)
+            + (0.12 if parsed.get("route_decorators") else 0.0),
+            "general": 0.26,
+        }
+        lowered = code.lower()
+        if "fastapi" in lowered:
+            scores["web"] += 0.12
+        if "pandas" in lowered or "numpy" in lowered:
+            scores["data_science"] += 0.1
+        if "torch" in lowered or "sklearn" in lowered:
+            scores["ml_dl"] += 0.1
+        if "while" in code or "for" in code:
+            scores["dsa"] += 0.06
+        return {key: round(min(value, 0.99), 4) for key, value in scores.items()}
+    def _general_domain_analysis(self, parsed: dict[str, Any], complexity: dict[str, Any]) -> DomainAnalysis:
+        """Fallback analysis when no specialized domain is strongly selected."""
+        suggestions = [
+            "Keep functions small, validate inputs explicitly, and add focused tests for edge cases.",
+        ]
+        if int(parsed.get("max_loop_depth", 0) or 0) >= 2:
+            suggestions.append("Consider replacing repeated scans with a precomputed dictionary or set.")
+        return DomainAnalysis(
+            domain="general",
+            domain_score=round(_clamp_unit(0.62 - (0.12 * float(complexity["complexity_penalty"]))), 4),
+            issues=_static_issues(parsed, complexity)[:2],
+            suggestions=suggestions,
+            highlights={
+                "cyclomatic_complexity": float(complexity["cyclomatic_complexity"]),
+                "max_loop_depth": float(parsed.get("max_loop_depth", 0) or 0),
+                "lint_score": float(_lint_score(parsed)),
+            },
+        )
+    def analyze(self, request: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
+        """Run the complete static-plus-ML code review pipeline."""
+        started = time.perf_counter()
+        parsed = parse_code_structure(request.code)
+        complexity = estimate_complexity(parsed, request.code)
+        lint_score = _lint_score(parsed)
+        model_prediction = self.model.predict(
+            request.code,
+            request.context_window,
+            request.traceback_text,
+            parsed,
+        )
+        heuristic_scores = self._heuristic_domain_scores(parsed, request.code)
+        combined_scores: dict[str, float] = {}
+        for domain, heuristic_score in heuristic_scores.items():
+            model_score = float(model_prediction["domain_scores"].get(domain, 0.2))
+            combined_scores[domain] = round((0.65 * model_score) + (0.35 * heuristic_score), 4)
+        detected_domain = request.domain_hint if request.domain_hint != "auto" else max(combined_scores, key=combined_scores.get)
+        analyzer = self._analyzers.get(detected_domain)
+        domain_analysis = (
+            analyzer(request.code, parsed, complexity)
+            if analyzer is not None
+            else self._general_domain_analysis(parsed, complexity)
+        )
+        static_issues = _static_issues(parsed, complexity)
+        static_analysis = StaticAnalysisSummary(
+            syntax_valid=bool(parsed["syntax_valid"]),
+            syntax_error=str(parsed["syntax_error"]),
+            cyclomatic_complexity=int(complexity["cyclomatic_complexity"]),
+            line_count=int(parsed["line_count"]),
+            max_nesting_depth=int(parsed["max_nesting_depth"]),
+            max_loop_depth=int(parsed["max_loop_depth"]),
+            time_complexity=str(complexity["time_complexity"]),
+            space_complexity=str(complexity["space_complexity"]),
+            lint_score=lint_score,
+            docstring_coverage=float(parsed["docstring_ratio"]),
+            detected_imports=list(parsed["imports"]),
+            code_smells=list(parsed["code_smells"]),
+            issues=static_issues,
+        )
+        score_breakdown = self.reward_service.compute(
+            ml_score=float(model_prediction["ml_quality_score"]),
+            domain_score=domain_analysis.domain_score,
+            lint_score=lint_score,
+            complexity_penalty=float(complexity["complexity_penalty"]),
+            maintainability_score=float(model_prediction["maintainability_score"]),
+            issue_probabilities=dict(model_prediction["issue_probabilities"]),
+        )
+        suggestions = self.suggestion_service.build_suggestions(
+            domain_analysis=domain_analysis,
+            static_analysis=static_analysis,
+        )
+        improvement_plan = self.suggestion_service.build_improvement_plan(
+            domain_analysis=domain_analysis,
+            static_analysis=static_analysis,
+        )
+        auto_fix_preview = self.suggestion_service.build_auto_fix_preview(
+            domain_analysis=domain_analysis,
+            static_analysis=static_analysis,
+        )
+        summary = (
+            f"Reviewed Python code as `{detected_domain}` with an ML quality score of {score_breakdown.ml_score:.0%}, "
+            f"lint score {score_breakdown.lint_score:.0%}, and RL-ready reward {score_breakdown.reward:.0%}."
+        )
+        model_notes = list(model_prediction["notes"])
+        if static_issues:
+            model_notes.append(f"Static analyzer found {len(static_issues)} review issue(s).")
+        return AnalyzeCodeResponse(
+            detected_domain=detected_domain,  # type: ignore[arg-type]
+            domain_confidences=combined_scores,
+            score_breakdown=score_breakdown,
+            static_analysis=static_analysis,
+            model_prediction=ModelPrediction(
+                quality_label=str(model_prediction["quality_label"]),  # type: ignore[arg-type]
+                quality_score=float(model_prediction["quality_score"]),
+                maintainability_score=float(model_prediction["maintainability_score"]),
+                issue_probabilities=dict(model_prediction["issue_probabilities"]),
+                notes=model_notes,
+            ),
+            domain_analysis=domain_analysis,
+            suggestions=suggestions if request.enable_suggestions else [],
+            improvement_plan=improvement_plan if request.enable_suggestions else [],
+            auto_fix_preview=auto_fix_preview if request.enable_suggestions else [],
+            score_visualization={
+                "reward": score_breakdown.reward,
+                "ml_quality": score_breakdown.ml_score,
+                "lint_score": score_breakdown.lint_score,
+                "maintainability": score_breakdown.maintainability_score,
+                "security": score_breakdown.security_score,
+                "readability": score_breakdown.readability_score,
+                "quality_signal": score_breakdown.quality_signal,
+                "error_reduction_signal": score_breakdown.error_reduction_signal,
+                "completion_signal": score_breakdown.completion_signal,
+                "complexity_penalty": score_breakdown.complexity_penalty,
+            },
+            model_backend=str(model_prediction["backend_name"]),
+            model_id=str(model_prediction["model_id"]),
+            summary=summary,
+            context_window=request.context_window,
+            filename=request.filename,
+            analysis_time_ms=round((time.perf_counter() - started) * 1000.0, 2),
+        )

services/reward_service.py CHANGED Viewed

@@ -1,38 +1,56 @@
-"""Reward shaping logic for RL-ready code analysis scores."""
-from __future__ import annotations
-from schemas.response import ScoreBreakdown
-class RewardService:
-    """Compute reward scores from model, domain, lint, and complexity signals."""
-    def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
-        """Apply dynamic reward shaping based on quality, errors, and completion."""
-        quality_signal = max(0.0, min(1.0, (0.45 * ml_score) + (0.3 * domain_score) + (0.25 * lint_score)))
-        error_reduction_signal = max(0.0, min(1.0, lint_score - (0.6 * complexity_penalty)))
-        completion_signal = max(0.0, min(1.0, (ml_score + domain_score + lint_score) / 3.0))
-        reward = max(
-            0.0,
-            min(
-                1.0,
-                (0.35 * quality_signal)
-                + (0.25 * completion_signal)
-                + (0.2 * error_reduction_signal)
-                + (0.1 * ml_score)
-                + (0.1 * domain_score)
-                - (0.15 * complexity_penalty),
-            ),
-        )
-        return ScoreBreakdown(
-            ml_score=round(ml_score, 4),
-            domain_score=round(domain_score, 4),
-            lint_score=round(lint_score, 4),
-            complexity_penalty=round(complexity_penalty, 4),
-            quality_signal=round(quality_signal, 4),
-            error_reduction_signal=round(error_reduction_signal, 4),
-            completion_signal=round(completion_signal, 4),
-            reward=round(reward, 4),
-        )

+"""Reward shaping logic for RL-ready code analysis scores."""
+from __future__ import annotations
+from schemas.response import ScoreBreakdown
+def _clamp_unit(value: float) -> float:
+    return max(0.0, min(1.0, float(value)))
+class RewardService:
+    """Compute reward scores from model, lint, complexity, and issue-risk signals."""
+    def compute(
+        self,
+        *,
+        ml_score: float,
+        domain_score: float,
+        lint_score: float,
+        complexity_penalty: float,
+        maintainability_score: float,
+        issue_probabilities: dict[str, float],
+    ) -> ScoreBreakdown:
+        """Apply RL-friendly reward shaping to the code review analysis signals."""
+        security_score = _clamp_unit(1.0 - issue_probabilities.get("security", 0.0))
+        readability_score = _clamp_unit((0.6 * lint_score) + (0.4 * maintainability_score))
+        quality_signal = _clamp_unit((0.55 * ml_score) + (0.25 * maintainability_score) + (0.20 * domain_score))
+        error_reduction_signal = _clamp_unit((0.7 * lint_score) + (0.3 * (1.0 - complexity_penalty)))
+        completion_signal = _clamp_unit(
+            (0.4 * quality_signal) + (0.25 * readability_score) + (0.2 * security_score) + (0.15 * domain_score)
+        )
+        reward = _clamp_unit(
+            (0.5 * ml_score)
+            + (0.18 * lint_score)
+            + (0.12 * maintainability_score)
+            + (0.10 * domain_score)
+            + (0.10 * security_score)
+            - (0.20 * complexity_penalty)
+        )
+        return ScoreBreakdown(
+            ml_score=round(ml_score, 4),
+            domain_score=round(domain_score, 4),
+            lint_score=round(lint_score, 4),
+            complexity_penalty=round(complexity_penalty, 4),
+            maintainability_score=round(maintainability_score, 4),
+            security_score=round(security_score, 4),
+            readability_score=round(readability_score, 4),
+            quality_signal=round(quality_signal, 4),
+            error_reduction_signal=round(error_reduction_signal, 4),
+            completion_signal=round(completion_signal, 4),
+            reward=round(reward, 4),
+        )

services/suggestion_service.py CHANGED Viewed

@@ -1,28 +1,113 @@
-"""Suggestion and improvement-plan generation for analyzed code."""
-from __future__ import annotations
-from schemas.response import DomainAnalysis, StaticAnalysisSummary
-class SuggestionService:
-    """Build high-signal improvement steps from analysis output."""
-    def build_improvement_plan(self, *, domain_analysis: DomainAnalysis, static_analysis: StaticAnalysisSummary) -> list[str]:
-        """Return a compact three-step plan optimized for developer action."""
-        primary_issue = (
-            domain_analysis.issues[0].description
-            if domain_analysis.issues
-            else "Stabilize correctness first and keep the public behavior explicit."
-        )
-        step_one = f"Step 1 - Correctness and safety: {primary_issue}"
-        step_two = "Step 2 - Edge cases: test empty inputs, boundary values, malformed payloads, and failure-mode behavior explicitly."
-        step_three = "Step 3 - Scalability: reduce repeated scans, lower cyclomatic complexity, and benchmark the path on realistic input sizes."
-        if domain_analysis.suggestions:
-            step_three = f"{step_three} Priority hint: {domain_analysis.suggestions[0]}"
-        if not static_analysis.syntax_valid:
-            step_one = f"Step 1 - Correctness and safety: fix the syntax error first ({static_analysis.syntax_error})."
-        return [step_one, step_two, step_three]

+"""Suggestion and improvement-plan generation for analyzed code."""
+from __future__ import annotations
+from schemas.response import DomainAnalysis, StaticAnalysisSummary, SuggestionItem
+class SuggestionService:
+    """Build high-signal improvement suggestions from analysis output."""
+    def build_suggestions(
+        self,
+        *,
+        domain_analysis: DomainAnalysis,
+        static_analysis: StaticAnalysisSummary,
+    ) -> list[SuggestionItem]:
+        """Return prioritized fixes tailored to the detected review signals."""
+        suggestions: list[SuggestionItem] = []
+        if not static_analysis.syntax_valid:
+            suggestions.append(
+                SuggestionItem(
+                    priority="P0",
+                    title="Fix the syntax error",
+                    rationale="Static parsing failed, so downstream tests and model signals are less reliable.",
+                    action=f"Resolve the parser issue first: {static_analysis.syntax_error}.",
+                    category="correctness",
+                )
+            )
+        if static_analysis.cyclomatic_complexity >= 6 or static_analysis.max_loop_depth >= 2:
+            suggestions.append(
+                SuggestionItem(
+                    priority="P1",
+                    title="Reduce branching or nested loops",
+                    rationale="Higher structural complexity makes bugs more likely and lowers the RL reward.",
+                    action="Extract helper functions or replace repeated scans with a dictionary, set, Counter, or vectorized operation.",
+                    category="performance",
+                )
+            )
+        if static_analysis.docstring_coverage == 0 and static_analysis.line_count > 0:
+            suggestions.append(
+                SuggestionItem(
+                    priority="P2",
+                    title="Add function-level documentation",
+                    rationale="Docstrings improve review speed and make behavior clearer for future edits.",
+                    action="Document the expected inputs, outputs, and edge cases in a short function docstring.",
+                    category="style",
+                )
+            )
+        for issue in domain_analysis.issues[:2]:
+            suggestions.append(
+                SuggestionItem(
+                    priority="P1" if issue.severity != "high" else "P0",
+                    title=issue.title,
+                    rationale=issue.description,
+                    action=domain_analysis.suggestions[0] if domain_analysis.suggestions else "Refactor the risky section and re-run analysis.",
+                    category=issue.category,
+                )
+            )
+        if not suggestions:
+            suggestions.append(
+                SuggestionItem(
+                    priority="P2",
+                    title="Strengthen review confidence",
+                    rationale="No severe issues were detected, but explicit edge-case coverage still improves maintainability.",
+                    action="Add targeted tests for empty input, boundary values, and malformed payloads.",
+                    category="maintainability",
+                )
+            )
+        return suggestions[:4]
+    def build_improvement_plan(self, *, domain_analysis: DomainAnalysis, static_analysis: StaticAnalysisSummary) -> list[str]:
+        """Return a compact three-step plan optimized for developer action."""
+        primary_issue = (
+            domain_analysis.issues[0].description
+            if domain_analysis.issues
+            else "Stabilize correctness first and keep the public behavior explicit."
+        )
+        step_one = f"Step 1 - Correctness and safety: {primary_issue}"
+        step_two = "Step 2 - Edge cases: test empty inputs, boundary values, malformed payloads, and failure-mode behavior explicitly."
+        step_three = "Step 3 - Scalability: reduce repeated scans, lower cyclomatic complexity, and benchmark the path on realistic input sizes."
+        if domain_analysis.suggestions:
+            step_three = f"{step_three} Priority hint: {domain_analysis.suggestions[0]}"
+        if not static_analysis.syntax_valid:
+            step_one = f"Step 1 - Correctness and safety: fix the syntax error first ({static_analysis.syntax_error})."
+        return [step_one, step_two, step_three]
+    def build_auto_fix_preview(
+        self,
+        *,
+        domain_analysis: DomainAnalysis,
+        static_analysis: StaticAnalysisSummary,
+    ) -> list[str]:
+        """Generate compact auto-fix hints for the UI preview panel."""
+        preview: list[str] = []
+        if not static_analysis.syntax_valid:
+            preview.append(f"Repair parser failure: {static_analysis.syntax_error}")
+        if static_analysis.max_loop_depth >= 2:
+            preview.append("Replace nested scans with a precomputed lookup table or aggregation structure.")
+        if static_analysis.docstring_coverage == 0:
+            preview.append("Add a short docstring describing the function contract and edge cases.")
+        if domain_analysis.suggestions:
+            preview.append(domain_analysis.suggestions[0])
+        return preview[:3]

tests/test_inference_runner.py CHANGED Viewed

@@ -1,11 +1,12 @@
 """Smoke tests for the strict inference output contract."""
-from __future__ import annotations
-from dataclasses import dataclass, field
-from app.env.runner import InferenceRunner
-from app.models.inference import AgentDecision, InferenceConfig
 @dataclass
@@ -56,6 +57,17 @@ class _FakeAgent:
         return AgentDecision(action_type="submit_solution")
 def test_inference_runner_emits_strict_lines(capsys) -> None:
     runner = InferenceRunner(InferenceConfig.from_env())
     runner.agent = _FakeAgent()
@@ -69,3 +81,38 @@ def test_inference_runner_emits_strict_lines(capsys) -> None:
         "[STEP]  step=2 action=submit_solution reward=0.97 done=true error=null",
         "[END]   success=true steps=2 rewards=0.45,0.97",
     ]

 """Smoke tests for the strict inference output contract."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from app.env.runner import InferenceRunner
+from app.models.inference import AgentDecision, InferenceConfig
+from app.utils.runtime import format_reward
 @dataclass
         return AgentDecision(action_type="submit_solution")
+class _LowScoreEnv(_FakeEnv):
+    def step_result(self, action: object) -> tuple[_FakeObservation, float, bool, dict[str, object]]:
+        self._step += 1
+        return (
+            _FakeObservation("demo_task", 2, 0.60, True, current_code="candidate"),
+            0.60,
+            True,
+            {"last_action_error": None},
+        )
 def test_inference_runner_emits_strict_lines(capsys) -> None:
     runner = InferenceRunner(InferenceConfig.from_env())
     runner.agent = _FakeAgent()
         "[STEP]  step=2 action=submit_solution reward=0.97 done=true error=null",
         "[END]   success=true steps=2 rewards=0.45,0.97",
     ]
+def test_inference_runner_marks_low_score_submission_unsuccessful(capsys) -> None:
+    runner = InferenceRunner(InferenceConfig.from_env())
+    runner.agent = _FakeAgent()
+    runner._create_env = lambda: _LowScoreEnv()  # type: ignore[method-assign]
+    runner.run_task("demo_task")
+    captured = capsys.readouterr().out.strip().splitlines()
+    assert captured[-1] == "[END]   success=false steps=1 rewards=0.60"
+def test_inference_config_prefers_openai_key_for_openai_base_url(monkeypatch) -> None:
+    monkeypatch.setenv("API_BASE_URL", "https://api.openai.com/v1")
+    monkeypatch.setenv("OPENAI_API_KEY", "openai-key")
+    monkeypatch.setenv("HF_TOKEN", "hf-key")
+    config = InferenceConfig.from_env()
+    assert config.api_key == "openai-key"
+def test_inference_config_prefers_hf_key_for_hf_router(monkeypatch) -> None:
+    monkeypatch.setenv("API_BASE_URL", "https://router.huggingface.co/v1")
+    monkeypatch.setenv("OPENAI_API_KEY", "openai-key")
+    monkeypatch.setenv("HF_TOKEN", "hf-key")
+    config = InferenceConfig.from_env()
+    assert config.api_key == "hf-key"
+def test_reward_formatting_stays_in_strict_two_decimal_interval() -> None:
+    assert format_reward(0.999999) == "0.99"
+    assert format_reward(0.000001) == "0.01"

tests/test_scoring.py CHANGED Viewed

@@ -1,6 +1,7 @@
 from __future__ import annotations
 from graders import grade_task
 from models import PythonCodeReviewAction
 from server.env import PythonCodeReviewEnvironment
 from tasks import list_tasks
@@ -10,6 +11,16 @@ def assert_open_unit_interval(value: float) -> None:
     assert 0 < value < 1, f"Invalid score: {value}"
 def test_task_grades_stay_strictly_between_zero_and_one() -> None:
     for task in list_tasks():
         starter_grade = grade_task(task, task.starter_code, include_hidden=False)

 from __future__ import annotations
 from graders import grade_task
+from graders.shared import component_score, final_score_pipeline, safe_score, shaped_score
 from models import PythonCodeReviewAction
 from server.env import PythonCodeReviewEnvironment
 from tasks import list_tasks
     assert 0 < value < 1, f"Invalid score: {value}"
+def test_score_helpers_clamp_extremes_into_open_interval() -> None:
+    for value in (0.0, 1.0, -999999.0, 999999.0):
+        assert_open_unit_interval(safe_score(value))
+        assert_open_unit_interval(final_score_pipeline(value))
+    for progress in (0.0, 0.5, 1.0):
+        assert_open_unit_interval(shaped_score(progress))
+        assert_open_unit_interval(component_score(progress))
 def test_task_grades_stay_strictly_between_zero_and_one() -> None:
     for task in list_tasks():
         starter_grade = grade_task(task, task.starter_code, include_hidden=False)

utils/ast_parser.py CHANGED Viewed

@@ -1,144 +1,248 @@
-"""Static parsing helpers for multi-domain Python code analysis."""
-from __future__ import annotations
-import ast
-from typing import Any, Dict, List
-class _LoopDepthVisitor(ast.NodeVisitor):
-    """Collect loop nesting depth for a parsed Python module."""
-    def __init__(self) -> None:
-        self.depth = 0
-        self.max_depth = 0
-    def _visit_loop(self, node: ast.AST) -> None:
-        self.depth += 1
-        self.max_depth = max(self.max_depth, self.depth)
-        self.generic_visit(node)
-        self.depth -= 1
-    def visit_For(self, node: ast.For) -> None:  # noqa: N802
-        self._visit_loop(node)
-    def visit_While(self, node: ast.While) -> None:  # noqa: N802
-        self._visit_loop(node)
-    def visit_comprehension(self, node: ast.comprehension) -> None:  # noqa: N802
-        self._visit_loop(node)
-def parse_code_structure(code: str) -> Dict[str, Any]:
-    """Parse Python code into reusable structural signals."""
-    summary: Dict[str, Any] = {
-        "syntax_valid": True,
-        "syntax_error": "",
-        "imports": [],
-        "function_names": [],
-        "class_names": [],
-        "loop_count": 0,
-        "branch_count": 0,
-        "max_loop_depth": 0,
-        "line_count": len(code.splitlines()),
-        "long_lines": 0,
-        "tabs_used": "\t" in code,
-        "trailing_whitespace_lines": 0,
-        "uses_numpy": False,
-        "uses_pandas": False,
-        "uses_torch": False,
-        "uses_sklearn": False,
-        "uses_fastapi": False,
-        "uses_flask": False,
-        "uses_pydantic": False,
-        "uses_recursion": False,
-        "calls_eval": False,
-        "calls_no_grad": False,
-        "calls_backward": False,
-        "calls_optimizer_step": False,
-        "route_decorators": [],
-        "docstring_ratio": 0.0,
-        "code_smells": [],
-    }
-    lines = code.splitlines()
-    summary["long_lines"] = sum(1 for line in lines if len(line) > 88)
-    summary["trailing_whitespace_lines"] = sum(1 for line in lines if line.rstrip() != line)
-    try:
-        tree = ast.parse(code)
-    except SyntaxError as exc:
-        summary["syntax_valid"] = False
-        summary["syntax_error"] = f"{exc.msg} (line {exc.lineno})"
-        summary["code_smells"].append("Code does not parse.")
-        return summary
-    visitor = _LoopDepthVisitor()
-    visitor.visit(tree)
-    summary["max_loop_depth"] = visitor.max_depth
-    functions = [node for node in tree.body if isinstance(node, ast.FunctionDef)]
-    summary["function_names"] = [node.name for node in functions]
-    summary["class_names"] = [node.name for node in tree.body if isinstance(node, ast.ClassDef)]
-    summary["docstring_ratio"] = (
-        sum(1 for node in functions if ast.get_docstring(node)) / len(functions)
-        if functions
-        else 0.0
-    )
-    imports: List[str] = []
-    for node in ast.walk(tree):
-        if isinstance(node, ast.Import):
-            imports.extend(alias.name.split(".")[0] for alias in node.names)
-        elif isinstance(node, ast.ImportFrom) and node.module:
-            imports.append(node.module.split(".")[0])
-        elif isinstance(node, (ast.For, ast.While, ast.comprehension)):
-            summary["loop_count"] += 1
-        elif isinstance(node, (ast.If, ast.Try, ast.Match)):
-            summary["branch_count"] += 1
-        elif isinstance(node, ast.Call) and isinstance(node.func, ast.Attribute):
-            attr = node.func.attr
-            if attr == "eval":
-                summary["calls_eval"] = True
-            elif attr == "backward":
-                summary["calls_backward"] = True
-            elif attr == "step":
-                summary["calls_optimizer_step"] = True
-        elif isinstance(node, ast.Call) and isinstance(node.func, ast.Name) and node.func.id == "print":
-            summary["code_smells"].append("Debug print statements are present.")
-        elif isinstance(node, ast.With):
-            if any(isinstance(item.context_expr, ast.Call) and isinstance(item.context_expr.func, ast.Attribute) and item.context_expr.func.attr == "no_grad" for item in node.items):
-                summary["calls_no_grad"] = True
-    import_set = sorted(set(imports))
-    summary["imports"] = import_set
-    summary["uses_numpy"] = "numpy" in import_set or "np" in code
-    summary["uses_pandas"] = "pandas" in import_set or "pd" in code
-    summary["uses_torch"] = "torch" in import_set
-    summary["uses_sklearn"] = "sklearn" in import_set
-    summary["uses_fastapi"] = "fastapi" in import_set
-    summary["uses_flask"] = "flask" in import_set
-    summary["uses_pydantic"] = "pydantic" in import_set or "BaseModel" in code
-    for node in functions:
-        for child in ast.walk(node):
-            if isinstance(child, ast.Call) and isinstance(child.func, ast.Name) and child.func.id == node.name:
-                summary["uses_recursion"] = True
-    for node in ast.walk(tree):
-        if isinstance(node, ast.FunctionDef):
-            for decorator in node.decorator_list:
-                if isinstance(decorator, ast.Call) and isinstance(decorator.func, ast.Attribute):
-                    summary["route_decorators"].append(decorator.func.attr)
-                elif isinstance(decorator, ast.Attribute):
-                    summary["route_decorators"].append(decorator.attr)
-    if summary["long_lines"]:
-        summary["code_smells"].append("Long lines reduce readability.")
-    if summary["tabs_used"]:
-        summary["code_smells"].append("Tabs detected; prefer spaces for consistency.")
-    if summary["trailing_whitespace_lines"]:
-        summary["code_smells"].append("Trailing whitespace found.")
-    return summary

+"""AST-based parsing helpers for Python code review."""
+from __future__ import annotations
+import ast
+from dataclasses import dataclass, field
+from typing import Any
+@dataclass(slots=True)
+class _StructureVisitor(ast.NodeVisitor):
+    """Collect lightweight structural signals from Python source."""
+    imports: set[str] = field(default_factory=set)
+    route_decorators: set[str] = field(default_factory=set)
+    function_names: list[str] = field(default_factory=list)
+    class_names: list[str] = field(default_factory=list)
+    code_smells: list[str] = field(default_factory=list)
+    branch_count: int = 0
+    max_loop_depth: int = 0
+    max_nesting_depth: int = 0
+    current_loop_depth: int = 0
+    current_nesting_depth: int = 0
+    recursive_functions: set[str] = field(default_factory=set)
+    current_function: str | None = None
+    docstring_total: int = 0
+    docstring_with_docs: int = 0
+    backward_calls: int = 0
+    optimizer_step_calls: int = 0
+    container_builds: int = 0
+    def visit_Import(self, node: ast.Import) -> None:  # noqa: N802
+        for alias in node.names:
+            self.imports.add(alias.name.split(".")[0])
+        self.generic_visit(node)
+    def visit_ImportFrom(self, node: ast.ImportFrom) -> None:  # noqa: N802
+        if node.module:
+            self.imports.add(node.module.split(".")[0])
+        self.generic_visit(node)
+    def _push_nesting(self) -> None:
+        self.current_nesting_depth += 1
+        self.max_nesting_depth = max(self.max_nesting_depth, self.current_nesting_depth)
+    def _pop_nesting(self) -> None:
+        self.current_nesting_depth = max(0, self.current_nesting_depth - 1)
+    def _visit_loop(self, node: ast.AST) -> None:
+        self.branch_count += 1
+        self.current_loop_depth += 1
+        self.max_loop_depth = max(self.max_loop_depth, self.current_loop_depth)
+        self._push_nesting()
+        self.generic_visit(node)
+        self._pop_nesting()
+        self.current_loop_depth = max(0, self.current_loop_depth - 1)
+    def visit_For(self, node: ast.For) -> None:  # noqa: N802
+        self._visit_loop(node)
+    def visit_AsyncFor(self, node: ast.AsyncFor) -> None:  # noqa: N802
+        self._visit_loop(node)
+    def visit_While(self, node: ast.While) -> None:  # noqa: N802
+        self._visit_loop(node)
+    def visit_If(self, node: ast.If) -> None:  # noqa: N802
+        self.branch_count += 1
+        self._push_nesting()
+        self.generic_visit(node)
+        self._pop_nesting()
+    def visit_Try(self, node: ast.Try) -> None:  # noqa: N802
+        self.branch_count += 1
+        self._push_nesting()
+        self.generic_visit(node)
+        self._pop_nesting()
+    def visit_With(self, node: ast.With) -> None:  # noqa: N802
+        self._push_nesting()
+        self.generic_visit(node)
+        self._pop_nesting()
+    def visit_AsyncWith(self, node: ast.AsyncWith) -> None:  # noqa: N802
+        self._push_nesting()
+        self.generic_visit(node)
+        self._pop_nesting()
+    def visit_comprehension(self, node: ast.comprehension) -> None:  # noqa: N802
+        self._visit_loop(node)
+    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:  # noqa: N802
+        self.function_names.append(node.name)
+        self.docstring_total += 1
+        if ast.get_docstring(node):
+            self.docstring_with_docs += 1
+        prior = self.current_function
+        self.current_function = node.name
+        for decorator in node.decorator_list:
+            decorator_name = self._decorator_name(decorator)
+            if decorator_name in {"get", "post", "put", "patch", "delete"}:
+                self.route_decorators.add(decorator_name)
+        self.generic_visit(node)
+        self.current_function = prior
+    def visit_AsyncFunctionDef(self, node: ast.AsyncFunctionDef) -> None:  # noqa: N802
+        self.visit_FunctionDef(node)
+    def visit_ClassDef(self, node: ast.ClassDef) -> None:  # noqa: N802
+        self.class_names.append(node.name)
+        self.generic_visit(node)
+    def visit_Call(self, node: ast.Call) -> None:  # noqa: N802
+        dotted_name = self._call_name(node.func)
+        if dotted_name.endswith(".backward") or dotted_name == "backward":
+            self.backward_calls += 1
+        if dotted_name.endswith(".step") or dotted_name == "step":
+            if "optimizer" in dotted_name:
+                self.optimizer_step_calls += 1
+        if dotted_name in {"list", "dict", "set", "tuple"}:
+            self.container_builds += 1
+        if self.current_function and dotted_name == self.current_function:
+            self.recursive_functions.add(self.current_function)
+        self.generic_visit(node)
+    @staticmethod
+    def _call_name(node: ast.AST) -> str:
+        if isinstance(node, ast.Name):
+            return node.id
+        if isinstance(node, ast.Attribute):
+            left = _StructureVisitor._call_name(node.value)
+            return f"{left}.{node.attr}" if left else node.attr
+        return ""
+    @staticmethod
+    def _decorator_name(node: ast.AST) -> str:
+        if isinstance(node, ast.Call):
+            return _StructureVisitor._decorator_name(node.func)
+        if isinstance(node, ast.Attribute):
+            return node.attr.lower()
+        if isinstance(node, ast.Name):
+            return node.id.lower()
+        return ""
+def _line_smells(lines: list[str]) -> tuple[int, list[int], bool]:
+    long_lines = sum(1 for line in lines if len(line) > 88)
+    trailing_whitespace_lines = [index + 1 for index, line in enumerate(lines) if line.rstrip() != line]
+    tabs_used = any("\t" in line for line in lines)
+    return long_lines, trailing_whitespace_lines, tabs_used
+def parse_code_structure(code: str) -> dict[str, Any]:
+    """Extract deterministic syntax, import, and structure signals from Python code."""
+    normalized_code = code or ""
+    lines = normalized_code.splitlines()
+    long_lines, trailing_whitespace_lines, tabs_used = _line_smells(lines)
+    result: dict[str, Any] = {
+        "syntax_valid": True,
+        "syntax_error": "",
+        "line_count": len(lines),
+        "imports": [],
+        "function_names": [],
+        "class_names": [],
+        "long_lines": long_lines,
+        "trailing_whitespace_lines": trailing_whitespace_lines,
+        "tabs_used": tabs_used,
+        "docstring_ratio": 0.0,
+        "uses_recursion": False,
+        "max_loop_depth": 0,
+        "max_nesting_depth": 0,
+        "route_decorators": [],
+        "code_smells": [],
+        "uses_pandas": False,
+        "uses_numpy": False,
+        "uses_torch": False,
+        "uses_sklearn": False,
+        "uses_fastapi": False,
+        "uses_flask": False,
+        "uses_pydantic": False,
+        "calls_backward": False,
+        "calls_optimizer_step": False,
+        "branch_count": 0,
+        "container_builds": 0,
+    }
+    try:
+        tree = ast.parse(normalized_code or "\n")
+    except SyntaxError as exc:
+        result["syntax_valid"] = False
+        result["syntax_error"] = f"{exc.msg} (line {exc.lineno}, column {exc.offset})"
+        result["code_smells"] = ["Code does not parse.", "Fix syntax before deeper review."]
+        return result
+    visitor = _StructureVisitor()
+    visitor.visit(tree)
+    imports = sorted(visitor.imports)
+    uses_pandas = "pandas" in imports or "pd" in normalized_code
+    uses_numpy = "numpy" in imports or "np." in normalized_code
+    uses_torch = "torch" in imports or "torch." in normalized_code
+    uses_sklearn = "sklearn" in imports
+    uses_fastapi = "fastapi" in imports
+    uses_flask = "flask" in imports
+    uses_pydantic = "pydantic" in imports or "BaseModel" in normalized_code
+    code_smells = list(visitor.code_smells)
+    if visitor.max_loop_depth >= 2:
+        code_smells.append("Nested loops may create avoidable performance pressure.")
+    if long_lines:
+        code_smells.append("Long lines reduce readability and reviewability.")
+    if trailing_whitespace_lines:
+        code_smells.append("Trailing whitespace suggests style drift.")
+    if visitor.docstring_total and visitor.docstring_with_docs == 0:
+        code_smells.append("Public functions are missing docstrings.")
+    if not visitor.function_names:
+        code_smells.append("Encapsulate behavior in functions for testability.")
+    result.update(
+        {
+            "imports": imports,
+            "function_names": visitor.function_names,
+            "class_names": visitor.class_names,
+            "docstring_ratio": round(
+                visitor.docstring_with_docs / max(visitor.docstring_total, 1),
+                4,
+            ),
+            "uses_recursion": bool(visitor.recursive_functions),
+            "max_loop_depth": visitor.max_loop_depth,
+            "max_nesting_depth": visitor.max_nesting_depth,
+            "route_decorators": sorted(visitor.route_decorators),
+            "code_smells": code_smells,
+            "uses_pandas": uses_pandas,
+            "uses_numpy": uses_numpy,
+            "uses_torch": uses_torch,
+            "uses_sklearn": uses_sklearn,
+            "uses_fastapi": uses_fastapi,
+            "uses_flask": uses_flask,
+            "uses_pydantic": uses_pydantic,
+            "calls_backward": visitor.backward_calls > 0,
+            "calls_optimizer_step": visitor.optimizer_step_calls > 0,
+            "branch_count": visitor.branch_count,
+            "container_builds": visitor.container_builds,
+        }
+    )
+    return result

utils/complexity.py CHANGED Viewed

@@ -1,37 +1,70 @@
-"""Complexity heuristics for DSA-style and general Python code."""
-from __future__ import annotations
-from typing import Any, Dict
-def estimate_complexity(parsed: Dict[str, Any], code: str) -> Dict[str, Any]:
-    """Estimate cyclomatic complexity and rough Big-O heuristics."""
-    cyclomatic = 1 + int(parsed.get("branch_count", 0))
-    loop_depth = int(parsed.get("max_loop_depth", 0))
-    uses_recursion = bool(parsed.get("uses_recursion", False))
-    if loop_depth >= 3:
-        time_complexity = "O(n^3)"
-    elif loop_depth == 2:
-        time_complexity = "O(n^2)"
-    elif "sorted(" in code or ".sort(" in code:
-        time_complexity = "O(n log n)"
-    elif loop_depth == 1 or uses_recursion:
-        time_complexity = "O(n)"
-    else:
-        time_complexity = "O(1)"
-    if "append(" in code or "list(" in code or "dict(" in code or "set(" in code:
-        space_complexity = "O(n)"
-    else:
-        space_complexity = "O(1)"
-    complexity_penalty = min(0.99, 0.08 + (cyclomatic * 0.04) + (loop_depth * 0.12))
-    return {
-        "cyclomatic_complexity": cyclomatic,
-        "time_complexity": time_complexity,
-        "space_complexity": space_complexity,
-        "complexity_penalty": round(complexity_penalty, 4),
-    }

+"""Complexity heuristics for Python code review."""
+from __future__ import annotations
+import ast
+from typing import Any
+def _clamp_unit(value: float) -> float:
+    return max(0.0, min(1.0, float(value)))
+def _estimate_time_complexity(loop_depth: int, uses_recursion: bool) -> str:
+    if uses_recursion and loop_depth >= 1:
+        return "O(n^2)"
+    if loop_depth >= 3:
+        return "O(n^3)"
+    if loop_depth == 2:
+        return "O(n^2)"
+    if loop_depth == 1:
+        return "O(n)"
+    if uses_recursion:
+        return "O(n)"
+    return "O(1)"
+def _estimate_space_complexity(code: str, uses_recursion: bool) -> str:
+    if uses_recursion:
+        return "O(n)"
+    if any(token in code for token in ("[]", "{}", "set(", "dict(", "list(", "Counter(")):
+        return "O(n)"
+    return "O(1)"
+def _cyclomatic_complexity(code: str) -> int:
+    try:
+        tree = ast.parse(code or "\n")
+    except SyntaxError:
+        return 1
+    decision_points = sum(
+        isinstance(node, (ast.If, ast.For, ast.AsyncFor, ast.While, ast.Try, ast.ExceptHandler, ast.Match, ast.BoolOp))
+        for node in ast.walk(tree)
+    )
+    return max(1, 1 + decision_points)
+def estimate_complexity(parsed: dict[str, Any], code: str) -> dict[str, Any]:
+    """Estimate Python complexity signals from parsed structure plus source text."""
+    cyclomatic_complexity = _cyclomatic_complexity(code)
+    loop_depth = int(parsed.get("max_loop_depth", 0) or 0)
+    max_nesting_depth = int(parsed.get("max_nesting_depth", 0) or 0)
+    uses_recursion = bool(parsed.get("uses_recursion", False))
+    line_count = int(parsed.get("line_count", 0) or 0)
+    complexity_penalty = _clamp_unit(
+        0.08
+        + min(cyclomatic_complexity, 12) * 0.045
+        + min(loop_depth, 4) * 0.11
+        + min(max_nesting_depth, 4) * 0.06
+        + (0.06 if uses_recursion else 0.0)
+        + min(line_count, 200) * 0.0009
+    )
+    return {
+        "cyclomatic_complexity": cyclomatic_complexity,
+        "time_complexity": _estimate_time_complexity(loop_depth, uses_recursion),
+        "space_complexity": _estimate_space_complexity(code, uses_recursion),
+        "complexity_penalty": round(complexity_penalty, 4),
+    }

uv.lock CHANGED Viewed

@@ -1926,7 +1926,6 @@ source = { editable = "." }
 dependencies = [
     { name = "fastapi" },
     { name = "gradio" },
-    { name = "hf-xet" },
     { name = "openai" },
     { name = "openenv-core", extra = ["core"] },
     { name = "streamlit" },
@@ -1945,7 +1944,6 @@ dev = [
 requires-dist = [
     { name = "fastapi", specifier = ">=0.111.0" },
     { name = "gradio", specifier = ">=5.26.0" },
-    { name = "hf-xet", specifier = ">=1.4.3" },
     { name = "openai", specifier = ">=1.76.0" },
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },

 dependencies = [
     { name = "fastapi" },
     { name = "gradio" },
     { name = "openai" },
     { name = "openenv-core", extra = ["core"] },
     { name = "streamlit" },
 requires-dist = [
     { name = "fastapi", specifier = ">=0.111.0" },
     { name = "gradio", specifier = ">=5.26.0" },
     { name = "openai", specifier = ">=1.76.0" },
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0.0" },