Spaces:

Pratap-K
/

AutoMathReasoner

Sleeping

App Files Files Community

Pratap-K commited on 15 days ago

Commit

98fc9b6

0 Parent(s):

AutoMathReasoner

Browse files

Files changed (16) hide show

.gitignore +79 -0
Dockerfile +83 -0
README.md +133 -0
__init__.py +16 -0
client.py +101 -0
config/openenv.yaml +16 -0
openenv.yaml +7 -0
pyproject.toml +45 -0
requirements.txt +368 -0
server/__init__.py +11 -0
server/app.py +80 -0
tests/test_env.py +79 -0
train/colab_train.py +143 -0
train/sft_warm_start.py +57 -0
train/train_grpo.py +188 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,79 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# Virtual environments
+.venv/
+venv/
+env/
+ENV/
+env.bak/
+venv.bak/
+# Environment variables
+.env
+.env.local
+# Build/distribution directories
+build/
+dist/
+*.egg-info/
+.eggs/
+eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+# C extensions
+*.so
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+pytest_out*
+# Machine Learning / Outputs
+outputs/
+colab_outputs/
+wandb/
+checkpoints/
+*.pt
+*.pth
+*.safetensors
+*.ckpt
+# IDEs and Editors
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+.spyderproject
+.spyproject
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+#docs
+docs

Dockerfile ADDED Viewed

	@@ -0,0 +1,83 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=AutoMathReasoner
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+#Enable Web Interface
+ENV ENABLE_WEB_INTERFACE=true
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 7860"]

README.md ADDED Viewed

	@@ -0,0 +1,133 @@

+---
+title: AutoMathReasoner Environment
+emoji: 🧠
+colorFrom: indigo
+colorTo: purple
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# ♾️ AutoMathReasoner: Self-Improving Mathematics RL Environment
+**AutoMathReasoner** is an OpenEnv-compliant reinforcement learning server specifically formulated to bootstrap mathematical intelligence in Large Language Models (LLMs). Rooted in principles from DeepSeekMath and Group-Relative Policy Optimization (GRPO), it facilitates absolute, fully autonomous self-improvement through rigorous dense reward curves, exploration entropy, and curriculum scaling.
+This repository wraps the environment architecture securely into a lightweight Docker-backed REST API for direct ingestion in Google Colab, SageMaker, or distributed compute arrays.
+---
+## 🏗️ Architecture Overview
+The system strictly decouples the interactive RL environment from the learning engine. The `FastAPI` instance serves purely as the mathematical world simulation.
+```mermaid
+graph TD
+    subgraph EnvAPI [OpenEnv API Space]
+        GE["Task Generator Engine"] -->|"Yields Math"| Server["FastAPI Server"]
+        Server -->|"Computes"| VR["Verifier System & Reward Logic"]
+        VR --> Server
+    end
+    subgraph ClientNode [Training Node e.g. Colab]
+        MD["Language Model Policy"] -->|"Action: Reason & Answer"| HG["HF GRPOTrainer"]
+        HG -->|"REST HTTP POST"| Server
+        Server -->|"Observation: Rewards"| HG
+        HG -->|"Log diff"| MD
+    end
+    classDef space fill:transparent,stroke:#9370DB,stroke-width:2px,stroke-dasharray: 5 5;
+    classDef client fill:transparent,stroke:#008B8B,stroke-width:2px,stroke-dasharray: 5 5;
+    class EnvAPI space
+    class ClientNode client
+```
+---
+## 🎯 Reward Composite Hierarchy (Graders)
+Instead of binary scalar rewards (0 for incorrect, 1 for correct), the AutoMathReasoner relies on an aggressive mathematical dense reward architecture designed to shape logical structures rather than just end targets.
+The absolute reward matrix evaluates as:
+$$R = 0.35C + 0.15\tanh(Q) + 0.1P + 0.1R_{\text{ref}} + 0.15D - 0.05E + 0.1X + \mathcal{N}(0, \sigma^2)$$
+### Individual Mathematical Graders
+- **Correctness ($C$):** $C \in \{0.0, 1.0\}$. Passed through an exact match, numeric bound tolerance limit, and generic python evaluation. E.g. correctly evaluating `3.1415 = 3.14159`.
+- **Reasoning Squashing ($Q_{\text{smooth}}$):** $Q_{\text{smooth}} = \tanh(Q)$. Uses hyperbolic tangent functions bounding heuristic step-formatting markers to ensure extreme verbosity does not dominate correctness.
+- **Process Supervision ($P$):** A step-aware structural logic test that algorithmically assigns $-0.5$ scalar penalties for hallucinatory inferential jumps.
+- **Reflection Parsing ($R_{\text{ref}}$):** Tracks deducing logic boundaries ("Wait", "What could be wrong"). Rewards $+1.0$ for successful self-correction routing, and $-0.5$ if it reflects into a broken contradiction.
+- **Entropic Exploration ($X$):** Rewards unique reasoning path token variance mapped dynamically against historical encounter probability:
+  $$X = \frac{\log(1 + \text{unique\_ratio})}{\sqrt{1 + \text{times\_seen\_problem}}}$$
+- **Token Efficiency Penalty ($E$):** Penalizes overly verbose traces dynamically. It anchors outputs safely against a $50$-token optimal length via an inverse negative Gaussian curve:
+  $$E = \exp\left(-\left(\frac{\text{approx\_tokens} - 50}{50}\right)^2\right) - 1.0$$
+- **History Diversity ($D$):** Employs strict, absolute mathematical blocks against network hacking and identical solution repetition loops:
+  $$D = \begin{cases} -\exp(1.0) & \text{if answer repeats exactly} \\ 1.0 & \text{otherwise} \end{cases}$$
+---
+## 🔄 Self-Curriculum Training Loop
+The pipeline intrinsically manages mathematical difficulty scaling while systematically applying ReST-Style trajectory filtration to block network poisoning.
+```mermaid
+sequenceDiagram
+    participant Model as P-Model
+    participant Buffer as Replay/LADDER Buffer
+    participant Env as AutoMath Env OpenEnv
+    loop Episodic Batch GRPO
+        Env->>Model: Emit Algebra Prompt (Diff=2.0)
+        Model->>Env: Rollout K=4 Completion Traces
+        Note over Env: Execute Process Supervision<br>Determine Majority Sample Output
+        Env-->>Model: Return Normalized Reward Arrays
+        Model->>Model: Compute Relative Log Likelihood
+        Model->>Model: LoRA Gradient Step
+        alt is_correct == 1 AND Q_reasoning > 0.6
+            Model->>Buffer: Store Trajectory (ReST/LADDER)
+        else
+            Model->>Buffer: Store as Hard Negative Mine
+        end
+    end
+    loop Curriculum Scaling Tick
+        Note over Env: If Mean Rolling Accuracy >= 65%
+        Env->>Env: Diff = Diff + 0.5 (Generate advanced word problems)
+    end
+```
+---
+## 💻 Steps to Get the Code Running on Your System
+### 1. Initialize the Environment Server Locally
+You can launch the core OpenEnv FastAPI server effortlessly using `uv` to orchestrate dependencies automatically. This handles environment states entirely.
+```bash
+# Clone the repository
+git clone https://github.com/yourusername/AutoMathReasoner.git
+cd AutoMathReasoner
+# Install native editable package bindings via uv
+uv pip install -e .
+# Launch the FastAPI Server Engine
+uv run server
+```
+_The server is now live at `http://localhost:7860`. You can visit `http://localhost:7860/docs` to view the raw interactive environment endpoints._
+### 2. Begin Reinforcement Learning (GRPO)
+Once your server is running (either locally or deployed to Hugging Face Spaces), execute the automated GRPO rollout.
+To execute the free-tier Colab notebook simulation pointing back at your running server:
+```bash
+# In an entirely separate terminal
+python train/colab_train.py
+```
+*(Ensure `HF_SPACE_URL` in `train/colab_train.py` points to your `http://localhost:7860` or deployed Space domain!)*

__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Automathreasoner Environment."""
+from .client import AutomathreasonerEnv
+from .env.models import AutomathreasonerAction, AutomathreasonerObservation
+__all__ = [
+    "AutomathreasonerAction",
+    "AutomathreasonerObservation",
+    "AutomathreasonerEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Automathreasoner Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .env.models import AutomathreasonerAction, AutomathreasonerObservation
+class AutomathreasonerEnv(
+    EnvClient[AutomathreasonerAction, AutomathreasonerObservation, State]
+):
+    """
+    Client for the Automathreasoner Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> # Connect to a running server
+        >>> with AutomathreasonerEnv(base_url="http://localhost:7860") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.echoed_message)
+        ...
+        ...     result = client.step(AutomathreasonerAction(message="Hello!"))
+        ...     print(result.observation.echoed_message)
+    Example with Docker:
+        >>> # Automatically start container and connect
+        >>> client = AutomathreasonerEnv.from_docker_image("AutoMathReasoner-env:latest")
+        >>> try:
+        ...     result = client.reset()
+        ...     result = client.step(AutomathreasonerAction(message="Test"))
+        ... finally:
+        ...     client.close()
+    """
+    def _step_payload(self, action: AutomathreasonerAction) -> Dict:
+        """
+        Convert AutomathreasonerAction to JSON payload for step message.
+        Args:
+            action: AutomathreasonerAction instance
+        Returns:
+            Dictionary representation suitable for JSON encoding
+        """
+        return {
+            "reasoning": action.reasoning,
+            "final_answer": action.final_answer,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[AutomathreasonerObservation]:
+        """
+        Parse server response into StepResult[AutomathreasonerObservation].
+        Args:
+            payload: JSON response data from server
+        Returns:
+            StepResult with AutomathreasonerObservation
+        """
+        obs_data = payload.get("observation", {})
+        observation = AutomathreasonerObservation(
+            problem_text=obs_data.get("problem_text", ""),
+            difficulty_level=obs_data.get("difficulty_level", 1.0),
+            history=obs_data.get("history", []),
+            done=payload.get("done", False),
+            reward=payload.get("reward", 0.0),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """
+        Parse server response into State object.
+        Args:
+            payload: JSON response from state request
+        Returns:
+            State object with episode_id and step_count
+        """
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

config/openenv.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+env:
+  name: "AutoMathReasoner"
+  author: "Meta Hackathon User"
+  description: "A self-improving math reasoning environment that dynamically generates tasks, tracking accuracy to provide curriculum learning for RL agents."
+  version: "1.0.0"
+server:
+  host: "0.0.0.0"
+  port: 7860
+  workers: 4
+  module: "server.app:app"
+features:
+  multi_reward: true
+  prevent_hacking: true
+  curriculum_scheduler: true

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: AutoMathReasoner
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860

pyproject.toml ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-AutoMathReasoner"
+version = "0.1.0"
+description = "Automathreasoner environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m AutoMathReasoner.server.app
+server = "AutoMathReasoner.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["AutoMathReasoner", "AutoMathReasoner.server", "AutoMathReasoner.env"]
+package-dir = { "AutoMathReasoner" = ".", "AutoMathReasoner.server" = "server", "AutoMathReasoner.env" = "env" }

requirements.txt ADDED Viewed

	@@ -0,0 +1,368 @@

+# This file was autogenerated by uv via the following command:
+#    uv export --no-hashes -o requirements.txt
+-e .
+aiofile==3.9.0
+    # via py-key-value-aio
+annotated-doc==0.0.4
+    # via
+    #   fastapi
+    #   typer
+annotated-types==0.7.0
+    # via pydantic
+anyio==4.13.0
+    # via
+    #   gradio
+    #   httpx
+    #   mcp
+    #   openai
+    #   py-key-value-aio
+    #   sse-starlette
+    #   starlette
+    #   watchfiles
+attrs==26.1.0
+    # via
+    #   cyclopts
+    #   jsonschema
+    #   referencing
+audioop-lts==0.2.2 ; python_full_version >= '3.13'
+    # via gradio
+authlib==1.7.0
+    # via fastmcp
+backports-tarfile==1.2.0 ; python_full_version < '3.12'
+    # via jaraco-context
+beartype==0.22.9
+    # via py-key-value-aio
+brotli==1.2.0
+    # via gradio
+cachetools==7.0.6
+    # via py-key-value-aio
+caio==0.9.25
+    # via aiofile
+certifi==2026.4.22
+    # via
+    #   httpcore
+    #   httpx
+    #   requests
+cffi==2.0.0 ; platform_python_implementation != 'PyPy'
+    # via cryptography
+charset-normalizer==3.4.7
+    # via requests
+click==8.3.3
+    # via
+    #   typer
+    #   uvicorn
+colorama==0.4.6 ; sys_platform == 'win32'
+    # via
+    #   click
+    #   tqdm
+cryptography==46.0.7
+    # via
+    #   authlib
+    #   joserfc
+    #   pyjwt
+    #   secretstorage
+cyclopts==4.11.0
+    # via fastmcp
+distro==1.9.0
+    # via openai
+dnspython==2.8.0
+    # via email-validator
+docstring-parser==0.18.0
+    # via cyclopts
+docutils==0.22.4
+    # via rich-rst
+email-validator==2.3.0
+    # via pydantic
+exceptiongroup==1.3.1
+    # via
+    #   anyio
+    #   fastmcp
+fastapi==0.136.0
+    # via
+    #   gradio
+    #   openenv-core
+fastmcp==3.2.4
+    # via openenv-core
+filelock==3.29.0
+    # via huggingface-hub
+fsspec==2026.3.0
+    # via
+    #   gradio-client
+    #   huggingface-hub
+gradio==6.13.0
+    # via openenv-core
+gradio-client==2.5.0
+    # via
+    #   gradio
+    #   hf-gradio
+griffelib==2.0.2
+    # via fastmcp
+groovy==0.1.2
+    # via gradio
+h11==0.16.0
+    # via
+    #   httpcore
+    #   uvicorn
+hf-gradio==0.4.1
+    # via gradio
+hf-xet==1.4.3 ; platform_machine == 'AMD64' or platform_machine == 'aarch64' or platform_machine == 'amd64' or platform_machine == 'arm64' or platform_machine == 'x86_64'
+    # via huggingface-hub
+httpcore==1.0.9
+    # via httpx
+httpx==0.28.1
+    # via
+    #   fastmcp
+    #   gradio
+    #   gradio-client
+    #   huggingface-hub
+    #   mcp
+    #   openai
+    #   openenv-core
+    #   safehttpx
+httpx-sse==0.4.3
+    # via mcp
+huggingface-hub==1.11.0
+    # via
+    #   gradio
+    #   gradio-client
+    #   openenv-core
+idna==3.13
+    # via
+    #   anyio
+    #   email-validator
+    #   httpx
+    #   requests
+importlib-metadata==8.7.1
+    # via
+    #   keyring
+    #   opentelemetry-api
+jaraco-classes==3.4.0
+    # via keyring
+jaraco-context==6.1.2
+    # via keyring
+jaraco-functools==4.4.0
+    # via keyring
+jeepney==0.9.0 ; sys_platform == 'linux'
+    # via
+    #   keyring
+    #   secretstorage
+jinja2==3.1.6
+    # via gradio
+jiter==0.14.0
+    # via openai
+joserfc==1.6.4
+    # via authlib
+jsonref==1.1.0
+    # via fastmcp
+jsonschema==4.26.0
+    # via mcp
+jsonschema-path==0.4.5
+    # via fastmcp
+jsonschema-specifications==2025.9.1
+    # via jsonschema
+keyring==25.7.0
+    # via py-key-value-aio
+markdown-it-py==4.0.0
+    # via rich
+markupsafe==3.0.3
+    # via
+    #   gradio
+    #   jinja2
+mcp==1.27.0
+    # via fastmcp
+mdurl==0.1.2
+    # via markdown-it-py
+more-itertools==11.0.2
+    # via
+    #   jaraco-classes
+    #   jaraco-functools
+numpy==2.2.6 ; python_full_version < '3.11'
+    # via
+    #   gradio
+    #   pandas
+numpy==2.4.4 ; python_full_version >= '3.11'
+    # via
+    #   gradio
+    #   pandas
+openai==2.32.0
+    # via openenv-core
+openapi-pydantic==0.5.1
+    # via fastmcp
+openenv-core==0.2.3
+    # via openenv-automathreasoner
+opentelemetry-api==1.41.0
+    # via fastmcp
+orjson==3.11.8
+    # via gradio
+packaging==26.1
+    # via
+    #   fastmcp
+    #   gradio
+    #   gradio-client
+    #   huggingface-hub
+pandas==2.3.3 ; python_full_version < '3.11'
+    # via gradio
+pandas==3.0.2 ; python_full_version >= '3.11'
+    # via gradio
+pathable==0.5.0
+    # via jsonschema-path
+pillow==12.2.0
+    # via gradio
+platformdirs==4.9.6
+    # via fastmcp
+py-key-value-aio==0.4.4
+    # via fastmcp
+pycparser==3.0 ; implementation_name != 'PyPy' and platform_python_implementation != 'PyPy'
+    # via cffi
+pydantic==2.13.3
+    # via
+    #   fastapi
+    #   fastmcp
+    #   gradio
+    #   mcp
+    #   openai
+    #   openapi-pydantic
+    #   openenv-core
+    #   pydantic-settings
+pydantic-core==2.46.3
+    # via pydantic
+pydantic-settings==2.14.0
+    # via mcp
+pydub==0.25.1
+    # via gradio
+pygments==2.20.0
+    # via rich
+pyjwt==2.12.1
+    # via mcp
+pyperclip==1.11.0
+    # via fastmcp
+python-dateutil==2.9.0.post0
+    # via pandas
+python-dotenv==1.2.2
+    # via
+    #   fastmcp
+    #   pydantic-settings
+python-multipart==0.0.26
+    # via
+    #   gradio
+    #   mcp
+pytz==2026.1.post1
+    # via
+    #   gradio
+    #   pandas
+pywin32==311 ; sys_platform == 'win32'
+    # via mcp
+pywin32-ctypes==0.2.3 ; sys_platform == 'win32'
+    # via keyring
+pyyaml==6.0.3
+    # via
+    #   fastmcp
+    #   gradio
+    #   huggingface-hub
+    #   jsonschema-path
+    #   openenv-core
+referencing==0.37.0
+    # via
+    #   jsonschema
+    #   jsonschema-path
+    #   jsonschema-specifications
+requests==2.33.1
+    # via openenv-core
+rich==15.0.0
+    # via
+    #   cyclopts
+    #   fastmcp
+    #   openenv-core
+    #   rich-rst
+    #   typer
+rich-rst==1.3.2
+    # via cyclopts
+rpds-py==0.30.0
+    # via
+    #   jsonschema
+    #   referencing
+safehttpx==0.1.7
+    # via gradio
+secretstorage==3.5.0 ; sys_platform == 'linux'
+    # via keyring
+semantic-version==2.10.0
+    # via gradio
+shellingham==1.5.4
+    # via typer
+six==1.17.0
+    # via python-dateutil
+sniffio==1.3.1
+    # via openai
+sse-starlette==3.3.4
+    # via mcp
+starlette==1.0.0
+    # via
+    #   fastapi
+    #   gradio
+    #   mcp
+    #   sse-starlette
+tomli==2.4.1
+    # via
+    #   cyclopts
+    #   openenv-core
+tomli-w==1.2.0
+    # via openenv-core
+tomlkit==0.14.0
+    # via gradio
+tqdm==4.67.3
+    # via
+    #   huggingface-hub
+    #   openai
+typer==0.24.2
+    # via
+    #   gradio
+    #   hf-gradio
+    #   huggingface-hub
+    #   openenv-core
+typing-extensions==4.15.0
+    # via
+    #   anyio
+    #   cryptography
+    #   cyclopts
+    #   exceptiongroup
+    #   fastapi
+    #   gradio
+    #   gradio-client
+    #   huggingface-hub
+    #   mcp
+    #   openai
+    #   opentelemetry-api
+    #   py-key-value-aio
+    #   pydantic
+    #   pydantic-core
+    #   pyjwt
+    #   referencing
+    #   starlette
+    #   typing-inspection
+    #   uvicorn
+typing-inspection==0.4.2
+    # via
+    #   fastapi
+    #   mcp
+    #   pydantic
+    #   pydantic-settings
+tzdata==2026.1 ; python_full_version < '3.11' or sys_platform == 'emscripten' or sys_platform == 'win32'
+    # via pandas
+uncalled-for==0.3.1
+    # via fastmcp
+urllib3==2.6.3
+    # via requests
+uvicorn==0.46.0
+    # via
+    #   fastmcp
+    #   gradio
+    #   mcp
+    #   openenv-core
+watchfiles==1.1.1
+    # via fastmcp
+websockets==16.0
+    # via
+    #   fastmcp
+    #   openenv-core
+zipp==3.23.1
+    # via importlib-metadata

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Automathreasoner environment server components."""
+from AutoMathReasoner.env.environment import AutomathreasonerEnvironment
+__all__ = ["AutomathreasonerEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Automathreasoner Environment.
+This module creates an HTTP server that exposes the AutomathreasonerEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 7860
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 7860 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+from AutoMathReasoner.env.models import AutomathreasonerAction, AutomathreasonerObservation
+from AutoMathReasoner.env.environment import AutomathreasonerEnvironment
+# Create the app with web interface and README integration
+app = create_app(
+    AutomathreasonerEnvironment,
+    AutomathreasonerAction,
+    AutomathreasonerObservation,
+    env_name="AutoMathReasoner",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 7860):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m AutoMathReasoner.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 7860)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn AutoMathReasoner.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=7860)
+    args = parser.parse_args()
+    main(port=args.port)

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,79 @@

+import sys
+import os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from env.generator import TaskGenerationEngine
+from env.verifier import VerifierSystem
+from env.rewards import RewardSystem
+from env.environment import AutomathreasonerEnvironment
+from env.models import AutomathreasonerAction
+def test_generator():
+    engine = TaskGenerationEngine()
+    # Test arithmetic
+    prob, diff, ans = engine.generate_arithmetic(complexity=1)
+    assert prob and ans
+    # Test overall generate task
+    task = engine.generate_task(target_difficulty_band=2.0)
+    assert "problem" in task
+    assert "solution" in task
+    assert "difficulty" in task
+def test_verifier():
+    verifier = VerifierSystem()
+    # Exact match
+    assert verifier.check_exact_match("42", "42")
+    assert verifier.check_exact_match(" 42 ", "42")
+    # Numeric tolerance
+    assert verifier.check_numeric_tolerance("3.14159", "3.1415")
+    assert not verifier.check_numeric_tolerance("4.1415", "3.1415")
+    # Python execution
+    assert verifier.check_python_execution("2 + 2", "4")
+    # Full verification
+    c, q = verifier.verify("Because 2 + 2 is 4", "4", "4")
+    assert c == 1.0
+    assert q > 0.0  # Should have some mock reasoning score
+def test_rewards():
+    reward_sys = RewardSystem(max_len=1000)
+    history = [{"final_answer": "42"}]
+    # Test diversity drop on repeat
+    d = reward_sys.compute_diversity("42", history)
+    assert d == -1.0
+    # Normal compute
+    r, comps = reward_sys.compute_reward(
+        correctness=1.0,
+        reasoning_quality=1.0,
+        action_str="step 1: do math. = 42",
+        final_answer="42",
+        history=[],
+        times_seen_problem=0
+    )
+    assert r > 0.0
+def test_environment_step():
+    env = AutomathreasonerEnvironment()
+    obs = env.reset()
+    assert obs.problem_text != ""
+    assert obs.difficulty_level > 0
+    assert len(obs.history) == 0
+    # Create action where they just pass dummy stuff
+    action = AutomathreasonerAction(
+        reasoning="I am guessing the answer.",
+        final_answer="0"
+    )
+    obs_after = env.step(action)
+    assert obs_after.reward is not None
+    assert len(obs_after.history) == 1
+    assert "reward_components" in obs_after.metadata

train/colab_train.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""
+Colab Training Script for AutoMathReasoner (Hugging Face Space + Free T4 GPU)
+Instructions for Colab:
+1. Create a new Google Colab notebook (Free Tier: T4 GPU is supported by Unsloth)
+2. Run the following installation commands in your first cell:
+!pip install unsloth "trl<0.9.0"
+!pip install openenv-core pydantic httpx
+!git clone <YOUR-GITHUB-REPO-URL>
+!cd AutoMathReasoner && pip install -e .
+3. Run the following Python script in the next cell.
+"""
+import collections
+import random
+from datasets import Dataset
+import torch
+# Unsloth & TRL
+from unsloth import FastLanguageModel
+from trl import GRPOConfig, GRPOTrainer
+# AutoMathReasoner OpenEnv Client
+import sys
+sys.path.append("./AutoMathReasoner")
+from AutoMathReasoner.client import AutomathreasonerEnv
+from AutoMathReasoner.env.models import AutomathreasonerAction
+# 1. Configuration
+# Replace with your actual Hugging Face Space URL!
+HF_SPACE_URL = "https://your-username-automathreasoner.hf.space"
+env = AutomathreasonerEnv(url=HF_SPACE_URL)
+max_seq_length = 1024 # Fits well within Colab T4 16GB VRAM limit
+lora_rank = 16
+# 2. Load Model via Unsloth (optimized for Free Colab VRAM)
+print("Loading model via Unsloth...")
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit", # Pre-quantized 4bit for fast download
+    max_seq_length = max_seq_length,
+    dtype = None,
+    load_in_4bit = True,
+)
+# Enable LoRA fine-tuning
+model = FastLanguageModel.get_peft_model(
+    model,
+    r = lora_rank,
+    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
+                      "gate_proj", "up_proj", "down_proj"],
+    lora_alpha = lora_rank,
+    use_gradient_checkpointing = "unsloth", # Crucial for fitting into T4
+)
+# 3. Prepare Dummy Prompts from the Remote Environment
+print("Gathering initial prompts from HF Space environment...")
+initial_prompts = []
+for _ in range(30):
+    # This fires an HTTP request to your Hugging Face Space
+    obs = env.reset()
+    initial_prompts.append({"prompt": obs.problem_text})
+dataset = Dataset.from_list(initial_prompts)
+# 4. Define Reward Function for TRL
+def compute_rewards(prompts, completions, **kwargs):
+    """
+    Interfaces with the OpenEnv running on Hugging Face Spaces.
+    Extracts the generation, passes it via HTTP to the env, and yields the dense reward.
+    """
+    rewards = []
+    parsed_actions = []
+    prompt_answers = collections.defaultdict(list)
+    # Track completion variants
+    for prompt, completion in zip(prompts, completions):
+        try:
+            parts = completion.split("Answer:")
+            reasoning = parts[0].strip()
+            answer = parts[1].strip() if len(parts) > 1 else ""
+        except Exception:
+            reasoning = completion
+            answer = ""
+        parsed_actions.append((prompt, completion, reasoning, answer))
+        prompt_answers[prompt].append(answer)
+    majority_answers = {}
+    for p, ans_list in prompt_answers.items():
+        if ans_list:
+            majority_answers[p] = collections.Counter(ans_list).most_common(1)[0][0]
+    for p, c, r, a in parsed_actions:
+        action = AutomathreasonerAction(reasoning=r, final_answer=a)
+        # In a real environment mapping, we would initialize the episode with the specific prompt.
+        # But for REST API environments, we simply reset and forcefully simulate.
+        obs = env.reset()
+        # Step through HTTP API
+        step_obs = env.step(action)
+        r_total = step_obs.reward
+        # Self-consistency matching bonus
+        majority = majority_answers.get(p, "")
+        if (a == majority) and len(a) > 0:
+            r_total += 0.2
+        rewards.append(r_total)
+    return rewards
+# 5. Execute Training
+training_args = GRPOConfig(
+    output_dir="colab_outputs",
+    learning_rate=2e-5,
+    per_device_train_batch_size=1, # 1 for Colab GPUs to prevent OOM
+    gradient_accumulation_steps=4,
+    max_prompt_length=128,
+    max_completion_length=256,
+    num_generations=4, # K=4 (Reduced from 8 for Colab T4 Memory limitations)
+    max_steps=150,
+    logging_steps=10,
+    optim="adamw_8bit", # 8-bit optimizer saves VRAM
+)
+trainer = GRPOTrainer(
+    model=model,
+    reward_funcs=[compute_rewards],
+    args=training_args,
+    train_dataset=dataset,
+)
+print("Starting GRPO Training in Colab using Remote HF Environment...")
+# Will show wandb/tensorboard logging so you can prove "it is actually learning"
+trainer.train()
+# 6. Push to Hugging Face
+# Optional: save locally or push to Hub after it learns
+# model.push_to_hub("your-name/AutoMathReasoner-Trained")

train/sft_warm_start.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from datasets import load_dataset
+from trl import SFTTrainer, SFTConfig
+from unsloth import FastLanguageModel
+def main():
+    max_seq_length = 1024
+    # Load model and tokenizer
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name = "llama-3-8b-instruct",
+        max_seq_length = max_seq_length,
+        dtype = None,
+        load_in_4bit = True,
+    )
+    # We use a subset of GSM8K style data to warm start the reasoning format
+    # In practice, this would load a custom generated dataset locally
+    try:
+        dataset = load_dataset("gsm8k", "main", split="train[:5%]")
+    except Exception:
+        # Fallback dummy dataset
+        dataset = load_dataset("json", data_files={"train": ["dummy.json"]}, split="train")
+    def formatting_prompts_func(examples):
+        texts = []
+        for q, a in zip(examples['question'], examples['answer']):
+            # Assuming 'answer' has reasoning and then '#### answer'
+            parts = a.split("####")
+            reasoning = parts[0].strip()
+            final_answer = parts[1].strip() if len(parts) > 1 else ""
+            text = f"Problem: {q}\nReasoning: {reasoning}\nAnswer: {final_answer}"
+            texts.append(text)
+        return { "text" : texts }
+    dataset = dataset.map(formatting_prompts_func, batched = True)
+    training_args = SFTConfig(
+        output_dir="sft_outputs",
+        dataset_text_field="text",
+        max_seq_length=max_seq_length,
+        per_device_train_batch_size=2,
+        max_steps=100,
+        learning_rate=2e-5,
+    )
+    trainer = SFTTrainer(
+        model=model,
+        train_dataset=dataset,
+        args=training_args,
+    )
+    print("Starting SFT Warm-Start...")
+    trainer.train()
+if __name__ == "__main__":
+    main()

train/train_grpo.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import random
+import collections
+import torch
+import numpy as np
+from datasets import Dataset
+from trl import GRPOTrainer, GRPOConfig
+from unsloth import FastLanguageModel
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from env.environment import AutomathreasonerEnvironment
+from env.models import AutomathreasonerAction
+class ReplayBuffer:
+    def __init__(self):
+        self.ladder_buffer = []  # A. LADDER-STYLE self-bootstrapping buffer
+        self.failed = []         # F. HARD NEGATIVE MINING buffer
+        self.all_history = []
+    def add_ladder(self, item):
+        """
+        [PAPER TRACEABILITY: LADDER-Style Self-Bootstrapping]
+        Stores only high-quality trajectories.
+        """
+        self.ladder_buffer.append(item)
+        # Keep top 20% effectively by hard capping and sorting if applicable
+        # Simplistic version: Just keep recent highest
+        if len(self.ladder_buffer) > 200:
+            self.ladder_buffer.sort(key=lambda x: x['reward'], reverse=True)
+            self.ladder_buffer = self.ladder_buffer[:100]
+    def add(self, problem, best_solution, failed_attempts, reward=0.0):
+        item = {
+            "prompt": problem,
+            "best_solution": best_solution,
+            "failed_attempts": failed_attempts,
+            "reward": reward
+        }
+        self.all_history.append(item)
+        # F. HARD NEGATIVE MINING
+        # Prioritize tracking failed problems
+        if failed_attempts:
+            # We explicitly track failures to reintroduce them
+            self.failed.append(item)
+            if len(self.failed) > 200:
+                self.failed.pop(0)
+    def sample(self, batch_size) -> list:
+        """
+        [PAPER TRACEABILITY: Hard Negative Mining]
+        Samples from Ladder/High-quality, Failed, and Random.
+        """
+        if len(self.all_history) < batch_size:
+            return self.all_history
+        n_ladder = int(batch_size * 0.5)
+        n_failed = int(batch_size * 0.3)
+        n_random = batch_size - n_ladder - n_failed
+        batch = []
+        batch.extend(random.choices(self.ladder_buffer if self.ladder_buffer else self.all_history, k=n_ladder))
+        batch.extend(random.choices(self.failed if self.failed else self.all_history, k=n_failed))
+        batch.extend(random.choices(self.all_history, k=n_random))
+        return batch
+def main():
+    max_seq_length = 1024
+    # Load model via Unsloth
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_name = "llama-3-8b-instruct",
+        max_seq_length = max_seq_length,
+        dtype = None,
+        load_in_4bit = True,
+    )
+    env = AutomathreasonerEnvironment()
+    replay_buffer = ReplayBuffer()
+    # Generate some initial experiences
+    initial_prompts = []
+    for _ in range(50):
+        obs = env.reset()
+        initial_prompts.append({"prompt": obs.problem_text})
+    dataset = Dataset.from_list(initial_prompts)
+    def compute_rewards(prompts, completions, **kwargs):
+        """
+        [PAPER TRACEABILITY: GRPO (Group-Relative Policy Optimization)]
+        D. GROUP-RELATIVE TRAINING
+        TRL GRPOTrainer automatically handles the relative optimization aspect:
+        log π(best) − log π(worst) by using the normalized rewards returned here.
+        """
+        rewards = []
+        # C. SELF-CONSISTENCY SAMPLING
+        # We group generated outputs by prompt to find the majority answer
+        # TRL provides completions aligned with prompts. Usually completions are batched by K per prompt.
+        prompt_answers = collections.defaultdict(list)
+        parsed_actions = []
+        for prompt, completion in zip(prompts, completions):
+            try:
+                parts = completion.split("Answer:")
+                reasoning = parts[0].strip()
+                answer = parts[1].strip() if len(parts) > 1 else ""
+            except Exception:
+                reasoning = completion
+                answer = ""
+            parsed_actions.append((prompt, completion, reasoning, answer))
+            prompt_answers[prompt].append(answer)
+        majority_answers = {}
+        for p, ans_list in prompt_answers.items():
+            if ans_list:
+                majority_answers[p] = collections.Counter(ans_list).most_common(1)[0][0]
+        for p, c, r, a in parsed_actions:
+            action = AutomathreasonerAction(reasoning=r, final_answer=a)
+            # Simulate step
+            env.reset()
+            env.current_problem = p
+            step_obs = env.step(action)
+            r_total = step_obs.reward
+            # [PAPER TRACEABILITY: Self-Consistency Sampling]
+            # Verify majority match
+            majority = majority_answers.get(p, "")
+            is_majority = (a == majority) and len(a) > 0
+            if is_majority:
+                r_total += 0.2  # Bonus reward for mapping to majority
+            rewards.append(r_total)
+            is_correct = step_obs.metadata.get('is_correct', False)
+            q_score = step_obs.metadata.get('reward_components', {}).get('Q_reasoning', 0.0)
+            # B. ReST-STYLE FILTERING (SELF-TRAINING)
+            # Filter samples where correctness = 1 AND reasoning quality > 0.6
+            # [PAPER TRACEABILITY: ReST (Rest-Style Filtering)]
+            if is_correct and q_score > 0.6:
+                # Store as High Quality trajectory in Ladder buffer
+                ladder_item = {
+                    "prompt": p,
+                    "best_solution": c,
+                    "failed_attempts": [],
+                    "reward": r_total
+                }
+                replay_buffer.add_ladder(ladder_item)
+            # Standard buffer mapping
+            if is_correct:
+                replay_buffer.add(p, c, [], reward=r_total)
+            else:
+                replay_buffer.add(p, "", [c], reward=r_total)
+        return rewards
+    training_args = GRPOConfig(
+        output_dir="outputs",
+        learning_rate=1e-5,
+        per_device_train_batch_size=1,
+        gradient_accumulation_steps=4,
+        max_prompt_length=128,
+        max_completion_length=256,
+        num_generations=8, # K=8 outputs per problem (Allows Self-consistency majority to work)
+        max_steps=100,
+        logging_steps=10,
+    )
+    trainer = GRPOTrainer(
+        model=model,
+        reward_funcs=[compute_rewards],
+        args=training_args,
+        train_dataset=dataset,
+    )
+    print("Starting GRPO Training with Research-Aligned Modules...")
+    trainer.train()
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff