Spaces:
Build error
Build error
| # OpenEnv: Production RL Made Simple | |
| <div align="center"> | |
| <img src="https://upload.wikimedia.org/wikipedia/commons/1/10/PyTorch_logo_icon.svg" width="200" alt="PyTorch"> | |
| ### *From "Hello World" to RL Training in 5 Minutes* โจ | |
| --- | |
| **What if RL environments were as easy to use as REST APIs?** | |
| That's OpenEnv. Type-safe. Isolated. Production-ready. ๐ฏ | |
| [](https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb) | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://opensource.org/licenses/BSD-3-Clause) | |
| [](https://pytorch.org/) | |
| Author: [Sanyam Bhutani](http://twitter.com/bhutanisanyam1/) | |
| </div> | |
| --- | |
| ## Why OpenEnv? | |
| Let's take a trip down memory lane: | |
| It's 2016, RL is popular. You read some papers, it looks promising. | |
| But in real world: Cartpole is the best you can run on a gaming GPU. | |
| What do you do beyond Cartpole? | |
| Fast-forward to 2025, GRPO is awesome and this time it's not JUST in theory, it works well in practise and is really here! | |
| The problem still remains, how do you take these RL algorithms and take them beyond Cartpole? | |
| A huge part of RL is giving your algorithms environment access to learn. | |
| We are excited to introduce an Environment Spec for adding Open Environments for RL Training. This will allow you to focus on your experiments and allow everyone to bring their environments. | |
| Focus on experiments, use OpenEnvironments, and build agents that go beyond Cartpole on a single spec. | |
| --- | |
| ## ๐ What You'll Learn | |
| <table> | |
| <tr> | |
| <td width="50%"> | |
| **๐ฏ Part 1-2: The Fundamentals** | |
| - โก RL in 60 seconds | |
| - ๐ค Why existing solutions fall short | |
| - ๐ก The OpenEnv solution | |
| </td> | |
| <td width="50%"> | |
| **๐๏ธ Part 3-5: The Architecture** | |
| - ๐ง How OpenEnv works | |
| - ๐ Exploring real code | |
| - ๐ฎ OpenSpiel integration example | |
| </td> | |
| </tr> | |
| <tr> | |
| <td width="50%"> | |
| **๐ฎ Part 6-8: Hands-On Demo** | |
| - ๐ Use existing OpenSpiel environment | |
| - ๐ค Test 4 different policies | |
| - ๐ Watch learning happen live | |
| </td> | |
| <td width="50%"> | |
| **๐ง Part 9-10: Going Further** | |
| - ๐ฎ Switch to other OpenSpiel games | |
| - โจ Build your own integration | |
| - ๐ Deploy to production | |
| </td> | |
| </tr> | |
| </table> | |
| !!! tip "Pro Tip" | |
| This notebook is designed to run top-to-bottom in Google Colab with zero setup! | |
| โฑ๏ธ **Time**: ~5 minutes | ๐ **Difficulty**: Beginner-friendly | ๐ฏ **Outcome**: Production-ready RL knowledge | |
| --- | |
| ## ๐ Table of Contents | |
| ### Foundation | |
| - [Part 1: RL in 60 Seconds โฑ๏ธ](#part-1-rl-in-60-seconds) | |
| - [Part 2: The Problem with Traditional RL ๐ค](#part-2-the-problem-with-traditional-rl) | |
| - [Part 3: Setup ๐ ๏ธ](#part-3-setup) | |
| ### Architecture | |
| - [Part 4: The OpenEnv Pattern ๐๏ธ](#part-4-the-openenv-pattern) | |
| - [Part 5: Example Integration - OpenSpiel ๐ฎ](#part-5-example-integration---openspiel) | |
| ### Hands-On Demo | |
| - [Part 6: Interactive Demo ๐ฎ](#part-6-using-real-openspiel) | |
| - [Part 7: Four Policies ๐ค](#part-7-four-policies) | |
| - [Part 8: Policy Competition! ๐](#part-8-policy-competition) | |
| ### Advanced | |
| - [Part 9: Using Real OpenSpiel ๐ฎ](#part-9-switching-to-other-games) | |
| - [Part 10: Create Your Own Integration ๐ ๏ธ](#part-10-create-your-own-integration) | |
| ### Wrap Up | |
| - [Summary: Your Journey ๐](#summary-your-journey) | |
| - [Resources ๐](#resources) | |
| --- | |
| ## Part 1: RL in 60 Seconds โฑ๏ธ | |
| **Reinforcement Learning is simpler than you think.** | |
| It's just a loop: | |
| ```python | |
| while not done: | |
| observation = environment.observe() | |
| action = policy.choose(observation) | |
| reward = environment.step(action) | |
| policy.learn(reward) | |
| ``` | |
| That's it. That's RL. | |
| Let's see it in action: | |
| ```python | |
| import random | |
| print("๐ฒ " + "="*58 + " ๐ฒ") | |
| print(" Number Guessing Game - The Simplest RL Example") | |
| print("๐ฒ " + "="*58 + " ๐ฒ") | |
| # Environment setup | |
| target = random.randint(1, 10) | |
| guesses_left = 3 | |
| print(f"\n๐ฏ I'm thinking of a number between 1 and 10...") | |
| print(f"๐ญ You have {guesses_left} guesses. Let's see how random guessing works!\n") | |
| # The RL Loop - Pure random policy (no learning!) | |
| while guesses_left > 0: | |
| # Policy: Random guessing (no learning yet!) | |
| guess = random.randint(1, 10) | |
| guesses_left -= 1 | |
| print(f"๐ญ Guess #{3-guesses_left}: {guess}", end=" โ ") | |
| # Reward signal (but we're not using it!) | |
| if guess == target: | |
| print("๐ Correct! +10 points") | |
| break | |
| elif abs(guess - target) <= 2: | |
| print("๐ฅ Warm! (close)") | |
| else: | |
| print("โ๏ธ Cold! (far)") | |
| else: | |
| print(f"\n๐ Out of guesses. The number was {target}.") | |
| print("\n" + "="*62) | |
| print("๐ก This is RL: Observe โ Act โ Reward โ Repeat") | |
| print(" But this policy is terrible! It doesn't learn from rewards.") | |
| print("="*62 + "\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ๐ฒ ========================================================== ๐ฒ | |
| Number Guessing Game - The Simplest RL Example | |
| ๐ฒ ========================================================== ๐ฒ | |
| ๐ฏ I'm thinking of a number between 1 and 10... | |
| ๐ญ You have 3 guesses. Let's see how random guessing works! | |
| ๐ญ Guess #1: 2 โ โ๏ธ Cold! (far) | |
| ๐ญ Guess #2: 10 โ ๐ Correct! +10 points | |
| ============================================================== | |
| ๐ก This is RL: Observe โ Act โ Reward โ Repeat | |
| But this policy is terrible! It doesn't learn from rewards. | |
| ============================================================== | |
| ``` | |
| --- | |
| ## Part 2: The Problem with Traditional RL ๐ค | |
| ### ๐ค Why Can't We Just Use OpenAI Gym? | |
| Good question! Gym is great for research, but production needs more... | |
| | Challenge | Traditional Approach | OpenEnv Solution | | |
| |-----------|---------------------|------------------| | |
| | **Type Safety** | โ `obs[0][3]` - what is this? | โ `obs.info_state` - IDE knows! | | |
| | **Isolation** | โ Same process (can crash your training) | โ Docker containers (fully isolated) | | |
| | **Deployment** | โ "Works on my machine" ๐คท | โ Same container everywhere ๐ณ | | |
| | **Scaling** | โ Hard to distribute | โ Deploy to Kubernetes โธ๏ธ | | |
| | **Language** | โ Python only | โ Any language (HTTP API) ๐ | | |
| | **Debugging** | โ Cryptic numpy errors | โ Clear type errors ๐ | | |
| ### ๐ก The OpenEnv Philosophy | |
| **"RL environments should be like microservices"** | |
| Think of it like this: You don't run your database in the same process as your web server, right? Same principle! | |
| - ๐ **Isolated**: Run in containers (security + stability) | |
| - ๐ **Standard**: HTTP API, works everywhere | |
| - ๐ฆ **Versioned**: Docker images (reproducibility!) | |
| - ๐ **Scalable**: Deploy to cloud with one command | |
| - ๐ก๏ธ **Type-safe**: Catch bugs before they happen | |
| - ๐ **Portable**: Works on Mac, Linux, Windows, Cloud | |
| ### The Architecture | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ YOUR TRAINING CODE โ | |
| โ โ | |
| โ env = OpenSpielEnv(...) โ Import the client โ | |
| โ result = env.reset() โ Type-safe! โ | |
| โ result = env.step(action) โ Type-safe! โ | |
| โ โ | |
| โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โ HTTP/JSON (Language-Agnostic) | |
| โ POST /reset, POST /step, GET /state | |
| โ | |
| โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ DOCKER CONTAINER โ | |
| โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ FastAPI Server โ โ | |
| โ โ โโ Environment (reset, step, state) โ โ | |
| โ โ โโ Your Game/Simulation Logic โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ | |
| โ Isolated โข Reproducible โข Secure โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| !!! info "Key Insight" | |
| You never see HTTP details - just clean Python methods! | |
| ```python | |
| env.reset() # Under the hood: HTTP POST to /reset | |
| env.step(...) # Under the hood: HTTP POST to /step | |
| env.state() # Under the hood: HTTP GET to /state | |
| ``` | |
| The magic? OpenEnv handles all the plumbing. You focus on RL! โจ | |
| --- | |
| ## Part 3: Setup ๐ ๏ธ | |
| **Running in Colab?** This cell will clone OpenEnv and install dependencies automatically. | |
| **Running locally?** Make sure you're in the OpenEnv directory. | |
| ```python | |
| # Detect environment | |
| try: | |
| import google.colab | |
| IN_COLAB = True | |
| print("๐ Running in Google Colab - Perfect!") | |
| except ImportError: | |
| IN_COLAB = False | |
| print("๐ป Running locally - Nice!") | |
| if IN_COLAB: | |
| print("\n๐ฆ Cloning OpenEnv repository...") | |
| !git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1 | |
| %cd OpenEnv | |
| print("๐ Installing dependencies (this takes ~10 seconds)...") | |
| !pip install -q fastapi uvicorn requests | |
| import sys | |
| sys.path.insert(0, './src') | |
| print("\nโ Setup complete! Everything is ready to go! ๐") | |
| else: | |
| import sys | |
| from pathlib import Path | |
| sys.path.insert(0, str(Path.cwd().parent / 'src')) | |
| print("โ Using local OpenEnv installation") | |
| print("\n๐ Ready to explore OpenEnv and build amazing things!") | |
| print("๐ก Tip: Run cells top-to-bottom for the best experience.\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ๐ป Running locally - Nice! | |
| โ Using local OpenEnv installation | |
| ๐ Ready to explore OpenEnv and build amazing things! | |
| ๐ก Tip: Run cells top-to-bottom for the best experience. | |
| ``` | |
| --- | |
| ## Part 4: The OpenEnv Pattern ๐๏ธ | |
| ### Every OpenEnv Environment Has 3 Components: | |
| ``` | |
| src/envs/your_env/ | |
| โโโ ๐ models.py โ Type-safe contracts | |
| โ (Action, Observation, State) | |
| โ | |
| โโโ ๐ฑ client.py โ What YOU import | |
| โ (HTTPEnvClient implementation) | |
| โ | |
| โโโ ๐ฅ๏ธ server/ | |
| โโโ environment.py โ Game/simulation logic | |
| โโโ app.py โ FastAPI server | |
| โโโ Dockerfile โ Container definition | |
| ``` | |
| Let's explore the actual OpenEnv code to see how this works: | |
| ```python | |
| # Import OpenEnv's core abstractions | |
| from core.env_server import Environment, Action, Observation, State | |
| from core.http_env_client import HTTPEnvClient | |
| print("="*70) | |
| print(" ๐งฉ OPENENV CORE ABSTRACTIONS") | |
| print("="*70) | |
| print(""" | |
| ๐ฅ๏ธ SERVER SIDE (runs in Docker): | |
| class Environment(ABC): | |
| '''Base class for all environment implementations''' | |
| @abstractmethod | |
| def reset(self) -> Observation: | |
| '''Start new episode''' | |
| @abstractmethod | |
| def step(self, action: Action) -> Observation: | |
| '''Execute action, return observation''' | |
| @property | |
| def state(self) -> State: | |
| '''Get episode metadata''' | |
| ๐ฑ CLIENT SIDE (your training code): | |
| class HTTPEnvClient(ABC): | |
| '''Base class for HTTP clients''' | |
| def reset(self) -> StepResult: | |
| # HTTP POST /reset | |
| def step(self, action) -> StepResult: | |
| # HTTP POST /step | |
| def state(self) -> State: | |
| # HTTP GET /state | |
| """) | |
| print("="*70) | |
| print("\nโจ Same interface on both sides - communication via HTTP!") | |
| print("๐ฏ You focus on RL, OpenEnv handles the infrastructure.\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ====================================================================== | |
| ๐งฉ OPENENV CORE ABSTRACTIONS | |
| ====================================================================== | |
| ๐ฅ๏ธ SERVER SIDE (runs in Docker): | |
| class Environment(ABC): | |
| '''Base class for all environment implementations''' | |
| @abstractmethod | |
| def reset(self) -> Observation: | |
| '''Start new episode''' | |
| @abstractmethod | |
| def step(self, action: Action) -> Observation: | |
| '''Execute action, return observation''' | |
| @property | |
| def state(self) -> State: | |
| '''Get episode metadata''' | |
| ๐ฑ CLIENT SIDE (your training code): | |
| class HTTPEnvClient(ABC): | |
| '''Base class for HTTP clients''' | |
| def reset(self) -> StepResult: | |
| # HTTP POST /reset | |
| def step(self, action) -> StepResult: | |
| # HTTP POST /step | |
| def state(self) -> State: | |
| # HTTP GET /state | |
| ====================================================================== | |
| โจ Same interface on both sides - communication via HTTP! | |
| ๐ฏ You focus on RL, OpenEnv handles the infrastructure. | |
| ``` | |
| --- | |
| ## Part 5: Example Integration - OpenSpiel ๐ฎ | |
| ### What is OpenSpiel? | |
| **OpenSpiel** is a library from DeepMind with **70+ game environments** for RL research. | |
| ### OpenEnv's Integration | |
| We've wrapped **6 OpenSpiel games** following the OpenEnv pattern: | |
| | **๐ฏ Single-Player** | **๐ฅ Multi-Player** | | |
| |---------------------|---------------------| | |
| | 1. **Catch** - Catch falling ball | 5. **Tic-Tac-Toe** - Classic 3ร3 | | |
| | 2. **Cliff Walking** - Navigate grid | 6. **Kuhn Poker** - Imperfect info poker | | |
| | 3. **2048** - Tile puzzle | | | |
| | 4. **Blackjack** - Card game | | | |
| This shows how OpenEnv can wrap **any** existing RL library! | |
| ```python | |
| from envs.openspiel_env.client import OpenSpielEnv | |
| print("="*70) | |
| print(" ๐ HOW OPENENV WRAPS OPENSPIEL") | |
| print("="*70) | |
| print(""" | |
| class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]): | |
| def _step_payload(self, action: OpenSpielAction) -> dict: | |
| '''Convert typed action to JSON for HTTP''' | |
| return { | |
| "action_id": action.action_id, | |
| "game_name": action.game_name, | |
| } | |
| def _parse_result(self, payload: dict) -> StepResult: | |
| '''Parse HTTP JSON response into typed observation''' | |
| return StepResult( | |
| observation=OpenSpielObservation(...), | |
| reward=payload['reward'], | |
| done=payload['done'] | |
| ) | |
| """) | |
| print("โ" * 70) | |
| print("\nโจ Usage (works for ALL OpenEnv environments):") | |
| print(""" | |
| env = OpenSpielEnv(base_url="http://localhost:8000") | |
| result = env.reset() | |
| # Returns StepResult[OpenSpielObservation] - Type safe! | |
| result = env.step(OpenSpielAction(action_id=2, game_name="catch")) | |
| # Type checker knows this is valid! | |
| state = env.state() | |
| # Returns OpenSpielState | |
| """) | |
| print("โ" * 70) | |
| print("\n๐ฏ This pattern works for ANY environment you want to wrap!\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ====================================================================== | |
| ๐ HOW OPENENV WRAPS OPENSPIEL | |
| ====================================================================== | |
| class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]): | |
| def _step_payload(self, action: OpenSpielAction) -> dict: | |
| '''Convert typed action to JSON for HTTP''' | |
| return { | |
| "action_id": action.action_id, | |
| "game_name": action.game_name, | |
| } | |
| def _parse_result(self, payload: dict) -> StepResult: | |
| '''Parse HTTP JSON response into typed observation''' | |
| return StepResult( | |
| observation=OpenSpielObservation(...), | |
| reward=payload['reward'], | |
| done=payload['done'] | |
| ) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โจ Usage (works for ALL OpenEnv environments): | |
| env = OpenSpielEnv(base_url="http://localhost:8000") | |
| result = env.reset() | |
| # Returns StepResult[OpenSpielObservation] - Type safe! | |
| result = env.step(OpenSpielAction(action_id=2, game_name="catch")) | |
| # Type checker knows this is valid! | |
| state = env.state() | |
| # Returns OpenSpielState | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๐ฏ This pattern works for ANY environment you want to wrap! | |
| ``` | |
| ### Type-Safe Models | |
| ```python | |
| # Import OpenSpiel integration models | |
| from envs.openspiel_env.models import ( | |
| OpenSpielAction, | |
| OpenSpielObservation, | |
| OpenSpielState | |
| ) | |
| from dataclasses import fields | |
| print("="*70) | |
| print(" ๐ฎ OPENSPIEL INTEGRATION - TYPE-SAFE MODELS") | |
| print("="*70) | |
| print("\n๐ค OpenSpielAction (what you send):") | |
| print(" " + "โ" * 64) | |
| for field in fields(OpenSpielAction): | |
| print(f" โข {field.name:20s} : {field.type}") | |
| print("\n๐ฅ OpenSpielObservation (what you receive):") | |
| print(" " + "โ" * 64) | |
| for field in fields(OpenSpielObservation): | |
| print(f" โข {field.name:20s} : {field.type}") | |
| print("\n๐ OpenSpielState (episode metadata):") | |
| print(" " + "โ" * 64) | |
| for field in fields(OpenSpielState): | |
| print(f" โข {field.name:20s} : {field.type}") | |
| print("\n" + "="*70) | |
| print("\n๐ก Type safety means:") | |
| print(" โ Your IDE autocompletes these fields") | |
| print(" โ Typos are caught before running") | |
| print(" โ Refactoring is safe") | |
| print(" โ Self-documenting code\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ====================================================================== | |
| ๐ฎ OPENSPIEL INTEGRATION - TYPE-SAFE MODELS | |
| ====================================================================== | |
| ๐ค OpenSpielAction (what you send): | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โข metadata : typing.Dict[str, typing.Any] | |
| โข action_id : int | |
| โข game_name : str | |
| โข game_params : Dict[str, Any] | |
| ๐ฅ OpenSpielObservation (what you receive): | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โข done : <class 'bool'> | |
| โข reward : typing.Union[bool, int, float, NoneType] | |
| โข metadata : typing.Dict[str, typing.Any] | |
| โข info_state : List[float] | |
| โข legal_actions : List[int] | |
| โข game_phase : str | |
| โข current_player_id : int | |
| โข opponent_last_action : Optional[int] | |
| ๐ OpenSpielState (episode metadata): | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โข episode_id : typing.Optional[str] | |
| โข step_count : <class 'int'> | |
| โข game_name : str | |
| โข agent_player : int | |
| โข opponent_policy : str | |
| โข game_params : Dict[str, Any] | |
| โข num_players : int | |
| ====================================================================== | |
| ๐ก Type safety means: | |
| โ Your IDE autocompletes these fields | |
| โ Typos are caught before running | |
| โ Refactoring is safe | |
| โ Self-documenting code | |
| ``` | |
| ### How the Client Works | |
| The client **inherits from HTTPEnvClient** and implements 3 methods: | |
| 1. `_step_payload()` - Convert action โ JSON | |
| 2. `_parse_result()` - Parse JSON โ typed observation | |
| 3. `_parse_state()` - Parse JSON โ state | |
| That's it! The base class handles all HTTP communication. | |
| --- | |
| ## Part 6: Using Real OpenSpiel ๐ฎ | |
| <div style="text-align: center; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px; border-radius: 15px; margin: 30px 0;"> | |
| ### Now let's USE a production environment! | |
| We'll play **Catch** using OpenEnv's **OpenSpiel integration** ๐ฏ | |
| This is a REAL environment running in production at companies! | |
| **Get ready for:** | |
| - ๐ Using existing environments (not building) | |
| - ๐ค Testing policies against real games | |
| - ๐ Live gameplay visualization | |
| - ๐ฏ Production-ready patterns | |
| </div> | |
| ### The Game: Catch ๐ด๐ | |
| ``` | |
| โฌ โฌ ๐ด โฌ โฌ | |
| โฌ โฌ โฌ โฌ โฌ | |
| โฌ โฌ โฌ โฌ โฌ Ball | |
| โฌ โฌ โฌ โฌ โฌ | |
| โฌ โฌ โฌ โฌ โฌ falls | |
| โฌ โฌ โฌ โฌ โฌ | |
| โฌ โฌ โฌ โฌ โฌ down | |
| โฌ โฌ โฌ โฌ โฌ | |
| โฌ โฌ โฌ โฌ โฌ | |
| โฌ โฌ ๐ โฌ โฌ | |
| Paddle | |
| ``` | |
| **Rules:** | |
| - 10ร5 grid | |
| - Ball falls from random column | |
| - Move paddle left/right to catch it | |
| **Actions:** | |
| - `0` = Move LEFT โฌ ๏ธ | |
| - `1` = STAY ๐ | |
| - `2` = Move RIGHT โก๏ธ | |
| **Reward:** | |
| - `+1` if caught ๐ | |
| - `0` if missed ๐ข | |
| !!! note "Why Catch?" | |
| - Simple rules (easy to understand) | |
| - Fast episodes (~5 steps) | |
| - Clear success/failure | |
| - Part of OpenSpiel's 70+ games! | |
| **๐ก The Big Idea:** | |
| Instead of building this from scratch, we'll USE OpenEnv's existing OpenSpiel integration. Same interface, but production-ready! | |
| ```python | |
| from envs.openspiel_env import OpenSpielEnv | |
| from envs.openspiel_env.models import ( | |
| OpenSpielAction, | |
| OpenSpielObservation, | |
| OpenSpielState | |
| ) | |
| from dataclasses import fields | |
| print("๐ฎ " + "="*64 + " ๐ฎ") | |
| print(" โ Importing Real OpenSpiel Environment!") | |
| print("๐ฎ " + "="*64 + " ๐ฎ\n") | |
| print("๐ฆ What we just imported:") | |
| print(" โข OpenSpielEnv - HTTP client for OpenSpiel games") | |
| print(" โข OpenSpielAction - Type-safe actions") | |
| print(" โข OpenSpielObservation - Type-safe observations") | |
| print(" โข OpenSpielState - Episode metadata\n") | |
| print("๐ OpenSpielObservation fields:") | |
| print(" " + "โ" * 60) | |
| for field in fields(OpenSpielObservation): | |
| print(f" โข {field.name:25s} : {field.type}") | |
| print("\n" + "="*70) | |
| print("\n๐ก This is REAL OpenEnv code - used in production!") | |
| print(" โข Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.)") | |
| print(" โข Type-safe actions and observations") | |
| print(" โข Works via HTTP (we'll see that next!)\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ๐ฎ ================================================================ ๐ฎ | |
| โ Importing Real OpenSpiel Environment! | |
| ๐ฎ ================================================================ ๐ฎ | |
| ๐ฆ What we just imported: | |
| โข OpenSpielEnv - HTTP client for OpenSpiel games | |
| โข OpenSpielAction - Type-safe actions | |
| โข OpenSpielObservation - Type-safe observations | |
| โข OpenSpielState - Episode metadata | |
| ๐ OpenSpielObservation fields: | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โข done : <class 'bool'> | |
| โข reward : typing.Union[bool, int, float, NoneType] | |
| โข metadata : typing.Dict[str, typing.Any] | |
| โข info_state : List[float] | |
| โข legal_actions : List[int] | |
| โข game_phase : str | |
| โข current_player_id : int | |
| โข opponent_last_action : Optional[int] | |
| ====================================================================== | |
| ๐ก This is REAL OpenEnv code - used in production! | |
| โข Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.) | |
| โข Type-safe actions and observations | |
| โข Works via HTTP (we'll see that next!) | |
| ``` | |
| --- | |
| ## Part 7: Four Policies ๐ค | |
| Let's test 4 different AI strategies: | |
| | Policy | Strategy | Expected Performance | | |
| |--------|----------|----------------------| | |
| | **๐ฒ Random** | Pick random action every step | ~20% (pure luck) | | |
| | **๐ Always Stay** | Never move, hope ball lands in center | ~20% (terrible!) | | |
| | **๐ง Smart** | Move paddle toward ball | 100% (optimal!) | | |
| | **๐ Learning** | Start random, learn smart strategy | ~85% (improves over time) | | |
| **๐ก These policies work with ANY OpenSpiel game!** | |
| ```python | |
| import random | |
| # ============================================================================ | |
| # POLICIES - Different AI strategies (adapted for OpenSpiel) | |
| # ============================================================================ | |
| class RandomPolicy: | |
| """Baseline: Pure random guessing.""" | |
| name = "๐ฒ Random Guesser" | |
| def select_action(self, obs: OpenSpielObservation) -> int: | |
| return random.choice(obs.legal_actions) | |
| class AlwaysStayPolicy: | |
| """Bad strategy: Never moves.""" | |
| name = "๐ Always Stay" | |
| def select_action(self, obs: OpenSpielObservation) -> int: | |
| return 1 # STAY | |
| class SmartPolicy: | |
| """Optimal: Move paddle toward ball.""" | |
| name = "๐ง Smart Heuristic" | |
| def select_action(self, obs: OpenSpielObservation) -> int: | |
| # Parse OpenSpiel observation | |
| # For Catch: info_state is a flattened 10x5 grid | |
| # Ball position and paddle position encoded in the vector | |
| info_state = obs.info_state | |
| # Find ball and paddle positions from info_state | |
| # Catch uses a 10x5 grid, so 50 values | |
| grid_size = 5 | |
| # Find positions (ball = 1.0 in the flattened grid, paddle = 1.0 in the last row of the flattened grid) | |
| ball_col = None | |
| paddle_col = None | |
| for idx, val in enumerate(info_state): | |
| if abs(val - 1.0) < 0.01: # Ball | |
| ball_col = idx % grid_size | |
| break | |
| last_row = info_state[-grid_size:] | |
| paddle_col = last_row.index(1.0) # Paddle | |
| if ball_col is not None and paddle_col is not None: | |
| if paddle_col < ball_col: | |
| return 2 # Move RIGHT | |
| elif paddle_col > ball_col: | |
| return 0 # Move LEFT | |
| return 1 # STAY (fallback) | |
| class LearningPolicy: | |
| """Simulated RL: Epsilon-greedy exploration.""" | |
| name = "๐ Learning Agent" | |
| def __init__(self): | |
| self.steps = 0 | |
| self.smart_policy = SmartPolicy() | |
| def select_action(self, obs: OpenSpielObservation) -> int: | |
| self.steps += 1 | |
| # Decay exploration rate over time | |
| epsilon = max(0.1, 1.0 - (self.steps / 100)) | |
| if random.random() < epsilon: | |
| # Explore: random action | |
| return random.choice(obs.legal_actions) | |
| else: | |
| # Exploit: use smart strategy | |
| return self.smart_policy.select_action(obs) | |
| print("๐ค " + "="*64 + " ๐ค") | |
| print(" โ 4 Policies Created (Adapted for OpenSpiel)!") | |
| print("๐ค " + "="*64 + " ๐ค\n") | |
| policies = [RandomPolicy(), AlwaysStayPolicy(), SmartPolicy(), LearningPolicy()] | |
| for i, policy in enumerate(policies, 1): | |
| print(f" {i}. {policy.name}") | |
| print("\n๐ก These policies work with OpenSpielObservation!") | |
| print(" โข Read info_state (flattened grid)") | |
| print(" โข Use legal_actions") | |
| print(" โข Work with ANY OpenSpiel game that exposes these!\n") | |
| ``` | |
| **Output:** | |
| ``` | |
| ๐ค ================================================================ ๐ค | |
| โ 4 Policies Created (Adapted for OpenSpiel)! | |
| ๐ค ================================================================ ๐ค | |
| 1. ๐ฒ Random Guesser | |
| 2. ๐ Always Stay | |
| 3. ๐ง Smart Heuristic | |
| 4. ๐ Learning Agent | |
| ๐ก These policies work with OpenSpielObservation! | |
| โข Read info_state (flattened grid) | |
| โข Use legal_actions | |
| โข Work with ANY OpenSpiel game that exposes these! | |
| ``` | |
| --- | |
| ## Part 8: Policy Competition! ๐ | |
| Let's run **50 episodes** for each policy against **REAL OpenSpiel** and see who wins! | |
| This is production code - every action is an HTTP call to the OpenSpiel server! | |
| ```python | |
| def evaluate_policies(env, num_episodes=50): | |
| """Compare all policies over many episodes using real OpenSpiel.""" | |
| policies = [ | |
| RandomPolicy(), | |
| AlwaysStayPolicy(), | |
| SmartPolicy(), | |
| LearningPolicy(), | |
| ] | |
| print("\n๐ " + "="*66 + " ๐") | |
| print(f" POLICY SHOWDOWN - {num_episodes} Episodes Each") | |
| print(f" Playing against REAL OpenSpiel Catch!") | |
| print("๐ " + "="*66 + " ๐\n") | |
| results = [] | |
| for policy in policies: | |
| print(f"โก Testing {policy.name}...", end=" ") | |
| successes = sum(run_episode(env, policy, visualize=False) | |
| for _ in range(num_episodes)) | |
| success_rate = (successes / num_episodes) * 100 | |
| results.append((policy.name, success_rate, successes)) | |
| print(f"โ Done!") | |
| print("\n" + "="*70) | |
| print(" ๐ FINAL RESULTS") | |
| print("="*70 + "\n") | |
| # Sort by success rate (descending) | |
| results.sort(key=lambda x: x[1], reverse=True) | |
| # Award medals to top 3 | |
| medals = ["๐ฅ", "๐ฅ", "๐ฅ", " "] | |
| for i, (name, rate, successes) in enumerate(results): | |
| medal = medals[i] | |
| bar = "โ" * int(rate / 2) | |
| print(f"{medal} {name:25s} [{bar:<50}] {rate:5.1f}% ({successes}/{num_episodes})") | |
| print("\n" + "="*70) | |
| print("\nโจ Key Insights:") | |
| print(" โข Random (~20%): Baseline - pure luck ๐ฒ") | |
| print(" โข Always Stay (~20%): Bad strategy - stays center ๐") | |
| print(" โข Smart (100%): Optimal - perfect play! ๐ง ") | |
| print(" โข Learning (~85%): Improves over time ๐") | |
| print("\n๐ This is Reinforcement Learning + OpenEnv in action:") | |
| print(" 1. We USED existing OpenSpiel environment (didn't build it)") | |
| print(" 2. Type-safe communication over HTTP") | |
| print(" 3. Same code works for ANY OpenSpiel game") | |
| print(" 4. Production-ready architecture\n") | |
| # Run the epic competition! | |
| print("๐ฎ Starting the showdown against REAL OpenSpiel...\n") | |
| evaluate_policies(client, num_episodes=50) | |
| ``` | |
| --- | |
| ## Part 9: Switching to Other Games ๐ฎ | |
| ### What We Just Used: Real OpenSpiel! ๐ | |
| In Parts 6-8, we **USED** the existing OpenSpiel Catch environment: | |
| | What We Did | How It Works | | |
| |-------------|--------------| | |
| | **Imported** | OpenSpielEnv client (pre-built) | | |
| | **Started** | OpenSpiel server via uvicorn | | |
| | **Connected** | HTTP client to server | | |
| | **Played** | Real OpenSpiel Catch game | | |
| **๐ฏ This is production code!** Every action was an HTTP call to a real OpenSpiel environment. | |
| ### ๐ฎ 6 Games Available - Same Interface! | |
| The beauty of OpenEnv? **Same code, different games!** | |
| ```python | |
| # We just used Catch | |
| env = OpenSpielEnv(base_url="http://localhost:8000") | |
| # game_name="catch" was set via environment variable | |
| # Want Tic-Tac-Toe instead? Just change the game! | |
| # Start server with: OPENSPIEL_GAME=tic_tac_toe uvicorn ... | |
| # Same client code works! | |
| ``` | |
| **๐ฎ All 6 Games:** | |
| 1. โ **`catch`** - What we just used! | |
| 2. **`tic_tac_toe`** - Classic 3ร3 | |
| 3. **`kuhn_poker`** - Imperfect information poker | |
| 4. **`cliff_walking`** - Grid navigation | |
| 5. **`2048`** - Tile puzzle | |
| 6. **`blackjack`** - Card game | |
| **All use the exact same OpenSpielEnv client!** | |
| ### Try Another Game (Optional): | |
| ```python | |
| # Stop the current server (kill the server_process) | |
| # Then start a new game: | |
| server_process = subprocess.Popen( | |
| [sys.executable, "-m", "uvicorn", | |
| "envs.openspiel_env.server.app:app", | |
| "--host", "0.0.0.0", | |
| "--port", "8000"], | |
| env={**os.environ, | |
| "PYTHONPATH": f"{work_dir}/src", | |
| "OPENSPIEL_GAME": "tic_tac_toe", # Changed! | |
| "OPENSPIEL_AGENT_PLAYER": "0", | |
| "OPENSPIEL_OPPONENT_POLICY": "random"}, | |
| # ... rest of config | |
| ) | |
| # Same client works! | |
| client = OpenSpielEnv(base_url="http://localhost:8000") | |
| result = client.reset() # Now playing Tic-Tac-Toe! | |
| ``` | |
| **๐ก Key Insight**: You don't rebuild anything - you just USE different games with the same client! | |
| --- | |
| ## Part 10: Create Your Own Integration ๐ ๏ธ | |
| ### The 5-Step Pattern | |
| Want to wrap your own environment in OpenEnv? Here's how: | |
| ### Step 1: Define Types (`models.py`) | |
| ```python | |
| from dataclasses import dataclass | |
| from core.env_server import Action, Observation, State | |
| @dataclass | |
| class YourAction(Action): | |
| action_value: int | |
| # Add your action fields | |
| @dataclass | |
| class YourObservation(Observation): | |
| state_data: List[float] | |
| done: bool | |
| reward: float | |
| # Add your observation fields | |
| @dataclass | |
| class YourState(State): | |
| episode_id: str | |
| step_count: int | |
| # Add your state fields | |
| ``` | |
| ### Step 2: Implement Environment (`server/environment.py`) | |
| ```python | |
| from core.env_server import Environment | |
| class YourEnvironment(Environment): | |
| def reset(self) -> Observation: | |
| # Initialize your game/simulation | |
| return YourObservation(...) | |
| def step(self, action: Action) -> Observation: | |
| # Execute action, update state | |
| return YourObservation(...) | |
| @property | |
| def state(self) -> State: | |
| return self._state | |
| ``` | |
| ### Step 3: Create Client (`client.py`) | |
| ```python | |
| from core.http_env_client import HTTPEnvClient | |
| from core.types import StepResult | |
| class YourEnv(HTTPEnvClient[YourAction, YourObservation]): | |
| def _step_payload(self, action: YourAction) -> dict: | |
| """Convert action to JSON""" | |
| return {"action_value": action.action_value} | |
| def _parse_result(self, payload: dict) -> StepResult: | |
| """Parse JSON to observation""" | |
| return StepResult( | |
| observation=YourObservation(...), | |
| reward=payload['reward'], | |
| done=payload['done'] | |
| ) | |
| def _parse_state(self, payload: dict) -> YourState: | |
| return YourState(...) | |
| ``` | |
| ### Step 4: Create Server (`server/app.py`) | |
| ```python | |
| from core.env_server import create_fastapi_app | |
| from .your_environment import YourEnvironment | |
| env = YourEnvironment() | |
| app = create_fastapi_app(env) | |
| # That's it! OpenEnv creates all endpoints for you. | |
| ``` | |
| ### Step 5: Dockerize (`server/Dockerfile`) | |
| ```dockerfile | |
| FROM python:3.11-slim | |
| WORKDIR /app | |
| COPY requirements.txt . | |
| RUN pip install --no-cache-dir -r requirements.txt | |
| COPY . . | |
| CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] | |
| ``` | |
| ### ๐ Examples to Study | |
| OpenEnv includes 3 complete examples: | |
| 1. **`src/envs/echo_env/`** | |
| - Simplest possible environment | |
| - Great for testing and learning | |
| 2. **`src/envs/openspiel_env/`** | |
| - Wraps external library (OpenSpiel) | |
| - Shows integration pattern | |
| - 6 games in one integration | |
| 3. **`src/envs/coding_env/`** | |
| - Python code execution environment | |
| - Shows complex use case | |
| - Security considerations | |
| **๐ก Study these to understand the patterns!** | |
| --- | |
| ## ๐ Summary: Your Journey | |
| ### What You Learned | |
| <table> | |
| <tr> | |
| <td width="50%" style="vertical-align: top;"> | |
| ### ๐ Concepts | |
| โ **RL Fundamentals** | |
| - The observe-act-reward loop | |
| - What makes good policies | |
| - Exploration vs exploitation | |
| โ **OpenEnv Architecture** | |
| - Client-server separation | |
| - Type-safe contracts | |
| - HTTP communication layer | |
| โ **Production Patterns** | |
| - Docker isolation | |
| - API design | |
| - Reproducible deployments | |
| </td> | |
| <td width="50%" style="vertical-align: top;"> | |
| ### ๐ ๏ธ Skills | |
| โ **Using Environments** | |
| - Import OpenEnv clients | |
| - Call reset/step/state | |
| - Work with typed observations | |
| โ **Building Environments** | |
| - Define type-safe models | |
| - Implement Environment class | |
| - Create HTTPEnvClient | |
| โ **Testing & Debugging** | |
| - Compare policies | |
| - Visualize episodes | |
| - Measure performance | |
| </td> | |
| </tr> | |
| </table> | |
| ### OpenEnv vs Traditional RL | |
| | Feature | Traditional (Gym) | OpenEnv | Winner | | |
| |---------|------------------|---------|--------| | |
| | **Type Safety** | โ Arrays, dicts | โ Dataclasses | ๐ OpenEnv | | |
| | **Isolation** | โ Same process | โ Docker | ๐ OpenEnv | | |
| | **Deployment** | โ Manual setup | โ K8s-ready | ๐ OpenEnv | | |
| | **Language** | โ Python only | โ Any (HTTP) | ๐ OpenEnv | | |
| | **Reproducibility** | โ "Works on my machine" | โ Same everywhere | ๐ OpenEnv | | |
| | **Community** | โ Large ecosystem | ๐ก Growing | ๐ค Both! | | |
| !!! success "The Bottom Line" | |
| OpenEnv brings **production engineering** to RL: | |
| - Same environments work locally and in production | |
| - Type safety catches bugs early | |
| - Docker isolation prevents conflicts | |
| - HTTP API works with any language | |
| **It's RL for 2024 and beyond.** | |
| --- | |
| ## ๐ Resources | |
| ### ๐ Essential Links | |
| - **๐ OpenEnv GitHub**: https://github.com/meta-pytorch/OpenEnv | |
| - **๐ฎ OpenSpiel**: https://github.com/google-deepmind/open_spiel | |
| - **โก FastAPI Docs**: https://fastapi.tiangolo.com/ | |
| - **๐ณ Docker Guide**: https://docs.docker.com/get-started/ | |
| - **๐ฅ PyTorch**: https://pytorch.org/ | |
| ### ๐ Documentation Deep Dives | |
| - **Environment Creation Guide**: `src/envs/README.md` | |
| - **OpenSpiel Integration**: `src/envs/openspiel_env/README.md` | |
| - **Example Scripts**: `examples/` | |
| - **RFC 001**: [Baseline API Specs](https://github.com/meta-pytorch/OpenEnv/pull/26) | |
| ### ๐ Community & Support | |
| **Supported by amazing organizations:** | |
| - ๐ฅ Meta PyTorch | |
| - ๐ค Hugging Face | |
| - โก Unsloth AI | |
| - ๐ Reflection AI | |
| - ๐ And many more! | |
| **License**: BSD 3-Clause (very permissive!) | |
| **Contributions**: Always welcome! Check out the issues tab. | |
| --- | |
| ### ๐ What's Next? | |
| 1. โญ **Star the repo** to show support and stay updated | |
| 2. ๐ **Try modifying** the Catch game (make it harder? bigger grid?) | |
| 3. ๐ฎ **Explore** other OpenSpiel games | |
| 4. ๐ ๏ธ **Build** your own environment integration | |
| 5. ๐ฌ **Share** what you build with the community! | |