Spaces:

AniketAsla
/

debatefloor

Running

File size: 14,690 Bytes

b4ac377

# RL Training with OpenEnv: 2048 Game

This tutorial covers training a language model to play the 2048 game using
reinforcement learning with GRPO (Group Relative Policy Optimization).

```{note}

**Time**: ~45 minutes | **Difficulty**: Advanced | **GPU Required**: Yes (T4 or better)

```

## What You'll Learn

- **Model Setup**: Load and configure LLMs with Unsloth for efficient RL
- **Environment Connection**: Connect to the 2048 OpenEnv environment
- **Reward Design**: Create effective reward functions
- **GRPO Training**: Train models with reinforcement learning
- **Deployment**: Save and deploy trained models

## Prerequisites

Before starting this tutorial, you should have completed the
[Getting Started](/auto_getting_started/index) series to understand:

- How OpenEnv environments work
- The reset/step/state API pattern
- How to connect to environments

You'll also need:

- A GPU (free T4 on Google Colab works)
- Basic understanding of PyTorch
- ~30 minutes for training

## Part 1: Environment Setup

### Installation

```bash

# Install required packages

!pip install -q unsloth openenv-core trl



# For Google Colab, also run:

!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

```

### Imports

```python

import torch

from dataclasses import dataclass

from typing import List, Optional, Dict, Any

import random



# Check GPU availability

print(f"GPU Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():

    print(f"GPU: {torch.cuda.get_device_name(0)}")

    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

```

## Part 2: Model Configuration

We use Unsloth for memory-efficient training with LoRA adapters.

### Configuration Classes

```python

@dataclass

class ModelConfig:

    """Configuration for loading LLM models."""

    model_name: str = "unsloth/Qwen2.5-1.5B"

    max_seq_length: int = 768

    load_in_4bit: bool = True

    dtype: Optional[str] = None  # Auto-detect





@dataclass

class LoRAConfig:

    """Configuration for LoRA fine-tuning."""

    r: int = 16

    lora_alpha: int = 32

    target_modules: List[str] = None

    lora_dropout: float = 0.0



    def __post_init__(self):

        if self.target_modules is None:

            self.target_modules = [

                "q_proj", "k_proj", "v_proj", "o_proj",

                "gate_proj", "up_proj", "down_proj",

            ]

```

### Loading the Model

```python

from unsloth import FastLanguageModel



# Create configurations

model_config = ModelConfig()

lora_config = LoRAConfig()



# Load model

model, tokenizer = FastLanguageModel.from_pretrained(

    model_name=model_config.model_name,

    max_seq_length=model_config.max_seq_length,

    load_in_4bit=model_config.load_in_4bit,

    dtype=model_config.dtype,

)



# Apply LoRA adapters

model = FastLanguageModel.get_peft_model(

    model,

    r=lora_config.r,

    target_modules=lora_config.target_modules,

    lora_alpha=lora_config.lora_alpha,

    lora_dropout=lora_config.lora_dropout,

    bias="none",

    use_gradient_checkpointing="unsloth",

    random_state=42,

)



# Check parameter counts

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)

total = sum(p.numel() for p in model.parameters())

print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")

```

## Part 3: The 2048 Environment

### Game Overview

2048 is a sliding puzzle game where you combine tiles to reach 2048.

**Actions:**
- `0` = UP
- `1` = RIGHT
- `2` = DOWN
- `3` = LEFT

**Goal:** Create a tile with value 2048 (or higher!)

### Connecting to the Environment

```python

from envs.openspiel_env import OpenSpielEnv, OpenSpielAction



# Connect to 2048 environment

# Option 1: From Hub

env = OpenSpielEnv.from_hub("openenv/openspiel-env")



# Option 2: From running server

# env = OpenSpielEnv(base_url="http://localhost:8000")



# Test connection

with env:

    result = env.reset()

    print(f"Game started!")

    print(f"Legal actions: {result.observation.legal_actions}")



    # Take a test action

    action = OpenSpielAction(action_id=0, game_name="2048")

    result = env.step(action)

    print(f"After UP: reward={result.reward}, done={result.done}")

```

### Board Utilities

```python

import numpy as np

from typing import List



def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]:

    """Convert flat info_state to 2D board."""

    return np.array(info_state, dtype=int).reshape(size, size).tolist()



def render_board(board: List[List[int]]) -> str:

    """Render board as ASCII string."""

    lines = ["+------" * len(board[0]) + "+"]

    for row in board:

        cells = [f"{v:5d}" if v > 0 else "    ." for v in row]

        lines.append("|" + " |".join(cells) + " |")

        lines.append("+------" * len(row) + "+")

    return "\n".join(lines)



def get_max_tile(board: List[List[int]]) -> int:

    """Get highest tile value."""

    return max(cell for row in board for cell in row)

```

## Part 4: Reward Function Design

The reward function is crucial for RL. We consider:

1. **Success**: Did we reach 2048?
2. **Progress**: What's the highest tile achieved?
3. **Code Quality**: Did the generated code execute correctly?

### Reward Implementation

```python

import math



def calculate_reward(

    max_tile: int,

    success: bool,

    code_error: bool = False

) -> float:

    """

    Calculate reward for a 2048 game outcome.



    Args:

        max_tile: Highest tile achieved (2, 4, 8, ..., 2048)

        success: Whether we reached 2048

        code_error: Whether generated code had errors



    Returns:

        Float reward value

    """

    if code_error:

        return -0.5  # Penalty for invalid code



    if success:

        return 1.0  # Full reward for winning



    # Progress reward: log scale from 0 to 0.9

    if max_tile > 0:

        progress = math.log2(max_tile) / math.log2(2048)

        return min(0.9, progress)



    return 0.0



# Test reward function

test_cases = [

    (2048, True, False, "Won!"),

    (1024, False, False, "Got to 1024"),

    (512, False, False, "Got to 512"),

    (64, False, False, "Early game"),

]



for max_tile, success, error, desc in test_cases:

    reward = calculate_reward(max_tile, success, error)

    print(f"{desc:20s} -> Reward: {reward:+.3f}")

```

## Part 5: Strategy Generation

We'll train the model to generate Python strategy functions.

### Prompt Template

```python

SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function

that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT).



The board is a 4x4 list of integers. Empty cells are 0.

Your function should analyze the board and return an optimal move.

"""



def create_prompt(board: List[List[int]]) -> str:

    """Create prompt for strategy generation."""

    board_str = "\n".join(str(row) for row in board)

    return f"""{SYSTEM_PROMPT}



Current board:

{board_str}



Generate a strategy function:

```python

def strategy(board):

    # Your code here

    return action  # 0, 1, 2, or 3

```"""

```

### Executing Generated Strategies

```python

import ast

from typing import Callable



def extract_and_execute_strategy(

    generated_code: str,

    board: List[List[int]],

    timeout: float = 5.0

) -> tuple[int, bool]:

    """

    Extract and execute a generated strategy function.



    Returns:

        (action, success): The action to take and whether execution succeeded

    """

    try:

        # Extract code block

        if "```python" in generated_code:

            code = generated_code.split("```python")[1].split("```")[0]

        else:

            code = generated_code



        # Parse and validate AST

        tree = ast.parse(code)



        # Execute in sandbox

        namespace = {"board": board}

        exec(compile(tree, "<strategy>", "exec"), namespace)



        # Call the strategy function

        if "strategy" in namespace:

            action = namespace["strategy"](board)

            if action in [0, 1, 2, 3]:

                return action, True



        return 0, False  # Default action on failure



    except Exception as e:

        print(f"Strategy execution error: {e}")

        return 0, False

```

## Part 6: GRPO Training

GRPO (Group Relative Policy Optimization) is optimized for language models.

### Training Configuration

```python

from trl import GRPOConfig, GRPOTrainer



grpo_config = GRPOConfig(

    # Learning rate

    learning_rate=2e-6,



    # Batch sizes

    per_device_train_batch_size=4,

    gradient_accumulation_steps=4,



    # Training duration

    max_steps=200,



    # Memory optimization

    bf16=True,

    gradient_checkpointing=True,



    # Logging

    logging_steps=1,

    output_dir="./2048_grpo_output",

    report_to="none",

)

```

### Training Loop

```python

def train_2048_agent(

    model,

    tokenizer,

    env,

    config: GRPOConfig,

    num_episodes: int = 100,

):

    """

    Train the model to play 2048 using GRPO.

    """

    # Prepare model for training

    FastLanguageModel.for_training(model)



    training_data = []



    for episode in range(num_episodes):

        # Reset environment

        result = env.reset()

        board = info_state_to_board(result.observation.info_state)



        episode_reward = 0

        steps = 0



        while not result.done and steps < 1000:

            # Generate strategy

            prompt = create_prompt(board)

            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)



            outputs = model.generate(

                **inputs,

                max_new_tokens=256,

                temperature=0.7,

                do_sample=True,

            )



            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)



            # Execute strategy

            action, success = extract_and_execute_strategy(generated, board)



            # Take action in environment

            env_action = OpenSpielAction(action_id=action, game_name="2048")

            result = env.step(env_action)



            # Update board

            board = info_state_to_board(result.observation.info_state)

            episode_reward += result.reward if result.reward else 0

            steps += 1



        # Calculate final reward

        max_tile = get_max_tile(board)

        final_reward = calculate_reward(max_tile, max_tile >= 2048)



        # Store for training

        training_data.append({

            "prompt": prompt,

            "response": generated,

            "reward": final_reward,

        })



        if episode % 10 == 0:

            print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}")



    return training_data

```

## Part 7: Deployment

After training, save and deploy your model.

### Saving the Model

```python

# Save LoRA adapters only

model.save_pretrained("./2048_strategy_model")

tokenizer.save_pretrained("./2048_strategy_model")



# Save merged model for inference

model.save_pretrained_merged(

    "./2048_strategy_model_merged",

    tokenizer,

    save_method="merged_16bit",

)

```

### Push to Hugging Face Hub

```python

# Push to Hub

model.push_to_hub(

    "your-username/2048-strategy-model",

    tokenizer,

    save_method="merged_16bit",

    private=False,

)



print("Model deployed to: huggingface.co/your-username/2048-strategy-model")

```

### Using the Trained Model

```python

from transformers import AutoModelForCausalLM, AutoTokenizer



# Load trained model

model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model")

tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model")



# Generate strategy

def get_action(board: List[List[int]]) -> int:

    prompt = create_prompt(board)

    inputs = tokenizer(prompt, return_tensors="pt")

    outputs = model.generate(**inputs, max_new_tokens=256)

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    action, _ = extract_and_execute_strategy(generated, board)

    return action



# Play a game

with OpenSpielEnv.from_hub("openenv/openspiel-env") as env:

    result = env.reset()

    board = info_state_to_board(result.observation.info_state)



    while not result.done:

        action = get_action(board)

        result = env.step(OpenSpielAction(action_id=action, game_name="2048"))

        board = info_state_to_board(result.observation.info_state)



    print(f"Final max tile: {get_max_tile(board)}")

```

## Preventing Reward Hacking

Be aware of potential reward hacking strategies:

1. **Code that modifies rewards** - Run in sandboxed environment
2. **Infinite loops** - Set execution timeouts
3. **Memory exhaustion** - Limit resource usage

```python

import resource

import signal



def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int:

    """Execute strategy with safety limits."""



    def handler(signum, frame):

        raise TimeoutError("Strategy timed out")



    # Set timeout

    signal.signal(signal.SIGALRM, handler)

    signal.alarm(int(timeout))



    try:

        # Set memory limit (100MB)

        resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1))



        # Execute in restricted namespace

        namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}}

        exec(code, namespace)



        return namespace.get("strategy", lambda b: 0)(board)

    finally:

        signal.alarm(0)

```

## Summary

In this tutorial, you learned:

1. **Model Setup**: Loading LLMs with Unsloth and LoRA
2. **Environment Connection**: Using OpenEnv's 2048 environment
3. **Reward Design**: Creating balanced reward functions
4. **GRPO Training**: Training with reinforcement learning
5. **Deployment**: Saving and sharing trained models

## Next Steps

- Try different model architectures
- Experiment with reward function designs
- Train on other OpenEnv environments
- Share your trained models on Hugging Face Hub!

## Related Resources

- [OpenEnv Getting Started](../auto_getting_started/index)
- [Building Custom Environments](../auto_getting_started/plot_03_building_environments)
- [GRPO Documentation](https://huggingface.co/docs/trl/grpo_trainer)
- [Unsloth Documentation](https://github.com/unslothai/unsloth)