Spaces:

AniketAsla
/

debatefloor

Running

App Files Files Community

debatefloor / docs /source /tutorials /rl-training-2048.md

AniketAsla

sync: mirror git d05fcb5 to Space

b4ac377 verified 12 days ago

preview code

raw

history blame contribute delete

14.7 kB

	# RL Training with OpenEnv: 2048 Game

	This tutorial covers training a language model to play the 2048 game using
	reinforcement learning with GRPO (Group Relative Policy Optimization).

	```{note}
	Time: ~45 minutes \| Difficulty: Advanced \| GPU Required: Yes (T4 or better)
	```

	## What You'll Learn

	- Model Setup: Load and configure LLMs with Unsloth for efficient RL
	- Environment Connection: Connect to the 2048 OpenEnv environment
	- Reward Design: Create effective reward functions
	- GRPO Training: Train models with reinforcement learning
	- Deployment: Save and deploy trained models

	## Prerequisites

	Before starting this tutorial, you should have completed the
	[Getting Started](/auto_getting_started/index) series to understand:

	- How OpenEnv environments work
	- The reset/step/state API pattern
	- How to connect to environments

	You'll also need:

	- A GPU (free T4 on Google Colab works)
	- Basic understanding of PyTorch
	- ~30 minutes for training

	## Part 1: Environment Setup

	### Installation

	```bash
	# Install required packages
	!pip install -q unsloth openenv-core trl

	# For Google Colab, also run:
	!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
	```

	### Imports

	```python
	import torch
	from dataclasses import dataclass
	from typing import List, Optional, Dict, Any
	import random

	# Check GPU availability
	print(f"GPU Available: {torch.cuda.is_available()}")
	if torch.cuda.is_available():
	print(f"GPU: {torch.cuda.get_device_name(0)}")
	print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
	```

	## Part 2: Model Configuration

	We use Unsloth for memory-efficient training with LoRA adapters.

	### Configuration Classes

	```python
	@dataclass
	class ModelConfig:
	"""Configuration for loading LLM models."""
	model_name: str = "unsloth/Qwen2.5-1.5B"
	max_seq_length: int = 768
	load_in_4bit: bool = True
	dtype: Optional[str] = None # Auto-detect


	@dataclass
	class LoRAConfig:
	"""Configuration for LoRA fine-tuning."""
	r: int = 16
	lora_alpha: int = 32
	target_modules: List[str] = None
	lora_dropout: float = 0.0

	def __post_init__(self):
	if self.target_modules is None:
	self.target_modules = [
	"q_proj", "k_proj", "v_proj", "o_proj",
	"gate_proj", "up_proj", "down_proj",
	]
	```

	### Loading the Model

	```python
	from unsloth import FastLanguageModel

	# Create configurations
	model_config = ModelConfig()
	lora_config = LoRAConfig()

	# Load model
	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name=model_config.model_name,
	max_seq_length=model_config.max_seq_length,
	load_in_4bit=model_config.load_in_4bit,
	dtype=model_config.dtype,
	)

	# Apply LoRA adapters
	model = FastLanguageModel.get_peft_model(
	model,
	r=lora_config.r,
	target_modules=lora_config.target_modules,
	lora_alpha=lora_config.lora_alpha,
	lora_dropout=lora_config.lora_dropout,
	bias="none",
	use_gradient_checkpointing="unsloth",
	random_state=42,
	)

	# Check parameter counts
	trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
	total = sum(p.numel() for p in model.parameters())
	print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)")
	```

	## Part 3: The 2048 Environment

	### Game Overview

	2048 is a sliding puzzle game where you combine tiles to reach 2048.

	Actions:
	- `0` = UP
	- `1` = RIGHT
	- `2` = DOWN
	- `3` = LEFT

	Goal: Create a tile with value 2048 (or higher!)

	### Connecting to the Environment

	```python
	from envs.openspiel_env import OpenSpielEnv, OpenSpielAction

	# Connect to 2048 environment
	# Option 1: From Hub
	env = OpenSpielEnv.from_hub("openenv/openspiel-env")

	# Option 2: From running server
	# env = OpenSpielEnv(base_url="http://localhost:8000")

	# Test connection
	with env:
	result = env.reset()
	print(f"Game started!")
	print(f"Legal actions: {result.observation.legal_actions}")

	# Take a test action
	action = OpenSpielAction(action_id=0, game_name="2048")
	result = env.step(action)
	print(f"After UP: reward={result.reward}, done={result.done}")
	```

	### Board Utilities

	```python
	import numpy as np
	from typing import List

	def info_state_to_board(info_state: List[int], size: int = 4) -> List[List[int]]:
	"""Convert flat info_state to 2D board."""
	return np.array(info_state, dtype=int).reshape(size, size).tolist()

	def render_board(board: List[List[int]]) -> str:
	"""Render board as ASCII string."""
	lines = ["+------" * len(board[0]) + "+"]
	for row in board:
	cells = [f"{v:5d}" if v > 0 else " ." for v in row]
	lines.append("\|" + " \|".join(cells) + " \|")
	lines.append("+------" * len(row) + "+")
	return "\n".join(lines)

	def get_max_tile(board: List[List[int]]) -> int:
	"""Get highest tile value."""
	return max(cell for row in board for cell in row)
	```

	## Part 4: Reward Function Design

	The reward function is crucial for RL. We consider:

	1. Success: Did we reach 2048?
	2. Progress: What's the highest tile achieved?
	3. Code Quality: Did the generated code execute correctly?

	### Reward Implementation

	```python
	import math

	def calculate_reward(
	max_tile: int,
	success: bool,
	code_error: bool = False
	) -> float:
	"""
	Calculate reward for a 2048 game outcome.

	Args:
	max_tile: Highest tile achieved (2, 4, 8, ..., 2048)
	success: Whether we reached 2048
	code_error: Whether generated code had errors

	Returns:
	Float reward value
	"""
	if code_error:
	return -0.5 # Penalty for invalid code

	if success:
	return 1.0 # Full reward for winning

	# Progress reward: log scale from 0 to 0.9
	if max_tile > 0:
	progress = math.log2(max_tile) / math.log2(2048)
	return min(0.9, progress)

	return 0.0

	# Test reward function
	test_cases = [
	(2048, True, False, "Won!"),
	(1024, False, False, "Got to 1024"),
	(512, False, False, "Got to 512"),
	(64, False, False, "Early game"),
	]

	for max_tile, success, error, desc in test_cases:
	reward = calculate_reward(max_tile, success, error)
	print(f"{desc:20s} -> Reward: {reward:+.3f}")
	```

	## Part 5: Strategy Generation

	We'll train the model to generate Python strategy functions.

	### Prompt Template

	```python
	SYSTEM_PROMPT = """You are an expert at playing 2048. Generate a Python function
	that takes a board state and returns the best action (0=UP, 1=RIGHT, 2=DOWN, 3=LEFT).

	The board is a 4x4 list of integers. Empty cells are 0.
	Your function should analyze the board and return an optimal move.
	"""

	def create_prompt(board: List[List[int]]) -> str:
	"""Create prompt for strategy generation."""
	board_str = "\n".join(str(row) for row in board)
	return f"""{SYSTEM_PROMPT}

	Current board:
	{board_str}

	Generate a strategy function:
	```python
	def strategy(board):
	# Your code here
	return action # 0, 1, 2, or 3
	```"""
	```

	### Executing Generated Strategies

	```python
	import ast
	from typing import Callable

	def extract_and_execute_strategy(
	generated_code: str,
	board: List[List[int]],
	timeout: float = 5.0
	) -> tuple[int, bool]:
	"""
	Extract and execute a generated strategy function.

	Returns:
	(action, success): The action to take and whether execution succeeded
	"""
	try:
	# Extract code block
	if "```python" in generated_code:
	code = generated_code.split("```python")[1].split("```")[0]
	else:
	code = generated_code

	# Parse and validate AST
	tree = ast.parse(code)

	# Execute in sandbox
	namespace = {"board": board}
	exec(compile(tree, "<strategy>", "exec"), namespace)

	# Call the strategy function
	if "strategy" in namespace:
	action = namespace["strategy"](board)
	if action in [0, 1, 2, 3]:
	return action, True

	return 0, False # Default action on failure

	except Exception as e:
	print(f"Strategy execution error: {e}")
	return 0, False
	```

	## Part 6: GRPO Training

	GRPO (Group Relative Policy Optimization) is optimized for language models.

	### Training Configuration

	```python
	from trl import GRPOConfig, GRPOTrainer

	grpo_config = GRPOConfig(
	# Learning rate
	learning_rate=2e-6,

	# Batch sizes
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,

	# Training duration
	max_steps=200,

	# Memory optimization
	bf16=True,
	gradient_checkpointing=True,

	# Logging
	logging_steps=1,
	output_dir="./2048_grpo_output",
	report_to="none",
	)
	```

	### Training Loop

	```python
	def train_2048_agent(
	model,
	tokenizer,
	env,
	config: GRPOConfig,
	num_episodes: int = 100,
	):
	"""
	Train the model to play 2048 using GRPO.
	"""
	# Prepare model for training
	FastLanguageModel.for_training(model)

	training_data = []

	for episode in range(num_episodes):
	# Reset environment
	result = env.reset()
	board = info_state_to_board(result.observation.info_state)

	episode_reward = 0
	steps = 0

	while not result.done and steps < 1000:
	# Generate strategy
	prompt = create_prompt(board)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.7,
	do_sample=True,
	)

	generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

	# Execute strategy
	action, success = extract_and_execute_strategy(generated, board)

	# Take action in environment
	env_action = OpenSpielAction(action_id=action, game_name="2048")
	result = env.step(env_action)

	# Update board
	board = info_state_to_board(result.observation.info_state)
	episode_reward += result.reward if result.reward else 0
	steps += 1

	# Calculate final reward
	max_tile = get_max_tile(board)
	final_reward = calculate_reward(max_tile, max_tile >= 2048)

	# Store for training
	training_data.append({
	"prompt": prompt,
	"response": generated,
	"reward": final_reward,
	})

	if episode % 10 == 0:
	print(f"Episode {episode}: Max tile={max_tile}, Reward={final_reward:.3f}")

	return training_data
	```

	## Part 7: Deployment

	After training, save and deploy your model.

	### Saving the Model

	```python
	# Save LoRA adapters only
	model.save_pretrained("./2048_strategy_model")
	tokenizer.save_pretrained("./2048_strategy_model")

	# Save merged model for inference
	model.save_pretrained_merged(
	"./2048_strategy_model_merged",
	tokenizer,
	save_method="merged_16bit",
	)
	```

	### Push to Hugging Face Hub

	```python
	# Push to Hub
	model.push_to_hub(
	"your-username/2048-strategy-model",
	tokenizer,
	save_method="merged_16bit",
	private=False,
	)

	print("Model deployed to: huggingface.co/your-username/2048-strategy-model")
	```

	### Using the Trained Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load trained model
	model = AutoModelForCausalLM.from_pretrained("your-username/2048-strategy-model")
	tokenizer = AutoTokenizer.from_pretrained("your-username/2048-strategy-model")

	# Generate strategy
	def get_action(board: List[List[int]]) -> int:
	prompt = create_prompt(board)
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
	action, _ = extract_and_execute_strategy(generated, board)
	return action

	# Play a game
	with OpenSpielEnv.from_hub("openenv/openspiel-env") as env:
	result = env.reset()
	board = info_state_to_board(result.observation.info_state)

	while not result.done:
	action = get_action(board)
	result = env.step(OpenSpielAction(action_id=action, game_name="2048"))
	board = info_state_to_board(result.observation.info_state)

	print(f"Final max tile: {get_max_tile(board)}")
	```

	## Preventing Reward Hacking

	Be aware of potential reward hacking strategies:

	1. Code that modifies rewards - Run in sandboxed environment
	2. Infinite loops - Set execution timeouts
	3. Memory exhaustion - Limit resource usage

	```python
	import resource
	import signal

	def safe_execute(code: str, board: List[List[int]], timeout: float = 5.0) -> int:
	"""Execute strategy with safety limits."""

	def handler(signum, frame):
	raise TimeoutError("Strategy timed out")

	# Set timeout
	signal.signal(signal.SIGALRM, handler)
	signal.alarm(int(timeout))

	try:
	# Set memory limit (100MB)
	resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, -1))

	# Execute in restricted namespace
	namespace = {"board": board, "__builtins__": {"len": len, "max": max, "min": min}}
	exec(code, namespace)

	return namespace.get("strategy", lambda b: 0)(board)
	finally:
	signal.alarm(0)
	```

	## Summary

	In this tutorial, you learned:

	1. Model Setup: Loading LLMs with Unsloth and LoRA
	2. Environment Connection: Using OpenEnv's 2048 environment
	3. Reward Design: Creating balanced reward functions
	4. GRPO Training: Training with reinforcement learning
	5. Deployment: Saving and sharing trained models

	## Next Steps

	- Try different model architectures
	- Experiment with reward function designs
	- Train on other OpenEnv environments
	- Share your trained models on Hugging Face Hub!

	## Related Resources

	- [OpenEnv Getting Started](../auto_getting_started/index)
	- [Building Custom Environments](../auto_getting_started/plot_03_building_environments)
	- [GRPO Documentation](https://huggingface.co/docs/trl/grpo_trainer)
	- [Unsloth Documentation](https://github.com/unslothai/unsloth)