π PPO Agent β LunarLander-v3
Trained with Stable-Baselines3 PPO on LunarLander-v3. Hyperparameters optimized via Optuna TPE β 20 trials. Part of a 12-project Deep RL portfolio.
π Performance
| Version | Score | Notes |
|---|---|---|
| Baseline | 210.15 | Default SB3 params |
| Optuna v1 | 259.15 | 20-trial HPO sweep |
| Improvement | +49.00 | clip_range β, more stable |
Target: β₯ 200 β cleared by 59.1 points.
β‘ Reproduce
from huggingface_sb3 import load_from_hub
from stable_baselines3 import PPO
import gymnasium as gym
# Load model
checkpoint = load_from_hub(
repo_id = "muhrivandysetiawan/ppo-LunarLander-v3",
filename = "ppo-LunarLander-v3.zip",
)
model = PPO.load(checkpoint)
# Run episode
env = gym.make("LunarLander-v3", render_mode="human")
obs, _ = env.reset()
done = False
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
env.close()
ποΈ Architecture
OOP class-based pipeline β 6 classes, 3 entry points. Zero hardcoded hyperparameters β config-driven throughout. LunarLanderEnv β env wrapping + seed control PPOAgent β model lifecycle Trainer β training loop HyperparamSweeper β Optuna HPO Evaluator β separate eval pipeline WandBLogger β centralized tracking
π¬ Training Details
| Parameter | Baseline | Optuna Best |
|---|---|---|
| learning_rate | 3e-4 | optimized |
| n_steps | 1024 | optimized |
| clip_range | 0.20 | 0.161 |
| batch_size | 64 | optimized |
| n_epochs | 4 | optimized |
| total_timesteps | 1M | 1M |
| seed | 42 | 42 |
| n_envs | 16 | 16 |
π Experiment Tracking
W&B Dashboard β https://wandb.ai/muhrivandysetiawan-muh-rivandy-setiawan/deep-rl-portfolio 22 tracked runs β baseline, 20 Optuna trials, final.
ποΈ Portfolio
Project 1/12 β HuggingFace Deep RL Curriculum Author: muhrivandysetiawan
- Downloads last month
- 157
Space using muhrivandysetiawan/ppo-LunarLander-v3 1
Evaluation results
- mean_reward on LunarLander-v3self-reported285.10 +/- 23.36