🐼 A2C Agent on PandaReachDense-v3

This repository contains a trained Advantage Actor-Critic (A2C) agent that successfully plays the PandaReachDense-v3 environment using the Stable-Baselines3 library.

📊 Model Card

Model Name: a2c-PandaReachDense-v3
Environment: PandaReachDense-v3
Algorithm: A2C (Advantage Actor-Critic)
Performance Metric:

Mean Reward: 2.5
Demonstrates convergence toward stable reaching behavior

🚀 Usage (with Stable-Baselines3)

from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub
import gym

# Load the trained A2C model
model = load_from_hub(
    repo_id="KraTUZen/a2c-PandaReachDense-v3",
    filename="a2c.pkl"
)

# Initialize environment
env = gym.make(model["env_id"])

🧠 Notes

The agent is trained using A2C, a synchronous Actor-Critic method that reduces variance compared to vanilla policy gradient.
The environment is PandaReachDense-v3, where the agent must control a robotic arm to reach a target position.
The serialized policy is stored in a2c.pkl.

📂 Repository Structure

a2c.pkl → Trained policy weights
README.md → Documentation and usage guide

✅ Results

The agent learns to move the Panda robotic arm toward target positions.
Demonstrates stable convergence using A2C, though performance metrics show room for further optimization.

🔎 Environment Overview

Observation Space: Continuous (robot joint positions, target coordinates, gripper state)
Action Space: Continuous (joint torques, gripper control)
Objective: Reach the target position efficiently
Reward: Dense reward shaping to guide the agent toward the target

📚 Learning Highlights

Algorithm: A2C (Advantage Actor-Critic)
Update Rule: Actor updates policy, Critic estimates value function to reduce variance
Strengths: More sample-efficient than vanilla policy gradient, stable learning
Limitations: Sensitive to hyperparameter tuning (learning rate, entropy coefficient)

Downloads last month: 7

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on PandaReachDense-v3
self-reported

2.500