πΌ A2C Agent on PandaReachDense-v3
This repository contains a trained Advantage Actor-Critic (A2C) agent that successfully plays the PandaReachDense-v3 environment using the Stable-Baselines3 library.
π Model Card
Model Name: a2c-PandaReachDense-v3
Environment: PandaReachDense-v3
Algorithm: A2C (Advantage Actor-Critic)
Performance Metric:
- Mean Reward:
2.5 - Demonstrates convergence toward stable reaching behavior
π Usage (with Stable-Baselines3)
from stable_baselines3 import A2C
from huggingface_sb3 import load_from_hub
import gym
# Load the trained A2C model
model = load_from_hub(
repo_id="KraTUZen/a2c-PandaReachDense-v3",
filename="a2c.pkl"
)
# Initialize environment
env = gym.make(model["env_id"])
π§ Notes
- The agent is trained using A2C, a synchronous Actor-Critic method that reduces variance compared to vanilla policy gradient.
- The environment is PandaReachDense-v3, where the agent must control a robotic arm to reach a target position.
- The serialized policy is stored in
a2c.pkl.
π Repository Structure
a2c.pklβ Trained policy weightsREADME.mdβ Documentation and usage guide
β Results
- The agent learns to move the Panda robotic arm toward target positions.
- Demonstrates stable convergence using A2C, though performance metrics show room for further optimization.
π Environment Overview
- Observation Space: Continuous (robot joint positions, target coordinates, gripper state)
- Action Space: Continuous (joint torques, gripper control)
- Objective: Reach the target position efficiently
- Reward: Dense reward shaping to guide the agent toward the target
π Learning Highlights
- Algorithm: A2C (Advantage Actor-Critic)
- Update Rule: Actor updates policy, Critic estimates value function to reduce variance
- Strengths: More sample-efficient than vanilla policy gradient, stable learning
- Limitations: Sensitive to hyperparameter tuning (learning rate, entropy coefficient)
- Downloads last month
- 7
Evaluation results
- mean_reward on PandaReachDense-v3self-reported2.500