πΌ A2C Agent on PandaPickAndPlace-v3
This repository contains a trained Advantage Actor-Critic (A2C) agent that successfully plays the PandaPickAndPlace-v3 environment.
π Model Card
Model Name: a2c-PandaPickAndPlace-v3
Environment: PandaPickAndPlace-v3
Algorithm: A2C (Advantage Actor-Critic)
Performance Metric:
- Learns stable pick-and-place behavior across evaluation runs
- Demonstrates convergence to an effective policy
π Usage
from huggingface_hub import load_from_hub
import gym
# Load the trained A2C model
model = load_from_hub(
repo_id="KraTUZen/a2c-PandaPickAndPlace-v3",
filename="a2c.pkl"
)
# Initialize environment
env = gym.make(model["env_id"])
π§ Notes
- The agent is trained using A2C, a synchronous variant of Actor-Critic methods.
- The environment is PandaPickAndPlace-v3, where the agent must control a robotic arm to pick up and place objects.
- The serialized policy is stored in
a2c.pkl.
π Repository Structure
a2c.pklβ Trained policy weightsREADME.mdβ Documentation and usage guide
β Results
- The agent learns to control the Panda robotic arm effectively.
- Demonstrates stable convergence using A2C with reduced variance compared to vanilla policy gradient methods.
π Environment Overview
- Observation Space: Continuous (robot joint positions, object positions, gripper state)
- Action Space: Continuous (joint torques, gripper control)
- Objective: Pick up the target object and place it at the designated location
- Reward: Positive reward for successful placement, penalties for failed attempts
π Learning Highlights
- Algorithm: A2C (Advantage Actor-Critic)
- Update Rule: Actor updates policy, Critic estimates value function to reduce variance
- Strengths: More sample-efficient than vanilla policy gradient, stable learning
- Limitations: Requires careful tuning of learning rate and entropy coefficient
- Downloads last month
- 6
Evaluation results
- mean_reward on PandaPickAndPlace-v3self-reported2.5 +/- 0.00