SmolVLA Fine-tuned on SO-101 (Stratified Split)

87.66% success rate on SO-101 pick-and-place task through data-centric approach.

Model Description

This is a fine-tuned version of SmolVLA trained on the SO-101 pick-and-place dataset.

Key Achievement: Improved from 60.92% to 87.66% success rate (+44%) by implementing position-aware stratified data splitting instead of hyperparameter tuning.

Model Details

  • Model type: Vision-Language-Action (VLA) policy
  • Base model: lerobot/smolvla_base
  • Training data: 40 episodes (stratified across 5 cube positions)
  • Validation performance: 87.66% success rate (within 5% tolerance per joint)
  • Training/validation gap: 1.7x (healthy generalization)

Intended Use

This model is designed for:

  • Primary use: SO-101 robotic arm pick-and-place tasks
  • Research: Studying data-centric approaches in robot learning
  • Education: Understanding the impact of proper data splitting

Out-of-scope: This model is fine-tuned specifically for SO-101 pick-and-place. Performance on other tasks or robots is not guaranteed.

Performance

Success Rate (Within 5% of Joint Range)

Joint Success Rate
shoulder_pan 98.13%
shoulder_lift 82.91%
elbow_flex 83.86%
wrist_flex 80.38%
wrist_roll 95.69%
gripper 84.99%
Average 87.66%

Comparison to Initial Approach

Metric Sequential Split Stratified Split Improvement
Success Rate 60.92% 87.66% +44%
Train/Val Gap 5.0x 1.7x -66%
shoulder_pan 45.81% 98.13% +114%
wrist_roll 60.86% 95.69% +57%

Training Details

Training Data

  • Dataset: lerobot/svla_so101_pickplace
  • Split strategy: Position-aware stratified sampling
    • 40 training episodes (8 per cube position)
    • 10 validation episodes (2 per cube position)
  • Total episodes: 50 (5 positions × 10 episodes each)

Training Procedure

Key Innovation: Stratified splitting by cube position instead of sequential episode splitting.

# Stratified split ensures all 5 positions in train AND val
for position in range(5):
    position_episodes = episodes[position*10:(position+1)*10]
    train: 8 episodes per position (80%)
    val: 2 episodes per position (20%)

Training hyperparameters:

  • Base model: lerobot/smolvla_base (pretrained)
  • Training steps: 15,000
  • Batch size: 24
  • Weight decay: 0.001
  • Best checkpoint: Step 6000 (validation loss: 0.0360)

Data augmentation:

  • Color jitter (brightness, contrast, saturation, hue)
  • Training: Standard variance (±20%)
  • Validation: Shifted distribution (darker/higher contrast) for robustness testing

How to Use

Installation

pip install lerobot

Inference

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import torch

# Load model
policy = SmolVLAPolicy.from_pretrained("your-username/smolvla-so101-stratified")
policy.eval()
policy.to("cuda")

# Load dataset for testing
dataset = LeRobotDataset("lerobot/svla_so101_pickplace")

# Get observation
obs = dataset[0]

# Prepare input
batch = {
    'observation.images.camera1': obs['observation.images.up'].unsqueeze(0),
    'observation.images.camera2': obs['observation.images.side'].unsqueeze(0),
    'observation.state': obs['observation.state'].unsqueeze(0),
}

# Predict actions
with torch.no_grad():
    actions = policy.predict_action_chunk(batch)  # (1, 50, 6)
    next_action = actions[0, 0, :]  # First action in sequence

Limitations and Bias

Limitations

  1. Task-specific: Trained only on SO-101 pick-and-place. May not generalize to other manipulation tasks.
  2. Single robot: Fine-tuned for SO-101 6-DOF arm. Performance on other robot embodiments not guaranteed.
  3. Limited scenarios: 5 cube positions in training. Novel positions far from training distribution may have lower success.
  4. Simulation-trained: Trained on recorded demonstrations. Real-world deployment may require additional adaptation.

Bias Considerations

  • Position bias: Model has most experience with positions 0-3 (8 examples each), slightly less with position 4.
  • Lighting conditions: Trained with specific augmentation ranges. Extreme lighting outside this range may degrade performance.

Key Lesson: Data Quality Over Model Tuning

This model demonstrates that proper data handling can have larger impact than hyperparameter optimization.

What didn't work:

  • Increasing weight decay (0.001 to 0.2): ~5% improvement
  • Adjusting augmentation: minimal impact
  • Training longer: made validation worse

What worked:

  • Stratified data splitting: 44% improvement

The initial 5x train/val gap wasn't overfitting - it was unfair evaluation on underrepresented data (validation set was only position 4, which had few training examples).

More Information

Downloads last month
192
Video Preview
loading

Model tree for Sa74ll/smolvla_so101_pickandplace

Finetuned
(4994)
this model

Dataset used to train Sa74ll/smolvla_so101_pickandplace

Paper for Sa74ll/smolvla_so101_pickandplace

Evaluation results

  • Per-Joint Success Rate (5% tolerance) on SO-101 Pick & Place
    self-reported
    87.660