SmolVLA Fine-tuned on SO-101 (Stratified Split)
87.66% success rate on SO-101 pick-and-place task through data-centric approach.
Model Description
This is a fine-tuned version of SmolVLA trained on the SO-101 pick-and-place dataset.
Key Achievement: Improved from 60.92% to 87.66% success rate (+44%) by implementing position-aware stratified data splitting instead of hyperparameter tuning.
Model Details
- Model type: Vision-Language-Action (VLA) policy
- Base model: lerobot/smolvla_base
- Training data: 40 episodes (stratified across 5 cube positions)
- Validation performance: 87.66% success rate (within 5% tolerance per joint)
- Training/validation gap: 1.7x (healthy generalization)
Intended Use
This model is designed for:
- Primary use: SO-101 robotic arm pick-and-place tasks
- Research: Studying data-centric approaches in robot learning
- Education: Understanding the impact of proper data splitting
Out-of-scope: This model is fine-tuned specifically for SO-101 pick-and-place. Performance on other tasks or robots is not guaranteed.
Performance
Success Rate (Within 5% of Joint Range)
| Joint | Success Rate |
|---|---|
| shoulder_pan | 98.13% |
| shoulder_lift | 82.91% |
| elbow_flex | 83.86% |
| wrist_flex | 80.38% |
| wrist_roll | 95.69% |
| gripper | 84.99% |
| Average | 87.66% |
Comparison to Initial Approach
| Metric | Sequential Split | Stratified Split | Improvement |
|---|---|---|---|
| Success Rate | 60.92% | 87.66% | +44% |
| Train/Val Gap | 5.0x | 1.7x | -66% |
| shoulder_pan | 45.81% | 98.13% | +114% |
| wrist_roll | 60.86% | 95.69% | +57% |
Training Details
Training Data
- Dataset: lerobot/svla_so101_pickplace
- Split strategy: Position-aware stratified sampling
- 40 training episodes (8 per cube position)
- 10 validation episodes (2 per cube position)
- Total episodes: 50 (5 positions × 10 episodes each)
Training Procedure
Key Innovation: Stratified splitting by cube position instead of sequential episode splitting.
# Stratified split ensures all 5 positions in train AND val
for position in range(5):
position_episodes = episodes[position*10:(position+1)*10]
train: 8 episodes per position (80%)
val: 2 episodes per position (20%)
Training hyperparameters:
- Base model: lerobot/smolvla_base (pretrained)
- Training steps: 15,000
- Batch size: 24
- Weight decay: 0.001
- Best checkpoint: Step 6000 (validation loss: 0.0360)
Data augmentation:
- Color jitter (brightness, contrast, saturation, hue)
- Training: Standard variance (±20%)
- Validation: Shifted distribution (darker/higher contrast) for robustness testing
How to Use
Installation
pip install lerobot
Inference
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
import torch
# Load model
policy = SmolVLAPolicy.from_pretrained("your-username/smolvla-so101-stratified")
policy.eval()
policy.to("cuda")
# Load dataset for testing
dataset = LeRobotDataset("lerobot/svla_so101_pickplace")
# Get observation
obs = dataset[0]
# Prepare input
batch = {
'observation.images.camera1': obs['observation.images.up'].unsqueeze(0),
'observation.images.camera2': obs['observation.images.side'].unsqueeze(0),
'observation.state': obs['observation.state'].unsqueeze(0),
}
# Predict actions
with torch.no_grad():
actions = policy.predict_action_chunk(batch) # (1, 50, 6)
next_action = actions[0, 0, :] # First action in sequence
Limitations and Bias
Limitations
- Task-specific: Trained only on SO-101 pick-and-place. May not generalize to other manipulation tasks.
- Single robot: Fine-tuned for SO-101 6-DOF arm. Performance on other robot embodiments not guaranteed.
- Limited scenarios: 5 cube positions in training. Novel positions far from training distribution may have lower success.
- Simulation-trained: Trained on recorded demonstrations. Real-world deployment may require additional adaptation.
Bias Considerations
- Position bias: Model has most experience with positions 0-3 (8 examples each), slightly less with position 4.
- Lighting conditions: Trained with specific augmentation ranges. Extreme lighting outside this range may degrade performance.
Key Lesson: Data Quality Over Model Tuning
This model demonstrates that proper data handling can have larger impact than hyperparameter optimization.
What didn't work:
- Increasing weight decay (0.001 to 0.2): ~5% improvement
- Adjusting augmentation: minimal impact
- Training longer: made validation worse
What worked:
- Stratified data splitting: 44% improvement
The initial 5x train/val gap wasn't overfitting - it was unfair evaluation on underrepresented data (validation set was only position 4, which had few training examples).
More Information
- Base Model: lerobot/smolvla_base
- Dataset: lerobot/svla_so101_pickplace
- Full Analysis: GitHub Repository
- SmolVLA Paper: arXiv:2506.01844
- Downloads last month
- 192
Model tree for Sa74ll/smolvla_so101_pickandplace
Base model
lerobot/smolvla_baseDataset used to train Sa74ll/smolvla_so101_pickandplace
Paper for Sa74ll/smolvla_so101_pickandplace
Evaluation results
- Per-Joint Success Rate (5% tolerance) on SO-101 Pick & Placeself-reported87.660