6Net — 6-Axis Visual Robot Policy (~115M)
Custom transformer policy for visual 6-DoF robot arm control. Trained from scratch (no LoRA).
| Component | Detail | Params |
|---|---|---|
| Visual Encoder | ResNet-18 fine-tuned | ~11.7M |
| Visual Projection | Linear(512→768) | ~0.4M |
| State Encoder | MLP(6→256→768) | ~0.2M |
| Transformer | 14L · d=768 · 12h · ffn=3072 | ~99.1M |
| Action Head | MLP(768→256→6) | ~0.2M |
| Total | ~111M |
Dataset: synthetic · Steps: 455 · Eff. batch: 32
Inference
import torch
from train_6net_local import SixNet, Config
import torchvision.transforms as T
from PIL import Image
model = SixNet(Config())
ckpt = torch.load("6net_final.pt", map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval()
tf = T.Compose([T.Resize((224,224)), T.ToTensor(),
T.Normalize([.485,.456,.406],[.229,.224,.225])])
img = tf(Image.open("cam.jpg")).unsqueeze(0) # (1,3,224,224)
jts = torch.zeros(1, 6) # current joint angles (rad)
with torch.no_grad():
action = model(img, jts) # (1,6) predicted targets