| --- |
| language: |
| - en |
| license: cc-by-nc-sa-4.0 |
| pipeline_tag: image-to-video |
| tags: |
| - motion-generation |
| - trajectory-prediction |
| - robotics |
| - computer-vision |
| - pytorch |
| - torch-hub |
| --- |
| |
| # ZipMo (Learning Long-term Motion Embeddings for Efficient Kinematics Generation) |
|
|
| [](https://compvis.github.io/long-term-motion) |
| [](https://arxiv.org/abs/2604.11737) |
| [](https://github.com/CompVis/long-term-motion) |
| [](https://compvis.github.io/long-term-motion) |
|
|
| ZipMo is a motion-space model for efficient long-horizon kinematics generation. It learns compact long-term motion embeddings from large-scale tracker-derived trajectories and generates plausible future motion directly in this learned motion space. The model supports spatial-poke conditioning for open-domain videos and task/text-embedding conditioning for LIBERO robotics evaluation. |
|
|
| ## Paper and Abstract |
|
|
| ZipMo was introduced in the CVPR 2026 paper **Learning Long-term Motion Embeddings for Efficient Kinematics Generation**. |
|
|
| Understanding and predicting motion is a fundamental component of visual intelligence. Although video models can synthesize scene dynamics, exploring many possible futures through full video generation is expensive. ZipMo instead operates directly on long-term motion embeddings learned from tracker trajectories, enabling efficient generation of long, realistic motions while preserving dense reconstruction at arbitrary spatial query points. |
|
|
|  |
| *ZipMo generates long-horizon motion in a compact learned motion space, supporting spatial-poke conditioning for open-domain videos and task-conditioned action prediction on LIBERO.* |
|
|
| ## Usage |
|
|
| For programmatic use, the simplest way to use ZipMo is via `torch.hub`: |
|
|
| ```python |
| import torch |
| |
| repo = "CompVis/long-term-motion" |
| |
| # Open-domain motion prediction |
| planner_sparse = torch.hub.load(repo, "zipmo_planner_sparse") |
| planner_dense = torch.hub.load(repo, "zipmo_planner_dense") |
| |
| # Motion autoencoder |
| vae = torch.hub.load(repo, "zipmo_vae") |
| ``` |
|
|
| LIBERO planning and policy components can be loaded in the same way: |
|
|
| ```python |
| import torch |
| |
| repo = "CompVis/long-term-motion" |
| |
| # LIBERO planners |
| libero_atm_planner = torch.hub.load(repo, "zipmo_planner_libero", "atm") |
| libero_tramoe_planner = torch.hub.load(repo, "zipmo_planner_libero", "tramoe") |
| |
| # LIBERO policy heads |
| policy_head_atm = torch.hub.load(repo, "zipmo_policy_head", "atm") |
| policy_head_tramoe_goal = torch.hub.load(repo, "zipmo_policy_head", "tramoe", "goal") |
| ``` |
|
|
| Available Torch Hub entries: |
|
|
| - `zipmo_planner_sparse`: sparse-poke planner for open-domain motion prediction. |
| - `zipmo_planner_dense`: dense-conditioning planner for open-domain motion prediction. |
| - `zipmo_vae`: long-term motion autoencoder. |
| - `zipmo_planner_libero`: LIBERO planner with mode `atm` or `tramoe`. |
| - `zipmo_policy_head`: LIBERO policy head with mode `atm` or `tramoe`. For `tramoe`, pass one of `10`, `goal`, `object`, or `spatial`. |
|
|
| For the interactive demo, standard track prediction evaluation, LIBERO rollout evaluation, and training instructions, see the [GitHub repository](https://github.com/CompVis/long-term-motion). |
|
|
| ## Citation |
|
|
| If you find our model or code useful, please cite our paper: |
|
|
| ```bibtex |
| @inproceedings{stracke2026motionembeddings, |
| title = {Learning Long-term Motion Embeddings for Efficient Kinematics Generation}, |
| author = {Stracke, Nick and Bauer, Kolja and Baumann, Stefan Andreas and Bautista, Miguel Angel and Susskind, Josh and Ommer, Bj{\"o}rn}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
| year = {2026} |
| } |
| ``` |