Text-to-Motion Retrieval (TMR) Model
Description:
The Text-to-Motion Retrieval (TMR) model is a multimodal motion and language model that enables embedding text prompts and human motion clips into a shared latent space [Petrovich et al, 2023]. This is useful for retrieval tasks as well as for computing motion generation evaluation metrics such as R-precision and FID. This version of TMR is trained on the Bones Rigplay dataset with the SOMA skeleton, making it useful for evaluating the Kimodo Motion Diffusion Model and similar motion generation models. The model is integrated into the Kimodo Motion Generation Benchmark for computing metrics.
This model is ready for commercial use.
License:
This model is released under the NVIDIA Open Model License.
Deployment Geography:
Global
Use Case:
The main use case for TMR is evaluating human motion generation models through retrieval metrics like R-precision and through latent similarity metrics such as FID. The model can also be used to easily search large human motion (e.g., character animation) databases through text-motion retrieval. These capabilities are useful in domains such as character animation and humanoid robotics.
Release Date:
Github [04/10/2026] via link
HuggingFace [04/10/2026] via link
References:
- TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis, Petrovich et al., ICCV 2023 [link]
- Kimodo Project Webpage: link
Model Architecture:
Architecture Type: Dual Encoder
Network Architecture: Transformer
Model Size:
- Motion encoder: 4.8 M parameters
- Text encoder: 5.8 M parameters
Inputs:
Input Types: Text, Motion
Input Formats:
- Text: String
- Motion: Matrix of Joint Positions
Input Parameters:
- Text: One-Dimensional (1D)
- Motion: Three-Dimensional (
num_framesx 30 x 3)
Other Properties Related to Input: Maximum motion duration is 10 sec (300 frames at 30 frames per second).
Output
Output Type: Latent Embeddings
Output Formats: Vector
Output Parameters: 256-Dimensional
Other Properties Related to Output:
- One embedding for text and one for motion
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Lovelace
Supported Operating Systems:
- Linux
- Windows
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version
TMR-SOMA-RP-v1
Training Dataset:
Name: Proprietary Bones Rigplay Dataset
Data Modalities
- Text
- Human Motion Capture
Data Size:
- Less than 1 Billion tokens of text
- 700 hours of human motion capture
Data Collection Method
Automatic/Sensors
Labeling Method
Hybrid: Automatic/Sensors, Human
Properties: 700 hours of captured human body motions on the SOMA skeleton with corresponding text descriptions. Trained on the full dataset (train+test split) to be most useful for evaluating motion generation models that are trained only on the train split. Various augmentations were employed to expand text and motion variety.
Testing/Evaluation Dataset:
Name: Proprietary Bones Rigplay Dataset
Data Modalities
- Text
- Human Motion Capture
Data Size:
- Less than 1 Billion tokens of text
- Roughly 10 hours of human motion capture
Data Collection Method
Automatic/Sensors
Labeling Method
Hybrid: Automatic/Sensors, Human
Properties: Our internal evaluation dataset contains around 5k motions sampled from the full training dataset that is described above. These motions all contain unique motion content, and therefore unique text descriptions, making them well-suited for evaluating TMR on the retrieval task.
Quantitative Evaluation
The TMR model has also been evaluated on the smaller Kimodo Motion Generation Benchmark. Please refer to the "Ground Truth" results in the Kimodo benchmark documentation, which indicates retrieval accuracy when using TMR with ground truth motions (rather than generated ones).
Inference:
Acceleration Engine: N/A
Test Hardware:
- GeForce RTX 3090
- GeForce RTX 4090
- GeForce RTX 5090
- NVIDIA A100
- NVIDIA L40S
- NVIDIA L4
- NVIDIA RTX 6000 Ada
- NVIDIA RTX A6000
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Safety & Security, and Privacy Subcards below.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | Gender |
| Measures taken to mitigate against unwanted bias: | Our training data contains motion captured from a roughly equal number of male and female actors |
Explainability
| Field | Response |
|---|---|
| Intended Task/Domain: | Character Animation |
| Model Type: | Multimodal Encoder |
| Intended Users: | The model is intended for researchers and developers working on human motion generation to evaluate their models, and to use the retrieval capabilities of the model for motion datasets. |
| Output: | Latent embedding of text or human motion |
| Describe how the model works: | The model is comprised of dual encoders, one transformer for text and one for motion, that embeds the input into a 256D vector. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Gender |
| Technical Limitations & Mitigation: | The model is trained specifically for certain types of actions (e.g., locomotion, gestures, combat, dancing, and everyday activities) and as such, it may generate incorrect predictions for actions outside of the training distribution. The model is specific to the SOMA skeleton with a single set of body proportions. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Retrieval accuracy |
| Potential Known Risks: | The model is not always sensitive to small details in motion, which may make retrieval difficult for some combinations of text and motion. For example, the text "Waving right hand" may be encoded close to a motion containing waving with the left hand, making it difficult to retrieve motions with the correct handedness. |
| Licensing: | This model is released under the NVIDIA Open Model License |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | No |
| Personal data used to create this model? | No |
| How often is dataset reviewed? | During dataset creation, model training, evaluation and before release |
| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Not Applicable |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Safety
| Field | Response |
|---|---|
| Model Application Field(s): | Media & Entertainment, Industrial/Machinery and Robotics |
| Describe the life critical impact (if present). | Not Applicable |
| Use Case Restrictions: | Abide by the NVIDIA Open Model License |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
- Downloads last month
- 16