tmr
nvidia
TMR
rigplay
SOMA

Text-to-Motion Retrieval (TMR) Model

Description:

The Text-to-Motion Retrieval (TMR) model is a multimodal motion and language model that enables embedding text prompts and human motion clips into a shared latent space [Petrovich et al, 2023]. This is useful for retrieval tasks as well as for computing motion generation evaluation metrics such as R-precision and FID. This version of TMR is trained on the Bones Rigplay dataset with the SOMA skeleton, making it useful for evaluating the Kimodo Motion Diffusion Model and similar motion generation models. The model is integrated into the Kimodo Motion Generation Benchmark for computing metrics.

This model is ready for commercial use.

License:

This model is released under the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

The main use case for TMR is evaluating human motion generation models through retrieval metrics like R-precision and through latent similarity metrics such as FID. The model can also be used to easily search large human motion (e.g., character animation) databases through text-motion retrieval. These capabilities are useful in domains such as character animation and humanoid robotics.

Release Date:

Github [04/10/2026] via link
HuggingFace [04/10/2026] via link

References:

  • TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis, Petrovich et al., ICCV 2023 [link]
  • Kimodo Project Webpage: link

Model Architecture:

Architecture Type: Dual Encoder
Network Architecture: Transformer
Model Size:

  • Motion encoder: 4.8 M parameters
  • Text encoder: 5.8 M parameters

Inputs:

Input Types: Text, Motion

Input Formats:

  • Text: String
  • Motion: Matrix of Joint Positions

Input Parameters:

  • Text: One-Dimensional (1D)
  • Motion: Three-Dimensional (num_frames x 30 x 3)

Other Properties Related to Input: Maximum motion duration is 10 sec (300 frames at 30 frames per second).

Output

Output Type: Latent Embeddings

Output Formats: Vector

Output Parameters: 256-Dimensional

Other Properties Related to Output:

  • One embedding for text and one for motion

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

  • PyTorch

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Lovelace

Supported Operating Systems:

  • Linux
  • Windows

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version

TMR-SOMA-RP-v1

Training Dataset:

Name: Proprietary Bones Rigplay Dataset

Data Modalities

  • Text
  • Human Motion Capture

Data Size:

  • Less than 1 Billion tokens of text
  • 700 hours of human motion capture

Data Collection Method
Automatic/Sensors

Labeling Method
Hybrid: Automatic/Sensors, Human

Properties: 700 hours of captured human body motions on the SOMA skeleton with corresponding text descriptions. Trained on the full dataset (train+test split) to be most useful for evaluating motion generation models that are trained only on the train split. Various augmentations were employed to expand text and motion variety.

Testing/Evaluation Dataset:

Name: Proprietary Bones Rigplay Dataset

Data Modalities

  • Text
  • Human Motion Capture

Data Size:

  • Less than 1 Billion tokens of text
  • Roughly 10 hours of human motion capture

Data Collection Method
Automatic/Sensors

Labeling Method
Hybrid: Automatic/Sensors, Human

Properties: Our internal evaluation dataset contains around 5k motions sampled from the full training dataset that is described above. These motions all contain unique motion content, and therefore unique text descriptions, making them well-suited for evaluating TMR on the retrieval task.

Quantitative Evaluation
The TMR model has also been evaluated on the smaller Kimodo Motion Generation Benchmark. Please refer to the "Ground Truth" results in the Kimodo benchmark documentation, which indicates retrieval accuracy when using TMR with ground truth motions (rather than generated ones).

Inference:

Acceleration Engine: N/A

Test Hardware:

  • GeForce RTX 3090
  • GeForce RTX 4090
  • GeForce RTX 5090
  • NVIDIA A100
  • NVIDIA L40S
  • NVIDIA L4
  • NVIDIA RTX 6000 Ada
  • NVIDIA RTX A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Safety & Security, and Privacy Subcards below.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field Response
Participation considerations from adversely impacted groups protected classes in model design and testing: Gender
Measures taken to mitigate against unwanted bias: Our training data contains motion captured from a roughly equal number of male and female actors

Explainability

Field Response
Intended Task/Domain: Character Animation
Model Type: Multimodal Encoder
Intended Users: The model is intended for researchers and developers working on human motion generation to evaluate their models, and to use the retrieval capabilities of the model for motion datasets.
Output: Latent embedding of text or human motion
Describe how the model works: The model is comprised of dual encoders, one transformer for text and one for motion, that embeds the input into a 256D vector.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: Gender
Technical Limitations & Mitigation: The model is trained specifically for certain types of actions (e.g., locomotion, gestures, combat, dancing, and everyday activities) and as such, it may generate incorrect predictions for actions outside of the training distribution. The model is specific to the SOMA skeleton with a single set of body proportions.
Verified to have met prescribed NVIDIA quality standards: Yes
Performance Metrics: Retrieval accuracy
Potential Known Risks: The model is not always sensitive to small details in motion, which may make retrieval difficult for some combinations of text and motion. For example, the text "Waving right hand" may be encoded close to a motion containing waving with the left hand, making it difficult to retrieve motions with the correct handedness.
Licensing: This model is released under the NVIDIA Open Model License

Privacy

Field Response
Generatable or reverse engineerable personal data? No
Personal data used to create this model? No
How often is dataset reviewed? During dataset creation, model training, evaluation and before release
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? No
Is there provenance for all datasets used in training? Yes
Does data labeling (annotation, metadata) comply with privacy laws? Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made? Not Applicable
Applicable Privacy Policy https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field Response
Model Application Field(s): Media & Entertainment, Industrial/Machinery and Robotics
Describe the life critical impact (if present). Not Applicable
Use Case Restrictions: Abide by the NVIDIA Open Model License
Model and dataset restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/TMR-SOMA-RP-v1

Paper for nvidia/TMR-SOMA-RP-v1