khania commited on
Commit
873cf13
·
verified ·
1 Parent(s): 900f45e

MotionCLIP model update

Browse files
Files changed (6) hide show
  1. README.md +224 -0
  2. config.json +10 -0
  3. mean.npy +3 -0
  4. motion_clip_hf.py +327 -0
  5. pytorch_model.bin +3 -0
  6. std.npy +3 -0
README.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - motion
5
+ - clip
6
+ - text-to-motion
7
+ - motion-retrieval
8
+ - multimodal
9
+ - human-motion
10
+ - motion-generation
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ pipeline_tag: feature-extraction
15
+ datasets:
16
+ - MotionMillion
17
+ ---
18
+
19
+ # MotionCLIP
20
+
21
+ A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.
22
+
23
+ > ⚠️ **License Notice**: This model is released under **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. **This model is for research and non-commercial use only.**
24
+
25
+ > 📋 **Body Model**: This model was trained on motion data using the **SMPL body model** (22 joints). Input motions must be in SMPL skeleton format.
26
+
27
+ ## Model Description
28
+
29
+ MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:
30
+
31
+ - **Retrieve** the most relevant text for a motion (and vice versa)
32
+ - **Classify** motions in a zero-shot manner using text labels
33
+ - **Compute similarity** between motions and text descriptions
34
+
35
+ ## Usage
36
+
37
+ ### Installation
38
+
39
+ ```bash
40
+ pip install torch transformers huggingface_hub numpy
41
+ ```
42
+
43
+ ### Download the Model Code
44
+
45
+ Download `motion_clip_hf.py` from this repository or copy it to your project.
46
+
47
+ ### Quick Start
48
+
49
+ ```python
50
+ from motion_clip_hf import MotionCLIP
51
+ import numpy as np
52
+
53
+ # Load model (auto-downloads from HuggingFace)
54
+ model = MotionCLIP.from_pretrained("khania/motion-clip")
55
+
56
+ # Encode text
57
+ text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
58
+ print(f"Text embeddings: {text_emb.shape}") # (2, 512)
59
+
60
+ # Encode motion (272-dim absolute root format, variable length)
61
+ motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion
62
+ motion_emb = model.encode_motion(motion)
63
+ print(f"Motion embedding: {motion_emb.shape}") # (512,)
64
+
65
+ # Compute similarity
66
+ similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
67
+ predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
68
+ print(f"Predicted action: {predicted}")
69
+ ```
70
+
71
+ ### Text-to-Motion Retrieval
72
+
73
+ ```python
74
+ # Find most similar motions for a text query
75
+ results = model.retrieve_motion(
76
+ text="a person waves their hand",
77
+ candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays
78
+ top_k=3
79
+ )
80
+ for r in results:
81
+ print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")
82
+ ```
83
+
84
+ ### Motion-to-Text Retrieval
85
+
86
+ ```python
87
+ # Find most similar texts for a motion
88
+ results = model.retrieve_text(
89
+ motion=my_motion, # (T, 272) array
90
+ candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
91
+ top_k=3
92
+ )
93
+ for r in results:
94
+ print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")
95
+ ```
96
+
97
+ ### Zero-Shot Motion Classification
98
+
99
+ ```python
100
+ # Define action categories
101
+ actions = ["walking", "running", "jumping", "sitting", "waving",
102
+ "kicking", "punching", "dancing", "stretching", "bowing"]
103
+
104
+ # Classify a motion
105
+ similarity = model.compute_similarity(motion, actions)
106
+ predicted_action = actions[similarity.argmax()]
107
+ confidence = similarity.max()
108
+ print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")
109
+ ```
110
+
111
+ ## Model Architecture
112
+
113
+ | Component | Details |
114
+ |-----------|---------|
115
+ | **Motion Encoder** | 8-layer Transformer |
116
+ | **Hidden Dimension** | 768 |
117
+ | **Attention Heads** | 12 |
118
+ | **Text Encoder** | CLIP ViT-B/32 (fine-tuned) |
119
+ | **Embedding Dimension** | 512 |
120
+ | **Max Sequence Length** | 1024 frames |
121
+
122
+ ## Motion Format
123
+
124
+ The model expects **272-dimensional motion features in absolute root format** based on the **SMPL body model** (22 joints).
125
+
126
+ ### SMPL Body Model Requirement
127
+
128
+ This model was trained exclusively on motion data represented using the [SMPL body model](https://smpl.is.tue.mpg.de/). Your input motions must:
129
+
130
+ - Use the **SMPL skeleton** with 22 joints
131
+ - Follow the SMPL joint ordering
132
+ - Be converted to the 272-dimensional HumanML3D-style representation
133
+
134
+ If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.
135
+
136
+ ### Feature Dimensions
137
+
138
+ | Dimensions | Description |
139
+ |------------|-------------|
140
+ | `[0:2]` | Root XZ velocities |
141
+ | `[2:8]` | Absolute heading rotation (6D representation) |
142
+ | `[8:74]` | Local joint positions (22 joints × 3) |
143
+ | `[74:140]` | Local joint velocities (22 joints × 3) |
144
+ | `[140:272]` | Joint rotations in 6D (22 joints × 6) |
145
+
146
+ The model automatically normalizes input motions using the bundled mean/std statistics.
147
+
148
+ ## Training Details
149
+
150
+ | Parameter | Value |
151
+ |-----------|-------|
152
+ | **Dataset** | MotionMillion (~884K training motions) |
153
+ | **Batch Size** | 256 |
154
+ | **Training Iterations** | 100,000 |
155
+ | **Learning Rate (Motion Encoder)** | 1e-4 |
156
+ | **Learning Rate (Text Encoder)** | 5e-5 |
157
+ | **Loss Function** | Symmetric InfoNCE |
158
+ | **Temperature** | Learnable (initialized at 0.07) |
159
+
160
+ ## Performance
161
+
162
+ Retrieval performance (R@k) on random test subsets:
163
+
164
+ | Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 |
165
+ |-------------|-----------------|-----------------|-----------------|-----------------|
166
+ | 1,000 | 36.2% | 67.8% | 36.4% | 68.1% |
167
+ | 5,000 | 17.7% | 42.1% | 17.8% | 42.3% |
168
+ | 10,000 | 12.4% | 31.5% | 12.5% | 31.6% |
169
+
170
+ *Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.*
171
+
172
+ ## Files in This Repository
173
+
174
+ | File | Size | Description |
175
+ |------|------|-------------|
176
+ | `config.json` | 239 B | Model configuration |
177
+ | `pytorch_model.bin` | 219 MB | Model weights |
178
+ | `mean.npy` | 1.2 KB | Motion normalization mean (272,) |
179
+ | `std.npy` | 1.2 KB | Motion normalization std (272,) |
180
+
181
+ ## Limitations
182
+
183
+ - Trained on English text descriptions only
184
+ - Motion format is specific to HumanML3D-style 272-dim representation
185
+ - Best performance on motions similar to training distribution (daily activities, sports, etc.)
186
+
187
+ ## Citation
188
+
189
+ ```bibtex
190
+ @article{motionmillion2026,
191
+ title={MotionMillion: A Large-Scale Motion-Language Dataset},
192
+ author={...},
193
+ year={2026}
194
+ }
195
+ ```
196
+
197
+ ## License
198
+
199
+ **CC BY-NC 4.0** (Creative Commons Attribution-NonCommercial 4.0 International)
200
+
201
+ This model is released for **research and non-commercial use only**.
202
+
203
+ ### Why Non-Commercial?
204
+
205
+ The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:
206
+ - Some datasets permit commercial use
207
+ - Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)
208
+
209
+ To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.
210
+
211
+ ### What This Means
212
+
213
+ ✅ **Allowed:**
214
+ - Academic research
215
+ - Personal projects
216
+ - Non-commercial applications
217
+ - Sharing and adapting with attribution
218
+
219
+ ❌ **Not Allowed:**
220
+ - Commercial products or services
221
+ - Selling access to the model
222
+ - Using in revenue-generating applications
223
+
224
+ For commercial licensing inquiries, please contact the authors.
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "motion_input_dim": 272,
3
+ "motion_hidden_dim": 768,
4
+ "embed_dim": 512,
5
+ "motion_num_heads": 12,
6
+ "motion_num_layers": 8,
7
+ "motion_max_seq_len": 784,
8
+ "motion_dropout": 0.1,
9
+ "text_encoder_name": "openai/clip-vit-base-patch32"
10
+ }
mean.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3e3ce8012ec7085209c805c3d9f8deb56bc447e8901b8f30fea8da6a841f302
3
+ size 1216
motion_clip_hf.py ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MotionCLIP - Motion-Text CLIP Model
3
+ Load and use the MotionCLIP model for motion-text retrieval and similarity computation.
4
+
5
+ Usage:
6
+ from motion_clip_hf import MotionCLIP
7
+
8
+ # Load from HuggingFace Hub
9
+ model = MotionCLIP.from_pretrained("khania/motion-clip")
10
+
11
+ # Encode text and motion
12
+ text_emb = model.encode_text(["a person walks forward"])
13
+ motion_emb = model.encode_motion(motion_array) # (T, 272) numpy array
14
+
15
+ # Compute similarity
16
+ similarity = model.compute_similarity(motion_array, ["walking", "running"])
17
+ """
18
+
19
+ import os
20
+ import json
21
+ import math
22
+ import numpy as np
23
+ import torch
24
+ import torch.nn as nn
25
+ import torch.nn.functional as F
26
+ from typing import List, Optional, Union
27
+ from pathlib import Path
28
+
29
+ try:
30
+ from transformers import CLIPTextModel, CLIPTokenizer
31
+ TRANSFORMERS_AVAILABLE = True
32
+ except ImportError:
33
+ TRANSFORMERS_AVAILABLE = False
34
+
35
+ try:
36
+ from huggingface_hub import hf_hub_download
37
+ HF_HUB_AVAILABLE = True
38
+ except ImportError:
39
+ HF_HUB_AVAILABLE = False
40
+
41
+
42
+ def sinusoidal_positional_encoding(seq_len: int, dim: int, device: torch.device) -> torch.Tensor:
43
+ """Generate sinusoidal positional encoding (matches original training code)."""
44
+ pe = torch.zeros(seq_len, dim, device=device)
45
+ position = torch.arange(0, seq_len, dtype=torch.float32, device=device).unsqueeze(1)
46
+ div_term = torch.exp(
47
+ torch.arange(0, dim, 2, dtype=torch.float32, device=device) * (-math.log(10000.0) / dim)
48
+ )
49
+ pe[:, 0::2] = torch.sin(position * div_term)
50
+ pe[:, 1::2] = torch.cos(position * div_term)
51
+ return pe
52
+
53
+
54
+ class MotionTransformerEncoder(nn.Module):
55
+ """Transformer encoder for motion sequences.
56
+
57
+ Architecture matches original training code exactly:
58
+ - Sinusoidal positional encoding (not learnable)
59
+ - Masked mean pooling (no cls token)
60
+ - Simple Linear output projection
61
+ - Pre-LayerNorm architecture (norm_first=True to match _SDPATransformerEncoderLayer)
62
+ """
63
+
64
+ def __init__(
65
+ self,
66
+ input_dim: int = 272,
67
+ hidden_dim: int = 768,
68
+ embed_dim: int = 512,
69
+ num_heads: int = 12,
70
+ num_layers: int = 8,
71
+ max_seq_len: int = 1024,
72
+ dropout: float = 0.1
73
+ ):
74
+ super().__init__()
75
+ self.input_dim = input_dim
76
+ self.hidden_dim = hidden_dim
77
+ self.embed_dim = embed_dim
78
+ self.max_seq_len = max_seq_len
79
+
80
+ self.input_proj = nn.Linear(input_dim, hidden_dim)
81
+
82
+ encoder_layer = nn.TransformerEncoderLayer(
83
+ d_model=hidden_dim,
84
+ nhead=num_heads,
85
+ dim_feedforward=hidden_dim * 4,
86
+ dropout=dropout,
87
+ activation='gelu',
88
+ batch_first=True,
89
+ norm_first=True # Pre-LayerNorm to match _SDPATransformerEncoderLayer
90
+ )
91
+ self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
92
+ self.output_proj = nn.Linear(hidden_dim, embed_dim)
93
+
94
+ def forward(self, x: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
95
+ B, T, _ = x.shape
96
+ x = self.input_proj(x)
97
+
98
+ pe = sinusoidal_positional_encoding(T, self.hidden_dim, x.device)
99
+ x = x + pe.unsqueeze(0)
100
+
101
+ if mask is not None:
102
+ key_padding_mask = ~mask
103
+ else:
104
+ key_padding_mask = None
105
+
106
+ x = self.transformer(x, src_key_padding_mask=key_padding_mask)
107
+
108
+ if mask is not None:
109
+ mask_expanded = mask.unsqueeze(-1).float()
110
+ x = (x * mask_expanded).sum(dim=1) / mask_expanded.sum(dim=1).clamp(min=1e-6)
111
+ else:
112
+ x = x.mean(dim=1)
113
+
114
+ output = self.output_proj(x)
115
+ return output
116
+
117
+
118
+ class TextEncoderCLIP(nn.Module):
119
+ """HuggingFace CLIP text encoder - matches original training code."""
120
+
121
+ def __init__(self, model_name: str = "openai/clip-vit-base-patch32", output_dim: int = 512):
122
+ super().__init__()
123
+ if not TRANSFORMERS_AVAILABLE:
124
+ raise ImportError("transformers required: pip install transformers")
125
+
126
+ self.tokenizer = CLIPTokenizer.from_pretrained(model_name)
127
+ self.model = CLIPTextModel.from_pretrained(model_name)
128
+ self.hidden_size = self.model.config.hidden_size
129
+ self.output_dim = output_dim
130
+
131
+ if self.hidden_size != output_dim:
132
+ self.proj = nn.Linear(self.hidden_size, output_dim)
133
+ else:
134
+ self.proj = nn.Identity()
135
+
136
+ def forward(self, texts: List[str], device: torch.device) -> torch.Tensor:
137
+ inputs = self.tokenizer(
138
+ texts,
139
+ padding=True,
140
+ truncation=True,
141
+ max_length=self.tokenizer.model_max_length,
142
+ return_tensors="pt"
143
+ )
144
+ inputs = {k: v.to(device) for k, v in inputs.items()}
145
+ out = self.model(**inputs)
146
+
147
+ if hasattr(out, "pooler_output") and out.pooler_output is not None:
148
+ feat = out.pooler_output
149
+ else:
150
+ feat = out.last_hidden_state[:, 0]
151
+
152
+ return self.proj(feat)
153
+
154
+
155
+ class MotionCLIP(nn.Module):
156
+ """Motion-Text CLIP Model with fine-tuned text encoder."""
157
+
158
+ DEFAULT_CONFIG = {
159
+ "motion_input_dim": 272,
160
+ "motion_hidden_dim": 768,
161
+ "embed_dim": 512,
162
+ "motion_num_heads": 12,
163
+ "motion_num_layers": 8,
164
+ "motion_max_seq_len": 1024,
165
+ "motion_dropout": 0.1,
166
+ "text_encoder_name": "openai/clip-vit-base-patch32"
167
+ }
168
+
169
+ def __init__(self, config: dict = None):
170
+ super().__init__()
171
+ self.config = {**self.DEFAULT_CONFIG, **(config or {})}
172
+
173
+ self.motion_encoder = MotionTransformerEncoder(
174
+ input_dim=self.config["motion_input_dim"],
175
+ hidden_dim=self.config["motion_hidden_dim"],
176
+ embed_dim=self.config["embed_dim"],
177
+ num_heads=self.config["motion_num_heads"],
178
+ num_layers=self.config["motion_num_layers"],
179
+ max_seq_len=self.config["motion_max_seq_len"],
180
+ dropout=self.config["motion_dropout"]
181
+ )
182
+
183
+ self.text_encoder = TextEncoderCLIP(
184
+ model_name=self.config["text_encoder_name"],
185
+ output_dim=self.config["embed_dim"]
186
+ )
187
+
188
+ self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
189
+ self.register_buffer("mean", torch.zeros(self.config["motion_input_dim"]))
190
+ self.register_buffer("std", torch.ones(self.config["motion_input_dim"]))
191
+
192
+ def encode_text(self, texts: List[str], normalize: bool = True) -> torch.Tensor:
193
+ device = next(self.parameters()).device
194
+ text_embeds = self.text_encoder(texts, device)
195
+ if normalize:
196
+ text_embeds = F.normalize(text_embeds, dim=-1)
197
+ return text_embeds
198
+
199
+ def encode_motion(
200
+ self,
201
+ motion: Union[np.ndarray, torch.Tensor, List[np.ndarray]],
202
+ normalize: bool = True,
203
+ mask: Optional[torch.Tensor] = None,
204
+ apply_motion_norm: bool = True
205
+ ) -> torch.Tensor:
206
+ """Encode motion sequences to embeddings.
207
+
208
+ Args:
209
+ motion: Motion input as numpy array, torch tensor, or list of arrays.
210
+ Shape: (T, 272) for single motion or (B, T, 272) for batch.
211
+ normalize: Whether to L2-normalize the output embeddings.
212
+ mask: Optional boolean mask for padded sequences. Shape: (B, T).
213
+ apply_motion_norm: Whether to apply mean/std normalization to input.
214
+ Set to False if input is already normalized.
215
+
216
+ Returns:
217
+ Motion embeddings of shape (B, embed_dim) or (embed_dim,) for single input.
218
+ """
219
+ device = next(self.parameters()).device
220
+
221
+ if isinstance(motion, list):
222
+ max_len = max(m.shape[0] for m in motion)
223
+ batch = torch.zeros(len(motion), max_len, motion[0].shape[-1])
224
+ mask = torch.zeros(len(motion), max_len, dtype=torch.bool)
225
+ for i, m in enumerate(motion):
226
+ if isinstance(m, np.ndarray):
227
+ m = torch.from_numpy(m)
228
+ batch[i, :m.shape[0]] = m
229
+ mask[i, :m.shape[0]] = True
230
+ motion = batch
231
+ mask = mask.to(device)
232
+ elif isinstance(motion, np.ndarray):
233
+ motion = torch.from_numpy(motion)
234
+
235
+ if motion.dim() == 2:
236
+ motion = motion.unsqueeze(0)
237
+
238
+ motion = motion.float().to(device)
239
+
240
+ if apply_motion_norm:
241
+ motion = (motion - self.mean) / (self.std + 1e-8)
242
+
243
+ motion_embeds = self.motion_encoder(motion, mask=mask)
244
+
245
+ if normalize:
246
+ motion_embeds = F.normalize(motion_embeds, dim=-1)
247
+ return motion_embeds
248
+
249
+ def compute_similarity(
250
+ self,
251
+ motion: Union[np.ndarray, torch.Tensor, List[np.ndarray]],
252
+ texts: List[str]
253
+ ) -> torch.Tensor:
254
+ motion_embeds = self.encode_motion(motion, normalize=True)
255
+ text_embeds = self.encode_text(texts, normalize=True)
256
+ logit_scale = self.logit_scale.exp()
257
+ similarity = logit_scale * motion_embeds @ text_embeds.T
258
+ return similarity
259
+
260
+ def forward(
261
+ self,
262
+ motion: torch.Tensor,
263
+ texts: List[str],
264
+ motion_mask: Optional[torch.Tensor] = None
265
+ ) -> dict:
266
+ motion_embeds = self.encode_motion(motion, normalize=True, mask=motion_mask)
267
+ text_embeds = self.encode_text(texts, normalize=True)
268
+ logit_scale = self.logit_scale.exp()
269
+ logits_per_motion = logit_scale * motion_embeds @ text_embeds.T
270
+ logits_per_text = logits_per_motion.T
271
+ return {"logits_per_motion": logits_per_motion, "logits_per_text": logits_per_text}
272
+
273
+ @classmethod
274
+ def from_pretrained(cls, path_or_repo: str, device: str = None, **kwargs):
275
+ if device is None:
276
+ device = "cuda" if torch.cuda.is_available() else "cpu"
277
+
278
+ path = Path(path_or_repo)
279
+ if path.exists():
280
+ config_file = path / "config.json"
281
+ weights_file = path / "pytorch_model.bin"
282
+ else:
283
+ if not HF_HUB_AVAILABLE:
284
+ raise ImportError("huggingface_hub required: pip install huggingface_hub")
285
+ config_file = hf_hub_download(path_or_repo, "config.json", **kwargs)
286
+ weights_file = hf_hub_download(path_or_repo, "pytorch_model.bin", **kwargs)
287
+ config_file = Path(config_file)
288
+ weights_file = Path(weights_file)
289
+
290
+ with open(config_file, 'r') as f:
291
+ config = json.load(f)
292
+
293
+ model = cls(config)
294
+
295
+ print(f"Loading weights from: {weights_file.name}")
296
+ state_dict = torch.load(weights_file, map_location="cpu")
297
+
298
+ missing, unexpected = model.load_state_dict(state_dict, strict=False)
299
+ if missing:
300
+ print(f"Missing keys: {len(missing)}")
301
+ if unexpected:
302
+ print(f"Unexpected keys: {len(unexpected)}")
303
+
304
+ model = model.to(device)
305
+ model.eval()
306
+
307
+ print(f"Loaded MotionCLIP (embed_dim={config.get('embed_dim', 512)}) on {device}")
308
+ return model
309
+
310
+ def save_pretrained(self, save_dir: str):
311
+ save_dir = Path(save_dir)
312
+ save_dir.mkdir(parents=True, exist_ok=True)
313
+
314
+ with open(save_dir / "config.json", 'w') as f:
315
+ json.dump(self.config, f, indent=2)
316
+
317
+ torch.save(self.state_dict(), save_dir / "pytorch_model.bin")
318
+ print(f"Saved MotionCLIP to {save_dir}")
319
+
320
+
321
+ if __name__ == "__main__":
322
+ model = MotionCLIP()
323
+ print(f"MotionCLIP created with {sum(p.numel() for p in model.parameters()):,} parameters")
324
+
325
+ dummy_motion = torch.randn(2, 64, 272)
326
+ motion_emb = model.encode_motion(dummy_motion)
327
+ print(f"Motion embedding shape: {motion_emb.shape}")
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd889ff01f6f22ead7f0a8ed5859f4358547dfe4899012c061f0c2dbb323c3af
3
+ size 481994447
std.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86c79a66805f80a5219047235536aee339de3accc4aa6de4a1857ff6ff61fc41
3
+ size 1216