Refine selected image regions using a text prompt
Generate a talking face video from an image and audio
Official Space for SpatialTrackerV2