Segment and caption objects in images and videos
Generate speech in a cloned voice from reference audio