Multimodal embedding model, supporting datasets, and a paper describing the process going into building both the datasets and the models 🤗