C-MET: Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Ulsan National Institute of Science and Technology
CVPR 2026
Model Description
Talking face generation is impressive β but making the face express the desired emotion is still an open problem. Label-based methods are too coarse, audio-based methods tangle emotion with speech content, and image-based methods need hard-to-get reference photos.
C-MET solves this by learning cross-modal emotion semantic vectors that bridge speech and visual feature spaces β so the model can transfer emotions from speech to face without needing any reference image, even for extended emotions like sarcasm that rarely appear in training data.
On MEAD and CREMA-D benchmarks, C-MET improves emotion accuracy by 14% over state-of-the-art methods.
This repository contains the Connector_exp module β the core C-MET model that generates emotion semantic vectors from cross-modal embeddings.
Repository Structure
coldhyuk/C-MET
βββ model.safetensors # C-MET Connector_exp weights
βββ config.json # Model config
βββ pretrained_weights/
β βββ Audio2Lip.pt # EDTalk dependency (Apache-2.0)
β βββ EDTalk.pt # EDTalk dependency (Apache-2.0)
β βββ EDTalk-V.pt # EDTalk dependency (Apache-2.0)
βββ checkpoints/
βββ _epoch_2105_checkpoint_step000200000.pth # Full training checkpoint
Usage
Installation
git clone https://github.com/ChanHyeok-Choi/C-MET
cd C-MET
conda create -n C_MET python=3.8
conda activate C_MET
pip install -r requirements.txt
Load model with from_pretrained
from src.connector import Connector_exp
model = Connector_exp.from_pretrained("coldhyuk/C-MET")
model.eval()
Inference
Pretrained dependency weights (pretrained_weights/*.pt) are automatically downloaded from this repo on first run.
python inference.py \
--num_samples 10 \
--connector_exp_path ./checkpoints/_epoch_2105_checkpoint_step000200000.pth \
--source_path ./asset/identity/ChatGPT_man3_crop.png \
--audio_driving_path ./asset/audio/W009_038.wav \
--pose_driving_path ./asset/video/W009_038.mp4 \
--save_path ./res/output_happy.mp4 \
--neu_e2v_path ./audios/MEAD/neutral/emotion2vec+large_features/ \
--emo_e2v_path ./audios/MEAD/happy/emotion2vec+large_features/
Supported emotions:
- Standard (7): angry, contempt, disgusted, fear, happy, sad, surprised
- Extended (6): charismatic, desirous, empathetic, envious, romantic, sarcastic
Training
python train.py
tensorboard --logdir ./tensorboard_runs
For data preprocessing, refer to the GitHub README.
Third-Party Pretrained Weights
The following pretrained weights are from EDTalk, licensed under Apache-2.0:
pretrained_weights/Audio2Lip.ptpretrained_weights/EDTalk.ptpretrained_weights/EDTalk-V.pt
Citation
@inproceedings{choi2026cross,
title={Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video},
author={Choi, Chanhyuk and Kim, Taesoo and Lee, Donggyu and Jung, Siyeol and Kim, Taehwan},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
Acknowledgements
- Downloads last month
- 10