C-MET: Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Ulsan National Institute of Science and Technology
CVPR 2026

Model Description

Talking face generation is impressive — but making the face express the desired emotion is still an open problem. Label-based methods are too coarse, audio-based methods tangle emotion with speech content, and image-based methods need hard-to-get reference photos.

C-MET solves this by learning cross-modal emotion semantic vectors that bridge speech and visual feature spaces — so the model can transfer emotions from speech to face without needing any reference image, even for extended emotions like sarcasm that rarely appear in training data.

On MEAD and CREMA-D benchmarks, C-MET improves emotion accuracy by 14% over state-of-the-art methods.

This repository contains the Connector_exp module — the core C-MET model that generates emotion semantic vectors from cross-modal embeddings.

Repository Structure

coldhyuk/C-MET
├── model.safetensors          # C-MET Connector_exp weights
├── config.json                # Model config
├── pretrained_weights/
│   ├── Audio2Lip.pt           # EDTalk dependency (Apache-2.0)
│   ├── EDTalk.pt              # EDTalk dependency (Apache-2.0)
│   └── EDTalk-V.pt            # EDTalk dependency (Apache-2.0)
└── checkpoints/
    └── _epoch_2105_checkpoint_step000200000.pth  # Full training checkpoint

Usage

Installation

git clone https://github.com/ChanHyeok-Choi/C-MET
cd C-MET
conda create -n C_MET python=3.8
conda activate C_MET
pip install -r requirements.txt

Load model with from_pretrained

from src.connector import Connector_exp

model = Connector_exp.from_pretrained("coldhyuk/C-MET")
model.eval()

Inference

Pretrained dependency weights (pretrained_weights/*.pt) are automatically downloaded from this repo on first run.

python inference.py \
    --num_samples 10 \
    --connector_exp_path ./checkpoints/_epoch_2105_checkpoint_step000200000.pth \
    --source_path ./asset/identity/ChatGPT_man3_crop.png \
    --audio_driving_path ./asset/audio/W009_038.wav \
    --pose_driving_path ./asset/video/W009_038.mp4 \
    --save_path ./res/output_happy.mp4 \
    --neu_e2v_path ./audios/MEAD/neutral/emotion2vec+large_features/ \
    --emo_e2v_path ./audios/MEAD/happy/emotion2vec+large_features/

Supported emotions:

Standard (7): angry, contempt, disgusted, fear, happy, sad, surprised
Extended (6): charismatic, desirous, empathetic, envious, romantic, sarcastic

Training

python train.py
tensorboard --logdir ./tensorboard_runs

For data preprocessing, refer to the GitHub README.

Third-Party Pretrained Weights

The following pretrained weights are from EDTalk, licensed under Apache-2.0:

pretrained_weights/Audio2Lip.pt
pretrained_weights/EDTalk.pt
pretrained_weights/EDTalk-V.pt

Citation

@inproceedings{choi2026cross,
  title={Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video},
  author={Choi, Chanhyuk and Kim, Taesoo and Lee, Donggyu and Jung, Siyeol and Kim, Taehwan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Acknowledgements

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using coldhyuk/C-MET 1

Paper for coldhyuk/C-MET

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Paper • 2604.07786 • Published 6 days ago • 3