MER-Factory: Your Open-Source Factory for Multimodal Emotion Datasets
Today, we're excited to introduce MER-Factory, a new open-source Python framework designed to be your automated factory for building Multimodal Emotion Recognition and Reasoning (MERR) datasets. It systematically analyzes video or image files, extracts multimodal features, and uses the power of Large Language Modelsβincluding those on the Hugging Face Hubβto generate detailed emotional analysis and reasoning.
This tool aims to empower the Affective Computing community by providing an open, scalable, and scientifically grounded framework for dataset creation.
- GitHub Repo: https://github.com/Lum1104/MER-Factory
- Full Documentation: https://lum1104.github.io/MER-Factory/
What Can MER-Factory Do?
At its core, MER-Factory automates the entire pipeline from raw media to a structured, LLM-annotated dataset. It takes a video, identifies key emotional moments, and then analyzes the scene from every angle:
- π£οΈ Audio Analysis: Transcribes speech and analyzes vocal tone.
- π Facial Analysis: Uses OpenFace to extract Facial Action Units (AUs) based on the scientific Facial Action Coding System (FACS).
- π¬ Visual Analysis: Describes the visual context of the scene or image.
Finally, it synthesizes all this information into a cohesive, reasoned summary, explaining why a particular emotion is being expressed.
The framework is built on a robust set of technologies, including LangGraph for managing complex, stateful workflows, FFmpeg for media processing, and a pluggable architecture for AI model integration.
Hugging Face Integration and Model Flexibility π€
A key design principle of MER-Factory is model flexibility. You are not locked into a single proprietary API. The framework features a pluggable architecture that seamlessly supports:
- Hugging Face Hub Models: Run inference with the latest open-source models like
google/gemma-3n-E4B-it. This is perfect for custom implementations and leveraging cutting-edge research. - Ollama (Local Models): Run popular models like
Llama 3.2andLLaVAlocally for privacy, cost savings, and impressive performance on many tasks. - API-based Models: Integrates with Google Gemini and OpenAI (GPT-4o, etc.) for tasks requiring the most advanced reasoning capabilities.
This allows you to choose the best tool for the job, whether you prioritize open-source principles, privacy, cost, or state-of-the-art performance.
Getting Started in 3 Steps
Let's get you up and running with MER-Factory using a Hugging Face model.
1. Installation
First, ensure you have the prerequisites, FFmpeg and OpenFace, installed. Then, clone the repository and set up your environment.
# Clone the repository
git clone https://github.com/Lum1104/MER-Factory.git
cd MER-Factory
# Create a Python 3.12+ environment and install dependencies
conda create -n mer-factory python=3.12
conda activate mer-factory
pip install -r requirements.txt
2. Configuration
Copy the example environment file and add the absolute path to your OpenFace FeatureExtraction executable. This is crucial for facial analysis.
# Copy the example environment file
cp .env.example .env
Now, edit your new .env file:
# Required for AU and MER pipelines
OPENFACE_EXECUTABLE=/absolute/path/to/OpenFace/build/bin/FeatureExtraction
# Optional: Add API keys if you plan to use them
# GOOGLE_API_KEY=your_google_api_key_here
# OPENAI_API_KEY=your_openai_api_key_here
3. Run Your First Pipeline
You're ready to go! Let's analyze a video using a Hugging Face model. MER-Factory will download the model and run it locally.
# Analyze a video with a Hugging Face model
python main.py path/to/your/video.mp4 output/ --type MER --huggingface-model google/gemma-3n-E4B-it
Want to use local models via Ollama instead? It's just as easy.
# First, pull the models you want to use
ollama pull llava-llama3:latest
ollama pull llama3.2
# Run the pipeline with Ollama
python main.py path/to/your/video.mp4 output/ --type MER \
--ollama-vision-model llava-llama3:latest \
--ollama-text-model llama3.2
After running, you'll find a detailed _merr_data.json file in your output/ directory containing the full multimodal analysis.
The "Best-of-Breed" Caching Workflow π°
One of the most powerful features is the caching system. It saves the output of each processing step, allowing you to build a dataset using the best model for each modality.
For example, you could:
- Run Video Analysis using a powerful API model like GPT-4o, which excels at temporal reasoning.
- Run AU and Audio Analysis using local Ollama or Hugging Face models to save costs.
- Run the Final Synthesis by re-running the
MERpipeline with the--cacheflag. MER-Factory will detect the existing results and only run the final synthesis step, combining your high-quality, pre-computed analyses into one summary.
This workflow provides maximum flexibility and cost-efficiency, allowing you to create "best-of-breed" datasets without being locked into a single provider.
Why This Matters for the Community
MER-Factory is more than just a tool; it's a contribution to open and reproducible science in Affective Computing.
- Democratizes Dataset Creation: Lowers the barrier for researchers and smaller teams to create large-scale, high-quality MERR datasets.
- Enables Reproducibility: Provides an open-source, configurable pipeline for standardized dataset construction.
- A Testbed for MLLMs: Offers a practical framework for evaluating how well different MLLMs (open-source and proprietary) perform on nuanced emotional reasoning tasks.
We believe that by providing robust, open tools, we can collectively accelerate progress in understanding and modeling human emotion.
Join Us!
We invite you to try MER-Factory for your own research or projects. Explore the code, generate your dataset, and join the conversation!
- β Star us on GitHub: https://github.com/Lum1104/MER-Factory
- π Read the Docs: https://lum1104.github.io/MER-Factory/
- π Report Issues: GitHub Issues
- π¬ Start a Discussion: GitHub Discussions
Let's build the future of Affective Computing together!