Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
Introduction
Unified masked-diffusion modeling across textual reasoning, image generation, image editing, multi-modal understanding, text to speech, and speech to text.
Dynin-Omni: Omnimodal Unified Large Diffusion Language Model is an 8B-scale masked-diffusion foundation model that unifies text, image, video, and speech understanding and generation within a single architecture.
Unlike autoregressive (AR) unified models that serialize heterogeneous modalities into a left-to-right sequence, Dynin-Omni models all modalities as discrete tokens in a shared vocabulary and performs generation via iterative masked denoising. This enables bidirectional context modeling, parallel multi-token prediction, and globally conditioned any-to-any inference without modality-specific expert decoders.
Training proceeds in three stages: (1) modality adaptation, (2) omni-modal supervised fine-tuning with model merging, and (3) continual capability scaling.
vLLM-Omni
Dynin-Omni support is planned to be integrated into vLLM-Omni in version 0.18.0. Once the integration is released, this section will be updated with the official setup and usage instructions.
Prerequisites
Direct local-machine inference and training with Dynin-Omni are supported. Follow the instructions below.
Environment
Clone this repository:
git clone https://github.com/AIDASLab/Dynin-Omni.git
cd Dynin-Omni
Create and activate a conda environment:
conda create -n dynin_omni python=3.10
conda activate dynin_omni
Initialize the environment (installs and builds Python packages):
bash scripts/init_env.sh --overwrite
--overwrite forces the Hugging Face cache root to datasets/huggingface under the project root. Without --overwrite, the cache root is resolved as HF_CACHE_DIR > HF_HOME > project default.
Inference
Dynin-Omni performs multimodal inference through iterative masked denoising. Target tokens are initialized as masks and refined over diffusion steps.
Entrypoint script:
bash scripts/inference.sh [--text|--i2i|--mmu|--speech|--t2i] [options]
The default configuration is configs/dynin_omni_demo.yaml.
--result defaults to results/<mode>.
1. Text Generation
Masked-diffusion text generation with block-wise decoding.
Validation script: validation/generate.py.
bash scripts/inference.sh --text
- Input questions (default):
validation/data/text/lm_questions.jsonl. jsonlformat: one sample per line (e.g.,{"question":"..."}).- Optional override:
--questions-file. - Fallback behavior: a built-in demo question is used when the file is missing or empty.
2. Multi-Modal Understanding (Image and Video)
Validation script: validation/mmu_generate.py.
bash scripts/inference.sh --mmu
- Image directory (
.jpg/.jpeg/.png/.webp): defaultvalidation/data/image(override with--mmu-image-root). - Video directory (
.mp4/.mov/.avi/.mkv/.webm): defaultvalidation/data/video(create if absent, or override with--video-image-root).
3. Text-to-Image Generation
Discrete image tokens are generated via parallel masked refinement, followed by deterministic detokenization.
Validation script: validation/t2i_generate.py.
bash scripts/inference.sh --t2i
- Input data (default):
validation/data/text/t2i_metadata.jsonl. jsonlformat: one sample per line, e.g.{"id":"t2i-00000","prompt":"..."}(promptis required).- Optional alternative:
--validation-prompts-filewith a plain-text file (one prompt per line) instead ofjsonl.
4. Image-to-Image Generation (Image Editing)
Validation script: validation/i2i_generate.py.
bash scripts/inference.sh --i2i
- Input
json(default):validation/data/text/i2i_edits.json. jsonformat: each item includesid(source image filename) andprompt.- Source image directory (default):
validation/data/image(override with--origin-img-root).
5. Speech (ASR and TTS)
Speech recognition and synthesis are performed within the same token-level diffusion backbone without a modality-specific decoder.
Validation script: validation/speech.py.
bash scripts/inference.sh --speech
- Default source: LibriSpeech ASR test split from Hugging Face (
openslr/librispeech_asr). - Optional local audio root:
--librispeech-root(directory containing LibriSpeech.flacfiles).
Training
Training configurations (datasets, hyperparameters, etc.) are defined in configs/*.yaml.
scripts/train.sh path variables (CONFIG_FILE, TRAIN_SCRIPT, EXPERIMENT_CFG, LOG_DIR) must be specified as project-root-relative paths.
The examples below assume a single-node setup; host/runtime variables should be adapted to the target environment.
Accelerate configuration can be prepared by running:
python -m accelerate config
Predefined configurations are also available in accelerate_configs/:
βββ accelerate_configs/
β βββ 1_gpu.yaml
β βββ 1_node_8_gpus_deepspeed_zero2.yaml
β βββ 1_node_8_gpus_deepspeed_zero3.yaml
β βββ 8_node_8_gpus_deepspeed_zero2.yaml
Stage 1 Omni-Modal Pretraining
Stage 1 adapts newly introduced modalities (video and speech) to the masked-diffusion backbone. The following modality directions are activated:
- Video β Text (Video Captioning)
- Speech β Text (ASR)
- Text β Speech (TTS)
This stage anchors video and speech tokens into the shared semantic token space under text supervision.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage1_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage1.py \
./scripts/train.sh
Stage 1 starts from the MMaDA-8B-MixCoT backbone checkpoint and extends it to support video and speech modalities through vocabulary expansion and text-centric alignment.
Stage 2 Omni-Modal Supervised Fine-Tuning
Stage 2 continues from the Stage 1 checkpoint and performs full omni-modal supervised fine-tuning.
Activated modality directions:
- Text β Text (Chat & Reasoning)
- Image β Text, Video β Text (Multi-Modal Understanding)
- Text β Image (Image Generation)
- Image β Image (Image Editing)
- Speech β Text (ASR)
- Text β Speech (TTS)
Before training, model merging is applied between the original backbone and the Stage 1 checkpoint to mitigate catastrophic forgetting. Explicit <EOS> supervision enables stable variable-length generation across modalities.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage2_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage2.py \
./scripts/train.sh
Stage 3 Continual Omni-Modal Supervised Fine-Tuning
Stage 3 continues from the Stage 2 checkpoint, retaining all modality directions while further scaling model capabilities.
Key enhancements include:
- Extended context length
- Higher-resolution image modeling
- Long-form speech generation (up to 21 seconds)
- Thinking-mode control (
\think/o_think) - Chain-of-thought supervision
- Increased synthetic data for reasoning and generation
This stage improves reasoning depth, perception granularity, and long-form generation while preserving the unified masked-diffusion objective.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage3_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage3.py \
./scripts/train.sh
Stage 3 starts from the Stage 2 checkpoint specified in configs/dynin_omni_stage3_llada_instruct.yaml and performs continual capability scaling under the same unified diffusion objective.
Evaluation
Dynin-Omni achieves the following results across several multimodal benchmarks as reported in the paper:
- Language reasoning: 87.6 on GSM8K
- Multimodal understanding: 1733.6 on MME-P
- Video understanding: 61.4 on VideoMME
- Image generation: 0.87 on GenEval
- Speech recognition: 2.1 WER on LibriSpeech test-clean
Detailed evaluation scripts and setups are provided in evaluation/README.md.
Citation
@article{aidaslab2026dyninomni,
title={Dynin-Omni: Omnimodal Unified Large Diffusion Language Model},
author={Kim, Jaeik and Kim, Woojin and Hong, Jihwan and Lee, Yejoon and Hyeon, Sieun and Lim, Mintaek and Han, Yunseok and Kim, Dogeun and Lee, Hoeun and Kim, Hyunggeun and Do, Jaeyoung},
journal={arXiv preprint arXiv:2604.00007},
year={2026}
}
- Downloads last month
- 3,958
Model tree for snu-aidas/Dynin-Omni
Base model
Gen-Verse/MMaDA-8B-MixCoT