ICCV 2023 Oral

DDFM

Denoising Diffusion Model for Multi-Modality Image Fusion

1Xi'an Jiaotong University 2Computer Vision Lab, ETH Zürich 3Northwestern Polytechnical University 4University of Würzburg

Core Idea

Generative priors, rectified by source images.

DDFM treats multi-modality image fusion as conditional generation. A pretrained unconditional DDPM provides natural image priors, while likelihood rectification injects infrared-visible or medical source information into each sampling step.

Motivation
GAN Fusion

Adversarial training is fragile

GAN-based fusion can suffer from unstable optimization, opaque behavior, and mode collapse.

No Target

Fusion lacks ground truth

IVF and MIF must preserve complementary cues without a single supervised fused image target.

Prior

Diffusion models know images

Pretrained DDPMs provide a powerful natural image manifold for stable generation.

Condition

Source fidelity still matters

The generative prior must be steered toward thermal targets, textures, and medical structures.

DDFM Contributions

Posterior Sampling

Fusion is formulated as conditional DDPM posterior sampling over the fused image.

Likelihood Rectification

Source-image constraints refine each denoised estimate inside the sampling loop.

EM Inference

A hierarchical Bayesian model turns fusion losses into tractable latent-variable inference.

No Fine-Tuning

DDFM directly uses an unconditional pretrained diffusion model for IVF and MIF.

Overview

Diffusion prior for better cross-modality fusion

Fusion from GAN to Diffusion

DDFM overview comparing GAN fusion, likelihood rectification, and the DDFM workflow
DDFM replaces adversarial generation with diffusion sampling and rectifies the generated image with source-image likelihood constraints.

Architecture

Unconditional DDPM with one-step EM rectification

DDFM computational graph for one diffusion sampling iteration
One diffusion iteration with DDPM denoising, E-step, M-step, and likelihood rectification.
DDFM algorithm with DDPM denoising and EM likelihood rectification
Sampling procedure that rectifies the DDPM estimate using source-image likelihood constraints.

DDFM decomposes conditional fusion into unconditional diffusion generation and likelihood rectification. The EM update steers the denoised estimate toward source-image information before the next sampling step.

DDPM estimate Predict the denoised fused image from the current noisy state.
EM rectification Infer latent variables and update the estimate with source-image likelihood.
Diffusion update Sample the next state and repeat until the final fused image is generated.

Qualitative Results

Infrared-visible and medical fusion examples

DDFM quantitative comparison for infrared-visible image fusion
Infrared-visible image fusion: MSRS, M3FD, RoadScene, and TNO metrics.
DDFM quantitative comparison for medical image fusion
Medical image fusion: Harvard MRI-CT quantitative comparison.

Release

Code, configs, and sampling scripts

Pretrained Prior Uses the public 256x256 unconditional guided-diffusion checkpoint.
Sampling Code Inference is provided through sample.py and YAML configs.
IVF + MIF The same sampling framework supports infrared-visible and medical fusion.
Open GitHub Repository

Citation

BibTeX

@InProceedings{Zhao_2023_ICCV,
  author    = {Zhao, Zixiang and Bai, Haowen and Zhu, Yuanzhi and Zhang, Jiangshe and Xu, Shuang and Zhang, Yulun and Zhang, Kai and Meng, Deyu and Timofte, Radu and Van Gool, Luc},
  title     = {DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2023},
  pages     = {8082-8093}
}