ICCV 2023 Oral

DDFM

Denoising Diffusion Model for Multi-Modality Image Fusion

Zixiang Zhao^1,2, Haowen Bai¹, Yuanzhi Zhu², Jiangshe Zhang¹, Shuang Xu³, Yulun Zhang², Kai Zhang², Deyu Meng¹, Radu Timofte^2,4, Luc Van Gool²

¹Xi'an Jiaotong University ²Computer Vision Lab, ETH Zürich ³Northwestern Polytechnical University ⁴University of Würzburg

Paper arXiv Code Supp.

Core Idea

Generative priors, rectified by source images.

DDFM treats multi-modality image fusion as conditional generation. A pretrained unconditional DDPM provides natural image priors, while likelihood rectification injects infrared-visible or medical source information into each sampling step.

Motivation

GAN Fusion

Adversarial training is fragile

GAN-based fusion can suffer from unstable optimization, opaque behavior, and mode collapse.

No Target

Fusion lacks ground truth

IVF and MIF must preserve complementary cues without a single supervised fused image target.

Prior

Diffusion models know images

Pretrained DDPMs provide a powerful natural image manifold for stable generation.

Condition

Source fidelity still matters

The generative prior must be steered toward thermal targets, textures, and medical structures.

DDFM Contributions

Posterior Sampling

Fusion is formulated as conditional DDPM posterior sampling over the fused image.

Likelihood Rectification

Source-image constraints refine each denoised estimate inside the sampling loop.

EM Inference

A hierarchical Bayesian model turns fusion losses into tractable latent-variable inference.

No Fine-Tuning

DDFM directly uses an unconditional pretrained diffusion model for IVF and MIF.

Overview

Diffusion prior for better cross-modality fusion

Architecture

Unconditional DDPM with one-step EM rectification

DDFM computational graph for one diffusion sampling iteration — One diffusion iteration with DDPM denoising, E-step, M-step, and likelihood rectification.

DDFM algorithm with DDPM denoising and EM likelihood rectification — Sampling procedure that rectifies the DDPM estimate using source-image likelihood constraints.

DDFM decomposes conditional fusion into unconditional diffusion generation and likelihood rectification. The EM update steers the denoised estimate toward source-image information before the next sampling step.

DDPM estimate Predict the denoised fused image from the current noisy state.

EM rectification Infer latent variables and update the estimate with source-image likelihood.

Diffusion update Sample the next state and repeat until the final fused image is generated.

Qualitative Results

Infrared-visible and medical fusion examples

DDFM infrared-visible fusion comparison on M3FD

M³FD: Infrared-Visible Fusion

DDFM preserves thermal targets while keeping visible-scene texture and natural image appearance.

DDFM quantitative comparison for infrared-visible image fusion — Infrared-visible image fusion: MSRS, M³FD, RoadScene, and TNO metrics.

DDFM quantitative comparison for medical image fusion — Medical image fusion: Harvard MRI-CT quantitative comparison.

Release

Code, configs, and sampling scripts

Pretrained Prior Uses the public 256x256 unconditional guided-diffusion checkpoint.

Sampling Code Inference is provided through sample.py and YAML configs.

IVF + MIF The same sampling framework supports infrared-visible and medical fusion.

Open GitHub Repository

Citation

BibTeX

@InProceedings{Zhao_2023_ICCV,
  author    = {Zhao, Zixiang and Bai, Haowen and Zhu, Yuanzhi and Zhang, Jiangshe and Xu, Shuang and Zhang, Yulun and Zhang, Kai and Meng, Deyu and Timofte, Radu and Van Gool, Luc},
  title     = {DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2023},
  pages     = {8082-8093}
}