Image Fusion via Vision-Language Model

Zixiang Zhao^1,2, Lilun Deng¹, Haowen Bai¹, Yukun Cui¹, Zhipeng Zhang^2,3, Yulun Zhang⁴, Haotong Qin², Dongdong Chen⁵, Jiangshe Zhang¹, Peng Wang³, Luc Van Gool^2,6,7

¹Xi'an Jiaotong University ²ETH Zürich ³Northwestern Polytechnical University
⁴Shanghai Jiao Tong University ⁵Heriot-Watt University ⁶KU Leuven ⁷INSAIT
ICML 2024

Paper Code arXiv Dataset

Figure 1: Workflow for our Fusion via vIsion-Language Model (FILM). Input images are first processed to create prompts for ChatGPT, which then generate detailed textual descriptions. These descriptions help to get fused textual features via the frozen BLIP2 model. Then, these textual features are fused and guide the extraction and fusion of visual features via cross-attention, enhancing contextual understanding with text-based semantic information. Finally, the fusion image is output by the image decoder.

Architecture of FILM

Figure 2: Network pipeline for our FILM, which encompasses three components: text paragraph generation and text feature fusion, language-guided vision feature fusion via cross attention and vision feature decoding, corresponding to the first, second, and third columns.

Vision-Language Fusion (VLF) Dataset

Considering the high computational cost of invoking various vision-language components, and to facilitate subsequent research on image fusion based on vision-language models, we propose the VLF Dataset. This dataset encompasses paired paragraph descriptions generated by ChatGPT, covering all image pairs from the training and test sets of the eight widely-used fusion datasets.

These include paragraph descriptions of:

Infrared-visible image fusion (IVF): MSRS, M³FD, and RoadScene datasets;

Medical image fusion (MIF): Harvard dataset;

Multi-exposure image fusion (MEF): SICE and MEFB datasets.

Multi-focus image fusion (MFF): RealMFF and Lytro datasets;

The dataset is available for download via Google Drive.

[Notice]: Considering the immense workload involved in creating this dataset, we have opened a Google Form for error correction feedback. Please provide your suggestions for correcting any errors in the VLF dataset. If you have any questions regarding the Google Form, please contact Zixiang Zhao via email.

Visualization of the VLF dataset:

Figure 3: Visualization of the VLF dataset creation process and representative data displays.

More detailed images of the VLF dataset:

More visualization results for the VLF dataset on IVF and MFF.

More visualization results for the VLF dataset on MEF and MIF.

Experimental Results

Infrared-visible image fusion (IVF):

Visualization comparison of the fusion results in the infrared-visible image fusion task.

Medical image fusion (MIF):

Visualization comparison of the fusion results in the medical image fusion task.

Multi-exposure image fusion (MEF):

Visualization comparison of the fusion results in the multi-exposure image fusion task.

Multi-focus image fusion (MFF):

Visualization comparison of the fusion results and error maps in multi-focus image fusion task.

BibTeX


      @inproceedings{Zhao_2024_ICML,
        title={Image Fusion via Vision-Language Model},
        author={Zixiang Zhao and Lilun Deng and Haowen Bai and Yukun Cui and Zhipeng Zhang and Yulun Zhang and Haotong Qin and Dongdong Chen and Jiangshe Zhang and Peng Wang and Luc Van Gool},
        booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
        year={2024},
      }

Related Works

Equivariant Multi-Modality Image Fusion. CVPR 2024.
Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, Luc Van Gool
@inproceedings{Zhao_2024_CVPR,   author = {Zhao, Zixiang and Bai, Haowen and Zhang, Jiangshe and Zhang, Yulun and Zhang, Kai and Xu, Shuang and Chen, Dongdong and Timofte, Radu and Van Gool, Luc},   title = {Equivariant Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},   month = {June},   year = {2024},   pages = {25912-25921} }

DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion. ICCV 2023 (Oral).
Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, Luc Van Gool.
@inproceedings{Zhao_2023_ICCV,   author = {Zhao, Zixiang and Bai, Haowen and Zhu, Yuanzhi and Zhang, Jiangshe and Xu, Shuang and Zhang, Yulun and Zhang, Kai and Meng, Deyu and Timofte, Radu and Van Gool, Luc},   title = {DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},   month = {October},   year = {2023},   pages = {8082-8093} }

CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. CVPR 2023.
Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, Luc Van Gool.
@inproceedings{Zhao_2023_CVPR,   author = {Zhao, Zixiang and Bai, Haowen and Zhang, Jiangshe and Zhang, Yulun and Xu, Shuang and Lin, Zudi and Timofte, Radu and Van Gool, Luc},   title = {CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},   month = {June},   year = {2023},   pages = {5906-5916} }

DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion. IJCAI 2020.
Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang and Pengfei Li.
@inproceedings{DBLP:conf/ijcai/ZhaoXZLZL20,   author = {Zixiang Zhao and Shuang Xu and Chunxia Zhang and Junmin Liu and Jiangshe Zhang and Pengfei Li},   title = {DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion},   booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence ({IJCAI})},   pages = {970--976},   year = {2020} }

Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling. IEEE Transactions on Circuits and Systems for Video Technology 2021.
Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang and Junmin Liu.
@article{zhao2021efficient,   title = {Efficient and model-based infrared and visible image fusion via algorithm unrolling},   author = {Zhao, Zixiang and Xu, Shuang and Zhang, Jiangshe and Liang, Chengyang and Zhang, Chunxia and Liu, Junmin},   journal = {IEEE Transactions on Circuits and Systems for Video Technology},   volume = {32},   number = {3},   pages = {1186--1196},   year = {2021},   publisher = {IEEE} }

License

FILM is licensed under a CC BY-NC-SA 4.0 License.

Image Fusion via Vision-Language Model

Abstract

Architecture of FILM

Figure 2: Network pipeline for our FILM, which encompasses three components: text paragraph generation and text feature fusion, language-guided vision feature fusion via cross attention and vision feature decoding, corresponding to the first, second, and third columns.

Vision-Language Fusion (VLF) Dataset

Figure 3: Visualization of the VLF dataset creation process and representative data displays.

More detailed images of the VLF dataset:

More visualization results for the VLF dataset on IVF and MFF.

More visualization results for the VLF dataset on MEF and MIF.

Experimental Results

Infrared-visible image fusion (IVF):

Visualization comparison of the fusion results in the infrared-visible image fusion task.

Visualization comparison of the fusion results in the infrared-visible image fusion task.

Medical image fusion (MIF):

Visualization comparison of the fusion results in the medical image fusion task.

Visualization comparison of the fusion results in the medical image fusion task.

Multi-exposure image fusion (MEF):

Visualization comparison of the fusion results in the multi-exposure image fusion task.

Visualization comparison of the fusion results in the multi-exposure image fusion task.

Multi-focus image fusion (MFF):

Visualization comparison of the fusion results and error maps in multi-focus image fusion task.

Visualization comparison of the fusion results and error maps in multi-focus image fusion task.

BibTeX

Related Works

License