Image Fusion via Vision-Language Model

1Xi'an Jiaotong University 2ETH Zürich 3Northwestern Polytechnical University
4Shanghai Jiao Tong University 5Heriot-Watt University 6KU Leuven 7INSAIT
ICML 2024

Figure 1: Workflow for our Fusion via vIsion-Language Model (FILM). Input images are first processed to create prompts for ChatGPT, which then generate detailed textual descriptions. These descriptions help to get fused textual features via the frozen BLIP2 model. Then, these textual features are fused and guide the extraction and fusion of visual features via cross-attention, enhancing contextual understanding with text-based semantic information. Finally, the fusion image is output by the image decoder.

Abstract

Image fusion integrates essential information from multiple images into a single composite, enhancing structures, textures, and refining imperfections. Existing methods predominantly focus on pixel-level and semantic visual features for recognition, but often overlook the deeper text-level semantic information beyond vision. Therefore, we introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM), for the first time, utilizing explicit textual information from source images to guide the fusion process. Specifically, FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions. These descriptions are fused within the textual domain and guide the visual information fusion, enhancing feature extraction and contextual understanding, directed by textual semantic information via cross-attention. FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion. We also propose a vision-language dataset containing ChatGPT-generated paragraph descriptions for the eight image fusion datasets across four fusion tasks, facilitating future research in vision-language model-based image fusion.

Architecture of FILM

Figure 2: Network pipeline for our FILM, which encompasses three components: text paragraph generation and text feature fusion, language-guided vision feature fusion via cross attention and vision feature decoding, corresponding to the first, second, and third columns.

Vision-Language Fusion (VLF) Dataset

Considering the high computational cost of invoking various vision-language components, and to facilitate subsequent research on image fusion based on vision-language models, we propose the VLF Dataset. This dataset encompasses paired paragraph descriptions generated by ChatGPT, covering all image pairs from the training and test sets of the eight widely-used fusion datasets.

These include paragraph descriptions of:

  • Infrared-visible image fusion (IVF): MSRS, M3FD, and RoadScene datasets;
  • Medical image fusion (MIF): Harvard dataset;
  • Multi-exposure image fusion (MEF): SICE and MEFB datasets.
  • Multi-focus image fusion (MFF): RealMFF and Lytro datasets;

The dataset is available for download via Google Drive.

[Notice]: Considering the immense workload involved in creating this dataset, we have opened a Google Form for error correction feedback. Please provide your suggestions for correcting any errors in the VLF dataset. If you have any questions regarding the Google Form, please contact Zixiang Zhao via email.

Visualization of the VLF dataset:

Figure 3: Visualization of the VLF dataset creation process and representative data displays.

More detailed images of the VLF dataset:

Experimental Results

Infrared-visible image fusion (IVF):

Medical image fusion (MIF):

Multi-exposure image fusion (MEF):

Multi-focus image fusion (MFF):

BibTeX


      @inproceedings{Zhao_2024_ICML,
        title={Image Fusion via Vision-Language Model},
        author={Zixiang Zhao and Lilun Deng and Haowen Bai and Yukun Cui and Zhipeng Zhang and Yulun Zhang and Haotong Qin and Dongdong Chen and Jiangshe Zhang and Peng Wang and Luc Van Gool},
        booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
        year={2024},
      }
        

Related Works

  • Equivariant Multi-Modality Image Fusion. CVPR 2024.
    Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Kai Zhang, Shuang Xu, Dongdong Chen, Radu Timofte, Luc Van Gool
    @inproceedings{Zhao_2024_CVPR,   author = {Zhao, Zixiang and Bai, Haowen and Zhang, Jiangshe and Zhang, Yulun and Zhang, Kai and Xu, Shuang and Chen, Dongdong and Timofte, Radu and Van Gool, Luc},   title = {Equivariant Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},   month = {June},   year = {2024},   pages = {25912-25921} }
  • DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion. ICCV 2023 (Oral).
    Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, Luc Van Gool.
    @inproceedings{Zhao_2023_ICCV,   author = {Zhao, Zixiang and Bai, Haowen and Zhu, Yuanzhi and Zhang, Jiangshe and Xu, Shuang and Zhang, Yulun and Zhang, Kai and Meng, Deyu and Timofte, Radu and Van Gool, Luc},   title = {DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},   month = {October},   year = {2023},   pages = {8082-8093} }
  • CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion. CVPR 2023.
    Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, Luc Van Gool.
    @inproceedings{Zhao_2023_CVPR,   author = {Zhao, Zixiang and Bai, Haowen and Zhang, Jiangshe and Zhang, Yulun and Xu, Shuang and Lin, Zudi and Timofte, Radu and Van Gool, Luc},   title = {CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion},   booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},   month = {June},   year = {2023},   pages = {5906-5916} }
  • DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion. IJCAI 2020.
    Zixiang Zhao, Shuang Xu, Chunxia Zhang, Junmin Liu, Jiangshe Zhang and Pengfei Li.
    @inproceedings{DBLP:conf/ijcai/ZhaoXZLZL20,   author = {Zixiang Zhao and Shuang Xu and Chunxia Zhang and Junmin Liu and Jiangshe Zhang and Pengfei Li},   title = {DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion},   booktitle = {Proceedings of the International Joint Conference on Artificial Intelligence ({IJCAI})},   pages = {970--976},   year = {2020} }
  • Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling. IEEE Transactions on Circuits and Systems for Video Technology 2021.
    Zixiang Zhao, Shuang Xu, Jiangshe Zhang, Chengyang Liang, Chunxia Zhang and Junmin Liu.
    @article{zhao2021efficient,   title = {Efficient and model-based infrared and visible image fusion via algorithm unrolling},   author = {Zhao, Zixiang and Xu, Shuang and Zhang, Jiangshe and Liang, Chengyang and Zhang, Chunxia and Liu, Junmin},   journal = {IEEE Transactions on Circuits and Systems for Video Technology},   volume = {32},   number = {3},   pages = {1186--1196},   year = {2021},   publisher = {IEEE} }

License

Creative Commons License
FILM is licensed under a CC BY-NC-SA 4.0 License.