Segment Anything with Multiple Modalities
Aoran Xiao1✮
Weihao Xuan2,3✮
Heli Qi4
Yun Xing1
Naoto Yokoya2,3✉
Shijian Lu1✉
Nanyang Technological University, Singapore1
The University of Tokyo, Japan2
RIKEN AIP, Japan3
Nara Institute of Science and Technology, Japan4
Equally contributing first authors Co-corresponding authors
[arXiv]
[code]

Abstract

Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.


Method

Objective

MM-SAM extends and expands SAM towards multi-modal data with various sensor suites, facilitating cross-modal and multi-modal segmentation without requiring mask annotations in different downstream tasks.

Objective

Overview of MM-SAM. MM-SAM freezes the entire SAM architecture while tuning it with multi-modal pairs (RGB and non-RGB modal X) for achieving cross-modal and multi-modal segmentation. It consists of two novel tuning modules: 1) Unsupervised Cross-Modal Transfer (UCMT) introduces modality-specific patch embedding module and low-rank (LoRA) structures into SAM’s image encoder for extracting modality-specific X embeddings. An embedding unification loss ($L_U$) aligns X embeddings with SAM’s RGB image embeddings to ensure segmentation compatibility; 2) Weakly-supervised Multi-Modal Fusion (WMMF) incorporates Selective Fusion Gate (SFG) to generate multi-modal embeddings, trained with multi-modal pseudo-labeling for adaptive sensor fusion. The whole training is mask-free. During inference, MM-SAM supports segmentation for single or multiple modality data.

Experiments

Objective

Objective

Objective

Objective


Citation

@article{mmsam,
  title={Segment Anything with Multiple Modalities},
  author={Aoran Xiao and Weihao Xuan and Heli Qi and Yun Xing and Naoto Yokoya and Shijian Lu},
  journal={arXiv preprint arXiv:2408.09085},
  year={2024}
}