[ECCV2024 (Oral)] CAT-SAM: Conditional Tuning
for Few-Shot Adaptation of Segment Anything Model
Aoran Xiao*1
Weihao Xuan*2,3
Heli Qi4
Yun Xing1
Ruijie Ren5
Xiaoqin Zhang6
Ling Shao7
Shijian Lu1
Nanyang Technological University1
The University of Tokyo2
RIKEN AIP3
Nara Institute of Science and Technology4
Waseda University5
Wenzhou University6
UCAS-Terminus AI Lab, UCAS7
*Equally contributing first authors
[arXiv]
[code]

Video


Abstract

The Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, it often struggles in domains that are either sparsely represented or lie outside its training distribution, such as aerial, medical, and non-RGB images. Recent efforts have predominantly focused on adapting SAM to these domains using fully supervised methods, which necessitate large amounts of annotated training data and pose practical challenges in data collection. This paper presents CAT-SAM, a ConditionAl Tuning network that explores few-shot adaptation of SAM toward various challenging downstream domains in a data-efficient manner. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the domain-specific features of the mask decoder to the image encoder, fostering synergic adaptation of both components with mutual benefits with few-shot target samples only, ultimately leading to superior segmentation in various downstream tasks. We develop two CAT-SAM variants that adopt two tuning strategies for the image encoder: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 downstream tasks show that CAT-SAM achieves superior segmentation consistently even under the very challenging one-shot adaptation setup.


Method


Experiments

Model #Tuning Sample WHU (IoU↑) Kvasir (mIoU↑) SBU-Shadow (BER↓) JSRT (mIoU↑) FLS (mIoU↑) HRSID (AP↑)
SAM - 43.5 79.0 62.4 78.5 69.7 38.2
CAT-SAM-T 1-shot 86.8 83.4 78.0 93.0 N/A 46.2
CAT-SAM-A 88.2 85.4 81.9 92.6 N/A 44.9
CAT-SAM-T 16-shots 89.6 93.1 4.04 94.2 73.2 46.2
CAT-SAM-A 90.3 93.6 3.80 93.5 71.4 45.7
CAT-SAM-T Full-shots 93.3 94.5 2.54 94.4 81.7 51.4
CAT-SAM-A 93.6 94.3 2.39 94.6 82.0 52.9

Visual Results

Eiffel tower

Visual comparisons of SAM (top row) and CAT-SAM (bottom row). We illustrate samples from WHU for building segmentation, MA. Roads for road segmentation, SBU-Shadow for shadow segmentation, Kvasir for polyp segmen- tation, JSRT for chest organ segmentation (X-ray images), FLS for marine debris segmentation (Sonar images), and HRSID for ship instance segmentation (SAR images). CAT-SAM exhibits one-shot adaptation across most datasets, except for 16-shot over FLS. Red boxes and stars denote geometric prompts, colored regions are mask predictions, and lines show the boundary of ground truth segmentation.

Citation

		@article{xiao2024cat,
		  title={CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model},
		  author={Xiao, Aoran and Xuan, Weihao and Qi, Heli and Xing, Yun and Ren, Ruijie and Zhang, Xiaoqin and Ling, Shao and Lu, Shijian},
		  journal={arXiv preprint arXiv:2402.03631},
		  year={2024}
		}