[ECCV2024 (Oral)] CAT-SAM: Conditional Tuning
for Few-Shot Adaptation of Segment Anything Model

Aoran Xiao^*1

Weihao Xuan^*2,3

Heli Qi⁴

Yun Xing¹

Ruijie Ren⁵

Xiaoqin Zhang⁶

Ling Shao⁷

Shijian Lu¹

Nanyang Technological University¹

The University of Tokyo²

RIKEN AIP³

Nara Institute of Science and Technology⁴

Waseda University⁵

Wenzhou University⁶

UCAS-Terminus AI Lab, UCAS⁷

^*Equally contributing first authors

[arXiv]

[code]

Video

Abstract

The Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, it often struggles in domains that are either sparsely represented or lie outside its training distribution, such as aerial, medical, and non-RGB images. Recent efforts have predominantly focused on adapting SAM to these domains using fully supervised methods, which necessitate large amounts of annotated training data and pose practical challenges in data collection. This paper presents CAT-SAM, a ConditionAl Tuning network that explores few-shot adaptation of SAM toward various challenging downstream domains in a data-efficient manner. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the domain-specific features of the mask decoder to the image encoder, fostering synergic adaptation of both components with mutual benefits with few-shot target samples only, ultimately leading to superior segmentation in various downstream tasks. We develop two CAT-SAM variants that adopt two tuning strategies for the image encoder: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 downstream tasks show that CAT-SAM achieves superior segmentation consistently even under the very challenging one-shot adaptation setup.

Method

Experiments

Model	#Tuning Sample	WHU (IoU↑)	Kvasir (mIoU↑)	SBU-Shadow (BER↓)	JSRT (mIoU↑)	FLS (mIoU↑)	HRSID (AP↑)
SAM	-	43.5	79.0	62.4	78.5	69.7	38.2
CAT-SAM-T	1-shot	86.8	83.4	78.0	93.0	N/A	46.2
CAT-SAM-A	1-shot	88.2	85.4	81.9	92.6	N/A	44.9
CAT-SAM-T	16-shots	89.6	93.1	4.04	94.2	73.2	46.2
CAT-SAM-A	16-shots	90.3	93.6	3.80	93.5	71.4	45.7
CAT-SAM-T	Full-shots	93.3	94.5	2.54	94.4	81.7	51.4
CAT-SAM-A	Full-shots	93.6	94.3	2.39	94.6	82.0	52.9

Visual Results

Eiffel tower — Visual comparisons of SAM (top row) and CAT-SAM (bottom row). We illustrate samples from WHU for building segmentation, MA. Roads for road segmentation, SBU-Shadow for shadow segmentation, Kvasir for polyp segmen- tation, JSRT for chest organ segmentation (X-ray images), FLS for marine debris segmentation (Sonar images), and HRSID for ship instance segmentation (SAR images). CAT-SAM exhibits one-shot adaptation across most datasets, except for 16-shot over FLS. Red boxes and stars denote geometric prompts, colored regions are mask predictions, and lines show the boundary of ground truth segmentation.

Citation

		@article{xiao2024cat,
		  title={CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model},
		  author={Xiao, Aoran and Xuan, Weihao and Qi, Heli and Xing, Yun and Ren, Ruijie and Zhang, Xiaoqin and Ling, Shao and Lu, Shijian},
		  journal={arXiv preprint arXiv:2402.03631},
		  year={2024}
		}