|
|
|
|
|
|
|
|
Nanyang Technological University1 |
The University of Tokyo2 |
RIKEN AIP3 |
Nara Institute of Science and Technology4 |
Waseda University5 |
Wenzhou University6 |
UCAS-Terminus AI Lab, UCAS7 |
*Equally contributing first authors |
|
|
The Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, it often struggles in domains that are either sparsely represented or lie outside its training distribution, such as aerial, medical, and non-RGB images. Recent efforts have predominantly focused on adapting SAM to these domains using fully supervised methods, which necessitate large amounts of annotated training data and pose practical challenges in data collection. This paper presents CAT-SAM, a ConditionAl Tuning network that explores few-shot adaptation of SAM toward various challenging downstream domains in a data-efficient manner. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the domain-specific features of the mask decoder to the image encoder, fostering synergic adaptation of both components with mutual benefits with few-shot target samples only, ultimately leading to superior segmentation in various downstream tasks. We develop two CAT-SAM variants that adopt two tuning strategies for the image encoder: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 downstream tasks show that CAT-SAM achieves superior segmentation consistently even under the very challenging one-shot adaptation setup. |
|
Model | #Tuning Sample | WHU (IoU↑) | Kvasir (mIoU↑) | SBU-Shadow (BER↓) | JSRT (mIoU↑) | FLS (mIoU↑) | HRSID (AP↑) |
---|---|---|---|---|---|---|---|
SAM | - | 43.5 | 79.0 | 62.4 | 78.5 | 69.7 | 38.2 |
CAT-SAM-T | 1-shot | 86.8 | 83.4 | 78.0 | 93.0 | N/A | 46.2 |
CAT-SAM-A | 88.2 | 85.4 | 81.9 | 92.6 | N/A | 44.9 | |
CAT-SAM-T | 16-shots | 89.6 | 93.1 | 4.04 | 94.2 | 73.2 | 46.2 |
CAT-SAM-A | 90.3 | 93.6 | 3.80 | 93.5 | 71.4 | 45.7 | |
CAT-SAM-T | Full-shots | 93.3 | 94.5 | 2.54 | 94.4 | 81.7 | 51.4 |
CAT-SAM-A | 93.6 | 94.3 | 2.39 | 94.6 | 82.0 | 52.9 |
@article{xiao2024cat, title={CAT-SAM: Conditional Tuning Network for Few-Shot Adaptation of Segmentation Anything Model}, author={Xiao, Aoran and Xuan, Weihao and Qi, Heli and Xing, Yun and Ren, Ruijie and Zhang, Xiaoqin and Ling, Shao and Lu, Shijian}, journal={arXiv preprint arXiv:2402.03631}, year={2024} } |