Overall architecture of ERR-Seg. Initially, redundancy-reduced hierarchical cost maps are generated by extracting cost maps from middle-layer features and eliminating class redundancy. Subsequently, the sequence length is reduced before cost aggregation to speed up the computation. Finally, the upsampling decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features of CLIP's visual encoder.
Visualization of segmentation results in various domains. Our proposed ERR-Seg is capable of segmenting capybaras (a rare category in public datasets) from various domains, including (a) synthesized images, (b) cartoon images, (c) natural images, and (d) capybara dolls. Moreover, ERR-Seg achieves more precise masks than SAN and CAT-Seg.
ERR-Seg can correctly distinguish between a yellow dog and a white dog and between a lying capybara and a standing capybara.
@article{chen2025efficient,
title={Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation},
author={Chen, Lin and Yang, Qi and Ding, Kun and Li, Zhihao and Shen, Gang and Li, Fei and Cao, Qiyuan and Xiang, Shiming},
journal={arXiv preprint arXiv:2501.17642},
year={2025}
}