Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

Lin Chen¹, Qi Yang¹, Kun Ding¹, Zhihao Li², Gang Shen³, Fei Li³, Qiyuan Cao¹, Shiming Xiang¹

¹Institute of Automation, Chinese Academy of Sciences

²Shandong University, ³Tower Corporation Limited

Abstract

Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. While large-scale vision-language models have shown remarkable open-vocabulary capabilities, their image-level pretraining limits effectiveness on pixel-wise dense prediction tasks like OVSS. Recent cost-based methods narrow this granularity gap by constructing pixel-text cost maps and refining them via cost aggregation mechanisms. Despite achieving promising performance, these approaches suffer from high computational costs and long inference latency. In this paper, we identify two major sources of redundancy in the cost-based OVSS framework: redundant information introduced during cost maps construction and inefficient sequence modeling in cost aggregation. To address these issues, we propose ERR-Seg, an efficient architecture that incorporates Redundancy-Reduced Hierarchical Cost maps (RRHC) and Redundancy-Reduced Cost Aggregation (RRCA). Specifically, RRHC reduces redundant class channels by customizing a compact class vocabulary for each image and integrates hierarchical cost maps to enrich semantic representation. RRCA alleviates computational burden by performing both spatial-level and class-level sequence reduction before aggregation. Overall, ERR-Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state-of-the-art methods on the ADE20K-847 benchmark, ERR-Seg improves performance by $5.6\%$ while achieving a 3.1× speedup.

Main Architecture

Overall architecture of ERR-Seg. Initially, redundancy-reduced hierarchical cost maps are generated by extracting cost maps from middle-layer features and eliminating class redundancy. Subsequently, the sequence length is reduced before cost aggregation to speed up the computation. Finally, the upsampling decoder restores the high-rank information of cost maps by incorporating image details from the middle-layer features of CLIP's visual encoder.

Qualitative Results

Visualization of segmentation results in various domains. Our proposed ERR-Seg is capable of segmenting capybaras (a rare category in public datasets) from various domains, including (a) synthesized images, (b) cartoon images, (c) natural images, and (d) capybara dolls. Moreover, ERR-Seg achieves more precise masks than SAN and CAT-Seg.

ERR-Seg can correctly distinguish between a yellow dog and a white dog and between a lying capybara and a standing capybara.

BibTeX

@article{chen2025efficient, title={Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation}, author={Chen, Lin and Yang, Qi and Ding, Kun and Li, Zhihao and Shen, Gang and Li, Fei and Cao, Qiyuan and Xiang, Shiming}, journal={arXiv preprint arXiv:2501.17642}, year={2025} }