In this work, we propose a lightweight approach to enhance realtime semantic segmentation by leveraging the pre-trained visionlanguage models, specifically utilizing the text encoder of Contrastive Language-Image Pretraining (CLIP) to generate rich semantic embeddings for text labels. Then, our method distills this textual knowledge into the segmentation model, integrating the image and text embeddings to align visual and textual information. Additionally, we implement learnable prompt embeddings for better class-specific semantic comprehension. We propose a two-stage training strategy for efficient learning: the segmentation backbone initially learns from fixed text embeddings and subsequently optimizes prompt embeddings to streamline the learning process. The extensive evaluations and ablation studies validate our approach’s ability to effectively improve the semantic segmentation model’s performance over the compared methods.
Companion
APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2023
See the other articles that are part of this special issue.