Referring image segmentation segments an from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based are subject specific encoder choices, while attention-based offer limited gains. In this work, we introdu...