Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) computer vision. However, Transformer training and inference previous works can be prohibitively expensive due quadratic complexity self-attention over long sequence representations, especially for high-resolution dense prediction tasks. To ...