Since their introduction the Trasformer architectures emerged as dominating for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family "fully-attentive" arises from computation dot-product attention, which grows in memory consumption and number operations $O(n^2)$ where $n$ stands input sequence length, thus limiting application...