Convolutional neural networks (CNNs) have dominated the field of computer vision for nearly a decade. However, due to their limited receptive field, CNNs fail model global context. On other hand, transformers, an attention-based architecture, can context easily. Despite this, there are studies that investigate effectiveness transformers in crowd counting. In addition, majority existing crowd-co...