What Makes for Hierarchical Vision Transformer?

نویسندگان

چکیده

Recent studies indicate that hierarchical Vision Transformer (ViT) with a macro architecture of interleaved non-overlapped window-based self-attention & shifted-window operation can achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. In most recently proposed ViTs, is de-facto standard for spatial information aggregation. this paper, we question whether only choice ViT to attain strong performance, study effects different kinds cross-window communication methods. To end, replace layers embarrassingly simple linear mapping layers, resulting proof-of-concept termed TransLinear very ImageNet- $\text{1}~k$ image recognition. Moreover, find able leverage ImageNet pre-trained weights demonstrates competitive transfer learning properties on downstream dense prediction tasks such as object detection instance segmentation. We also experiment other alternatives content aggregation inside each window under approaches. Our results reveal macro architecture , than specific or mechanisms, more responsible ViT's real challenger CNN's sliding paradigm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

What makes space-time interactions in human vision asymmetrical?

The interaction of space and time affects perception of extents: (1) the longer the exposure duration, the longer the line length is perceived and vice versa; (2) the shorter the line length is, the shorter the exposure duration is perceived. Previous studies have shown that space-time interactions in human vision are asymmetrical; spatial cognition has a larger effect on temporal cognition rat...

متن کامل

What makes a Pollock Pollock: a machine vision approach

Jackson Pollock introduced a revolutionary artistic style of dripping paint on a horizontal canvas. Here we study Pollock’s unique artistic style by using computational methods for characterizing the low-level numerical differences between original Pollock drip paintings and drip paintings of other painters who attempted to mimic his signature drip painting style. Four thousands and twenty four...

متن کامل

Hierarchical Spatial Transformer Network

Computer vision researchers have been expecting that neural networks have spatial transformation ability to eliminate the interference caused by geometric distortion for a long time. Emergence of spatial transformer network makes dream come true. Spatial transformer network and its variants can handle global displacement well, but lack the ability to deal with local spatial variance. Hence how ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Pattern Analysis and Machine Intelligence

سال: 2023

ISSN: ['1939-3539', '2160-9292', '0162-8828']

DOI: https://doi.org/10.1109/tpami.2023.3282019