LGANet: Local and global attention are both you need for action recognition
نویسندگان
چکیده
Due to redundancy in the spatiotemporal neighborhood and global dependency between video frames, recognition remains a challenge. Some prior works have been mainly driven by 3D convolutional neural networks (CNNs) or 2D CNNs with well-designed module for temporal information. However, convolution-based lack capability capture due limited receptive field. Alternatively, transformer is proposed build long-range frame patches. Nevertheless, most transformer-based significant computational costs because attention calculated among all tokens. Based on these observations, we propose an efficient network which dub LGANet. Unlike conventional transformers recognition, LGANet can tackle both learning local token affinity shallow deep layers, respectively . Specifically, implemented layers reduce parameters eliminate redundancy. In spatial-wise channel-wise self-attention are embedded realize of high-level features. Moreover, several key designs made multi-head (MSA) feed-forward (FFN). Extensive experiments conducted popular benchmarks, such as Kinetics-400, Something-Something V1&V2. Without any bells whistles, achieves state-of-the-art performance. The code will be released soon.
منابع مشابه
Global and Local Attention Processing in Depressed Mood
Background: Attention impairments are the hallmark feature of subclinical depression. The present study used Navon task to compare the allocation of attention to the local and global stimuli in depressed and nondepressed participants. Method: The primary sample included 186 female high school students from Shiraz city who were selected using cluster sampl...
متن کاملAttention is All you Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. E...
متن کاملNeed for global action for cancer control.
When the Millennium Development Goals (MDGs) [1] were being developed, priority was given to the problems of the poorest billion people in the world. In terms of health, this was translated into a set of targets of indicators in health that give visibility to maternal and child health, (under) nutrition, acquired immunodeficiency syndrome (AIDS), malaria, and tuberculosis, and a vague catch-all...
متن کاملJoint Network based Attention for Action Recognition
By extracting spatial and temporal characteristics in one network, the two-stream ConvNets can achieve the state-ofthe-art performance in action recognition. However, such a framework typically suffers from the separately processing of spatial and temporal information between the two standalone streams and is hard to capture long-term temporal dependence of an action. More importantly, it is in...
متن کاملAction Recognition using Visual Attention
We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the fram...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Iet Image Processing
سال: 2023
ISSN: ['1751-9659', '1751-9667']
DOI: https://doi.org/10.1049/ipr2.12876