ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization
نویسندگان
چکیده
The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due the lack frame-level annotations during training, current WS-TAL methods rely on attention mechanisms foreground snippets or frames that contribute classification task. This strategy frequently confuse context actual action, localization result. Separating and a core problem for precise WS-TAL, but it very challenging has been largely ignored literature. In this paper, we introduce Action-Context Separation Network (ACSNet) explicitly takes into account accurate localization. It consists two branches (i.e., Foreground-Background branch branch). first distinguishes from background within entire while further separates as context. We associate latent components positive component negative component), their different combinations can effectively characterize foreground, Furthermore, extended labels auxiliary categories facilitate learning action-context separation. Experiments THUMOS14 ActivityNet v1.2/v1.3 datasets demonstrate ACSNet outperforms existing state-of-the-art by large margin.
منابع مشابه
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video usin...
متن کاملTowards Weakly-Supervised Action Localization
This paper presents a novel approach for weakly-supervised action localization, i.e., that does not require per-frame spatial annotations for training. We first introduce an effective method for extracting human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Our tube extraction leverages the large amount of annotated humans available today and outper...
متن کاملWeakly Supervised Action Detection
Detection of human action in videos has many applications such as video surveillance and content based video retrieval. Actions can be considered as spatio-temporal objects corresponding to spatio-temporal volumes in a video. The problem of action detection can thus be solved similarly to object detection in 2D images [3] where typically an object classifier is trained using positive and negati...
متن کاملConnectionist Temporal Modeling for Weakly Supervised Action Labeling
We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently e...
متن کاملContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization
We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by introducing two types of context-aware guidance models, additive and contrastive models, that leverage their surrounding context regions to improve localization. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i3.16322