ACSNet: Action-Context Separation Network for Weakly Supervised Temporal Action Localization

نویسندگان

چکیده

The object of Weakly-supervised Temporal Action Localization (WS-TAL) is to localize all action instances in an untrimmed video with only video-level supervision. Due the lack frame-level annotations during training, current WS-TAL methods rely on attention mechanisms foreground snippets or frames that contribute classification task. This strategy frequently confuse context actual action, localization result. Separating and a core problem for precise WS-TAL, but it very challenging has been largely ignored literature. In this paper, we introduce Action-Context Separation Network (ACSNet) explicitly takes into account accurate localization. It consists two branches (i.e., Foreground-Background branch branch). first distinguishes from background within entire while further separates as context. We associate latent components positive component negative component), their different combinations can effectively characterize foreground, Furthermore, extended labels auxiliary categories facilitate learning action-context separation. Experiments THUMOS14 ActivityNet v1.2/v1.3 datasets demonstrate ACSNet outperforms existing state-of-the-art by large margin.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video usin...

متن کامل

Towards Weakly-Supervised Action Localization

This paper presents a novel approach for weakly-supervised action localization, i.e., that does not require per-frame spatial annotations for training. We first introduce an effective method for extracting human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Our tube extraction leverages the large amount of annotated humans available today and outper...

متن کامل

Weakly Supervised Action Detection

Detection of human action in videos has many applications such as video surveillance and content based video retrieval. Actions can be considered as spatio-temporal objects corresponding to spatio-temporal volumes in a video. The problem of action detection can thus be solved similarly to object detection in 2D images [3] where typically an object classifier is trained using positive and negati...

متن کامل

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently e...

متن کامل

ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by introducing two types of context-aware guidance models, additive and contrastive models, that leverage their surrounding context regions to improve localization. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i3.16322