Grained Classification and Captioning Tasks
نویسندگان
چکیده
Understanding concepts in the world remains one of the well-sought endeavours of ML. Whereas ImageNet enabled success in object recognition and various related tasks via transfer learning, the ability to understand physical concepts prevalent in the world still remains an unattained, yet desirable, goal. Video as a vision modality encodes how objects change across time with respect to pose, position, distance of observer, etc.; and has therefore been researched extensively as a data domain and for studying “common sense” physical concepts of objects.
منابع مشابه
Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning
Modern neural image captioning systems typically adopt the encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for caption generation. Inspired by the robustness analysis of CNN-based image classifiers to adversarial perturbations, we propose Show-and-Fool, a novel algorithm for ...
متن کاملVideo Captioning via Hierarchical Reinforcement Learning
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the cha...
متن کاملImage Representations and New Domains in Neural Image Captioning
We examine the possibility that recent promising results in automatic caption generation are due primarily to language models. By varying image representation quality produced by a convolutional neural network, we find that a state-of-theart neural captioning algorithm is able to produce quality captions even when provided with surprisingly poor image representations. We replicate this result i...
متن کاملGrad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent. Our approach – Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for ‘dog’ or even a caption), flowing into the final convolutional layer to produce a coarse localiz...
متن کاملSeeing with Humans: Gaze-Assisted Neural Image Captioning
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captionin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018