Multi-View Attention Network for Visual Dialog
نویسندگان
چکیده
Visual dialog is a challenging vision-language task in which series of questions visually grounded by given image are answered. To resolve the visual task, high-level understanding various multimodal inputs (e.g., question, history, and image) required. Specifically, it necessary for an agent to (1) determine semantic intent question (2) align question-relevant textual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), leverages multiple views about based on attention mechanisms. MVAN effectively captures information from history with two complementary modules (i.e., Topic Aggregation Context Matching), builds representations through sequential alignment processes Modality Alignment). Experimental results VisDial v1.0 dataset show effectiveness our proposed model, outperforms previous state-of-the-art methods under both single model ensemble settings.
منابع مشابه
Visual Reference Resolution using Attention Memory for Visual Dialog
Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual atte...
متن کاملDual Attention Network for Visual Question Answering
Visual Question Answering (VQA) is a popular research problem that involves inferring answers to natural language questions about a given visual scene. Recent neural network approaches to VQA use attention to select relevant image features based on the question. In this paper, we propose a novel Dual Attention Network (DAN) that not only attends to image features, but also to question features....
متن کاملVisual Specification of Multi-View Visual Environments
We describe a set of visual tools for specifying and generating multi-view visual environments. JComposer provides an architecture description language for defining environment repositories, view models, and view-repository mappings. A visual event-flow language permits annotation of JComposer diagrams with event handlers specifying environment semantics. BuildByWire supports constraint-based v...
متن کاملMulti-level Gated Recurrent Neural Network for dialog act classification
In this paper we focus on the problem of dialog act (DA) labelling. This problem has recently attracted a lot of attention as it is an important sub-part of an automatic dialog model, which is currently in great demand. Traditional methods tend to see this problem as a sequence labelling task and deal with it by applying classifiers with rich features. Most of the current neural network models ...
متن کاملFast and adaptive network of spiking neurons for multi-view visual pattern recognition
In this paper, we describe and evaluate a new spiking neural network (SNN) architecture and its corresponding learning procedure to perform fast and adaptive multi-view visual pattern recognition. The network is composed of a simplified type of integrate-and-fire neurons arranged hierarchically in four layers of two-dimensional neuronal maps. Using a Hebbian-based training, the network adaptive...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied sciences
سال: 2021
ISSN: ['2076-3417']
DOI: https://doi.org/10.3390/app11073009