Visual dialog is a challenging vision-language task in which series of questions visually grounded by given image are answered. To resolve the visual task, high-level understanding various multimodal inputs (e.g., question, history, and image) required. Specifically, it necessary for an agent to (1) determine semantic intent question (2) align question-relevant textual contents among heterogene...