Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations decode it sentences. While neglect high-level semantic concepts subtle relationships between regions elements. To make full use these information, this paper attempt explo...