Unsupervised Generative Adversarial Cross-modal Hashing

نویسندگان

  • Jian Zhang
  • Yuxin Peng
  • Mingkuan Yuan
چکیده

Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Unsupervised cross-modal hashing is more flexible and applicable than supervised methods, since no intensive labeling work is involved. However, existing unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities for cross-modal retrieval. To address the above problem, in this paper we propose an Unsupervised Generative Adversarial Cross-modal Hashing approach (UGACH), which makes full use of GAN’s ability for unsupervised representation learning to exploit the underlying manifold structure of cross-modal data. The main contributions can be summarized as follows: (1) We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given a data of one modality, the generative model tries to fit the distribution over the manifold structure, and select informative data of another modality to challenge the discriminative model. The discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. These two models are trained in an adversarial way to improve each other and promote hashing function learning. (2) We propose a correlation graph based approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods on 2 widely-used datasets verify the effectiveness of our proposed approach. Introduction Multimedia retrieval has become an important application over the past decades, which can retrieve multimedia contents that users have interests in. However, it is a big challenge to retrieve multimedia data efficiently from large scale databases, due to the explosive growth of multimedia information. To address this issue, there are many hashing methods (Wang et al. 2016; Gionis, Indyk, and Motwani 1999; ∗Corresponding author. Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Zhang and Peng 2017) proposed to accomplish efficient retrieval. The goal of hashing methods is to map high dimensional representations in the original space to short binary codes in the Hamming space. By using these binary hash codes, faster Hamming distance computation can be applied based on bit operations that can be implemented efficiently. Moreover, binary codes take much less storage compared with original high dimensional representations. There are large numbers of hashing methods applied to single modality retrieval (Wang et al. 2016), by which users can only retrieve data by a query with the same modality, such as text retrieval (Baeza-Yates and Ribeiro-Neto 1999) and image retrieval (Wang et al. 2016). Nevertheless, single modality retrieval can not meet users’ increasing demands, due to the different modalities of multimedia data. For example, by single modality retrieval, it is impracticable to search an image by using a textual sentence that describes the semantic content of the image. Therefore, cross-modal hashing has been proposed to meet this kind of retrieval demands in large scale cross-modal databases. Owing to the effectiveness and flexibility of cross-modal hashing, users can submit whatever they have to retrieve whatever they want (Peng, Huang, and Zhao 2017; Peng et al. 2017). “Heterogeneous gap” is the key challenge of cross-modal hashing, which means the similarity of between different modalities cannot be measured directly. Consequently, some cross-modal hashing methods (Kumar and Udupa 2011; Rastegari et al. 2013; Ding et al. 2016; Zhang and Li 2014; Zhuang et al. 2014) have been proposed to bridge this gap. Existing cross-modal hashing methods can be categorized into traditional methods and Deep Neural Networks (DNN) based methods. Moreover, traditional methods can be divided into unsupervised methods and supervised methods by whether semantic information is leveraged. Unsupervised cross-modal hashing methods usually project data from different modalities into a common Hamming space to maximize their correlations, which hold the similar idea with Canonical Correlation Analysis (CCA) (Hardoon, Szedmak, and Shawe-Taylor 2004). Song et al. propose Inter-Media Hashing (IMH) (Song et al. 2013) to establish a common Hamming space by preserving inter-media and intra-media consistency. Cross-view Hashing (CVH) (Kumar and Udupa 2011) is proposed to consider both intra-view and inter-view similarities, which is an exar X iv :1 71 2. 00 35 8v 1 [ cs .C V ] 1 D ec 2 01 7 tension of image hashing method named Spectral Hashing (SH) (Weiss, Torralba, and Fergus 2009). Predictable Dualview Hashing (PDH) (Rastegari et al. 2013) designs an objective function to keep the predictability of pre-generated binary codes. Ding et al. propose Collective Matrix Factorization Hashing (CMFH) (Ding et al. 2016) to learn unified hash codes by collective matrix factorization. Composite Correlation Quantization (CCQ) (Long et al. 2016) jointly learns the correlation-maximal mappings that transform different modalities into an isomorphic latent space, and learns composite quantizers that convert the isomorphic latent features into compact binary codes. Supervised cross-modal hashing methods utilize labeled semantic information to learn hashing functions. Bronstein et al. propose Cross-Modality Similarity Sensitive Hashing (CMSSH) (Bronstein et al. 2010) to model hashing learning by a classification paradigm with a boosting manner. Wei et al. propose Heterogeneous Translated Hashing (HTH) (Wei et al. 2014), which learns translators to align separate Hamming spaces of different modalities to perform cross-modal hashing. Semantic Correlation Maximization (SCM) (Zhang and Li 2014) is proposed to learn hashing functions by constructing and preserving the semantic similarity matrix. Semantics-Preserving Hashing (SePH) (Lin et al. 2015) transforms the semantic matrix into a probability distribution and minimizes the KL-divergence in order to approximate the distribution with learned hash codes in Hamming space. DNN based methods are inspired by the successful applications of deep learning, such as image classification (Krizhevsky, Sutskever, and Hinton 2012). Cross-Media Neural Network Hashing (CMNNH) (Zhuang et al. 2014) is proposed to learn cross-modal hashing functions by preserving intra-modal discriminative ability and inter-modal pairwise correlation. Cross Autoencoder Hashing (CAH) (Cao et al. 2016b) is based on deep autoencoder structure to maximize the feature correlation and the semantic correlation between different modalities. Cao et al. propose Deep Visualsemantic Hashing (DVH) (Cao et al. 2016a) as an end-to-end framework that combines both representation learning and hashing function learning. Jiang et al. propose Deep Crossmodal Hashing (DCMH) (Jiang and Li 2017), which performs feature learning and hashing function learning simultaneously. Compared with unsupervised paradigm, supervised methods use labeled semantic information that requires massive labor to collect, resulting in a high labor cost in real world applications. On the contrary, unsupervised crossmodal hashing methods can leverage unlabeled data to realize efficient cross-modal retrieval, which is more flexible and applicable in real world applications. However, most unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities. To address this problem, in this paper, we exploit correlation information from underlying manifold structure of unlabeled data across different modalities to enhance cross-modal hashing learning. Inspired by recent progress of Generative Adversarial Network (GAN) (Goodfellow et al. 2014; Reed et al. 2016; Zhao and Gao 2017; Wang et al. 2017), which has shown its ability to model the data distribution in an unsupervised fashion. In this paper, we propose an unsupervised generative adversarial cross-modal hashing (UGACH) approach. We design a graph-based unsupervised correlation method to capture the underlying manifold structure across different modalities, and a generative adversarial network to learn the manifold structure and further enhance the performance by an adversarial boosting paradigm. The main contributions of this paper can be summarized as follows: • We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given the data of any modality, the generative model tries to fit the distribution over the manifold structure, and selects informative data of another modality to challenge the discriminative model. While the discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. • We propose a correlation graph based learning approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. We also integrate the proposed correlation graph into proposed generative adversarial network to provide manifold correlation guidance to promote the cross-modal retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods on 2 widely-used datasets verify the effectiveness of our proposed approach. The rest of this paper is organized as follows. In “The Proposed Approach” section, we present our UGACH approach in detail. The experimental results and analyses are reported in “Experiment” section. Finally, we conclude this paper in “Conclusion” section. The Proposed Approach Figure 1 presents the overview of our proposed approach, which consists of three parts, namely feature extraction, generative model G and discriminative model D. The feature extraction part employs image feature and text feature extraction to represent unlabeled data of different modalities as original features. The detailed implementation of this part will be described at “Experiment” section. Given a data of one modality,G attempts to select informative data from another modality to generate a pair of data and send them to D. In D, we construct a correlation graph, which can capture the manifold structure among the original features. D receives the generated pairs as inputs, and also samples positive data from constructed graph to form a true manifold pair. Then D tries to distinguish the manifold and generated pairs in order to get better discriminate ability. These two models play a minimax game to boost each other, and the finally trainedD can be used as cross-modal hashing model. We denote the cross-modal dataset as D = {I, T}, where I represents image modality and T represents text modalcel ebr it y dave tv

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network

Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and flexible retrieval across different modalities. Supervised cross-modal hashing methods have achieved considerable progress by incorporating semantic side information. However, they mainly have two limitations: (1) Heavily rely on large-scale labeled cross-modal training data whi...

متن کامل

HashGAN: Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval

As the rapid growth of multi-modal data, hashing methods for cross-modal retrieval have received considerable attention. Deep-networks-based cross-modal hashing methods are appealing as they can integrate feature learning and hash coding into end-to-end trainable frameworks. However, it is still challenging to find content similarities between different modalities of data due to the heterogenei...

متن کامل

TUCH: Turning Cross-view Hashing into Single-view Hashing via Generative Adversarial Nets

Cross-view retrieval, which focuses on searching images as response to text queries or vice versa, has received increasing attention recently. Crossview hashing is to efficiently solve the cross-view retrieval problem with binary hash codes. Most existing works on cross-view hashing exploit multiview embedding method to tackle this problem, which inevitably causes the information loss in both i...

متن کامل

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation, and most of ...

متن کامل

In2I : Unsupervised Multi-Image-to-Image Translation Using Generative Adversarial Networks

In unsupervised image-to-image translation, the goal is to learn the mapping between an input image and an output image using a set of unpaired training images. In this paper, we propose an extension of the unsupervised image-toimage translation problem to multiple input setting. Given a set of paired images from multiple modalities, a transformation is learned to translate the input into a spe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1712.00358  شماره 

صفحات  -

تاریخ انتشار 2017