Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

نویسندگان

چکیده

Image-text retrieval requires the system to bridge heterogenous gap between vision and language for accurate while keeping network lightweight-enough efficient retrieval. Existing trade-off solutions mainly study from view of incorporating cross-modal interactions with independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement heavyweight model structure in stage. In this work, we propose an image-text alignment module SelfAlign on top framework, improves accuracy maintains efficiency without extra supervision. contains two collaborative sub-modules that force at both concept level context by self-supervised contrastive learning. It does not require embedding during training maintaining independent image text encoders With comparable time cost, consistently boosts state-of-the-art non-pretraining models respectively 9.1%, 4.2% 6.6% terms R@sum score Flickr30K, MSCOCO 1K MS-COCO 5K datasets. The also outperforms most existing interactive-embedding orders magnitude decrease time. source code is available at: https://github.com/Zjamie813/SelfAlign.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization

Fine-grained image classification is to recognize hundreds of subcategories in each basic-level category. Existing methods employ discriminative localization to find the key distinctions among similar subcategories. However, existing methods generally have two limitations: (1) Discriminative localization relies on region proposal methods to hypothesize the locations of discriminative regions, w...

متن کامل

Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma

Fine-grained image retrieval (FGIR) enables a user to search for a photo of an object instance based on a mental picture. Depending on how the object is described by the user, two general approaches exist: sketch-based FGIR or text-based FGIR, each of which has its own pros and cons. However, no attempt has been made to systematically investigate how informative each of these two input modaliti...

متن کامل

Weakly Supervised Fine-Grained Image Categorization

In this paper, we categorize fine-grained images without using any object / part annotation neither in the training nor in the testing stage, a step towards making it suitable for deployments. Fine-grained image categorization aims to classify objects with subtle distinctions. Most existing works heavily rely on object / part detectors to build the correspondence between object parts by using o...

متن کامل

Multidimensional interactive fine-grained image retrieval

We propose an image retrieval methodology for a collection of similar images. By similar, we mean that one can define, for the collection, a set of dimensions, and for each of which a set of features. The dimensions are used to capture the essential characteristics of the images in the collection, and the features are for describing each image to a certain degree. We call this strategy fine-gra...

متن کامل

PatchIt: Self-Supervised Network Weight Initialization for Fine-grained Recognition

ConvNet training is highly sensitive to initialization of the weights. A widespread approach is to initialize the network with weights trained for a different task, an auxiliary task. The ImageNet-based ILSVRC classification task is a very popular choice for this, as it has shown to produce powerful feature representations applicable to a wide variety of tasks. However, this creates a significa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Multimedia

سال: 2023

ISSN: ['1520-9210', '1941-0077']

DOI: https://doi.org/10.1109/tmm.2023.3280734