A General Document Retrieval in Compact Space

نویسندگان

GONZALO NAVARRO

SIMON J. PUGLISI

DANIEL VALENZUELA

چکیده

Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioinformatics, multimedia databases and Web mining. This requires a more general setup where text and pattern can be general sequences of symbols, and the classical inverted indexes developed for words cannot be applied. While linear-space time-optimal solutions have been developed for most interesting queries in this general case, space usage is a serious problem in practice. In this article we develop compact data structures that solve various important document retrieval problems on general text collections. More specifically, we provide practical solutions for listing the documents where a query pattern appears, together with its frequency in each document, and for listing k documents where a query pattern appears most frequently. Some of our techniques build on existing theoretical proposals, while others are new. In particular, we introduce a novel grammar-based compressed bitmap representation that may be of independent interest when dealing with repetitive sequences. Ours are the first practical indexes that use less space when the text collection is compressible. Our experimental results show that, on various real-life text collections, our data structures are significantly smaller than the most space-efficient previous solutions, using up to half the space without noticeably increasing the query time. Overall, document listing can be carried out in 10 to 40 milliseconds for patterns that appear 100 to 10,000 times in the collection, whereas top-k retrieval is carried out in k to 10 k milliseconds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

On Character Space of the Algebra of BSE-functions

Suppose that $A$ is a semi-simple and commutative Banach algebra. In this paper we try to characterize the character space of the Banach algebra $C_{rm{BSE}}(Delta(A))$ consisting of all BSE-functions on $Delta(A)$ where $Delta(A)$ denotes the character space of $A$. Indeed, in the case that $A=C_0(X)$ where $X$ is a non-empty locally compact Hausdroff space, we give a complete characterizatio...

متن کامل

Construction of Compact Retrieval Models Unifying Framework and Analysis

In similarity search we are given a query document dq and a document collection D, and the task is to retrieve from D the most similar documents with respect to dq. For this task the vector space model, which represents a document d as a vector d, is a common starting point. Due to the high dimensionality of d the similarity search cannot be accelerated with spaceor data-partitioning indexes; d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A General Document Retrieval in Compact Space

نویسندگان

چکیده

منابع مشابه

Improved Skips for Faster Postings List Intersection

Improved Skips for Faster Postings List Intersection

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

On Character Space of the Algebra of BSE-functions

Construction of Compact Retrieval Models Unifying Framework and Analysis

عنوان ژورنال:

اشتراک گذاری