An Efficient and Versatile Query Engine for TopX Search

نویسندگان

  • Martin Theobald
  • Ralf Schenkel
  • Gerhard Weikum
چکیده

This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but nonschematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The difficulties in applying the existing top-k algorithms to XML data lie in 1) the need to consider scores for XML elements while aggregating them at the document level, 2) the combination of vague content conditions with XML path conditions, 3) the need to relax query conditions if too few results satisfy all conditions, and 4) the selectivity estimation for both content and structure conditions and their impact on evaluation strategies. TopX addresses these issues by precomputing score and path information in an appropriately designed index structure, by largely avoiding or postponing the evaluation of expensive path conditions so as to preserve the sequential access pattern on index lists, and by selectively scheduling random accesses when they are cost-beneficial. In addition, TopX can compute approximate topk results using probabilistic score estimators, thus speeding up queries with a small and controllable loss in retrieval precision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TopX - Efficient and Versatile Top-k Query Process-ing for Text, Semistructured, and Structured Data

This paper presents a comprehensive overview of the TopX search engine, an extensive framework for unified indexing and querying large collections of unstructured, semistructured, and structured data. Residing at the very synapse of database (DB) engineering and information retrieval (IR), it integrates efficient scheduling algorithms for top-k-style ranked retrieval with powerful scoring model...

متن کامل

TopX: efficient and versatile top-k query processing for text, structured, and semistructured data

TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conference...

متن کامل

Incremental Relevance Feedback for TopX submitted by Osama Sammodi

TopX is a highly efficient and effective search engine for ranked retrieval of XML and plain text data. However, for some difficult queries, the results provided by TopX are not yet completely satisfying. Towards the solution of this problem, an extensible framework has been proposed that incorporates feedback from the user to generate a better, expanded query. In this thesis, we integrate the ...

متن کامل

Similarity Measures for Query Expansion in TopX

TopX is a top-k retrieval engine for text and XML data. Unlike some other engines, TopX includes an ontology. This ontology allows TopX to use techniques like word sense disambiguation and query expansion, to search for words similar to the original query terms. These techniques allow finding data items which would be ignored for the original source query, due to missing of words similar to the...

متن کامل

P2P Web Search: Make It Light, Make It Fly (Demo)

We propose a live demonstration of MinervaLight, a P2P Web search engine. MinervaLight combines the (previously separate) focused crawler BINGO! (to harvest Web data), the local search engine TopX, and our P2P Web search system MINERVA under one common user interface. The crawler unattendedly downloads and indexes Web data, where the scope of the focused crawl can be tailored to the thematic in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005