Multi-Stage Search Architectures for Streaming Documents

نویسنده

  • Nima Asadi
چکیده

Title of dissertation: Multi-Stage Search Architectures for Streaming Documents Nima Asadi, Doctor of Philosophy, 2013 Dissertation directed by: Associate Professor Jimmy Lin Department of Computer Science and College of Information Studies The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search—an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today’s web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mechanism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list contiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed “Learning to Rank” (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and document re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architectureconscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiencyeffectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality. Multi-Stage Search Architectures for Streaming Documents by Nima Asadi Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Associate Professor Jimmy Lin, Chair/Advisor Assistant Professor Hal Daumé III Associate Professor Amol Deshpande Professor Alan Sussman Professor Carol Y. Espy-Wilson c © Copyright by Nima Asadi 2013 Dedication To Katherine, Sina, Christina, and my parents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lot Streaming in No-wait Multi Product Flowshop Considering Sequence Dependent Setup Times and Position Based Learning Factors

This paper considers a no-wait multi product flowshop scheduling problem with sequence dependent setup times. Lot streaming divide the lots of products into portions called sublots in order to reduce the lead times and work-in-process, and increase the machine utilization rates. The objective is to minimize the makespan. To clarify the system, mathematical model of the problem is presented. Sin...

متن کامل

Hybrid algorithms for Job shop Scheduling Problem with Lot streaming and A Parallel Assembly Stage

In this paper, a Job shop scheduling problem with a parallel assembly stage and Lot Streaming (LS) is considered for the first time in both machining and assembly stages. Lot Streaming technique is a process of splitting jobs into smaller sub-jobs such that successive operations can be overlapped. Hence, to solve job shop scheduling problem with a parallel assembly stage and lot streaming, deci...

متن کامل

Two Architectures for Parallel Processing of Huge Amounts of Text

This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The st...

متن کامل

A heuristic approach for multi-stage sequence-dependent group scheduling problems

We present several heuristic algorithms based on tabu search for solving the multi-stage sequence-dependent group scheduling (SDGS) problem by considering minimization of makespan as the criterion. As the problem is recognized to be strongly NP-hard, several meta (tabu) search-based solution algorithms are developed to efficiently solve industry-size problem instances. Also, two different initi...

متن کامل

Semantic Search over XML Document Streams

A large number of web data sources, such as blogs, news sites and podcast hosts, are currently disseminating their content in the form of streaming XML documents. The variability and heterogeneity of those sources make the employment of traditional querying schemes, which are based on structured query languages, cumbersome for the end user (those languages require precise knowledge of the under...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013