Processing XML Streams with Deterministic Automata
نویسندگان
چکیده
We consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real XML data sets.
منابع مشابه
Automaton Meets Query Algebra: Towards a Unified Model for XQuery Evaluation over XML Data Streams
In this work, we address the efficient evaluation of XQuery expressions over continuous XML data streams, which is essential for a broad range of applications including monitoring systems and information dissemination systems. While previous work has shown that automata theory is suited for on-the-fly pattern retrieval over XML data streams, we find that automata-based approaches suffer from be...
متن کاملOnline Dictionary Matching for Streams of XML Documents
We consider the online multiple-pattern matching problem for streams of XML documents, when the patterns are expressed as linear XPath expressions containing child operators (/), descendant operators (//) and wildcards (∗) but no predicates. For each document in the stream, the task is to determine all occurrences in the document of all the patterns. We present a general multiple-pattern-matchi...
متن کاملSemantic Query Optimization in an Automata-Algebra Combined XQuery Engine over XML Streams
Our Raindrop framework [6, 9] aims at tackling challenges of stream processing that are particular to XML. In contrast to the tuple-based or object-based data streams, XML streams are usually modeled as a sequence of primitive tokens, such as a start tag, an end tag or a PCDATA item. Unlike a self-contained tuple or object whose semantics are completely determined by its own values, a token lac...
متن کاملOptimizing The Lazy DFA Approach for XML Stream Processing
Lazy DFA (Deterministic Finite Automata) approach has been recently proposed to for efficient XML stream data processing. This paper discusses the drawbacks of the approach, suggests several optimizations as solutions, and presents a detailed analysis for the processing model. The experiments show that our proposed approach is indeed effective and scalable.
متن کاملEfficient XML Stream Processing with Automata and Query Algebra
XML Stream Processing is an emerging technology designed to support declarative queries over continuous streams of data. The interest in this novel technology is growing due to the increasing number of real world applications such as monitoring systems for stock, email, and sensor data that need to analyze incoming data streams. There are however several open challenges. One, we must develop ef...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003