Self-adjusting trees in practice for large text collections
نویسندگان
چکیده
Splay and randomized search trees (RSTs) are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree (BST). Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the efficiency of these trees for such vocabulary accumulation. Surprisingly, unmodified splaying and RSTs are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganization costs and show their effectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less effective. We also report the performance of efficient bit-wise hashing and red– black trees for comparison. Copyright 2001 John Wiley & Sons, Ltd.
منابع مشابه
Self - Indexing Based on LZ 77 ? Sebastian
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملSelf-Index Based on LZ77
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that sou...
متن کاملMUCK: A toolkit for extracting and visualizing semantic dimensions of large text collections
Users with large text collections are often faced with one of two problems; either they wish to retrieve a semanticallyrelevant subset of data from the collection for further scrutiny (needle-in-a-haystack) or they wish to glean a high-level understanding of how a subset compares to the parent corpus in the context of aforementioned semantic dimensions (forestfor-the-trees). In this paper, I de...
متن کاملSelf-Indexing XML
Self-indexing is a technology that integrates text compression and text indexing, such that a text collection can be simultaneously compressed and indexed. The resulting representation, called a self-index of the text, takes space close to that of the compressed text, is able of reproducing any text substring, and oers indexed searching of the collection. This has been a major breakthrough in t...
متن کاملAn Evaluation of Self-adjusting Binary Search Tree Techniques
Much has been said in praise of self-adjusting data structures, particularly self-adjusting binary search trees. Self-adjusting trees are most suited to skewed key-access distributions as the techniques attempt to place the most commonly accessed keys near the root of the tree. Theoretical bounds on worst-case and amortized performance (i.e. performance over a sequence of operations) have been ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Softw., Pract. Exper.
دوره 31 شماره
صفحات -
تاریخ انتشار 2001