Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage
نویسنده
چکیده
We address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. Our goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, we develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries. A novel component of our scheme is the use of a new indexing structure based on the concept of persistent or multi-version data structures. Our scheme can be shown to be asymptotically optimal both in storage utilization and insert/retrieval time. We illustrate the performance of our method on two very different data sets from the Stanford WebBase project, the first reflecting very dynamic web contents and the second relatively static web contents. The experimental results clearly illustrate the substantial storage savings achieved by eliminating duplicate contents detected between consecutive crawls, as well as the speed at which our method can find the archived contents specified through arbitrary temporal queries.
منابع مشابه
Search and Access Strategies for Web Archives
The Web has become the main publication medium worldwide, covering almost every facet of human activity. In many cases, the Web is the only medium where such information is recorded. However, the Web is an ephemeral medium whose contents are constantly changing and new information is rapidly replacing old information, and hence the critical importance of establishing web archives to capture at ...
متن کاملArcLink: Optimization techniques to build and retrieve the Temporal Web Graph
Archiving the web is socially and culturally critical, but presents problems of scale. In this paper, we present ArcLink, an exemplary system to optimize the construction, storage, and access to the temporal web graph from large-scale web archive. We divide the web archive construction into four stages (filtering, extraction, storage, and access) and explore optimizations for each stage. We wer...
متن کاملTemporal multi-page summarization
With the increasing popularity of the Web, efficient approaches to the information overload are becoming more necessary. Summarization of web pages aims at detecting the most important contents from pages so that a user can obtain a compact version of a web document or a group of pages. Traditionally, summaries are constructed on static snapshots of web pages. However, web pages are dynamic obj...
متن کاملReputation-based Contents Crawling in Web Archiving System
The size of the web archive is increasing exponentially, many national libraries are making efforts to preserve born-digital scientific, artistic and cultural contents. However, in order to crawl and store huge volume of digital information, it is very hard to resolve various problems from the social, legal and technical view points. In this paper, from the view points of long-term preserving d...
متن کاملA model for specification, composition and verification of access control policies and its application to web services
Despite significant advances in the access control domain, requirements of new computational environments like web services still raise new challenges. Lack of appropriate method for specification of access control policies (ACPs), composition, verification and analysis of them have all made the access control in the composition of web services a complicated problem. In this paper, a new indepe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008