WICE- Web Informative Content Extraction
نویسندگان
چکیده
With the accelerated Internet development a huge amount of data have been accumulated and stored on the Web. Web pages usually contain various contents, which are relevant or irrelevant with the main topic. The extraction of useful or relevant information in mass information becomes more complex and time consuming. Identifying of useful data region is a significant problem for information extraction from the Web documents. In this paper, we propose a system that can extract informative or useful content from Web pages across different sites. XPath-based extraction rules are generated to facilitate later extraction from other similar pages. We have performed experimental studies by using real Web pages over several Web sites namely, commerce, and business directory and publication sites. The result of extraction accuracy is also compared with other prior research and then observed that extraction results proved the validity of the approach convincingly.
منابع مشابه
Identifying Informative Web Content Blocks using Web Page Segmentation
Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to info...
متن کاملExtracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining
In this paper, we present a web content extraction method to extract different types of informative web content for news web pages. A fuzzy sequential pattern mining method, namely FSP, is developed to gradually discover fuzzy sequential patterns for various types of informative web content. To avoid the situation that the usage of HTML tags may be changed with the development of web technology...
متن کاملRecognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction
As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the difficulty in locating the segments of a page in which the target information is contained, which we call the informative blocks. This article reports on the Recognising Informative Page Blocks algorithm (RIPB), which is able to ide...
متن کاملA Web Agent for Conceptual Cost Estimation of Highway Construction Projects
Accurate cost estimation in the early stage of project lifecycle is essential for both effective financial planning and cost control of construction projects. Previous efforts by the research team have developed a Web-based Intelligent Cost Estimator (WICE) system. This paper describes a further step of WICE research to develop a special-purpose web agent for various highway construction projec...
متن کاملMain Content Extraction from Detailed Web Pages
As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better qua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013