WICE- Web Informative Content Extraction

نویسندگان

  • Swe Swe Nyein
  • Myat Myat Min
چکیده

With the accelerated Internet development a huge amount of data have been accumulated and stored on the Web. Web pages usually contain various contents, which are relevant or irrelevant with the main topic. The extraction of useful or relevant information in mass information becomes more complex and time consuming. Identifying of useful data region is a significant problem for information extraction from the Web documents. In this paper, we propose a system that can extract informative or useful content from Web pages across different sites. XPath-based extraction rules are generated to facilitate later extraction from other similar pages. We have performed experimental studies by using real Web pages over several Web sites namely, commerce, and business directory and publication sites. The result of extraction accuracy is also compared with other prior research and then observed that extraction results proved the validity of the approach convincingly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identifying Informative Web Content Blocks using Web Page Segmentation

Information Extraction has become an important task for discovering useful knowledge or information from the Web. A crawler system, which gathers the information from the Web, is one of the fundamental necessities of Information Extraction. A search engine uses a crawler to crawl and index web pages. Search engine takes into account only the informative content for indexing. In addition to info...

متن کامل

Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

In this paper, we present a web content extraction method to extract different types of informative web content for news web pages. A fuzzy sequential pattern mining method, namely FSP, is developed to gradually discover fuzzy sequential patterns for various types of informative web content. To avoid the situation that the usage of HTML tags may be changed with the development of web technology...

متن کامل

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the difficulty in locating the segments of a page in which the target information is contained, which we call the informative blocks. This article reports on the Recognising Informative Page Blocks algorithm (RIPB), which is able to ide...

متن کامل

A Web Agent for Conceptual Cost Estimation of Highway Construction Projects

Accurate cost estimation in the early stage of project lifecycle is essential for both effective financial planning and cost control of construction projects. Previous efforts by the research team have developed a Web-based Intelligent Cost Estimator (WICE) system. This paper describes a further step of WICE research to develop a special-purpose web agent for various highway construction projec...

متن کامل

Main Content Extraction from Detailed Web Pages

As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better qua...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013