On Mining DOM Trees to build Information Extractors

نویسندگان

Gretel Fernández

Hassan A. Sleiman

Rafael Corchuelo

Rafael Z. Frantz

چکیده

The Web is the largest information repository. The information it contains is usually available in human-friendly formats. Companies are interested in using this information. The problem is that they need it in structured formats so that they can use it in automated business processes. In the literature, there are many proposals to infer information extractors. They build on machine learning techniques that attempt to infer a pattern in the HTML or XPath sources. To the best of our knowledge, no-one has ever explored using datamining techniques on DOM trees. In this paper, we report on a methodology that builds on datamining CSS features and a few other DOM features. Our results prove that this methodology is promising.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DOMISA: DOM-based Information Space Adsorption for Web Information Hierarchy Mining

Due to the growth of dynamic page generation techniques, the amount and the complexity of Web pages has been increasing explosively, as has the information contained within Web pages. Redundant and irrelevant information is distributed and mixed throughout a page, making it difficult to automatically identify the useful information in that page. Consequently, we propose an information hierarchy...

متن کامل

DOMISA: DOM-Based Information Space Adsorption of Web Information Hierarchy Mining

متن کامل

A DOM Tree Alignment Model for Mining Parallel Data from the Web

This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compar...

متن کامل

A New Frequent Similar Tree Algorithm Motivated by Dom Mining - Using RTDM and its New Variant - SiSTeR

The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size...

متن کامل

Clustering for Web Information Hierarchy Mining

Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

On Mining DOM Trees to build Information Extractors

نویسندگان

چکیده

منابع مشابه

DOMISA: DOM-based Information Space Adsorption for Web Information Hierarchy Mining

DOMISA: DOM-Based Information Space Adsorption of Web Information Hierarchy Mining

A DOM Tree Alignment Model for Mining Parallel Data from the Web

A New Frequent Similar Tree Algorithm Motivated by Dom Mining - Using RTDM and its New Variant - SiSTeR

Clustering for Web Information Hierarchy Mining

عنوان ژورنال:

اشتراک گذاری