On Mining DOM Trees to build Information Extractors
نویسندگان
چکیده
The Web is the largest information repository. The information it contains is usually available in human-friendly formats. Companies are interested in using this information. The problem is that they need it in structured formats so that they can use it in automated business processes. In the literature, there are many proposals to infer information extractors. They build on machine learning techniques that attempt to infer a pattern in the HTML or XPath sources. To the best of our knowledge, no-one has ever explored using datamining techniques on DOM trees. In this paper, we report on a methodology that builds on datamining CSS features and a few other DOM features. Our results prove that this methodology is promising.
منابع مشابه
DOMISA: DOM-based Information Space Adsorption for Web Information Hierarchy Mining
Due to the growth of dynamic page generation techniques, the amount and the complexity of Web pages has been increasing explosively, as has the information contained within Web pages. Redundant and irrelevant information is distributed and mixed throughout a page, making it difficult to automatically identify the useful information in that page. Consequently, we propose an information hierarchy...
متن کاملDOMISA: DOM-Based Information Space Adsorption of Web Information Hierarchy Mining
Due to the growth of dynamic page generation techniques, the amount and the complexity of Web pages has been increasing explosively, as has the information contained within Web pages. Redundant and irrelevant information is distributed and mixed throughout a page, making it difficult to automatically identify the useful information in that page. Consequently, we propose an information hierarchy...
متن کاملA DOM Tree Alignment Model for Mining Parallel Data from the Web
This paper presents a new web mining scheme for parallel data acquisition. Based on the Document Object Model (DOM), a web page is represented as a DOM tree. Then a DOM tree alignment model is proposed to identify the translationally equivalent texts and hyperlinks between two parallel DOM trees. By tracing the identified parallel hyperlinks, parallel web documents are recursively mined. Compar...
متن کاملA New Frequent Similar Tree Algorithm Motivated by Dom Mining - Using RTDM and its New Variant - SiSTeR
The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size...
متن کاملClustering for Web Information Hierarchy Mining
Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011