Reverse Engineering for Web Data: From Visual to Semantic Structures

نویسندگان

  • Christina Yip Chung
  • Michael Gertz
  • Neel Sundaresan
چکیده

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of ”legacy” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of r esum e HTML documents gathered by a Web crawler.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph Technology and Semantic Web in Reverse Engineering - A Comparison -

Reverse engineering tools are mostly based on analyzing code repositories. Various technological spaces for realizing these repositories including appropriate analysis techniques exist. Graph technology and semantic web based technologies provide elaborated and sufficient means to analyze software structures. This paper elaborates differences and similarities of both technological spaces by com...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Reverse Engineering for Web Data: From Visual to Semantic Structure

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of ”legacy” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. This paper describes...

متن کامل

Extracting Personalised Ontology from Data-Intensive Web Application: an HTML Forms-Based Reverse Engineering Approach

The advance of the Web has significantly and rapidly changed the way of information organization, sharing and distribution. The next generation of the web, the semantic web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. In this context we try to propose a novel and integrated approach for a semi-automated extraction of ontology-b...

متن کامل

OWL Ontology Extraction from Relational Databases via Database Reverse Engineering

The main purpose of the Semantic Web is driving the evolution of the current Web by enabling users to find, share, and combine information more easily. OWL ontologies play a key role in this effort. It is widely believed that the majority of current Web data sources are powered by relational databases (RDB). Thus developing approaches and tools for extracting OWL ontologies from RDB is helpful ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002