Grammar Inference for Web Documents

نویسندگان

  • Shahab Kamali
  • Frank Wm. Tompa
چکیده

Presentational XML documents, such as XHTML or Presentation MathML, use XML tags mainly for formating purposes, while descriptive XML applications, such as a wellstructured movie database, use tags to structure data items in a semantically meaningful way. There is little semantic connection between tags in a presentational XML document and its content, so the tagging is often complex and seemingly ambiguous. These differences make inference of the underlying structure more difficult for presentational XML. The problem of schema or grammar inference has been studied mostly for descriptive XML, and proposed solutions are often ineffective for presentational XML. On the other hand, there are many applications such as data extraction tools and special-purpose search engines that need to infer structure from presentational XML. Current proposals for such systems provide only partial solutions to this problem. Restrictions imposed by DTDs and XML Schemas make them insufficient to describe many presentational XML documents effectively. In this paper we use regular tree grammars to define a class of grammars that is able to model many published presentational XML documents. We also propose an algorithm to infer such grammars, and prove that we can infer an appropriate grammar with high probability from given samples. We also empirically evaluate our algorithm by applying it to various types of presentational XML and comparing it to other algorithms.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-free Grammar Learning from Text Document using Sequential Pattern

The World-Wide-Web and information system has gained significant achievements over the last two decades as expressed their dominance in various business and scientific applications. As estimated by Blumberg and Atre more than 85% of all business information exists in the form of unstructured and semi-structured document, typically formatted for human viewing, not for system processing. Extracti...

متن کامل

Querying Large Collections of Semistructured Data

An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various dom...

متن کامل

STAN: Structural Analysis for Web Documents

In this paper we present STAN, a structural analysis tool used for classifying web documents while at the same time extracting meaningful information from them. The extraction and classification rules are defined in terms of a structrural grammar operating on both layout properties and content properties of the document. Stan was designed to accept HTML as input and is able to process documents...

متن کامل

Information retrieval on the Semantic Web: Integrating inference and retrieval

One vision of the Semantic Web is that it will be much like the Web we know today, except that documents will be enriched by annotations in machine understandable markup. These annotations will provide metadata about the documents as well as machine interpretable statements capturing some of the meaning of document content. We discuss how the information retrieval paradigm might be recast in su...

متن کامل

Segmentation of Document Using Discriminative Context-free Grammar Inference and Alignment Similarities

Text Documents present a great challenge to the field of document recognition. Automatic segmentation and layout analysis of documents is used for interpretation and machine translation of documents. Document such as research papers, address book, news etc. is available in the form of un-structured format. Extracting relevant Knowledge from this document has been recognized as promising task. E...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011