Automatic Data Extraction from Template-Generated Web Pages
نویسندگان
چکیده
منابع مشابه
Automatic Data Extraction from Template Generated Web Pages
Information Retrieval calls for accurate web page data extraction. To enhance retrieval precision, irrelevant data such as navigational bar and advertisement should be identified and removed prior to indexing. We propose a novel approach that identifies the web page templates and extracts the unstructured data. Our experimental results on several different web sites demonstrate the feasibility ...
متن کاملUnsupervised Structured Data Extraction from Template-generated Web Pages
This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle t...
متن کاملAutomatic Data Extraction from Data-Rich Web Pages
Extracting data from web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. In this paper, we propose a novel technique to the problem of differentiating roles of data items from Web pages, which is one of the key problems in our automatic extraction approach. The problem is resolved at various levels: semantic blocks, sections ...
متن کاملExperiences regarding Automatic Data Extraction from Web Pages
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they ...
متن کاملAutomatic Web Pages Author Extraction
This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-auto...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Journal of Software
سال: 2008
ISSN: 1000-9825
DOI: 10.3724/sp.j.1001.2008.00209