Data Extraction using Content-Based Handles

نویسندگان

  • A. Pouramini Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
  • S. Khaje Hassani Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
  • Sh. Nasiri Department of Computer Engineering, University of Sirjan Technology, Sirjan, Iran.
چکیده مقاله:

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text features such as textual delimiters, keywords, constants or text patterns, which we call handles, to construct patterns for the target data regions and data records. We offer a polynomial algorithm, in which these patterns are checked against the page elements in a mixed bottom-up and top-down traverse of the DOM-tree. The extracted data is directly mapped onto a hierarchical XML structure, which forms the output of the wrapper. The wrappers that are generated by this method are robust and independent of the HTML structure. Therefore, they can be adapted to similar websites to gather and integrate information.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Model-Based Analysis of Semiautomated Data Discovery and Entry Using Automated Content-Extraction

JADE GOLDSTEIN-STEWART and JUSTIN GROSSMAN United States Department of Defense, Washington, DC ________________________________________________________________________ Content extraction systems can automatically extract entities and relations from raw text and use the information to populate knowledge bases, potentially eliminating the need for manual data discovery and entry. Unfortunately, c...

متن کامل

Content-Based Authentication Using Digital Speech Data

A watermarking technique for speech content and speaker authentication scheme, which is based on using abstracts of speech features relevant to semantic meaning and combined with an ID for the speaker is proposed in this paper. The ID which, represents the watermark for the speaker, is embedded using spread spectrum technique. While the extracted abstracts of speech features are used to represe...

متن کامل

XSLT-based Web-Content Extraction

In this paper, we describe the Semantic

متن کامل

Content Based Video Retrieval Using Integrated Feature Extraction

Traditional video retrieval methods fail to meet technical challenges due to large and rapid growth of multimedia data, demanding effective retrieval systems. In the last decade Content Based Video Retrieval (CBVR) has become more and more popular. The amount of lecture video data on the Worldwide Web (WWW) is growing rapidly. Therefore, a more efficient method for video retrieval in WWW or wit...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

Deep Web Data Extraction by Using Vision-Based Item and Data Extraction Algorithms

Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to addre...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 6  شماره 2

صفحات  399- 407

تاریخ انتشار 2018-07-01

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023