Automatic Wrapper System for Semi- Structured Documents Based on Data Mining
نویسندگان
چکیده
Lumea în care evoluăm presupune înţelegerea şi acumularea unei cantităţi imense de informaţie împărţită în diferite surse care necesită integrare şi sinteză. A apărut necesitatea unor aplicaţii inteligente, capabile să proceseze sau să colecteze automat informaţiile dorite. Acestea folosesc algoritmi de clusterizare pentru a descoperi grupuri. Totodată, datorită experienţei obţinute în timp în domeniul aplicaţiilor software tendinţa care se impune este de automatizare a proceselor, economisind astfel timp preţios al dezvoltatorilor, timp care poate fi folosit în proiectarea de noi concepte, arhitecturi. Lucrarea propune o îmbinare între descoperirea de informaţii în documente şi procesarea acestora în vederea automatizării proceselor software.
منابع مشابه
A Tool for Semi-Automatic Generation and Maintenance of Taxonomies from Semi-Structured Documents
This chapter introduces OntoExtractor, a tool for the semi-automatic generation of the taxonomy from a set of documents or data sources. The tool generates the taxonomy in a bottom-up fashion. Starting from structural analysis of the documents, it produces a set of clusters, which can be refined by a further grouping created by content analysis. Metadata describing the content of each cluster i...
متن کاملLearning Information Extraction Rules for Web Data Mining
The explosive growth and popularity of the World Wide Web has resulted in a huge number of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and keyword searching. Sophisticated Webmining applications, such as comparison shopping, require expensiv...
متن کاملAutomatic Extraction of Information Blocks Using PAT Trees
Information extraction from semi-structured Web documents is a critical issue for software agents on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors, but this approach still requires human intervention to provide training examples. In this paper, we present a novel approach that extracts information blo...
متن کاملPopulating Ontologies with Data from OCRed Lists
A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...
متن کاملA Structured Wrapper Induction System for Extracting Information from Semi-Structured Documents
We propose an extensible architecture which allows wrapper-learning systems to be easily constructed and tuned. In this architecture the bias of the wrapper-learning system is encoded as an ordered set of “builders”, each associated with some restricted extraction language L. To implement a new builder it is only necessary to implement a small set of core operations for L. Builders can also be ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012