Categorisation of web documents using extraction ontologies
نویسندگان
چکیده
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, our document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance data (particular values for ontological concepts). We then use machine-learned rules over this extracted information to determine whether an HTML document contains items of interest. Experimental results show that our ontological approach to categorisation works well, having achieved F -measures above 90% for all applications we tried.
منابع مشابه
CCReSD: concept-based categorisation of Hidden Web databases
Hidden Web databases dynamically generate results in response to users’ queries. The categorisation of such databases into a category scheme has been widely employed in information searches. We present a Concept-based Categorisation over Refined Sampled Documents (CCReSD) approach that effectively handles information extraction, summarisation and categorisation of such databases. CCReSD detects...
متن کاملVarious Approaches to Web Information Processing
The paper focuses on the field of automatic extraction of information from texts and text document categorisation including pre-processing of text documents, which can be found on the Internet. In the frame of the presented work, we have devoted our attention to the following issues related to text categorisation: increasing the precision of categorisation algorithm results with the aid of a bo...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملAeóû Ø Ôôöøññòø Ó Óñôùøøö Ëëëëòòò¸éùùùò Ååöý ² Ï×ø¹ Ðð Óððððð¸íòòúö××øý Ó Äóòòóòº
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) using a representation of the content of web documents that captures these two characteristics ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IJMSO
دوره 3 شماره
صفحات -
تاریخ انتشار 2008