web documents

Text Mining Techniques: A Semantic Approach in order to classify the documents

2014

Manoj Kumar Singh Mohammad Kemal Ahmad Mohammad Iqbal

Over the past two decades, the automatic management of electronic documents has been a major research field in computer science. Text documents have become the most common type of information repositories especially with the increased popularity of the internet and the World Wide Web. Internet and web documents like web pages, emails, newsgroup messages, internet news feed etc., contain million...

متن کامل

Using Data Mining to Construct an Intelligent Web Search System

Journal: :Int. J. Comput. Proc. Oriental Lang. 2003

Yu-Ru Chen Ming-Chuan Hung Don-Lin Yang

In this paper, we present a new ranking algorithm and an intelligent Web search system using data mining techniques to search and analyze Web documents in a more flexible and effective way. Our method takes advantage of the characteristics of Web documents to extract, find, and rank data in a more meaningful manner. We utilize hyperlink structures with Web document content to intelligently rank...

متن کامل

Machine learning on Web documents

2004

Lawrence Kai Shih

The Web is a tremendous source of information: so tremendous that it becomes difficult for human beings to select meaningful information without support. We discuss tools that help people deal with web information, by, for example, blocking advertisements, recommending interesting news, and automatically sorting and compiling documents. We adapt and create machine learning algorithms for use wi...

متن کامل

Genre Classification of Web Documents

2005

Elizabeth Sugar Boese Adele E. Howe

Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in m...

متن کامل

Use of Linked Data principles for semantic management of scanned documents Emprego dos princípios Linked Data para gestão semântica de documentos digitalizados

2016

Luciane Lena Pessanha MONTEIRO Mark Douglas de Azevedo JACYNTHO

The study addresses the use of the Semantic Web and Linked Data principles proposed by the World Wide Web Consortium for the development of Web application for semantic management of scanned documents. The main goal is to record scanned documents describing them in a way the machine is able to understand and process them, filtering content and assisting us in searching for such documents when a...

متن کامل

Swoogle: A Semantic Web Search and Metadata Engine

2004

Li Ding Tim Finin Anupam Joshi Yun Peng Joel Sachs Rong Pan Pavan Reddivari Vishal Doshi

Swoogle is a crawler-based indexing and retrieval system for the Semantic Web, i.e., for Web documents in RDF or OWL. It extracts metadata for each discovered document, and computes relations between documents. Discovered documents are also indexed by an information retrieval system which can use either character N-Gram or URIrefs as keywords to find relevant documents and to compute the simila...

متن کامل

Efficient Algorithm for Removing Duplicate Documents

2014

Suresh Subramanian Sivaprakasam

Internet or Web world has a large amount of information, which may be html documents, word, pdf files, audio and video files, images etc. Huge challenges are being faced by the researches to provide the required and related documents to the users according to the user query. Additional overheads are available for researchers pertaining to identify the duplicate and near duplicate web documents....

متن کامل

Information Discovery based on Multi-granularity Text Fusion

2013

Qiaoyi HUANG Yi WEI

In this paper we introduce a new information discovery algorithm Multi-granularity Text Fusion (MGTF) on the Web. Granularity means the length of News relevant web documents, such as News web pages, Blog and Micro Blogs, which comes from web uses. The longer the text is, the higher of the granularity it has. Given a topic query on the Internet and the results of different granularity and time-s...

متن کامل

A Technique for Generating Semantic Web Documents

2006

Qasim Akram Abad Shah Amjad Farooq

The huge number of available web documents makes it increasingly difficult for users to find and access required information because their semantics are not understandable by machines. Semantic Web concentrates on this issue via machine understandable metadata for web documents to make them automatic processable. XML is widely used in Web to specify structure of documents in syntactic dimension...

متن کامل

Aligning Needles in a Haystack: Paraphrase Acquisition Across the Web

2005

Marius Pasca Péter Dienes

This paper presents a lightweight method for unsupervised extraction of paraphrases from arbitrary textual Web documents. The method differs from previous approaches to paraphrase acquisition in that 1) it removes the assumptions on the quality of the input data, by using inherently noisy, unreliable Web documents rather than clean, trustworthy, properly formatted documents; and 2) it does not ...

متن کامل