wikipedia mining

Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary

2008

Torsten Zesch Christof Müller Iryna Gurevych

Recently, collaboratively constructed resources such as Wikipedia and Wiktionary have been discovered as valuable lexical semantic knowledge bases with a high potential in diverse Natural Language Processing (NLP) tasks. Collaborative knowledge bases however significantly differ from traditional linguistic knowledge bases in various respects, and this constitutes both an asset and an impediment...

متن کامل

Did You Know? - Mining Interesting Trivia for Entities from Wikipedia

2015

Abhay Prakash Manoj Kumar Chinnakotla Dhaval Patel Puneet Garg

Trivia is any fact about an entity which is interesting due to its unusualness, uniqueness, unexpectedness or weirdness. In this paper, we propose a novel approach for mining entity trivia from their Wikipedia pages. Given an entity, our system extracts relevant sentences from its Wikipedia page and produces a list of sentences ranked based on their interestingness as trivia. At the heart of ou...

متن کامل

Automatically Classifying Edit Categories in Wikipedia Revisions

2013

Johannes Daxenberger Iryna Gurevych

In this paper, we analyze a novel set of features for the task of automatic edit category classification. Edit category classification assigns categories such as spelling error correction, paraphrase or vandalism to edits in a document. Our features are based on differences between two versions of a document including meta data, textual and language properties and markup. In a supervised machin...

متن کامل

Bilingual Dictionary Extraction from Wikipedia

2009

Kun Yu Junichi Tsujii

The way of mining comparable corpora and the strategy of dictionary extraction are two essential elements of bilingual dictionary extraction from comparable corpora. This paper first proposes a method, which uses the interlanguage link in Wikipedia, to build comparable corpora. The large scale of Wikipedia ensures the quantity of collected comparable corpora. Besides, because the inter-language...

متن کامل

Automatic Document Topic Identification Using Hierarchical Ontology Extracted from Human Background Knowledge

2013

Hassan Mostafa

The rapid growth in the number of documents available to end users from around the world has led to a greatly-increased need for machine understanding of their topics, as well as for automatic grouping of related documents. This constitutes one of the main current challenges in text mining. In this work, a novel technique is proposed, to automatically construct a background knowledge structure ...

متن کامل

Resolving Surface Forms to Wikipedia Topics

2010

Yiping Zhou Lan Nie Omid Rouhani-Kalleh Flavian Vasile Scott Gaffney

Ambiguity of entity mentions and concept references is a challenge to mining text beyond surface-level keywords. We describe an effective method of disambiguating surface forms and resolving them to Wikipedia entities and concepts. Our method employs an extensive set of features mined from Wikipedia and other large data sources, and combines the features using a machine learning approach with a...

متن کامل

Suggesting Subject Headings Using Web Information Sources

2008

Shun-Feng Su HIROSHI UEDA HARUMI MURAKAMI SHOJI TATSUMI

We proposed a method that suggests subject headings based on user queries when a pattern-matching algorithm fails to locate subject searches for Online Public Access Catalogs (OPAC). We combined information obtained from Wikipedia, Amazon, and Google for query expansion. Our method has two main advantages: (1) availability for any library without customizing OPACs, and (2) ability to suggest su...

متن کامل

Utilizing the Structure and Data Information for XML Document Clustering

2008

Tien Tran Sangeetha Kutty Richi Nayak

This paper reports on the experiments and results of a clustering approach used in the INEX 2008 Document Mining Challenge. The clustering approach utilizes both the structure and the content information of the XML documents in the Wikipedia collection. The content of the XML documents is measured using the latent semantic kernel (LSK). A well-known problem with the construction of latent seman...

متن کامل

Improving revision graph extraction in Wikipedia based on supergram decomposition

2013

As one of the popular social media that many people turn to in recent years, collaborative encyclopedia Wikipedia provides information in a more "Neutral Point of View" way than others. Towards this core principle, plenty of efforts have been put into collaborative contribution and editing. The trajectories of how such collaboration appears by revisions are valuable for group dynamics and socia...

متن کامل

Scalable Text Mining with Sparse Generative Models

Journal: :CoRR 2016

Antti Puurula

The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a n...

متن کامل