text similarity

A Component Histogram Map Based Text Similarity Detection Algorithm

Journal: :I. J. Network Security 2015

Huajun Huang Shuang Pang Qiong Deng Jiaohua Qin

The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed.This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each ...

متن کامل

Improved K-Means Algorithm in Text Semantic Clustering

2015

Junhong Ma

Text clustering is a very important technology in the area of text data mining. The semantic calculation method can greatly improve the computational. The aim of this paper is to improve the existing text clustering algorithms, for Chinese text and used semantic clustering method. First, in similarity calculation module of the clustering, used a staged and integrated semantic similarity algorit...

متن کامل

Text Categorization with Semantic Commonsense Knowledge

2007

Most of text categorization research exploit bag-of-words text representation. In this approach, however, all contextual information contained in text is neglected. Therefore, capturing semantic similarity between text documents that share very little or even no vocabulary is not possible. In this paper we present an approach that combines well established kernel text classifiers with external ...

متن کامل

Text Similarity Using Google Tri-grams

2012

Aminul Islam Evangelos E. Milios Vlado Keselj

The purpose of this paper is to propose an unsupervised approach for measuring the similarity of texts that can compete with supervised approaches. Finding the inherent properties of similarity between texts using a corpus in the form of a word n-gram data set is competitive with other text similarity techniques in terms of performance and practicality. Experimental results on a standard data s...

متن کامل

TakeLab: Systems for Measuring Semantic Text Similarity

2012

Frane Saric Goran Glavas Mladen Karan Jan Snajder Bojana Dalbelo Basic

This paper describes the two systems for determining the semantic similarity of short texts submitted to the SemEval 2012 Task 6. Most of the research on semantic similarity of textual content focuses on large documents. However, a fair amount of information is condensed into short text snippets such as social media posts, image captions, and scientific abstracts. We predict the human ratings o...

متن کامل

Semantic Similarity Match for Data Quality

2007

Fernando Martins André Falcão Francisco Couto

Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must...

متن کامل

Automatic Text Decomposition and Structuring

Journal: :Inf. Process. Manage. 1994

Gerard Salton James Allan

Sophisticated text similarity measurements are used to determine relationships between natural-language texts and text segments. The resulting linked hypertext maps are used to identify different text types and text structures, leading to improved text access and utilization. Examples of text decomposition are given for expository and non-expository texts. The vector processing model of retriev...

متن کامل

Learning Image Similarity Measures from Choice Data

2012

Matthias Scheller Lichtenauer Peter Zolliker Ingmar Lissner Jens Preiss Philipp Urban

We present a corpus of experimental data from psychometric studies on gamut mapping and demonstrate its use to develop image similarity measures. We investigate whether similarity measures based on luminance (SSIM) can be improved when features based on chroma and hue are added. Image similarity measures can be applied to automatically select a good image from a sample of transformed images.

متن کامل

Document Similarity Search Based on Manifold-Ranking of TextTiles

2006

Xiaojun Wan Jianwu Yang Jianguo Xiao

Document similarity search aims to find documents similar to a query document in a text corpus and return a ranked list of similar documents. Most existing approaches to document similarity search compute similarity scores between the query and the documents based on a retrieval function (e.g. Cosine) and then rank the documents by their similarity scores. In this paper, we proposed a novel ret...

متن کامل

On Paraphrase Identification Corpora

2014

Vasile Rus Rajendra Banjade Mihai C. Lintean

We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future app...

متن کامل