edit distance

String Distances and Uniformities

2009

David W. Pearson Jean-Christophe Janodet

The Levenstein or edit distance was developed as a metric for calculating distances between character strings. We are looking at weighting the different edit operations (insertion, deletion, substitution) to obtain different types of classifications of sets of strings. As a more general and less constrained approach we introduce topological notions and in particular uniformities.

متن کامل

CharacTer: Translation Edit Rate on Character Level

2016

Weiyue Wang Jan-Thorsten Peter Hendrik Rosendahl Hermann Ney

Recently, the capability of character-level evaluation measures for machine translation output has been confirmed by several metrics. This work proposes translation edit rate on character level (CharacTER), which calculates the character level edit distance while performing the shift edit on word level. The novel metric shows high system-level correlation with human rankings, especially for mor...

متن کامل

Motion Primitives for Action Recognition

2007

P. Fihl

The number of potential applications has made automatic recognition of human actions a very active research area. Different approaches have been followed based on trajectories through some state space. In this paper we also model an action as a trajectory through a state space, but we represent the actions as a sequence of temporal isolated instances, denoted primitives. These primitives are ea...

متن کامل

Searching for repeated words in a

1995

Marie-France Sagot Vincent Escalier Alain Viari Henri Soldano

We present in this paper an algorithm that locates similar words common to a set of strings deened over an alphabet , where the similarity is stated in terms of a Levenshtein edit distance. The comparison of the words in the strings is realized by using a reference object called a model which is a word over. This allows us to perform a multiple comparison of the strings as opposed to pairwise c...

متن کامل

The Relative Divergence of Dutch Dialect Pronunciations from their Common Source: An Exploratory Study

2007

Wilbert Heeringa Brian Joseph

In this paper we use the Reeks Nederlandse Dialectatlassen as a source for the reconstruction of a ‘proto-language’ of Dutch dialects. We used 360 dialects from locations in the Netherlands, the northern part of Belgium and French-Flanders. The density of dialect locations is about the same everywhere. For each dialect we reconstructed 85 words. For the reconstruction of vowels we used knowledg...

متن کامل

On the Correlation of Graph Edit Distance and L 1 Distance in the Attribute Statistics Embedding Space

2012

Jaume Gibert Ernest Valveny Horst Bunke Alicia Fornés

Graph embeddings in vector spaces aim at assigning a pattern vector to every graph so that the problems of graph classification and clustering can be solved by using data processing algorithms originally developed for statistical feature vectors. An important requirement graph features should fulfil is that they reproduce as much as possible the properties among objects in the graph domain. In ...

متن کامل

Coûts de distance d'édition pour la Recherche d'Information XML

2012

Cyril Laitang Karen Pinel-Sauvagnat Mohand Boughanem

Structured information retrieval (SIR) on XML documents allows to retrieve focused parts of documents that match the user needs. These needs can be expressed throught content and structured queries, that as well as XML documents can be represented as trees. Our approach uses these trees through tree edit distance to estimate the relevance of XML elements. Tree edit distance is the minimum set o...

متن کامل

Studying Evolution of a Branch of Knowledge by Constructing and Analyzing Its Ontology

2006

Pavel Makagonov Alejandro Ruiz Figueroa Alexander F. Gelbukh

We propose a method for semi-automatic construction of an ontology of a given branch of science for measuring its evolution in time. The method relies on a collection of documents in the given thematic domain. We observe that the words of different levels of abstraction are located within different parts of a document: say, the title or abstract contains more general words than the body of the ...

متن کامل

Shape-Space from Tree-Union

2002

Andrea Torsello Edwin R. Hancock

In this paper we investigate how to construct a shape space for sets of shock trees. To do this we construct a super-tree to span the union of the set of shock trees. This super-tree is constructed so that it both minimizes the total tree edit distance and preserves edge consistency constraints. Each node of the super-tree corresponds to a dimension of the pattern space. Individual such trees a...

متن کامل

Scaling Similarity Joins over Tree-Structured Data

Journal: :PVLDB 2015

Yu Tang Yilun Cai Nikos Mamoulis

Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-ofthe-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join res...

متن کامل