text length

Processing of Huffman Compressed Texts with a Super-Alphabet

2003

Kimmo Fredriksson Jorma Tarhio

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n log2 σ b ) time, where n is the size of the compressed text in bytes, σ is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O( b H log2 σ ) symbols, where H is the entropy of the text. Each ...

متن کامل

On the Approximation Ratio of Lempel-Ziv Parsing

2017

Travis Gagie Gonzalo Navarro Nicola Prezza

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in th...

متن کامل

A Fast Algorithm for Making Su x Arrays and for Burrows-Wheeler Transformation

1998

Kunihiko Sadakane

We propose a fast and memory e cient algorithm for sorting su xes of a text in lexicographic order. It is important to sort su xes because an array of indexes of su xes is called su x array and it is a memory e cient alternative of the su x tree. Sorting su xes is also used for the Burrows-Wheeler transformation in the Block Sorting text compression, therefore fast sorting algorithms are desire...

متن کامل

A Fast Algorithms for Making Suffix Arrays and for Burrows-Wheeler Transformation

1998

Kunihiko Sadakane

We propose a fast and memory e cient algorithm for sorting su xes of a text in lexicographic order. It is important to sort su xes because an array of indexes of su xes is called su x array and it is a memory e cient alternative of the su x tree. Sorting su xes is also used for the Burrows-Wheeler transformation in the Block Sorting text compression, therefore fast sorting algorithms are desire...

متن کامل

Distributional Term Representations for Short-Text Categorization

2013

Juan Manuel Cabrera Hugo Jair Escalante Manuel Montes-y-Gómez

Everyday, millions of short-texts are generated for which effective tools for organization and retrieval are required. Because of the tiny length of these documents and of their extremely sparse representations, the direct application of standard text categorization methods is not effective. In this work we propose using distributional term representations (DTRs) for short-text categorization. ...

متن کامل

Text Classiication Using String Kernels Produced as Part of the Esprit Working Group in Neural and Computational Learning Ii, Neurocolt2 27150

2000

Huma Lodhi John Shawe-Taylor Nello Cristianini Chris Watkins

We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those oc...

متن کامل

Classifying Short Text in Social Media: Twitter as Case Study

2015

Faris Kateb Jugal Kalita Fabian Abel Qi Gao Geert-Jan Houben Grigoris Antoniou Marko Grobelnik Elena Simperl Bijan Parsia Dimitris Plexousakis Pieter Leenheer Jeff Pan Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom Adam Bermingham Johan Bollen Huina Mao Meeyoung Cha Hamed Haddadi Fabricio Benevenuto

With the huge growth of social media, especially with 500 million Twitter messages being posted per day, analyzing these messages has caught intense interest of researchers. Topics of interest include micro-blog summarization, breaking news detection, opinion mining and discovering trending topics. In information extraction, researchers face challenges in applying data mining techniques due to ...

متن کامل

O(k) Parallel Algorithms for Approximate String Matching Approximate String Matching (proposed Running Head)

1993

Alden H. Wright Yi Jiang

Given a text string T of length n, a shorter pattern string A of length m, and an integer k, an simple straightforward O(k) parallel algorithm for nding all occurrences of the pattern string in the text string with at most k di erences (as de ned by edit distance) is presented. The algorithm uses the priority CRCW-PRAM model of computation and (n m+ k + 2) m = O(n m) processors. Over recent dec...

متن کامل

Pattern Matching Algorithms with Don't Cares

2007

Mohammad Sohel Rahman Costas S. Iliopoulos

In this paper, we present algorithms for pattern matching, where either the pattern P or the text T can contain “don’t care” characters. If the pattern P contains don’t care characters, then we can solve the pattern matching problem in O(n +m + α) time, where α is the total number of occurrences of the component subpatterns. We also can handle online queries, given an O(n) preprocessing time, r...

متن کامل

A New Algorithm for Subset Matching Problem

2007

Yangjun Chen

The subset matching problem is to find all occurrences of a pattern string p of length m in a text string t of length n, where each pattern and text position is a set of characters drawn from some alphabet Σ. The pattern is said to occur at text position i if the set p[j] is a subset of the set t[i + j 1], for all j (1 ≤ j ≤ m). This is a generalization of the ordinary string matching and can b...

متن کامل