Parallel Suffix Arrays for Corpus Exploration

نویسنده

  • Johannes Goller
چکیده

This paper describes how recently developed techniques for suffix array construction and compression can be expanded to bring a new data structure, called parallel suffix array, into existence, which is suitable as an in-memory representation of large annotated corpora, enabling complex queries and fast extractions of the context of matching substrings. It is also shown how parallel suffix arrays are superior to existing corpus search engines, in particular when sequential queries and corpora that are hard to tokenize are involved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Suffix Arrays for Linguistic Pattern Search

The paper presents the results of an analysis of the merits and problems of using suffix arrays as an index data structure for annotated natural-language corpora. It shows how multiple suffix arrays can be combined to represent layers of annotation, and how this enables matches for complex linguistic patterns to be identified in the corpus quickly and, for a large subclass of patterns, with gre...

متن کامل

K-Best Suffix Arrays

Suppose we have a large dictionary of strings. Each entry starts with a figure of merit (popularity). We wish to find the kbest matches for a substring, s, in a dictinoary, dict. That is, grep s dict | sort –n | head –k, but we would like to do this in sublinear time. Example applications: (1) web queries with popularities, (2) products with prices and (3) ads with click through rates. This pap...

متن کامل

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs

Translation models in statistical machine translation can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases “on the fly” in an indexed parallel corpus using suffix arrays. However, this can be slow because on-demand extraction of phrase tables is computationally expensive. We address this problem by developing novel algorithms for general purpo...

متن کامل

Translation of Multiword Expressions Using Parallel Suffix Arrays

Accurately translating multiword expressions is important to obtain good performance in machine translation, crosslanguage information retrieval, and other multilingual tasks in human language technology. Existing approaches to inducing translation equivalents of multiword units have focused on agglomerating individual words or on aligning words in a statistical machine translation system. We p...

متن کامل

Suffix Arrays in Parallel

Suffix arrays are powerful data structures for text indexing. In this paper we present parallel algorithms devised to increase throughput of suffix arrays on a multiple-query setting. Experimental results show that efficient performance is indeed feasible in this strongly sequential and very poor locality data structure.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010