Experimental Study of Higher-gram Index Length for N-gram Full Text Search System
نویسندگان
چکیده
منابع مشابه
n-Gram-Based Text Compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bi...
متن کاملN-gram-based Text Attribution
Quantitative authorship attribution refers to the task of identifying the author of a text based on measurable features of the author’s style—a problem that has practical application in areas as diverse as literary scholarship, plagiarism detection, and criminal forensics. Attribution methods generally follow a generative approach, wherein a statistical “profile” is created for a set of candida...
متن کاملN-Gram-Based Text Categorization
Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization ...
متن کاملA Study Using n-gram Features for Text Categorization
In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...
متن کاملn-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching
Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEJ Transactions on Electronics, Information and Systems
سال: 2006
ISSN: 0385-4221,1348-8155
DOI: 10.1541/ieejeiss.126.1173