Compressing Multisets With Large Alphabets
نویسندگان
چکیده
Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how convert a compression algorithm for sequences into one multisets, exchange additional term is quasi-linear sequence length. This allows us of exchangeable symbols rate, decoupled from the size. The key insight avoid encoding multiset directly, and instead proxy sequence, using technique called bits-back coding. demonstrate method experimentally on tasks are intractable previous optimal-rate methods: images JavaScript Object Notation (JSON) files. Code our experiments available https://github.com/facebookresearch/multiset-compression.
منابع مشابه
Visual Mining of Powersets with Large Alphabets
We present the PowerSetViewer visualization system for the lattice-based mining of powersets. Searching for items within the powerset of a universe occurs in many large dataset knowledge discovery contexts. Using a spatial layout based on a powerset provides a unified visual framework at three different levels: data mining on the filtered dataset, browsing the entire dataset, and comparing mult...
متن کاملOptimal Suffix Tree Construction with Large Alphabets
The suux tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner Wei73], who introduced the data structure, gave an O(n) time algorithm algorithm for building the suux tree of an n character string drawn from a constant size alphabet. In the comparison model, there is a trivial (n log n) time lower bound based on sorting, and Weiner's algorithm matches this ...
متن کاملLarge Alphabets and Incompressibility
We briefly survey some concepts related to empirical entropy — normal numbers, de Bruijn sequences and Markov processes — and investigate how well it approximates Kolmogorov complexity. Our results suggest lth-order empirical entropy stops being a reasonable complexity metric for almost all strings of length m over alphabets of size n about when nl surpasses m.
متن کاملCapacity of Random Channels with Large Alphabets
We consider discrete memoryless channels with input alphabet size n and output alphabet size m, where m = ⌈γn⌉ for some constant γ > 0. The channel transition matrix consists of entries that, before being normalized, are independent and identically distributed nonnegative random variables V and such that E [ (V logV ) ] < ∞. We prove that in the limit as n → ∞ the capacity of such a channel con...
متن کاملEnumeration of sequences with large alphabets
A binary sequence of length n with w ones can be identified by its lexicographical rank in the set of all binary sequences with same number of ones and zeros, which is of size n! w!·(n−w)! . Although that enumeration has been deeply studied for binary case, it is less addressed for σ-ary sequences, where σ > 2. Assuming n is a fixed predetermined parameter, the enumerative coding of a given n-s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE journal on selected areas in information theory
سال: 2022
ISSN: ['2641-8770']
DOI: https://doi.org/10.1109/jsait.2023.3245417