Compressing Multisets With Large Alphabets

نویسندگان

چکیده

Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how convert a compression algorithm for sequences into one multisets, exchange additional term is quasi-linear sequence length. This allows us of exchangeable symbols rate, decoupled from the size. The key insight avoid encoding multiset directly, and instead proxy sequence, using technique called bits-back coding. demonstrate method experimentally on tasks are intractable previous optimal-rate methods: images JavaScript Object Notation (JSON) files. Code our experiments available https://github.com/facebookresearch/multiset-compression.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visual Mining of Powersets with Large Alphabets

We present the PowerSetViewer visualization system for the lattice-based mining of powersets. Searching for items within the powerset of a universe occurs in many large dataset knowledge discovery contexts. Using a spatial layout based on a powerset provides a unified visual framework at three different levels: data mining on the filtered dataset, browsing the entire dataset, and comparing mult...

متن کامل

Optimal Suffix Tree Construction with Large Alphabets

The suux tree of a string is the fundamental data structure of combinatorial pattern matching. Weiner Wei73], who introduced the data structure, gave an O(n) time algorithm algorithm for building the suux tree of an n character string drawn from a constant size alphabet. In the comparison model, there is a trivial (n log n) time lower bound based on sorting, and Weiner's algorithm matches this ...

متن کامل

Large Alphabets and Incompressibility

We briefly survey some concepts related to empirical entropy — normal numbers, de Bruijn sequences and Markov processes — and investigate how well it approximates Kolmogorov complexity. Our results suggest lth-order empirical entropy stops being a reasonable complexity metric for almost all strings of length m over alphabets of size n about when nl surpasses m.

متن کامل

Capacity of Random Channels with Large Alphabets

We consider discrete memoryless channels with input alphabet size n and output alphabet size m, where m = ⌈γn⌉ for some constant γ > 0. The channel transition matrix consists of entries that, before being normalized, are independent and identically distributed nonnegative random variables V and such that E [ (V logV ) ] < ∞. We prove that in the limit as n → ∞ the capacity of such a channel con...

متن کامل

Enumeration of sequences with large alphabets

A binary sequence of length n with w ones can be identified by its lexicographical rank in the set of all binary sequences with same number of ones and zeros, which is of size n! w!·(n−w)! . Although that enumeration has been deeply studied for binary case, it is less addressed for σ-ary sequences, where σ > 2. Assuming n is a fixed predetermined parameter, the enumerative coding of a given n-s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE journal on selected areas in information theory

سال: 2022

ISSN: ['2641-8770']

DOI: https://doi.org/10.1109/jsait.2023.3245417