Set Shaping Theory Applied to Universal Coding

2026-06-15Information Theory

Information Theory
AI summary

The authors study how to improve the compression of completely random sequences, which are hard to compress because they have no patterns. They introduce a method called Set Shaping Theory (SST) that transforms each sequence into a slightly longer one in a way that reduces the overall coding length compared to a known standard method (Krichevsky-Trofimov coding). This transformation works for various compression algorithms without changing their inner workings, acting like a preprocessing step. Their findings suggest SST can make universal coders more efficient for these difficult cases.

Universal codingSet Shaping TheoryKrichevsky-Trofimov codingEmpirical frequencyEntropyAdaptive arithmetic codingEnumerative codingLZ78Adaptive Huffman codingAdaptive ANS
Authors
Alix Petit, Aida Koch, Logan Lewis, Lily Scott
Abstract
Universal coders process individual sequences without assuming that the source distribution is known. In this setting, uniformly generated sequences represent the most difficult test case: the source simulates pure randomness, contains no exploitable bias, and forces a frequency-estimating universal coder to infer the empirical composition entirely from the sequence itself. This paper reports that a Set Shaping Theory (SST) transformation systematically reduces the average universal coding length of uniformly generated sequences below the Krichevsky-Trofimov baseline N H_0(s) + R_KT(s). The transformation maps each input sequence s in A^N into an expanded sequence f(s) in A^(N+1), while storing the transformation index in the additional symbol in order to preserve reversibility. The comparison evaluates the exact Krichevsky-Trofimov baseline N H_0(s) + R_KT(s) against the shaped score (N+1)H_0(f(s)) + R_KT(f(s)), where H_0 is computed from the empirical frequencies of the individual sequence. A single unified transformation also yields reductions across distinct compression architectures, including adaptive arithmetic coding, enumerative coding, LZ78, adaptive Huffman coding, and adaptive ANS. These results support the interpretation of SST as a representation-level preprocessing layer that can structurally improve existing universal coders without requiring internal modifications to their coding mechanisms. All results reported in the article can be reproduced with the simulator available at https://sst-simulator.github.io/Set-Shaping-Theory-Simulator/.