Thursday, November 8, 2012

Compressing permuted data

Mark Nelson posed a challenge:
[...] Challenge 2 is more along the lines of troll bait, because it is patently impossible: create a system to compress and then decompress any file of size 415,241 bytes. In other words, create a compressed file that is smaller than the input, then use only that compressed file to restore the original data.
Unlike Challenge 1, there are no size limitations on Challenge 2. Your compressor and decompressor can be as large as you like. [...]
To keep it simple, the files will be simple permutations of the million digit file – scrambled with an encryption system, or perhaps a random number generator, or whatever. The point is, they should have all the same numeric characteristics as the original file, just organized slightly differently.
He also supplied a file AMillionRandomDigits.bin, which contains 415,241 random bytes.
Of course it is well-known that it is impossible to write a compressor that can compress every possible data file.  But the last paragraph states that the files to be compressed "will be simple permutations of the million digit file" [1].  As I will show below, it is possible to compress files that are bytewise permutations of a known file, albeit by only a small amount.

The key point is that the number of permutations of AMillionRandomDigits.bin is considerably smaller than the number of files of length 415241. In fact, it is possible to compute the numbers exactly:

Number of files of length N = 256^N
Number of permutations of a file of length N = N! / (n_0! n_1! ... n_255!)

where n_b is the number of times that the byte value b appears in the original file.
The minimum size, in bits, to which a file from a particular ensemble can be compressed is given by the logarithm base 2 of the number of possible files. Evaluating the above formulas given the byte frequencies in AMillionRandomDigits.bin, we find that an optimal compressor could compress any of these files by exactly 235 bytes or 0.06%. So a tiny extra crumb of information can be leveraged to effect a tiny bit of compression.
I was initially surprised how little compression is made possible by knowledge of character frequency counts, though in retrospect it is pretty obvious:

N = 415241
Each byte value appears approximately N/256 = 1622 times
The variation in byte frequencies is about ±sqrt(1622) ≈ ±40
That fact that byte b appears n_b times therefore gives you (very roughly)
    lg(80) or 6.3 bits of information.
Since we have independent counts for 255 bytes, that gives about
    6.3 bits * 255 ≈ 200 bytes of information

Thus this simple estimate suggests that the file can be compressed by about 200 bytes (not far off of the correct value, 235, as computed above).

Can we actually write a compressor/decompressor for such files?  In fact, it is easy to compress the file by one byte.  The compressor is trivial: simply drop the last byte from the file. To decompress,
  1. Compute the byte frequencies in AMillionRandomDigits.bin
  2. Compute the byte frequencies in the compressed file
  3. Determine which byte value appears fewer times than expected in the compressed file
  4. Append that byte to the end of the compressed file to produce the original
Implementing optimal compression for such a file is left as an exercise for the reader.


[1] After I submitted my solution, Mark claimed that what he meant by "permutations" was not the mathematical meaning of the word, but rather...well, I don't really know; if it's not meant precisely then the whole paragraph is kindof meaningless.