'How can I estimate the compressibility of a file without compressing it?

I'm using an event loop based server in twisted python that stores files, and I'd like to be able to classify the files according to their compressibility.

If the probability that they'd benefit from compression is high, they would go to a directory with btrfs compression switched on, otherwise they'd go elsewhere.

I do not need to be sure - 80% accuracy would be plenty, and would save a lot of diskspace. But since there is the CPU and fs performance issue too, I can not just save everything compressed.

The files are in the low megabytes. I can not test-compress them without using a huge chunk of CPU and unduly delaying the event loop or refactoring a compression algorithm to fit into the event loop.

Is there any best practice to give a quick estimate for compressibility? What I came up with is taking a small chunk (few kB) of data from the beginning of the file, test-compress it (with a presumably tolerable delay) and base my decision on that.

Any suggestions? Hints? Flaws in my reasoning and/or problem?



Solution 1:[1]

Just 10K from the middle of the file will do the trick. You don't want the beginning or the end, since they may contain header or trailer information that is not representative of the rest of the file. 10K is enough to get some amount of compression with any typical algorithm. That will predict a relative amount of compression for the whole file, to the extent that that middle 10K is representative. The absolute ratio you get will not be the same as for the whole file, but the amount that it differs from no compression will allow you to set a threshold. Just experiment with many files to see where to set the threshold.

As noted, you can save time by doing nothing for files that are obviously already compressed, e.g. .png. .jpg., .mov, .pdf, .zip, etc.

Measuring entropy is not necessarily a good indicator, since it only gives the zeroth-order estimate of compressibility. If the entropy indicates that it is compressible enough, then it is right. If the entropy indicates that it is not compressible enough, then it may or may not be right. Your actual compressor is a much better estimator of compressibility. Running it on 10K won't take long.

Solution 2:[2]

I think what you are looking for is How to calculate the entropy of a file?

This questions contains all kind of methods to calculate the entropy of the file (and by that you can get the 'compressibility' of a file). Here's a quote from the abstract of this article (Relationship Between Entropy and Test Data Compression Kedarnath J. Balakrishnan, Member, IEEE, and Nur A. Touba, Senior Member, IEEE):

The entropy of a set of data is a measure of the amount of information contained in it. Entropy calculations for fully specified data have been used to get a theoretical bound on how much that data can be compressed. This paper extends the concept of entropy for incompletely specified test data (i.e., that has unspecified or don't care bits) and explores the use of entropy to show how bounds on the maximum amount of compression for a particular symbol partitioning can be calculated. The impact of different ways of partitioning the test data into symbols on entropy is studied. For a class of partitions that use fixed-length symbols, a greedy algorithm for specifying the don't cares to reduce entropy is described. It is shown to be equivalent to the minimum entropy set cover problem and thus is within an additive constant error with respect to the minimum entropy possible among all ways of specifying the don't cares. A polynomial time algorithm that can be used to approximate the calculation of entropy is described. Different test data compression techniques proposed in the literature are analyzed with respect to the entropy bounds. The limitations and advantages of certain types of test data encoding strategies are studied using entropy theory

And to be more constructive, checkout this site for python implementation of entropy calculations of chunks of data

Solution 3:[3]

Compressed files usually don't compress well. This means that just about any media file is not going to compress very well, since most media formats already include compression. Clearly there are exceptions to this, such as BMP and TIFF images, but you can probably build a whitelist of well-compressed filetypes (PNGs, MPEGs, and venturing away from visual media - gzip, bzip2, etc) to skip and then assume the rest of the files you encounter will compress well.

If you feel like getting fancy, you could build feedback into the system (observe the results of any compression you do and associate the resulting ratio with the filetype). If you come across a filetype that has consistently poor compression, you could add it to the whitelist.

These ideas depend on being able to identify a file's type, but there are standard utilities which do a pretty good job of this (generally much better than 80%) - file(1), /etc/mime.types, etc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Community
Solution 3 Jean-Paul Calderone