'Compressing a collection of ISO files with similar content

I have a large collection of ISO files (around 1GB each) that have shared 'runs of data' between them. So, for example, one of the audio tracks may be the same (same length and content across 5 isos), but it may not necessarily have the same name or location in each.

Is there some compression technique I can apply that will detect and losslessly deduplicate this information across multiple files?



Solution 1:[1]

For anyone reading this, after some experimentation it turns out that by putting all the similar ISO or CHD files in a single 7zip archive (Solid archive, with maximum dictionary size of 1536MB), I was able to achieve extremely high compression via deduplication on already compressed data.

Solution 2:[2]

The lrzip program is designed for this kind of thing. It is available on most Linux/BSD systems package mangers, or via Cygwin for Windows.

It uses an extended version of rzip to first de-duplicate the source files, and then compresses them. Because it uses mmap() it does not have issues with the size of your RAM, like 7zip does.

In my tests lrzip was able to massively de-duplicate similar ISOs, bringing a 32GB set of OS installation discs down to around 5GB.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 dvasdekis
Solution 2 user11567957