'Comparing MD5 hashes between XLSX files with identical content

We have an internal web application that accepts a file of varying formats from a user in order to import large amounts of data into our systems.

One of the more recent upgrades that we implemented was to add a way to detect if a file was previously uploaded, and if so, to present the user with a warning and an option to resubmit the file, or to cancel the upload.

To accomplish this, we're computing the MD5 of the uploaded file, and comparing that against a database table containing the previously uploaded file information to determine if it is a duplicate. If there was a match on the MD5, the warning is displayed, otherwise it inserts the new file information in the table and carries on with the file processing.

The following is the C# code used to generate the MD5 hash:

private static string GetHash(byte[] input)
{
    using (MD5 md5 = MD5.Create())
    {
        byte[] data = md5.ComputeHash(input);

        StringBuilder bob = new StringBuilder();

        for (int i = 0; i < data.Length; i++)
            bob.Append(data[i].ToString("x2").ToUpper());

        return bob.ToString();
    }
}

Everything is working fine and well... with one exception.

Users are allowed to upload .xlsx files for this process, and unfortunately this file type also stores the metadata of the file within the file contents. (This can easily be seen by changing the extension of the .xlsx file to a .zip and extracting the contents [see below].)

Excel Metadata

Because of this, the MD5 hash of the .xlsx files will change with each subsequent save, even if the contents of the file are identical (simply opening and saving the file with no modifications will refresh the metadata and lead to a different MD5 hash).

In this situation, a file with identical records, but created at different times or by different users will slip past the duplicate file detection, and be processed.

My question: is there a way to determine if the content of an .xlsx file matches that of a previous file without storing the file content? In other words: is there a way to generate an MD5 hash of just the contents of an .xlsx file?



Solution 1:[1]

You can remove the pieces from the document that should not influence the hash before you compute the hash.

This can be achieved by extracting all parts of the Open XML package into a single XML document, removing the undesired nodes and computing a hash of the resulting XML document. Note that you will have to recompute the hash for already uploaded Excel files for this to work because the hash now no longer is based on the binary file contents.

Here is a simple sample program (add a reference to WindowsBase.dll):

using System;
using System.IO;
using System.IO.Packaging;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Xml.Linq;

internal class Program
{
    private static readonly XNamespace dcterms = "http://purl.org/dc/terms/";

    private static void Main(string[] args)
    {
        var fileName = args[0];

        // open the ZIP package
        var package = Package.Open(fileName);

        // convert the package to a single XML document
        var xdoc = OpcToFlatOpc(package);

        // remove the nodes we are not interested in
        // here you can add other nodes as well
        xdoc.Descendants(dcterms + "modified").Remove();

        // get a stream of the XML and compute the hash
        using (var ms = new MemoryStream())
        {
            xdoc.Save(ms);
            ms.Position = 0;

            string md5 = GetHash(ms);
            Console.WriteLine(md5);
        }
    }

    private static string GetHash(Stream stream)
    {
        using (var md5 = MD5.Create())
        {
            var data = md5.ComputeHash(stream);

            var bob = new StringBuilder();

            for (int i = 0; i < data.Length; i++)
            {
                bob.Append(data[i].ToString("X2"));
            }

            return bob.ToString();
        }
    }

    private static XDocument OpcToFlatOpc(Package package)
    {
        XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
        var declaration = new XDeclaration("1.0", "UTF-8", "yes");
        var doc = new XDocument(
            declaration,
            new XProcessingInstruction("mso-application", "progid=\"Word.Document\""),
            new XElement(
                pkg + "package",
                new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
                package.GetParts().Select(GetContentsAsXml)));

        return doc;
    }

    private static XElement GetContentsAsXml(PackagePart part)
    {
        XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
        if (part.ContentType.EndsWith("xml"))
        {
            using (var partstream = part.GetStream())
            {
                using (var streamReader = new StreamReader(partstream))
                {
                    string streamString = streamReader.ReadToEnd();
                    if (!string.IsNullOrEmpty(streamString))
                    {
                      var newXElement =
                        new XElement(
                          pkg + "part",
                          new XAttribute(pkg + "name", part.Uri),
                          new XAttribute(pkg + "contentType", part.ContentType),
                          new XElement(pkg 
                            + "xmlData", XElement.Parse(streamString)));
                        return newXElement;
                    }

                    return null;
                }
            }
        }

        using (var str = part.GetStream())
        {
            using (var binaryReader = new BinaryReader(str))
            {
                int len = (int)binaryReader.BaseStream.Length;
                var byteArray = binaryReader.ReadBytes(len);

                // the following expression creates the base64String, then chunks
                // it to lines of 76 characters long
                string base64String = Convert.ToBase64String(byteArray)
                       .Select((c, i) => new { Character = c, Chunk = i / 76 })
                       .GroupBy(c => c.Chunk)
                       .Aggregate(
                           new StringBuilder(),
                           (s, i) =>
                               s.Append(
                                   i.Aggregate(
                                       new StringBuilder(),
                                       (seed, it) => seed.Append(it.Character),
                                       sb => sb.ToString()))
                                .Append(Environment.NewLine),
                           s => s.ToString());

                return new XElement(
                    pkg + "part",
                    new XAttribute(pkg + "name", part.Uri),
                    new XAttribute(pkg + "contentType", part.ContentType),
                    new XAttribute(pkg + "compression", "store"),
                    new XElement(pkg + "binaryData", base64String));
            }
        }
    }
}

Solution 2:[2]

A simpler strategy takes advantage of the fact that the .xlsx file extension is the same as the .zip file extension. A .zip file includes a CRC-32 hash of un-compressed version of its content.

See: wikipedia entry on ZIP format file headers

Specifically, if you start at an offset of 14 bytes and take 4 bytes, you get a hash (really, a CRC-32 checksum) of the file's contents at the time the file was created.

For strategies on extracting bytes look here: StackO post on reading bytes from a file in C# . The best solution appears to be something like:

byte[] crc32 = new byte[4];
using (BinaryReader reader = new BinaryReader(new FileStream(xlsxFile, FileMode.Open)))
{
    reader.BaseStream.Seek(14, SeekOrigin.Begin);
    reader.Read(crc32, 0, 4);
}

But -- warning -- this is an untested suggestion. YMMV.

Another caveat -- Many applications that open .xlsx files make seemingly arbitrary changes to ids within the xml sub-files that get compressed into the larger spreadsheet file. Two files can have have effectively identical content but their checksums will differ because of these differences.

For users on Unix / Linux / MacOS users who want to compare .xlsx with scriptable shell utilities (xxd, unzip, diff) see: StackO post on shell utilities for comparing XLSX files .

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dirk Vollmar
Solution 2 ptmalcolm