'Comparing MD5 hashes between XLSX files with identical content
We have an internal web application that accepts a file of varying formats from a user in order to import large amounts of data into our systems.
One of the more recent upgrades that we implemented was to add a way to detect if a file was previously uploaded, and if so, to present the user with a warning and an option to resubmit the file, or to cancel the upload.
To accomplish this, we're computing the MD5
of the uploaded file, and comparing that against a database table containing the previously uploaded file information to determine if it is a duplicate. If there was a match on the MD5
, the warning is displayed, otherwise it inserts the new file information in the table and carries on with the file processing.
The following is the C#
code used to generate the MD5
hash:
private static string GetHash(byte[] input)
{
using (MD5 md5 = MD5.Create())
{
byte[] data = md5.ComputeHash(input);
StringBuilder bob = new StringBuilder();
for (int i = 0; i < data.Length; i++)
bob.Append(data[i].ToString("x2").ToUpper());
return bob.ToString();
}
}
Everything is working fine and well... with one exception.
Users are allowed to upload .xlsx
files for this process, and unfortunately this file type also stores the metadata of the file within the file contents. (This can easily be seen by changing the extension of the .xlsx
file to a .zip
and extracting the contents [see below].)
Because of this, the MD5
hash of the .xlsx
files will change with each subsequent save, even if the contents of the file are identical (simply opening and saving the file with no modifications will refresh the metadata and lead to a different MD5
hash).
In this situation, a file with identical records, but created at different times or by different users will slip past the duplicate file detection, and be processed.
My question: is there a way to determine if the content of an .xlsx
file matches that of a previous file without storing the file content? In other words: is there a way to generate an MD5
hash of just the contents of an .xlsx
file?
Solution 1:[1]
You can remove the pieces from the document that should not influence the hash before you compute the hash.
This can be achieved by extracting all parts of the Open XML package into a single XML document, removing the undesired nodes and computing a hash of the resulting XML document. Note that you will have to recompute the hash for already uploaded Excel files for this to work because the hash now no longer is based on the binary file contents.
Here is a simple sample program (add a reference to WindowsBase.dll):
using System;
using System.IO;
using System.IO.Packaging;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Xml.Linq;
internal class Program
{
private static readonly XNamespace dcterms = "http://purl.org/dc/terms/";
private static void Main(string[] args)
{
var fileName = args[0];
// open the ZIP package
var package = Package.Open(fileName);
// convert the package to a single XML document
var xdoc = OpcToFlatOpc(package);
// remove the nodes we are not interested in
// here you can add other nodes as well
xdoc.Descendants(dcterms + "modified").Remove();
// get a stream of the XML and compute the hash
using (var ms = new MemoryStream())
{
xdoc.Save(ms);
ms.Position = 0;
string md5 = GetHash(ms);
Console.WriteLine(md5);
}
}
private static string GetHash(Stream stream)
{
using (var md5 = MD5.Create())
{
var data = md5.ComputeHash(stream);
var bob = new StringBuilder();
for (int i = 0; i < data.Length; i++)
{
bob.Append(data[i].ToString("X2"));
}
return bob.ToString();
}
}
private static XDocument OpcToFlatOpc(Package package)
{
XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
var declaration = new XDeclaration("1.0", "UTF-8", "yes");
var doc = new XDocument(
declaration,
new XProcessingInstruction("mso-application", "progid=\"Word.Document\""),
new XElement(
pkg + "package",
new XAttribute(XNamespace.Xmlns + "pkg", pkg.ToString()),
package.GetParts().Select(GetContentsAsXml)));
return doc;
}
private static XElement GetContentsAsXml(PackagePart part)
{
XNamespace pkg = "http://schemas.microsoft.com/office/2006/xmlPackage";
if (part.ContentType.EndsWith("xml"))
{
using (var partstream = part.GetStream())
{
using (var streamReader = new StreamReader(partstream))
{
string streamString = streamReader.ReadToEnd();
if (!string.IsNullOrEmpty(streamString))
{
var newXElement =
new XElement(
pkg + "part",
new XAttribute(pkg + "name", part.Uri),
new XAttribute(pkg + "contentType", part.ContentType),
new XElement(pkg
+ "xmlData", XElement.Parse(streamString)));
return newXElement;
}
return null;
}
}
}
using (var str = part.GetStream())
{
using (var binaryReader = new BinaryReader(str))
{
int len = (int)binaryReader.BaseStream.Length;
var byteArray = binaryReader.ReadBytes(len);
// the following expression creates the base64String, then chunks
// it to lines of 76 characters long
string base64String = Convert.ToBase64String(byteArray)
.Select((c, i) => new { Character = c, Chunk = i / 76 })
.GroupBy(c => c.Chunk)
.Aggregate(
new StringBuilder(),
(s, i) =>
s.Append(
i.Aggregate(
new StringBuilder(),
(seed, it) => seed.Append(it.Character),
sb => sb.ToString()))
.Append(Environment.NewLine),
s => s.ToString());
return new XElement(
pkg + "part",
new XAttribute(pkg + "name", part.Uri),
new XAttribute(pkg + "contentType", part.ContentType),
new XAttribute(pkg + "compression", "store"),
new XElement(pkg + "binaryData", base64String));
}
}
}
}
Solution 2:[2]
A simpler strategy takes advantage of the fact that the .xlsx
file extension is the same as the .zip
file extension. A .zip
file includes a CRC-32 hash of un-compressed version of its content.
See: wikipedia entry on ZIP format file headers
Specifically, if you start at an offset of 14 bytes and take 4 bytes, you get a hash (really, a CRC-32 checksum) of the file's contents at the time the file was created.
For strategies on extracting bytes look here: StackO post on reading bytes from a file in C# . The best solution appears to be something like:
byte[] crc32 = new byte[4];
using (BinaryReader reader = new BinaryReader(new FileStream(xlsxFile, FileMode.Open)))
{
reader.BaseStream.Seek(14, SeekOrigin.Begin);
reader.Read(crc32, 0, 4);
}
But -- warning -- this is an untested suggestion. YMMV.
Another caveat -- Many applications that open .xlsx
files make seemingly arbitrary changes to ids within the xml sub-files that get compressed into the larger spreadsheet file. Two files can have have effectively identical content but their checksums will differ because of these differences.
For users on Unix / Linux / MacOS users who want to compare .xlsx
with scriptable shell utilities (xxd
, unzip
, diff
) see: StackO post on shell utilities for comparing XLSX files .
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Dirk Vollmar |
Solution 2 | ptmalcolm |