'md5sum shell script and python hashlib.md5 is different

I am comparing two qcow2 images file at two different location to see difference. /opt/images/file.qcow2 /mnt/images/file.qcow2

When i run

md5sum /opt/images/file.qcow2 
md5sum  /mnt/images/file.qcow2

both files the checksum is same

But when try to find md5sum using following piece of code

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(file1).hexdigest()
        md5File2 = hashlib.md5(file2).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

It says checksum is not same

UPDATE File size can of size of 8 GB



Solution 1:[1]

You are hashing the path of the file, not the content ...

hashlib.md5(file1).hexdigest()  # file1 == '/path/to/file.ext'

To hash the content:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(16384), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5(file1)
        md5File2 = md5(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

Sidenote: You probably want to use hashlib.sha1() (with unix's sha1sum) instead of md5 which is broken and deprecated...

Edit: Benchmark with various buffersize and md5 vs sha1 Using a 100mB random file on a crappy server (Atom N2800 @1.86GHz):

???????????????????????????????????????
? Algorithm ?  Buffer ?    Time (s)   ?
???????????????????????????????????????
?    md5sum ?     --- ? 0.387         ?
?       MD5 ?     2?  ? 21.5670549870 ?
?       MD5 ?     2?  ? 6.64844799042 ?
?       MD5 ?     2¹? ? 3.12886619568 ?
?       MD5 ?     2¹² ? 1.82865810394 ?
?       MD5 ?     2¹? ? 1.27349495888 ?
?       MD5 ?   128¹  ? 11.5235209465 ?
?       MD5 ?   128²  ? 1.27280807495 ?
?       MD5 ?   128³  ? 1.16839885712 ?
?   sha1sum ?    ---  ? 1.013         ?
?      SHA1 ?     2?  ? 23.4520659447 ?
?      SHA1 ?     2?  ? 7.75686216354 ?
?      SHA1 ?     2¹? ? 3.82775402069 ?
?      SHA1 ?     2¹² ? 2.52755594254 ?
?      SHA1 ?     2¹? ? 1.93437695503 ?
?      SHA1 ?   128¹  ? 12.9430441856 ?
?      SHA1 ?   128²  ? 1.93382811546 ?
?      SHA1 ?   128³  ? 1.81412386894 ?
???????????????????????????????????????

So md5sum is faster than sha1sum and python's implementations shows the same. Having a bigger buffer increases performance but within a limit (16384 seems a good tradeoff (not too big and efficient)).

Solution 2:[2]

Try this:

from hashlib import md5

def md5File(filename):
    hasher = md5()
    blockSize = 16 * 1024 * 1024

    with open(filename, 'rb') as f:
        while True:
            fileBuffer = f.read(blockSize)
            if not fileBuffer:
                break

            hasher.update(fileBuffer)

    return hasher.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5File(file1)
        md5File2 = md5File(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    return md5File1 == md5File

When you just do hashlib.md5(file1).hexdigest(), you're literally just md5'ing the name of the file. You actually want to md5 the content, which requires opening and reading the file using Python file operations. The method I've posted above can hash a large file without reading the whole thing into memory.

Solution 3:[3]

How about using the code below:

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(open(file1,"rb").read()).hexdigest()
        md5File2 = hashlib.md5(open(file2,"rb").read()).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

Please note that this is great for smaller files. If the file is large it is good to read the file chunk-by-chunk like the examples given above. For this case, you could use the following code:

import time
import hashlib
import time
with open("Some_Very_Large_File", "rb") as f:
    hasher = hashlib.md5()
    a = time.time()
    while True:
        data = f.read(3 * 1024)
        if not data:
            break
        hasher.update(data)
    print hasher.hexdigest()
    b = time.time()
    print "Done hashing in ", b - a, " seconds"

Following are the benchmarks I observed:

3.26GB media file and calculated the hash in 11.26 sec.
4.8GB file and hash calculated in 16.47 sec.
10.8GB file and hash calculated in 102.36 sec.

Please try the code and do let me know.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3