'md5sum shell script and python hashlib.md5 is different
I am comparing two qcow2 images file at two different location to see difference. /opt/images/file.qcow2 /mnt/images/file.qcow2
When i run
md5sum /opt/images/file.qcow2
md5sum /mnt/images/file.qcow2
both files the checksum is same
But when try to find md5sum using following piece of code
def isImageLatest(file1,file2):
print('Checking md5sum of {} {}'.format(file1, file2))
if os.path.isfile(file1) and os.path.isfile(file2):
md5File1 = hashlib.md5(file1).hexdigest()
md5File2 = hashlib.md5(file2).hexdigest()
print('md5sum of {} is {}'.format(file1, md5File1))
print('md5sum of {} is {}'.format(file2, md5File2))
else:
print('Either {} or {} File not found'.format(file1,file2))
return False
if md5File1 == md5File2:
return True
else:
return False
It says checksum is not same
UPDATE File size can of size of 8 GB
Solution 1:[1]
You are hashing the path of the file, not the content ...
hashlib.md5(file1).hexdigest() # file1 == '/path/to/file.ext'
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(16384), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def isImageLatest(file1,file2):
print('Checking md5sum of {} {}'.format(file1, file2))
if os.path.isfile(file1) and os.path.isfile(file2):
md5File1 = md5(file1)
md5File2 = md5(file2)
print('md5sum of {} is {}'.format(file1, md5File1))
print('md5sum of {} is {}'.format(file2, md5File2))
else:
print('Either {} or {} File not found'.format(file1,file2))
return False
if md5File1 == md5File2:
return True
else:
return False
Sidenote: You probably want to use hashlib.sha1()
(with unix's sha1sum
) instead of md5
which is broken and deprecated...
Edit: Benchmark with various buffersize and md5
vs sha1
Using a 100mB
random file on a crappy server (Atom N2800 @1.86GHz):
???????????????????????????????????????
? Algorithm ? Buffer ? Time (s) ?
???????????????????????????????????????
? md5sum ? --- ? 0.387 ?
? MD5 ? 2? ? 21.5670549870 ?
? MD5 ? 2? ? 6.64844799042 ?
? MD5 ? 2¹? ? 3.12886619568 ?
? MD5 ? 2¹² ? 1.82865810394 ?
? MD5 ? 2¹? ? 1.27349495888 ?
? MD5 ? 128¹ ? 11.5235209465 ?
? MD5 ? 128² ? 1.27280807495 ?
? MD5 ? 128³ ? 1.16839885712 ?
? sha1sum ? --- ? 1.013 ?
? SHA1 ? 2? ? 23.4520659447 ?
? SHA1 ? 2? ? 7.75686216354 ?
? SHA1 ? 2¹? ? 3.82775402069 ?
? SHA1 ? 2¹² ? 2.52755594254 ?
? SHA1 ? 2¹? ? 1.93437695503 ?
? SHA1 ? 128¹ ? 12.9430441856 ?
? SHA1 ? 128² ? 1.93382811546 ?
? SHA1 ? 128³ ? 1.81412386894 ?
???????????????????????????????????????
So md5sum
is faster than sha1sum
and python's implementations shows the same. Having a bigger buffer increases performance but within a limit (16384
seems a good tradeoff (not too big and efficient)).
Solution 2:[2]
Try this:
from hashlib import md5
def md5File(filename):
hasher = md5()
blockSize = 16 * 1024 * 1024
with open(filename, 'rb') as f:
while True:
fileBuffer = f.read(blockSize)
if not fileBuffer:
break
hasher.update(fileBuffer)
return hasher.hexdigest()
def isImageLatest(file1,file2):
print('Checking md5sum of {} {}'.format(file1, file2))
if os.path.isfile(file1) and os.path.isfile(file2):
md5File1 = md5File(file1)
md5File2 = md5File(file2)
print('md5sum of {} is {}'.format(file1, md5File1))
print('md5sum of {} is {}'.format(file2, md5File2))
else:
print('Either {} or {} File not found'.format(file1,file2))
return False
return md5File1 == md5File
When you just do hashlib.md5(file1).hexdigest()
, you're literally just md5'ing the name of the file. You actually want to md5 the content, which requires opening and reading the file using Python file operations. The method I've posted above can hash a large file without reading the whole thing into memory.
Solution 3:[3]
How about using the code below:
def isImageLatest(file1,file2):
print('Checking md5sum of {} {}'.format(file1, file2))
if os.path.isfile(file1) and os.path.isfile(file2):
md5File1 = hashlib.md5(open(file1,"rb").read()).hexdigest()
md5File2 = hashlib.md5(open(file2,"rb").read()).hexdigest()
print('md5sum of {} is {}'.format(file1, md5File1))
print('md5sum of {} is {}'.format(file2, md5File2))
else:
print('Either {} or {} File not found'.format(file1,file2))
return False
if md5File1 == md5File2:
return True
else:
return False
Please note that this is great for smaller files. If the file is large it is good to read the file chunk-by-chunk like the examples given above. For this case, you could use the following code:
import time
import hashlib
import time
with open("Some_Very_Large_File", "rb") as f:
hasher = hashlib.md5()
a = time.time()
while True:
data = f.read(3 * 1024)
if not data:
break
hasher.update(data)
print hasher.hexdigest()
b = time.time()
print "Done hashing in ", b - a, " seconds"
Following are the benchmarks I observed:
3.26GB media file and calculated the hash in 11.26 sec.
4.8GB file and hash calculated in 16.47 sec.
10.8GB file and hash calculated in 102.36 sec.
Please try the code and do let me know.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 |