'How to convert hashbytes string from sql to spark equivalent

I have a process using the following select statement in sql server

SELECT  HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted

This give you: 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE

I have been trying to re-create this same result in Spark SQL without luck.

I tried this sha1(encode('4100119300','utf-8')) in Spark

But the result of this is: b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4

During my test if I remove the cast in the sql area the result is the same in spark. The problem I see is that in spark you can't specify the size of the string or maybe is changing the encoding in the process. I already have data in sql hashed with the nvarchar(100) and is not possible to remove it from the spark equivalent.

Any suggestions ?



Solution 1:[1]

Have You seen those differences?

SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
-- 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
SELECT HASHBYTES('SHA1', '4100119300') AS StringConverted
-- 0xB4CF5AAE8CE3DC1673DA4949CFDF2EDFA33FDBA4

To store varbytes as string I use CONVERT with style = 1 flag (CAST & CONVERT)

SELECT CONVERT(VARCHAR(100), HASHBYTES('SHA1', '4100119300'), 1) AS StringConverted

And this is what You are looking for. It's simply Sparks b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4 in lowercase and without 0x prefix.

Solution 2:[2]

The first thing to highlight here is that Varchar to NVarchar relates to encoding so you just need the same encoding to regenerate the Hash Key which is 'utf_16_le' encoding.

For regeneration of :

SELECT CONVERT(VARCHAR(254), HASHBYTES('SHA2_512', CONVERT(NVARCHAR(24), '2020-05-27 00:00:00.000', 127)), 2)

You will need something like this in Pyspark :

hashlib.sha512('2020-05-27 00:00:00.000'.encode('utf_16_le')).hexdigest().upper()

Link to related issue: How to reproduce the behavior of SQL NVARCHAR in Python when generating a SHA-512 hash?

Hope that Helps. Thanks

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jan Madeyski
Solution 2 Waqar