'How to convert hashbytes string from sql to spark equivalent
I have a process using the following select statement in sql server
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
This give you: 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
I have been trying to re-create this same result in Spark SQL without luck.
I tried this sha1(encode('4100119300','utf-8'))
in Spark
But the result of this is: b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4
During my test if I remove the cast in the sql area the result is the same in spark. The problem I see is that in spark you can't specify the size of the string or maybe is changing the encoding in the process. I already have data in sql hashed with the nvarchar(100) and is not possible to remove it from the spark equivalent.
Any suggestions ?
Solution 1:[1]
Have You seen those differences?
SELECT HASHBYTES('SHA1', CAST('4100119300' AS NVARCHAR(100))) AS StringConverted
-- 0x66A2F63C04A3A85347AD2F5CD99F1113F1BDD9CE
SELECT HASHBYTES('SHA1', '4100119300') AS StringConverted
-- 0xB4CF5AAE8CE3DC1673DA4949CFDF2EDFA33FDBA4
To store varbytes as string I use CONVERT with style = 1
flag (CAST & CONVERT)
SELECT CONVERT(VARCHAR(100), HASHBYTES('SHA1', '4100119300'), 1) AS StringConverted
And this is what You are looking for. It's simply Sparks b4cf5aae8ce3dc1673da4949cfdf2edfa33fdba4
in lowercase and without 0x
prefix.
Solution 2:[2]
The first thing to highlight here is that Varchar to NVarchar relates to encoding so you just need the same encoding to regenerate the Hash Key which is 'utf_16_le' encoding.
For regeneration of :
SELECT CONVERT(VARCHAR(254), HASHBYTES('SHA2_512', CONVERT(NVARCHAR(24), '2020-05-27 00:00:00.000', 127)), 2)
You will need something like this in Pyspark :
hashlib.sha512('2020-05-27 00:00:00.000'.encode('utf_16_le')).hexdigest().upper()
Link to related issue: How to reproduce the behavior of SQL NVARCHAR in Python when generating a SHA-512 hash?
Hope that Helps. Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jan Madeyski |
Solution 2 | Waqar |