'SQL - UTF-8 to varchar/nvarchar Encoding issue
Problem background - I am receiving response data from a website, formatted in json and UTF-8 encoded. A body attribute of json has values in a base64binary type, and I store it as nvarchar on ms sql server.
That base64binary data when converted to varchar or nvarchar contains funny characters(in places of double quotes etc.), indicating that there is an encoding issue - This is reason I post this question.
See dissected code as follows and a runnable example at the bottom, with comments.
Notice funny characters during conversion.
eg. On behalf of IRB Holding Corp (the “Company")
The following query fixes above issue - I see quotes as they should appear, but then it fails on rows containing '&' , which is a special character in xml.
select convert(xml, '<?xml version="1.0" encoding="UTF-8"?>' + convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')))
The following query handles above issue by using replace
statements and I am able to completely see all the rows as expected. But this solution will only handle the '&'
s.
Example code to run:
declare @t table ( [body] nvarchar(max) )
insert into @t(body)
select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA=='
select convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))
, convert(xml, '<?xml version="1.0" encoding="UTF-8"?>'+ replace(convert(varchar(max),convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))),'&','&'))
from @t
The question - Will I have to add more replace statements for other xml special characters - < , >
?
Solution 1:[1]
The XML trick works fine, just let the XML engine handle the character entities:
declare @t table ([body] nvarchar(max));
insert into @t(body)
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');
select
cast(
cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
+
CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
+
cast(']]></root>' as varbinary(max))
as xml).value('.', 'nvarchar(max)')
from
@t;
The important parts here are:
- The absence of
N
in front of the string literals - The
encoding="UTF-8"
- The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to
varbinary
gives valid UTF-8 - The
<![CDATA]]>
block.
Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with
XML parsing: line 1, character 54, illegal xml character
Solution 2:[2]
UPDATE: I just learnd something new, which is - uhm - great :-)
Try this function
CREATE FUNCTION dbo.Convert_utf8(@utf8 VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
DECLARE @rslt NVARCHAR(MAX);
SELECT @rslt=
CAST(
--'<?xml version="1.0" encoding="UTF-8"?><![CDATA['
0x3C3F786D6C2076657273696F6E3D22312E302220656E636F64696E673D225554462D38223F3E3C215B43444154415B
--the content goes within CDATA
+ @utf8
--']]>'
+ 0x5D5D3E
AS XML).value('.', 'nvarchar(max)');
RETURN @rslt;
END
GO
And call it like this
SELECT *
,dbo.Convert_utf8(CAST(t.body AS XML).value('.','varbinary(max)'))
FROM @t t;
The result is
DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Park”
GSerg, thank you very much ! for your answer below. I tried around and simplified this to work within an UDF.
It looks as if the cast of a varbinary(max)
to XML is completely done in a CLR environment, where the XML's encoding declaration is taken into account. This seems to work with other encodings too, but I don't have the time right now, to test this generically.
Now the rest of the answer
As it contains some background about string encoding, which might be worth to read.
I simplified your code a bit:
declare @t table ( [body] nvarchar(max) )
insert into @t(body)
select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==';
SELECT CAST(t.body AS XML).value('.','varbinary(max)')
,CAST(CAST(t.body AS XML).value('.','varbinary(max)') AS VARCHAR(MAX))
FROM @t t;
You will see this result
0x44414C4C41532C20545820E28093204D617920372C203230313920E2809320436F7665792026205061726B20456E6572677920486F6C64696E6773204C4C432028E2809C436F766579205061726BE2809D20
DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Parkâ€
I'll place the first characters more reader friendly
0x44414C4C41532C20545820E28093
D A L L A S , T X â € “
The 0x44
is the D
, twice the 0x4C
is the doubled LL
, and after the space 0x20
we get to E28093
. This is a 3-byte encoded code point for the en dash. SQL-Server will not help you with this... It will interpret this into 3 characters of 1 byte each...
I'm afraid, you'r out of luck...
SQL-Server does not support utf-8
strings. There is limited support with BCP / BULK
to enable input from the file system, but a string within T-SQL
must be one of the two supported options:
(var)char
, which is extended ASCII. It is strictly one-byte-per-character and will need a collation to deal with a limited set of foreign characters.n(var)char
, which is UCS-2 (very similar toUTF-16
). It is strictly two-bytes-per-character and will encode (almost) any known character at the price of doubled size in memory.
UTF-8
is compatible with (var)char
, as long as we stick with plain latin and to one-byte-codes. But any ASCII code above 127 will lead into troubles (might work with the right collation). But - this is your case here - your string uses multi-byte-code-points. UTF-8
will encode a lot of characters with two or even more bytes (up to 4!) for one single character.
What you can do
You will have to use some engine capable to deal with UTF-8
- a CLR-function
- Export to a file and re-import using the limited support (needs v2014 SP2 or higher)
- Use an external tool (PowerShell, C#, any programming language you know)
And - thx to @GSerg - two more options:
- Wait for v2019. There will be special collations allowing for native support of
utf-8
in T-SQL-strings - This answer provides an UDF, which can transform UTF8 to NVARCHAR. It won't be fast, but it works.
General remark
A database can hold storage-data just as is, or working-data, you want to use in the one or the other way. Storing a picture as VARBINARY(MAX)
is just a chunk of bits. You would not try to use SQL-Server to perform image reckognition.
This is the same with text data. If you just store a chunk of text, it won't matter, how you do this. But if you want to use this text for filtering, searching or if you want to use SQL-Server to display this text, you must think about the format and the needs for performance.
An enocding with variable byte lengths will not allow a simple SUBSTRING('blahblah',2,3)
. With fixed length the engine can just take the string as an array, jump to the second index and pick the next three characters. But with variable bytes the engine will have to compute the index by checking all characters before, if there might be any multi-byte code point. This will slow down a lot of string methods extremly...
Best was, not to store data in a format, the SQL-Server cannot handle (well)...
Solution 3:[3]
If you had SQL server 2019, you could create another database with UTF8 as default collation and create simple function there:
USE UTF8_DATABASE
GO
CREATE OR ALTER FUNCTION dbo.VarBinaryToUTF8
(@UTF8 VARBINARY(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN CAST(@UTF8 AS VARCHAR(MAX));
END;
You would than call
SELECT
UTF8_DATABASE.dbo.VarBinaryToUTF8
(
CAST('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')
)
FROM
@t
This works because SQL server uses the default collation of specific database for its variables and function return values. You have to store the result into NVARCHAR
or UTF8
collated 'VARCHAR in your non-
UTF8` database.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Community |
Solution 3 | andowero |